Scrapinghub, Visual Web Scraping Software
Scrapinghub is a developer-focused cloud-based crawling platform that provides a number of web-scraping tools and services for building, deploying, and running web crawlers. The scalable platform has four major tools: Scrapy Cloud, Crawlera, Portia, and Splash.
Scrapy Cloud can help visualize and automate Scrapy web spiders’ activities. With Scrapy Cloud, users can:
- code or visually build their spiders and deploy them to the cloud
- manage spiders
- filter, analyze, and aggregate data
- review the extracted data and download it in CSV, JSON or XML formats
Portia is an open source visual scraping tool that runs in the browser for users that have no coding or programming knowledge. Users can create a template by clicking the content on pages they would like to scrape and Portia builds a spider to scrape the similar web pages.
Having a collection of thousands of IP addresses from 50+ countries, Crawlera provides a solution to the IP ban problem. The tool can detect more than 130 ban types and act appropriately to minimize IP ban – slow down operation speed, change IP etc. It supports HTTP and HTTPS proxies.
Scrapinghub started with the success of Scrapy and now it supports multiple open source crawling projects such as:
- Scrapely that is a library for generating parsers for web scraping
- Frontera that is a framework that manages user’s crawl logic and policies
- MDR that is a library for extracting list data
- ScrapyJS that is a middleware for Splash
- Loginform that is a library for filling login forms on specifies URLs, and more