Scrapy, a Fast and Powerful Web Crawling & Scraping Framework
Scrapy is a free open source and collaborative framework written in Python that is used to crawl websites and extract structured data from the web pages. It can be also used for a wide range of applications like data mining, information monitoring or historical archival as well as for automated testing. Scrapy runs on Linux, Mac, Windows, and BSD. The largest company that sponsors Scrapy development is Scrapinghub which was founded by Scrapy creators. Some well-known companies such as CareerBuilder, Lyst, Parse.ly use Scrapy in their business.
Although Scrapy software was designed for web scraping, it can be also used as a general purpose web crawler and a tool to extract data using APIs. The entire architecture of a Scrapy project is built around “spiders” which act as crawlers. Users can build, and deploy their spiders to Scrapy Cloud or host them on their own servers.
Scrapy has a wide range of powerful features and extensions that make scraping easy and efficient. It makes the process of building spiders quicker and less programming-intensive. The key features of this software are the following:
- selecting and extracting data from HTML/XML sources
- robust encoding support and auto-detection
- an interactive shell console
- multiple formats for exporting extracted data such as JSON, XML, and CSV
- storing extracted data in multiple backends
- a variety of extensions and middleware for handling HTTP features, robot.txt, cookies, crawl depth restrictions, etc.