Crawl Anywhere, a Web Crawling and Document Processing Solution
Crawl Anywhere is a web crawler available as an open source project under Github starting version 4.0. Crawl Anywhere also contains all the essential components that allow to develop a vertical search engine. Crawl Anywhere discovers and reads all HTML pages and documents such as Office or PDF on websites and indexes their content.
Crawl Anywhere has the following components:
- a Web Crawler that features a powerful web user interface
- a pipeline for document processing
- a Solr indexer
- a customizable full-featured application
Crawl Anywhere scrapes web sources and uses XmlQueueWriter document handler to write all binary files (doc, pdf, and more) or HTML pages as an XML file located in a file system queue. The software can crawl several different sources and multiple documents simultaneously and is compatible with Linux and Windows and respects robot.txt files. Crawl Anywhere is built on Java and uses MySQL database to store all source parameters about every crawled item such as crawl status, schedule, and more. The crawler has administration and monitoring interface that allows to manage sources, monitor current crawl processes, and see the list and details of sources.
The Pipeline, an open source software for document processing, transforms and enriches HTML pages and binary documents processed by Crawl Anywhere into a number of configurative stages and pushes them to Solr indexer which reads a queue of XML documents to index them. Any XML document includes the data to be indexed and the instruction on how to index the data.
The Search Application has a full set of necessary features that allow users to start implementing their own specific search interface and search immediately in the document that has been crawled and indexed.