Nutch, an Extensible and Scalable Web Crawler Software
Apache Nutch is a flexible open source web crawler developed by Apache Software Foundation to aggregate data from the web. Apache Nutch is very popular because it can handle data at a very large scale and be customized via wide variety of plugins. Apache Nutch software is used in conjunction with other Apache products such as Apache Hadoop for data analysis and Apache Solr which acts as a repository for various data collected with Apache Nutch.
Nutch has a complex architecture and can be divided into two pieces: the crawler that fetches webpages and turns them into an inverted index and a searcher that uses this index to resolve user’s queries. The fetcher follows robots.txt rules and robots directives. Both components can be scaled independently of each other.
Apache Nutch is coded entirely in Java but the format of data is language-independent. Today Apache Nutch has two code bases:
- Nutch 1.x relies on Apache Hadoop data structures with the data stored on HDFS and is well-suited for batch processing. It's a production-ready crawler that makes possible fine-grained configuration.
- Nutch 2.x is an emerging alternative which was inspired by 1.x but differs from it in one key area: it builds storage abstraction using Apache Gora. This open source framework allows to store a large variety of data (status, time, content, parsed text, inlinks, outlinks, and more) in an extremely flexible model or stack in NoSQL databases such as Cassandra or HBase.
Apache Nutch is a robust and fault tolerant framework that can run on a single machine and on a cluster of up to 100 machines for large scale web crawling. Having highly modular architecture, Nutch is extensible via plugins that are activated on demand. Extension points include URL normalizer, URL filter, parser, parse filter, index writer, indexing filter, scoring filter etc.