Large scale crawling with Apache Nutch

Large Scale Crawling with

Apache
Julien Nioche
julien@digitalpebble.com

ApacheCon Europe 2012

About myself
 DigitalPebble Ltd, Bristol (UK)
 Specialised in Text Engineering
– Web Crawling
– Natural Language Processing
– Information Retrieval
– Data Mining
 Strong focus on Open Source & Apache ecosystem
 Apache Nutch VP
 Apache Tika committer
 User | Contributor
– SOLR, Lucene
– GATE, UIMA
– Mahout
– Behemoth

2 / 37

Objectives

 Overview of the project

 Nutch in a nutshell

 Nutch 2.x

 Future developments

3 / 37

Nutch?
 “Distributed framework for large scale web crawling”
– but does not have to be large scale at all
– or even on the web (file-protocol)

 Apache TLP since May 2010

 Based on Apache Hadoop

 Indexing and Search

4 / 37

Short history
 2002/2003 : Started By Doug Cutting & Mike Caffarella
 2004 : sub-project of Lucene @Apache
 2005 : MapReduce implementation in Nutch
– 2006 : Hadoop sub-project of Lucene @Apache

 2006/7 : Parser and MimeType in Tika
– 2008 : Tika sub-project of Lucene @Apache

 May 2010 : TLP project at Apache
 June 2012 : Nutch 1.5.1
 Oct 2012 : Nutch 2.1

5 / 37

Recent Releases

1.0 1.1 1.2 1.3 1.4 1.5.1
trunk

2.x
2.0 2.1

06/09 06/10 06/11 06/12

7 / 37

Community

 6 active committers / PMC members
– 4 within the last 18 months

 Constant stream of new contributions & bug reports

 Steady numbers of mailing list subscribers and traffic

 Nutch is a very healthy 10-year old

9 / 37

Why use Nutch?

 Usual reasons
– Mature, business-friendly license, community, ...

 Scalability
– Tried and tested on very large scale
– Hadoop cluster : installation and skills

 Features
– e.g. Index with SOLR
– PageRank implementation
– Can be extended with plugins

10 / 37

Not the best option when ...

 Hadoop based == batch processing == high latency
– No guarantee that a page will be fetched / parsed / indexed within X
minutes|hours

 Javascript / Ajax not supported (yet)

11 / 37

Use cases

 Crawl for IR
– Generic or vertical
– Index and Search with SOLR
– Single node to large clusters on Cloud

 … but also
– Data Mining
– NLP (e.g.Sentiment Analysis)
– ML

– MAHOUT / UIMA / GATE
– Use Behemoth as glueware (https://github.com/DigitalPebble/behemoth)

12 / 37

Customer cases
Specificity (Verticality)
Usecase : BetterJobs.com
– Single server
– Aggregates content from job portals
– Extracts and normalizes structure (description,
requirements, locations)
– ~1M pages total
– Feeds SOLR index

Usecase : SimilarPages.com
– Large cluster on Amazon EC2 (up to 400
nodes)
– Fetched & parsed 3 billion pages
– 10+ billion pages in crawlDB (~100TB data)
– 200+ million lists of similarities
– No indexing / search involved
Scale

13 / 37

Typical Nutch Steps
 Same in 1.x and 2.x
 Sequence of batch operations
1) Inject → populates CrawlDB from seed list
2) Generate → Selects URLS to fetch in segment
3) Fetch → Fetches URLs from segment
4) Parse → Parses content (text + metadata)
5) UpdateDB → Updates CrawlDB (new URLs, new status...)
6) InvertLinks → Build Webgraph
7) SOLRIndex → Send docs to SOLR
8) SOLRDedup → Remove duplicate docs based on signature
 Repeat steps 2 to 8
 Or use the all-in-one crawl script

14 / 37

Main steps

Seed
List CrawlDB Segment
/
/crawl_fetch/
crawl_generate/
/content/
/crawl_parse/
/parse_data/
/parse_text/

LinkDB

15 / 37

Frontier expansion

 Manual “discovery”
– Adding new URLs by
hand, “seeding”

 Automatic discovery
of new resources
(frontier expansion)
– Not all outlinks are
equally useful - control seed
– Requires content
i=1
parsing and link
extraction
i=2
i=3

[Slide courtesy of A. Bialecki]

16 / 37

An extensible framework
 Plugins
– Activated with parameter 'plugin.includes'
– Implement one or more endpoints

 Endpoints
– Protocol
– Parser
– HtmlParseFilter (ParseFilter in Nutch 2.x)
– ScoringFilter (used in various places)
– URLFilter (ditto)
– URLNormalizer (ditto)
– IndexingFilter

17 / 37

Features

 Fetcher
– Multi-threaded fetcher
– Follows robots.txt
– Groups URLs per hostname / domain / IP
– Limit the number of URLs for round of fetching
– Default values are polite but can be made more aggressive

 Crawl Strategy
– Breadth-first but can be depth-first
– Configurable via custom scoring plugins

 Scoring
– OPIC (On-line Page Importance Calculation) by default
– LinkRank

18 / 37

Features (cont.)

 Protocols
– Http, file, ftp, https

 Scheduling
– Specified or adaptative

 URL filters
– Regex, FSA, TLD, prefix, suffix

 URL normalisers
– Default, regex

19 / 37

Features (cont.)

 Parsing with Apache Tika
– Hundreds of formats supported
– But some legacy parsers as well
 Other plugins
– CreativeCommons
– Feeds
– Language Identification
– Rel tags
– Arbitrary Metadata

 Indexing to SOLR
– Bespoke schema

20 / 37

Data Structures in 1.x
 MapReduce jobs => I/O : Hadoop [Sequence|Map]Files
 CrawlDB => status of known pages
MapFile : <Text,CrawlDatum>
byte status; [fetched? Unfetched? Failed? Redir?]
long fetchTime;
byte retries;
CrawlDB int fetchInterval;
float score = 1.0f;
byte[] signature = null;
long modifiedTime;
org.apache.hadoop.io.MapWritable metaData;

 Input of : generate - index
 Output of : inject - update

21 / 37

Data Structures 1.x

 Segment => round of fetching
 Identified by a timestamp

Segment
/crawl_generate/ → SequenceFile<Text,CrawlDatum>
/crawl_fetch/ → MapFile<Text,CrawlDatum>
/content/ → MapFile<Text,Content>
/crawl_parse/ → SequenceFile<Text,CrawlDatum>
/parse_data/ → MapFile<Text,ParseData>
/parse_text/ → MapFile<Text,ParseText>

 Can have multiple versions of a page in different
segments

22 / 37

Data Structures – 1.x

 linkDB => storage for Web Graph

MapFile : <Text,Inlinks>
Inlinks : HashSet <Inlink>
LinkDB Inlink :
String fromUrl
String anchor

 Output of : invertlinks
 Input of : SOLRIndex

23 / 37

NUTCH 2.x

 2.0 released in July 2012

 2.1 in October 2012

 Common features as 1.x
– delegation to SOLR, TIKA, MapReduce etc...

 Moved to table-based architecture
– Wealth of NoSQL projects in last few years

 Abstraction over storage layer → Apache GORA

24 / 37

Apache GORA

 http://gora.apache.org/

 ORM for NoSQL databases
– and limited SQL support + file based storage
 0.2.1 released in August 2012
 DataStore implementations
● Accumulo ● Avro
● Cassandra ● DynamoDB (soon)
● HBase ● SQL

 Serialization with Apache AVRO
 Object-to-datastore mappings (backend-specific)

25 / 37

AVRO Schema => Java code
{"name": "WebPage",
"type": "record",
"namespace": "org.apache.nutch.storage",
"fields": [
{"name": "baseUrl", "type": ["null", "string"] },
{"name": "status", "type": "int"},
{"name": "fetchTime", "type": "long"},
{"name": "prevFetchTime", "type": "long"},
{"name": "fetchInterval", "type": "int"},
{"name": "retriesSinceFetch", "type": "int"},
{"name": "modifiedTime", "type": "long"},
{"name": "protocolStatus", "type": {
"name": "ProtocolStatus",
"type": "record",
"namespace": "org.apache.nutch.storage",
"fields": [
{"name": "code", "type": "int"},
{"name": "args", "type": {"type": "array", "items": "string"}},
{"name": "lastModified", "type": "long"}
]
}},
[…]

26 / 37

Mapping file (backend specific – Hbase)
<gora-orm>

<table name="webpage">
<family name="p" maxVersions="1"/> 
<family name="f" maxVersions="1"/>
<family name="s" maxVersions="1"/>
<family name="il" maxVersions="1"/>
<family name="ol" maxVersions="1"/>
<family name="h" maxVersions="1"/>
<family name="mtdt" maxVersions="1"/>
<family name="mk" maxVersions="1"/>
</table>
<class table="webpage" keyClass="java.lang.String" name="org.apache.nutch.storage.WebPage">


<field name="baseUrl" family="f" qualifier="bas"/>
<field name="status" family="f" qualifier="st"/>
<field name="prevFetchTime" family="f" qualifier="pts"/>
<field name="fetchTime" family="f" qualifier="ts"/>
<field name="fetchInterval" family="f" qualifier="fi"/>
<field name="retriesSinceFetch" family="f" qualifier="rsf"/>

27 / 37

DataStore operations

 Atomic operations
– get(K key)
– put(K key, T obj)
– delete(K key)

 Querying
– execute(Query<K, T> query) → Result<K,T>
– deleteByQuery(Query<K, T> query)

 Wrappers for Apache Hadoop
– GORAInput|OutputFormat
– GoraRecordReader|Writer
– GORAMapper|Reducer

28 / 37

GORA in Nutch

 AVRO schema provided and java code pre-generated

 Mapping files provided for backends
– can be modified if necessary

 Need to rebuild to get dependencies for backend
– No binary distribution of Nutch 2.x

 http://wiki.apache.org/nutch/Nutch2Tutorial

29 / 37

Benefits

 Storage still distributed and replicated
 but one big table
– status, metadata, content, text → one place
 Simplified logic in Nutch
– Simpler code for updating / merging information
 More efficient (?)
– No need to read / write entire structure to update records
– No comparison available yet + early days for GORA
 Easier interaction with other resources
– Third-party code just need to use GORA and schema

30 / 37

Drawbacks

 More stuff to install and configure :-)

 Not as stable as Nutch 1.x

 Dependent on success of Gora

31 / 37

2.x Work in progress

 Stabilise backend implementations
– GORA-Hbase most reliable

 Synchronize features with 1.x
– e.g. has ElasticSearch but missing LinkRank equivalent

 Filter enabled scans (GORA-119)
– Don't need to de-serialize the whole dataset

32 / 37

Future

 Both 1.x and 2.x in parallel
– but more frequent releases for 2.x

 New functionalities
– Support for SOLRCloud
– Sitemap (from Crawler Commons library)
– Canonical tag
– More indexers (e.g. ElasticSearch) + pluggable indexers?

33 / 37

More delegation
 Great deal done in recent years (SOLR, Tika)

 Share code with crawler-commons
(http://code.google.com/p/crawler-commons/)
– Fetcher / protocol handling
– Robots.txt parsing
– URL normalisation / filtering

 PageRank-like computations to graph library
– e.g. Apache Giraph
– Should be more efficient as well

34 / 37

Where to find out more?

 Project page : http://nutch.apache.org/
 Wiki : http://wiki.apache.org/nutch/
 Mailing lists :
– user@nutch.apache.org
– dev@nutch.apache.org

 Chapter in 'Hadoop the Definitive Guide' (T. White)
– Understanding Hadoop is essential anyway...

 Support / consulting :
– http://wiki.apache.org/nutch/Support

35 / 37

Questions

?

36 / 37

Large scale crawling with Apache Nutch

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a Large scale crawling with Apache Nutch

Semelhante a Large scale crawling with Apache Nutch (20)

Último

Último (20)

Large scale crawling with Apache Nutch

Notas do Editor