SlideShare uma empresa Scribd logo
1 de 45
Lessons Learned in the
Development of a Web-scale
Search Engine: Nutch2 and beyond
Chris A. Mattmann
Senior Computer Scientist, NASA Jet Propulsion Laboratory
Adjunct Assistant Professor, Univ. of Southern California
Member, Apache Software Foundation
Roadmap
• What is Nutch?
• What are the current versions of Nutch?
• What can it do?
• What did we do right?
• What did we do wrong?
• Where is Nutch going?
And you are?
• Apache Member involved in
– Tika (VP,PMC), Nutch (PMC), Incubator (PMC),
OODT (Mentor), SIS (Mentor), Lucy (Mentor) and
Gora (Champion)
• Architect/Developer at
NASA JPL in
Pasadena, CA
• Software
Architecture/Engineeri
ng Prof at USC
is…
• A project originally started by Doug
Cutting
• Nutch builds upon the lower level text
indexing library and API called Lucene
• Nutch provides crawling services,
protocol services, parsing services,
content management services on top of
the indexing capability provided by
Lucene
• Allows you to sand up a web-scale infra.
Community
• Mailing lists
– User: 972 peeps
– Dev: 520 peeps
• Committers/PMC
– 8 peeps
– All 8 active: SERIOUSLY
• Releases
– 11 releases so far
– Working on 2.0
Credit: svnsearch.org
What Currently Exists?
• Version 0.6.x
– First easily deployable version
• Version 0.7.x
– Added several new features including several new parsers (MS-WORD,
PowerPoint), URLFilter extension point, first Apache release after Incubation,
mime type system
• Version 0.8.x
– Completely new underlying architecture based on Hadoop
– Parse plugins framework, multi-valued metadata container
– Parser Factory enhancement
• Version 0.9.x
– Major bug fixes
– Hadoop, and Lucene library upgrades
• Version 1.0
– Flexible filter framework
– Flexible scoring
– Initial integration with Tika
– Full Search Engine functionality and capabilities, in production at large scale
(Internet Archive)
What are the recent
versions?
• Version 1.1, upgrade all Nutch
library deps (Hadoop, Tika, etc.) and
make Fetcher faster
• Version 1.2, fix some big time
bugs (NPE in distributed search),
lots of feature upgrades
– You should be using this version
Some active dev areas
• Plenty!
• Bug fixes (> 200 issues in JIRA right
now with no resolution)
• Nutch 2.0 architecture
– http://search-lucene.com/m/gbrBF1RMWk9
– Refactored Nutch architecture,
delegating to Solr, HBase, Tika, and
ORM
Why Nutch?
• Observation: Web Search is a
commodity
– Why can’t it be provided freely?
• Allows tweaking of typically “hidden” ranking
algorithms
• Allows developers to focus less on the
infrastructure (since Brin & Page’s paper, the
infrastructure is well-known), and more on
providing value-added capabilities
Why Nutch?
• Value-added capabilities
– Improving fetching speed
– Parsing and handling of the hundreds of
different content types available on the internet
– Handling different protocols for obtaining
content
– Better ranking algorithms (OPIC, PageRank)
• More or less, in Nutch, these capabilities all
map to extension points available via Nutch’s
plugin framework
Nutch’s Architecture
• Nutch Core facilities
– Parsing
– Indexing
– Crawling
– Content Acquisition
– Querying
– Plugin Framework
• Nutch’s extension points
– Scoring, Parsing, Indexing, Querying,
URLFiltering
Nutch’s Architecture
Maps to
Search engine
architecture
proposed by Brin
& Page
Real world application of
Nutch
• I work at NASA’s Jet Propulsion
Laboratory
• NASA’s Planetary Data System
– NASA’s archive for all planetary science
data collected by missions over the past
30 years
– Collected 20 TB over the past 30 years
• Increasing to over 200 TB in the next 3
years!
– Built up a catalog of all data collected
• Where does Nutch fit in?
Where does Nutch fit into
the PDS?
• PDS Management Council decide
they want “Google-like” search of the
PDS catalog
• Our plan: use Nutch to implement
capability for PDS
PDS Google-like Search
Architecture
Search Engine Architecture (e.g. Nutch, Google)
PDS
Catalog
P
D
S
-
D
Existing PDS
Query
Indexer Index
Lucene
Crawler
PDS
Extract
Parser
PDS
Parser
pds.war
Tomcat
Web
Server
Catalog
Metadata
Credit: D. Crichton, S. Hughes, P. Ramirez, R. Joyner, S.
Hardman, C. Mattmann
Approach
• Export PDS catalog datasets in RDF format (flat
files)
• Use nutch to crawl RDF files
– protocol-file plugin in Nutch
• Wrote our own parse-pds plugin
– Parse the RDF files, and then extract the metadata
• Wrote our own index-pds plugin
– Index the fields that we want from the parsed metadata
• Wrote our own query-pds plugin
– Search the index on the fields that we want
Search Interface
Results
Some Nutch History
• In the next few slides, we’ll go
through some of Nutch’s history,
including my involvement, the history
of Nutch dev, and how we came to
today
How I got involved
• In CS72: Seminar on Search Engines at USC
– Okay well it used to be called CS599, but you get the picture
• Started out by contributing RSS parsing plugin
– My final project in 599
• Moved on from there to
– NUTCH-88, redesign of the parsing framework
– NUTCH-139, Metadata container support
– NUTCH-210, Web Context application file
– And various other bug fixes, and contributions here and there
– Mailing list support
– Wiki support
• Became committer in October 2006
• Helped spin Nutch into Apache TLP, March 2010,
Nutch PMC member
The Big Yellow Elephant
• Before this guy was born
• Lots of folks interested in Nutch
Hadoop is born
(January 2008)
Credit: svnsearch.org
Post Hadoop Life
• Nutch project kind of withered
– Well more than “kind of” it did wither
– Went years in-between a release
• 0.8 to 1.0 took a while
• Dev Community went into
maintenance mode
– Many committers simply went inactive
• User Community deteriorated
Some Observations
• It was pretty difficult to attract new
committers
– Took too long to VOTE
them in
– They were only interested
in Hadoop type stuff
– Not many organizations were doing web-
scale search
• Existing active committers dwindled
• I was one of them!
Some Observations
• There wasn’t a plan for what to do
next
– What features to work on?
– What bugs to fix?
– Many considered Nutch to be
“production” worthy in its current form
and not a huge number of internet-scale
users so people just “put up” with its
existing issues, e.g., difficult to configure
?
Hadoop wasn’t the only
spinoff
• A lot of us interested in content
detection and analysis, another major
Nutch strength, went off to work on
that in some other Apache project
that I can’t remember the name of
How can Nutch reorganize?
• Strong feeling from Nutch community
that we should take whomever is left
and think about what the “next
generation” Nutch (Nutch2) would
look like
• (Several cycles of) Mailing threads
started by Andrzej Bialecki, Dennis
Kubes, Otis Gospondetic
Initial Nutch2 fizzles
• Ended up being a lot of talk, but there
wasn’t enough interest to pick up a
shovel and help dig the hole
• But…there were interesting
things going on
– Example: Nutchbase work
from Dogacan, and Enis
What was “Nutchbase”?
• Take the Apache implementation of
Google’s “BigTable”
– Col oriented storge, high scalability in columns
and rows
• Store Nutch Web page content
+
Lots of interest in Nutchbase
• But, sadly maintained as a patch for a year
or more
– NUTCH-650 Hbase integration
• Brought about some interesting thoughts
– If storage can be abstracted, what about?
• Messaging layer (JMS Nutch?)
• Parsing?
• Indexing (Solr, Lucene, you-name-it)
Post Nutch 1.0
• Nutch 1.0 release was a true “1.”-oh!
– Included production features
– Those using it were happy, b/c they had bought
into the model
– Useable, tuneable
• But, how do we get
to Nutch 2.0?
A few things happen in parallel
• 1.1 Release?
– I had some free
time and was
willing to RM a
Nutch 1.1 release
to get things going
• Dogacan, Enis,
Julien and Andrzej
got interested in
moving Nutchbase
forward
– But took it to the
next level…we’ll get
back to this
• We elected a new
committer
• Julien Nioche
• Patches that had sat for years now
got committed
Oh, and Nutch became TLP
• Grabbed folks that were active in Nutch
community
• Decided to move forward with
Nutch/HBase as the de-facto platform
– No need to maintain home-grown storage
formats
– And, take it to the next level, to ORM-ness
• Decided to make Nutch a “delegator”
rather than a workhorse
– In other words…
Nutch2: “Delegator”
• Indexing/Querying?
– Solr has a lot of interest and
does tons of work in this area:
let’s use it instead of vanilla Lucene
• Parsing?
– Tika: ditto
• Storage
– Let’s use the ORM layer that some of the
Nutch committers were working on
Enter Gora:
“that ORM technology”
• Initially baked up at Github
• Decided to move
to the Incubator in Sept 2010
– I was contacted and asked to
champion the effort
• What is Gora?
– Uses Apache Avro to specify objects and
their schema
– ORM middleware takes Avro specs,
generates Java code – plugs for HBase,
Cassandra, in-memory SQL store, etc.
Nutch and Gora
• Throw out all code in Nutch that had to do
with Writeable interface
– Generated now by “Web Page” schema in
Gora
– Web Page is canonical Nutch object for
storage
• Parse text, parse data, etc.
• No more web-db, crawl-db, etc.
Out with the old…
• Throw out Nutch
webapp
– Solr provides
REST-ful services
to get at
metadata/index
– We’ll add the REST
(pun) for
storage/etc.
• Throw out Lucene
code • Slowly trash existing Nutch parsers
In with the new
• Get rid of webapp
– Nutch 2.x has seen contributions of REST
web services for full crawl cycle, storage I/F
• Delegate indexing to Solr
– Nutch 1.x first appearance of SolrIndexer and
Nutch Solr schema
• Delegate parsing to Tika
– Nutch 1.1 first appearance of parse-tika
– Have been decommissioning existing parsers
• Suggested improvements to Tika during this
process
Nutch2 Architecture
Learning from our mistakes
• Maintenance
– Checking in jars made the Nutch checkout
huge (even of just the “source”)
• Now using Ivy to manage dependencies
– Patches sitting?
• Not on my watch! Encouragement to find and commit
patches that have been sitting for a while, or simply
disposition them
– People want to use Nutch code as “dep”
• Build now includes ability for RM to push to Maven
Central
NOTE: CHRIS’S OPINION SLIDE
Learning from our mistakes
• Community
– Folks contributing patches?
• Make em’ a committer
– Folks providing good testing results?
• Make em’ a committer
– Folks making good documentation?
• Make em’ a committer
– It’s the sign of a healthy Apache project if new
committers (and members) are being elected
NOTE: CHRIS’S OPINION SLIDE
Learning from our mistakes
• Configuration of Nutch is hard
– It still is 
– Getting easier though
– Anyone have any great ideas or patches to
integrate with a DI framework?
– Things like GORA, Solr, etc, are making this
easier
• Providing flexible service interfaces beyond
Java APIs
– Existing work on NUTCH-932, NUTCH-931 and
NUTCH-880 is just the beginning
Interesting work going on
• I taught a class on Search Engines this
past summer
• Some neat projects that I’m working with
my students to contribute back to Apache
– Implementation of Authority/Hub scoring
– Deduplication improvements
– Clustering plugin improvements
– Work to improve Nutch-Solr-Drupal integration
Wrapup
• Nutch has seen tremendous highs and
lows over years
– We’re still kicking
• The newest version of Nutch (2.0) will have
a vastly slimmed down footprint, and will
use existing successful frameworks for
heavy lifting
– Solr, Tika, Gora, Hadoop
• If you’re interested in our dev, check us out
at http://nutch.apache.org
Alright, I’ll shut up now
• Any questions?
• THANK YOU!
– mattmann@apache.org
– @chrismattmann on Twitter
Acknowledgements
• Nutch team
• Some material inspired from Andrzej
Bialecki’s talks here
• OODT team at JPL

Mais conteúdo relacionado

Mais procurados

Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)Mark Kerzner
 
StormCrawler in the wild
StormCrawler in the wildStormCrawler in the wild
StormCrawler in the wildJulien Nioche
 
StormCrawler at Bristech
StormCrawler at BristechStormCrawler at Bristech
StormCrawler at BristechJulien Nioche
 
Low latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache StormLow latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache StormJulien Nioche
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Rahul Jain
 
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv larsgeorge
 
Solr + Hadoop = Big Data Search
Solr + Hadoop = Big Data SearchSolr + Hadoop = Big Data Search
Solr + Hadoop = Big Data SearchMark Miller
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Uwe Printz
 
Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrRahul Jain
 
Tajo: A Distributed Data Warehouse System for Hadoop
Tajo: A Distributed Data Warehouse System for HadoopTajo: A Distributed Data Warehouse System for Hadoop
Tajo: A Distributed Data Warehouse System for HadoopHyunsik Choi
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - OverviewJay
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonVitthal Gogate
 
Accessing external hadoop data sources using pivotal e xtension framework (px...
Accessing external hadoop data sources using pivotal e xtension framework (px...Accessing external hadoop data sources using pivotal e xtension framework (px...
Accessing external hadoop data sources using pivotal e xtension framework (px...Sameer Tiwari
 
OpenStack Trove Day (19 Aug 2014, Cambridge MA) - Sahara
OpenStack Trove Day (19 Aug 2014, Cambridge MA)  - SaharaOpenStack Trove Day (19 Aug 2014, Cambridge MA)  - Sahara
OpenStack Trove Day (19 Aug 2014, Cambridge MA) - Saharaspinningmatt
 
Apache Tajo - Bay Area HUG Nov. 2013 LinkedIn Special Event
Apache Tajo - Bay Area HUG Nov. 2013 LinkedIn Special EventApache Tajo - Bay Area HUG Nov. 2013 LinkedIn Special Event
Apache Tajo - Bay Area HUG Nov. 2013 LinkedIn Special EventGruter
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopHadoop User Group
 
AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012
AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012
AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012Amazon Web Services
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Adam Kawa
 

Mais procurados (20)

Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
 
StormCrawler in the wild
StormCrawler in the wildStormCrawler in the wild
StormCrawler in the wild
 
StormCrawler at Bristech
StormCrawler at BristechStormCrawler at Bristech
StormCrawler at Bristech
 
Low latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache StormLow latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache Storm
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )
 
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
 
Solr + Hadoop = Big Data Search
Solr + Hadoop = Big Data SearchSolr + Hadoop = Big Data Search
Solr + Hadoop = Big Data Search
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
 
Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache Solr
 
Big data and hadoop anupama
Big data and hadoop anupamaBig data and hadoop anupama
Big data and hadoop anupama
 
Tajo: A Distributed Data Warehouse System for Hadoop
Tajo: A Distributed Data Warehouse System for HadoopTajo: A Distributed Data Warehouse System for Hadoop
Tajo: A Distributed Data Warehouse System for Hadoop
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - Overview
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College London
 
Accessing external hadoop data sources using pivotal e xtension framework (px...
Accessing external hadoop data sources using pivotal e xtension framework (px...Accessing external hadoop data sources using pivotal e xtension framework (px...
Accessing external hadoop data sources using pivotal e xtension framework (px...
 
OpenStack Trove Day (19 Aug 2014, Cambridge MA) - Sahara
OpenStack Trove Day (19 Aug 2014, Cambridge MA)  - SaharaOpenStack Trove Day (19 Aug 2014, Cambridge MA)  - Sahara
OpenStack Trove Day (19 Aug 2014, Cambridge MA) - Sahara
 
Apache Tajo - Bay Area HUG Nov. 2013 LinkedIn Special Event
Apache Tajo - Bay Area HUG Nov. 2013 LinkedIn Special EventApache Tajo - Bay Area HUG Nov. 2013 LinkedIn Special Event
Apache Tajo - Bay Area HUG Nov. 2013 LinkedIn Special Event
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with Hadoop
 
AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012
AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012
AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
 

Destaque

Chppd july19 cqi_vs071811c
Chppd july19 cqi_vs071811cChppd july19 cqi_vs071811c
Chppd july19 cqi_vs071811cPriti Irani
 
080808 聚成岗位价值评估培训(简版)
080808 聚成岗位价值评估培训(简版)080808 聚成岗位价值评估培训(简版)
080808 聚成岗位价值评估培训(简版)20004
 
12 HU 133 Work and Retirement
12 HU 133   Work and Retirement12 HU 133   Work and Retirement
12 HU 133 Work and RetirementDon Thompson
 
Wengines, Workflows, and 2 years of advanced data processing in Apache OODT
Wengines, Workflows, and 2 years of advanced data processing in Apache OODTWengines, Workflows, and 2 years of advanced data processing in Apache OODT
Wengines, Workflows, and 2 years of advanced data processing in Apache OODTChris Mattmann
 
O uckan ag-kapital-post-kapital - 060510
O uckan   ag-kapital-post-kapital - 060510O uckan   ag-kapital-post-kapital - 060510
O uckan ag-kapital-post-kapital - 060510Ozgur Uckan
 
Mesaje de an nou.
Mesaje de an nou.Mesaje de an nou.
Mesaje de an nou.Nicky Nic
 
1.bucurie de sfarsit de an
1.bucurie de sfarsit de an1.bucurie de sfarsit de an
1.bucurie de sfarsit de anNicky Nic
 
Moodle course design
Moodle course designMoodle course design
Moodle course designJohn Roughley
 
By makenzie
By makenzieBy makenzie
By makenzievermigle
 
Lions project
Lions projectLions project
Lions projectvermigle
 
Washington DC Area Chapter of IAFIE Spring 2016 Newsletter
Washington DC Area Chapter of IAFIE Spring 2016 NewsletterWashington DC Area Chapter of IAFIE Spring 2016 Newsletter
Washington DC Area Chapter of IAFIE Spring 2016 NewsletterDavid Jimenez
 
《把信送给加西亚》
《把信送给加西亚》《把信送给加西亚》
《把信送给加西亚》20004
 
Accounting Perspective - JRM
Accounting Perspective - JRMAccounting Perspective - JRM
Accounting Perspective - JRMJay R Modi
 
Swot分析與生涯規劃
Swot分析與生涯規劃Swot分析與生涯規劃
Swot分析與生涯規劃20004
 
绩效考核及团队沟通
绩效考核及团队沟通绩效考核及团队沟通
绩效考核及团队沟通20004
 
Scrabbleομαδες
ScrabbleομαδεςScrabbleομαδες
Scrabbleομαδεςgymnasio
 
Parnitha 2012
Parnitha 2012Parnitha 2012
Parnitha 2012gymnasio
 
读懂人生的激励格言
读懂人生的激励格言读懂人生的激励格言
读懂人生的激励格言20004
 

Destaque (20)

Iafie europe 2017
Iafie europe 2017Iafie europe 2017
Iafie europe 2017
 
Chppd july19 cqi_vs071811c
Chppd july19 cqi_vs071811cChppd july19 cqi_vs071811c
Chppd july19 cqi_vs071811c
 
080808 聚成岗位价值评估培训(简版)
080808 聚成岗位价值评估培训(简版)080808 聚成岗位价值评估培训(简版)
080808 聚成岗位价值评估培训(简版)
 
12 HU 133 Work and Retirement
12 HU 133   Work and Retirement12 HU 133   Work and Retirement
12 HU 133 Work and Retirement
 
Wengines, Workflows, and 2 years of advanced data processing in Apache OODT
Wengines, Workflows, and 2 years of advanced data processing in Apache OODTWengines, Workflows, and 2 years of advanced data processing in Apache OODT
Wengines, Workflows, and 2 years of advanced data processing in Apache OODT
 
Connor big data
Connor big dataConnor big data
Connor big data
 
O uckan ag-kapital-post-kapital - 060510
O uckan   ag-kapital-post-kapital - 060510O uckan   ag-kapital-post-kapital - 060510
O uckan ag-kapital-post-kapital - 060510
 
Mesaje de an nou.
Mesaje de an nou.Mesaje de an nou.
Mesaje de an nou.
 
1.bucurie de sfarsit de an
1.bucurie de sfarsit de an1.bucurie de sfarsit de an
1.bucurie de sfarsit de an
 
Moodle course design
Moodle course designMoodle course design
Moodle course design
 
By makenzie
By makenzieBy makenzie
By makenzie
 
Lions project
Lions projectLions project
Lions project
 
Washington DC Area Chapter of IAFIE Spring 2016 Newsletter
Washington DC Area Chapter of IAFIE Spring 2016 NewsletterWashington DC Area Chapter of IAFIE Spring 2016 Newsletter
Washington DC Area Chapter of IAFIE Spring 2016 Newsletter
 
《把信送给加西亚》
《把信送给加西亚》《把信送给加西亚》
《把信送给加西亚》
 
Accounting Perspective - JRM
Accounting Perspective - JRMAccounting Perspective - JRM
Accounting Perspective - JRM
 
Swot分析與生涯規劃
Swot分析與生涯規劃Swot分析與生涯規劃
Swot分析與生涯規劃
 
绩效考核及团队沟通
绩效考核及团队沟通绩效考核及团队沟通
绩效考核及团队沟通
 
Scrabbleομαδες
ScrabbleομαδεςScrabbleομαδες
Scrabbleομαδες
 
Parnitha 2012
Parnitha 2012Parnitha 2012
Parnitha 2012
 
读懂人生的激励格言
读懂人生的激励格言读懂人生的激励格言
读懂人生的激励格言
 

Semelhante a Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and beyond

OpenStack Documentation in the Open
OpenStack Documentation in the OpenOpenStack Documentation in the Open
OpenStack Documentation in the OpenAnne Gentle
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlySarah Guido
 
OpenStack Doc Overview for Boot Camp
OpenStack Doc Overview for Boot CampOpenStack Doc Overview for Boot Camp
OpenStack Doc Overview for Boot CampAnne Gentle
 
Workflow Engines for Hadoop
Workflow Engines for HadoopWorkflow Engines for Hadoop
Workflow Engines for HadoopJoe Crobak
 
Drupal and Apache Stanbol
Drupal and Apache StanbolDrupal and Apache Stanbol
Drupal and Apache StanbolAlkuvoima
 
Apache Content Technologies
Apache Content TechnologiesApache Content Technologies
Apache Content Technologiesgagravarr
 
An Introduction of Apache Hadoop
An Introduction of Apache HadoopAn Introduction of Apache Hadoop
An Introduction of Apache HadoopKMS Technology
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopAmir Shaikh
 
Big_data_1674238705.ppt is a basic background
Big_data_1674238705.ppt is a basic backgroundBig_data_1674238705.ppt is a basic background
Big_data_1674238705.ppt is a basic backgroundNidhiAhuja30
 
Sonian, Open Source and Sensu
Sonian, Open Source and SensuSonian, Open Source and Sensu
Sonian, Open Source and SensuPete Cheslock
 
Automate Hadoop Cluster Deployment in a Banking Ecosystem
Automate Hadoop Cluster Deployment in a Banking EcosystemAutomate Hadoop Cluster Deployment in a Banking Ecosystem
Automate Hadoop Cluster Deployment in a Banking EcosystemHellmar Becker
 
If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!gagravarr
 

Semelhante a Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and beyond (20)

OpenStack Documentation in the Open
OpenStack Documentation in the OpenOpenStack Documentation in the Open
OpenStack Documentation in the Open
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at Bitly
 
OpenStack Doc Overview for Boot Camp
OpenStack Doc Overview for Boot CampOpenStack Doc Overview for Boot Camp
OpenStack Doc Overview for Boot Camp
 
Workflow Engines for Hadoop
Workflow Engines for HadoopWorkflow Engines for Hadoop
Workflow Engines for Hadoop
 
Drupal and Apache Stanbol
Drupal and Apache StanbolDrupal and Apache Stanbol
Drupal and Apache Stanbol
 
Be faster then rabbits
Be faster then rabbitsBe faster then rabbits
Be faster then rabbits
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Apache Content Technologies
Apache Content TechnologiesApache Content Technologies
Apache Content Technologies
 
Apache Lucene 4
Apache Lucene 4Apache Lucene 4
Apache Lucene 4
 
Hadoop Eco system
Hadoop Eco systemHadoop Eco system
Hadoop Eco system
 
An Introduction of Apache Hadoop
An Introduction of Apache HadoopAn Introduction of Apache Hadoop
An Introduction of Apache Hadoop
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
 
Big_data_1674238705.ppt is a basic background
Big_data_1674238705.ppt is a basic backgroundBig_data_1674238705.ppt is a basic background
Big_data_1674238705.ppt is a basic background
 
Sonian, Open Source and Sensu
Sonian, Open Source and SensuSonian, Open Source and Sensu
Sonian, Open Source and Sensu
 
Automate Hadoop Cluster Deployment in a Banking Ecosystem
Automate Hadoop Cluster Deployment in a Banking EcosystemAutomate Hadoop Cluster Deployment in a Banking Ecosystem
Automate Hadoop Cluster Deployment in a Banking Ecosystem
 
MahoutNew
MahoutNewMahoutNew
MahoutNew
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!
 

Mais de Chris Mattmann

Scalable Data Mining and Archiving in the Era of the Square Kilometre Array
Scalable Data Mining and Archiving in the Era of the Square Kilometre ArrayScalable Data Mining and Archiving in the Era of the Square Kilometre Array
Scalable Data Mining and Archiving in the Era of the Square Kilometre ArrayChris Mattmann
 
Teaching NASA to Open Source its Software the Apache Way
Teaching NASA to Open Source its Software the Apache WayTeaching NASA to Open Source its Software the Apache Way
Teaching NASA to Open Source its Software the Apache WayChris Mattmann
 
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!Chris Mattmann
 
Supercharging your Apache OODT deployments with the Process Control System
Supercharging your Apache OODT deployments with the Process Control SystemSupercharging your Apache OODT deployments with the Process Control System
Supercharging your Apache OODT deployments with the Process Control SystemChris Mattmann
 
A Look into the Apache OODT Ecosystem
A Look into the Apache OODT EcosystemA Look into the Apache OODT Ecosystem
A Look into the Apache OODT EcosystemChris Mattmann
 
Understanding the Meaningful Use of Open Source Software
Understanding the Meaningful Use of Open Source SoftwareUnderstanding the Meaningful Use of Open Source Software
Understanding the Meaningful Use of Open Source SoftwareChris Mattmann
 
An Open Source Strategy for NASA
An Open Source Strategy for NASAAn Open Source Strategy for NASA
An Open Source Strategy for NASAChris Mattmann
 
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...Chris Mattmann
 
Scientific data curation and processing with Apache Tika
Scientific data curation and processing with Apache TikaScientific data curation and processing with Apache Tika
Scientific data curation and processing with Apache TikaChris Mattmann
 

Mais de Chris Mattmann (9)

Scalable Data Mining and Archiving in the Era of the Square Kilometre Array
Scalable Data Mining and Archiving in the Era of the Square Kilometre ArrayScalable Data Mining and Archiving in the Era of the Square Kilometre Array
Scalable Data Mining and Archiving in the Era of the Square Kilometre Array
 
Teaching NASA to Open Source its Software the Apache Way
Teaching NASA to Open Source its Software the Apache WayTeaching NASA to Open Source its Software the Apache Way
Teaching NASA to Open Source its Software the Apache Way
 
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!
 
Supercharging your Apache OODT deployments with the Process Control System
Supercharging your Apache OODT deployments with the Process Control SystemSupercharging your Apache OODT deployments with the Process Control System
Supercharging your Apache OODT deployments with the Process Control System
 
A Look into the Apache OODT Ecosystem
A Look into the Apache OODT EcosystemA Look into the Apache OODT Ecosystem
A Look into the Apache OODT Ecosystem
 
Understanding the Meaningful Use of Open Source Software
Understanding the Meaningful Use of Open Source SoftwareUnderstanding the Meaningful Use of Open Source Software
Understanding the Meaningful Use of Open Source Software
 
An Open Source Strategy for NASA
An Open Source Strategy for NASAAn Open Source Strategy for NASA
An Open Source Strategy for NASA
 
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...
 
Scientific data curation and processing with Apache Tika
Scientific data curation and processing with Apache TikaScientific data curation and processing with Apache Tika
Scientific data curation and processing with Apache Tika
 

Último

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 

Último (20)

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 

Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and beyond

  • 1. Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and beyond Chris A. Mattmann Senior Computer Scientist, NASA Jet Propulsion Laboratory Adjunct Assistant Professor, Univ. of Southern California Member, Apache Software Foundation
  • 2. Roadmap • What is Nutch? • What are the current versions of Nutch? • What can it do? • What did we do right? • What did we do wrong? • Where is Nutch going?
  • 3. And you are? • Apache Member involved in – Tika (VP,PMC), Nutch (PMC), Incubator (PMC), OODT (Mentor), SIS (Mentor), Lucy (Mentor) and Gora (Champion) • Architect/Developer at NASA JPL in Pasadena, CA • Software Architecture/Engineeri ng Prof at USC
  • 4. is… • A project originally started by Doug Cutting • Nutch builds upon the lower level text indexing library and API called Lucene • Nutch provides crawling services, protocol services, parsing services, content management services on top of the indexing capability provided by Lucene • Allows you to sand up a web-scale infra.
  • 5. Community • Mailing lists – User: 972 peeps – Dev: 520 peeps • Committers/PMC – 8 peeps – All 8 active: SERIOUSLY • Releases – 11 releases so far – Working on 2.0 Credit: svnsearch.org
  • 6. What Currently Exists? • Version 0.6.x – First easily deployable version • Version 0.7.x – Added several new features including several new parsers (MS-WORD, PowerPoint), URLFilter extension point, first Apache release after Incubation, mime type system • Version 0.8.x – Completely new underlying architecture based on Hadoop – Parse plugins framework, multi-valued metadata container – Parser Factory enhancement • Version 0.9.x – Major bug fixes – Hadoop, and Lucene library upgrades • Version 1.0 – Flexible filter framework – Flexible scoring – Initial integration with Tika – Full Search Engine functionality and capabilities, in production at large scale (Internet Archive)
  • 7. What are the recent versions? • Version 1.1, upgrade all Nutch library deps (Hadoop, Tika, etc.) and make Fetcher faster • Version 1.2, fix some big time bugs (NPE in distributed search), lots of feature upgrades – You should be using this version
  • 8. Some active dev areas • Plenty! • Bug fixes (> 200 issues in JIRA right now with no resolution) • Nutch 2.0 architecture – http://search-lucene.com/m/gbrBF1RMWk9 – Refactored Nutch architecture, delegating to Solr, HBase, Tika, and ORM
  • 9. Why Nutch? • Observation: Web Search is a commodity – Why can’t it be provided freely? • Allows tweaking of typically “hidden” ranking algorithms • Allows developers to focus less on the infrastructure (since Brin & Page’s paper, the infrastructure is well-known), and more on providing value-added capabilities
  • 10. Why Nutch? • Value-added capabilities – Improving fetching speed – Parsing and handling of the hundreds of different content types available on the internet – Handling different protocols for obtaining content – Better ranking algorithms (OPIC, PageRank) • More or less, in Nutch, these capabilities all map to extension points available via Nutch’s plugin framework
  • 11. Nutch’s Architecture • Nutch Core facilities – Parsing – Indexing – Crawling – Content Acquisition – Querying – Plugin Framework • Nutch’s extension points – Scoring, Parsing, Indexing, Querying, URLFiltering
  • 12. Nutch’s Architecture Maps to Search engine architecture proposed by Brin & Page
  • 13. Real world application of Nutch • I work at NASA’s Jet Propulsion Laboratory • NASA’s Planetary Data System – NASA’s archive for all planetary science data collected by missions over the past 30 years – Collected 20 TB over the past 30 years • Increasing to over 200 TB in the next 3 years! – Built up a catalog of all data collected • Where does Nutch fit in?
  • 14. Where does Nutch fit into the PDS? • PDS Management Council decide they want “Google-like” search of the PDS catalog • Our plan: use Nutch to implement capability for PDS
  • 15. PDS Google-like Search Architecture Search Engine Architecture (e.g. Nutch, Google) PDS Catalog P D S - D Existing PDS Query Indexer Index Lucene Crawler PDS Extract Parser PDS Parser pds.war Tomcat Web Server Catalog Metadata Credit: D. Crichton, S. Hughes, P. Ramirez, R. Joyner, S. Hardman, C. Mattmann
  • 16. Approach • Export PDS catalog datasets in RDF format (flat files) • Use nutch to crawl RDF files – protocol-file plugin in Nutch • Wrote our own parse-pds plugin – Parse the RDF files, and then extract the metadata • Wrote our own index-pds plugin – Index the fields that we want from the parsed metadata • Wrote our own query-pds plugin – Search the index on the fields that we want
  • 19. Some Nutch History • In the next few slides, we’ll go through some of Nutch’s history, including my involvement, the history of Nutch dev, and how we came to today
  • 20. How I got involved • In CS72: Seminar on Search Engines at USC – Okay well it used to be called CS599, but you get the picture • Started out by contributing RSS parsing plugin – My final project in 599 • Moved on from there to – NUTCH-88, redesign of the parsing framework – NUTCH-139, Metadata container support – NUTCH-210, Web Context application file – And various other bug fixes, and contributions here and there – Mailing list support – Wiki support • Became committer in October 2006 • Helped spin Nutch into Apache TLP, March 2010, Nutch PMC member
  • 21. The Big Yellow Elephant • Before this guy was born • Lots of folks interested in Nutch Hadoop is born (January 2008) Credit: svnsearch.org
  • 22. Post Hadoop Life • Nutch project kind of withered – Well more than “kind of” it did wither – Went years in-between a release • 0.8 to 1.0 took a while • Dev Community went into maintenance mode – Many committers simply went inactive • User Community deteriorated
  • 23. Some Observations • It was pretty difficult to attract new committers – Took too long to VOTE them in – They were only interested in Hadoop type stuff – Not many organizations were doing web- scale search • Existing active committers dwindled • I was one of them!
  • 24. Some Observations • There wasn’t a plan for what to do next – What features to work on? – What bugs to fix? – Many considered Nutch to be “production” worthy in its current form and not a huge number of internet-scale users so people just “put up” with its existing issues, e.g., difficult to configure ?
  • 25. Hadoop wasn’t the only spinoff • A lot of us interested in content detection and analysis, another major Nutch strength, went off to work on that in some other Apache project that I can’t remember the name of
  • 26. How can Nutch reorganize? • Strong feeling from Nutch community that we should take whomever is left and think about what the “next generation” Nutch (Nutch2) would look like • (Several cycles of) Mailing threads started by Andrzej Bialecki, Dennis Kubes, Otis Gospondetic
  • 27. Initial Nutch2 fizzles • Ended up being a lot of talk, but there wasn’t enough interest to pick up a shovel and help dig the hole • But…there were interesting things going on – Example: Nutchbase work from Dogacan, and Enis
  • 28. What was “Nutchbase”? • Take the Apache implementation of Google’s “BigTable” – Col oriented storge, high scalability in columns and rows • Store Nutch Web page content +
  • 29. Lots of interest in Nutchbase • But, sadly maintained as a patch for a year or more – NUTCH-650 Hbase integration • Brought about some interesting thoughts – If storage can be abstracted, what about? • Messaging layer (JMS Nutch?) • Parsing? • Indexing (Solr, Lucene, you-name-it)
  • 30. Post Nutch 1.0 • Nutch 1.0 release was a true “1.”-oh! – Included production features – Those using it were happy, b/c they had bought into the model – Useable, tuneable • But, how do we get to Nutch 2.0?
  • 31. A few things happen in parallel • 1.1 Release? – I had some free time and was willing to RM a Nutch 1.1 release to get things going • Dogacan, Enis, Julien and Andrzej got interested in moving Nutchbase forward – But took it to the next level…we’ll get back to this • We elected a new committer • Julien Nioche • Patches that had sat for years now got committed
  • 32. Oh, and Nutch became TLP • Grabbed folks that were active in Nutch community • Decided to move forward with Nutch/HBase as the de-facto platform – No need to maintain home-grown storage formats – And, take it to the next level, to ORM-ness • Decided to make Nutch a “delegator” rather than a workhorse – In other words…
  • 33. Nutch2: “Delegator” • Indexing/Querying? – Solr has a lot of interest and does tons of work in this area: let’s use it instead of vanilla Lucene • Parsing? – Tika: ditto • Storage – Let’s use the ORM layer that some of the Nutch committers were working on
  • 34. Enter Gora: “that ORM technology” • Initially baked up at Github • Decided to move to the Incubator in Sept 2010 – I was contacted and asked to champion the effort • What is Gora? – Uses Apache Avro to specify objects and their schema – ORM middleware takes Avro specs, generates Java code – plugs for HBase, Cassandra, in-memory SQL store, etc.
  • 35. Nutch and Gora • Throw out all code in Nutch that had to do with Writeable interface – Generated now by “Web Page” schema in Gora – Web Page is canonical Nutch object for storage • Parse text, parse data, etc. • No more web-db, crawl-db, etc.
  • 36. Out with the old… • Throw out Nutch webapp – Solr provides REST-ful services to get at metadata/index – We’ll add the REST (pun) for storage/etc. • Throw out Lucene code • Slowly trash existing Nutch parsers
  • 37. In with the new • Get rid of webapp – Nutch 2.x has seen contributions of REST web services for full crawl cycle, storage I/F • Delegate indexing to Solr – Nutch 1.x first appearance of SolrIndexer and Nutch Solr schema • Delegate parsing to Tika – Nutch 1.1 first appearance of parse-tika – Have been decommissioning existing parsers • Suggested improvements to Tika during this process
  • 39. Learning from our mistakes • Maintenance – Checking in jars made the Nutch checkout huge (even of just the “source”) • Now using Ivy to manage dependencies – Patches sitting? • Not on my watch! Encouragement to find and commit patches that have been sitting for a while, or simply disposition them – People want to use Nutch code as “dep” • Build now includes ability for RM to push to Maven Central NOTE: CHRIS’S OPINION SLIDE
  • 40. Learning from our mistakes • Community – Folks contributing patches? • Make em’ a committer – Folks providing good testing results? • Make em’ a committer – Folks making good documentation? • Make em’ a committer – It’s the sign of a healthy Apache project if new committers (and members) are being elected NOTE: CHRIS’S OPINION SLIDE
  • 41. Learning from our mistakes • Configuration of Nutch is hard – It still is  – Getting easier though – Anyone have any great ideas or patches to integrate with a DI framework? – Things like GORA, Solr, etc, are making this easier • Providing flexible service interfaces beyond Java APIs – Existing work on NUTCH-932, NUTCH-931 and NUTCH-880 is just the beginning
  • 42. Interesting work going on • I taught a class on Search Engines this past summer • Some neat projects that I’m working with my students to contribute back to Apache – Implementation of Authority/Hub scoring – Deduplication improvements – Clustering plugin improvements – Work to improve Nutch-Solr-Drupal integration
  • 43. Wrapup • Nutch has seen tremendous highs and lows over years – We’re still kicking • The newest version of Nutch (2.0) will have a vastly slimmed down footprint, and will use existing successful frameworks for heavy lifting – Solr, Tika, Gora, Hadoop • If you’re interested in our dev, check us out at http://nutch.apache.org
  • 44. Alright, I’ll shut up now • Any questions? • THANK YOU! – mattmann@apache.org – @chrismattmann on Twitter
  • 45. Acknowledgements • Nutch team • Some material inspired from Andrzej Bialecki’s talks here • OODT team at JPL