SlideShare a Scribd company logo
1 of 28
Download to read offline
Migration from FAST ESP to
                                    Lucene Solr
                                   Presented by Michael McIntosh
                               michaelm@tnrglobal.com, Oct 19th, 2011




Wednesday, October 19, 11
What will we cover?
                Core Aspects of ESP to Solr Migration
                            Migration Overview
                            Crawling Content
                            Processing Content
                            Searching Content
                            Scaling for Growth
                            Questions?
                                            © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Who am I?

                    • 7+ Years FAST ESP
                    • 10+ Years in Search
                    • 15+ Years in Software
                    • Early Lycos Developer
                    • I also develop brain-computer interfaces :)
                                                   © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Who are we?

                    • 7+ Years in Search
                    • 15+ Years in Web Dev
                    • 30+ Years in Software
                    • Focus on ESP, Solr, Lucene, and the Cloud
                    • Scalable Web & Search Solution Experts
                                                   © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Migration Overview


                                         © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Migration Challenges

                    • Our clients depend on ESP 5.3
                    • No future support for Linux ESP
                    • We need a viable exit strategy
                    • We want a fairly painless approach
                    • How do we provide an alternative?
                                             © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Migration Use Case

                            Federated Product Search
                            ...millions of parts and services...

                    • XML documents (highly-structured)
                    • PDF documents (semi-structured)
                    • HTML documents (unstructured)

                                                      © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Our Approach
                            Solr Search Platform (SolrSP)
                    • Custom Scalable Crawler using Heritrix
                    • Events & Queues managed with RabbitMQ
                    • Caching & Persistence supported via Riak
                    • Python pipeline replacement using Pypes
                    • Advanced Linguistics via NLTK or Rosette
                                                  © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Crawling Content


                                        © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Crawling for ESP

                    • For XML content, our scripts query a
                            service, download resources and feed
                    • For PDF content, our scripts query a
                            database, download PDF urls and feed
                    • For HTML, our scripts query a database,
                            download seed URLs and launch ESP’s
                            Enterprise Crawler

                                                       © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Crawling for Solr

                    • For XML & PDF content, the approach
                            remains the same with a different writer
                    • We tried Nutch crawler, but found it
                            challenging to make it do what we needed
                    • We tried Lucid Works bundled crawler, but
                            found the exposed functionality did not
                            offer the level of flexibility we needed

                                                        © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Crawling with Heritrix

                    • Heritrix, created by the Internet Archive,
                            supports much of the same functionality
                            that the ESP Enterprise Crawler provides
                    • We wrapped Heritrix to provide a higher
                            level interface for service management
                    • Made it scalable and added document
                            caching via Riak to support refresh crawling

                                                         © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Crawler Architecture
                            Crawl Job        Crawler
                             Request         Manager



                                          Queue Cluster
                                           (RabbitMQ)



                             Heritrix        Heritrix          Heritrix
                            Messenger       Messenger         Messenger



                             Heritrix        Heritrix          Heritrix
                             Crawler         Crawler           Crawler



                                        Persistance Cluster
                                               (Riak)



                                                               © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Processing Content


                                         © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Processing for ESP
                            ESP Processing is document-centric
                    • For XML, we transform, tag metadata,
                            classify content before indexing
                    • For PDF, we split pages, generate
                            thumbnails, tag metadata and classify before
                            indexing
                    • For HTML, we normalize, clean content,
                            tag metadata and classify before indexing

                                                         © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Processing for Solr
                              Solr Processing is field-centric
                    • Solr analyzers work on a field by field basis
                            and lack the flexible workflow ESP provides
                    • Using some Solr analyzers for the now, but
                            evaluating alternatives (Rosette, NLTK)
                    • Hadoop + Cascading looks promising
                    • We use Stackless Python with Pypes to
                            make ESP stage migration less painful
                                                         © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Processing with Pypes
                              •   Written in Python

                              •   Easy stage migration

                              •   Very flexible & robust

                              •   Branching & Merging

                              •   Single Input, Many
                                  Outputs

                              •   Trivial to embed and
                                  extend

                                       © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Processor Migration

                                ...From ESP




                                   © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Processor Migration

                                ...to Pypes




                                  © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Searching Content


                                        © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Feature Differences
                    •       ESP has robust faceting support but facets must be
                            defined at index time, unlike Solr faceting

                    •       Solr does most of the heavy lifting at query time,
                            which allows for more flexible approaches

                    •       Solr now directly supports taxonomy (hierarchical)
                            faceting functionality (for drill down categories)

                    •       Solr now supports field collapsing which we use
                            heavily in ESP installation to collapse result sets

                    •       ESP to Solr schema mapping fairly strait-forward

                                                                © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Search Interface
                    •       Solr has no direct equivalent to FAST Query
                            Language (FQL) but function queries look like a
                            possible option for complex queries

                    •       If you don’t have overly complex queries, the
                            edismax query parser looks like a good option

                    •       Solr doesn’t have an easily extendable search-front
                            component like ESP, but we like TwigKit for that

                    •       Default Solr stemmer isn’t as good as the ESP
                            lemmatizer, so if you need good lemmatization
                            consider Rosette Linguistics Platform or NLTK

                                                              © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Scaling for Growth


                                         © 2011 TNR Global, LLC.

Wednesday, October 19, 11
About the hardware...
                    • Solr allows you to use the familiar rows /
                            columns layout ESP uses
                    • Add shards to scale content, add search
                            slaves to scale queries
                    • We’re currently using master/slave indexer/
                            search setup, but options are numerous
                    • We’re developing a solution to support
                            scaling at will, a pain point for ESP as well

                                                           © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Its not just hardware...
                    • Use Fabric to automate cluster installs, data
                            builds and deployment tasks
                    • Use Jenkins to automate, manage and track
                            Fabric tasks
                    • Use Supervisor to manage multiple services
                            running on each node
                    • Use Lucid Works for better out-of-the-box
                            stemming, alerts, services and support

                                                          © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Migration In a Nutshell

                    •       We now consider Solr robust enough to be a
                            viable replacement of a FAST ESP solution

                    •       You supply the glue, or work with someone like us
                            to tie the different components together

                    •       If you have many custom pipeline stages, consider
                            using Pypes to ease your initial ESP migration

                    •       Fully supported versions of Solr are available via
                            Lucid Works using latest cutting edge features

                                                               © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Resources
                       Lucid Works   http://www.lucidimagination.com/
                         Rosette     http://www.basistech.com/lucene/
                         Heritrix    http://crawler.archive.org/
                         TwigKit     http://twigkit.com/
                           Pypes     https://bitbucket.org/diji/pypes/
                            Riak     http://basho.com/
                           NLTK      http://www.nltk.org/
                        RabbitMQ     http://www.rabbitmq.com/
                        Cascading    http://www.cascading.org/
                           Fabric    http://fabfile.org/
                          Jenkins    http://jenkins-ci.org/
                        Supervisor   http://supervisord.org/

                                                              © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Questions?
                    • Contact Us!
                     • Website: http://www.tnrglobal.com
                     • E-Mail: fast2solr@tnrglobal.com
                     • Phone: 001-413-425-1499

                      Thank you for your time!
                                                 © 2011 TNR Global, LLC.

Wednesday, October 19, 11

More Related Content

Similar to Migrating from FAST ESP to Lucene Solr

Esp2solr eurocon-2011-presentation-111021215049-phpapp02
Esp2solr eurocon-2011-presentation-111021215049-phpapp02Esp2solr eurocon-2011-presentation-111021215049-phpapp02
Esp2solr eurocon-2011-presentation-111021215049-phpapp02TNR Global
 
Monitoring is easy, why are we so bad at it presentation
Monitoring is easy, why are we so bad at it  presentationMonitoring is easy, why are we so bad at it  presentation
Monitoring is easy, why are we so bad at it presentationTheo Schlossnagle
 
Practical Cloud Security
Practical Cloud SecurityPractical Cloud Security
Practical Cloud SecurityJason Chan
 
Splunk at Expedia - Gartner Symposium
Splunk at Expedia - Gartner SymposiumSplunk at Expedia - Gartner Symposium
Splunk at Expedia - Gartner SymposiumEddie Satterly
 
Powered by Oracle! Te ayudamos a distribuir tu aplicación en todo el mundo
Powered by Oracle! Te ayudamos a distribuir tu aplicación en todo el mundoPowered by Oracle! Te ayudamos a distribuir tu aplicación en todo el mundo
Powered by Oracle! Te ayudamos a distribuir tu aplicación en todo el mundoGeneXus
 
SplunkLive New York 2011: DealerTrack
SplunkLive New York 2011: DealerTrackSplunkLive New York 2011: DealerTrack
SplunkLive New York 2011: DealerTrackSplunk
 
Who is KARL? and what does he know about Knowledge Management
Who is KARL? and what does he know about Knowledge ManagementWho is KARL? and what does he know about Knowledge Management
Who is KARL? and what does he know about Knowledge ManagementJim Glenn
 
How Plone's Security Works
How Plone's Security WorksHow Plone's Security Works
How Plone's Security WorksMatthew Wilkes
 
Data Segmenting in Anzo
Data Segmenting in AnzoData Segmenting in Anzo
Data Segmenting in AnzoLeeFeigenbaum
 
ELK at LinkedIn - Kafka, scaling, lessons learned
ELK at LinkedIn - Kafka, scaling, lessons learnedELK at LinkedIn - Kafka, scaling, lessons learned
ELK at LinkedIn - Kafka, scaling, lessons learnedTin Le
 
Blackhat Workshop
Blackhat WorkshopBlackhat Workshop
Blackhat Workshopwremes
 
Community Code: Xero
Community Code: XeroCommunity Code: Xero
Community Code: XeroSencha
 
Taking eZ Find beyond full-text search
Taking eZ Find beyond  full-text searchTaking eZ Find beyond  full-text search
Taking eZ Find beyond full-text searchPaul Borgermans
 
토드(Toad) 신제품 및 크로스 플랫폼 전략(1)
토드(Toad) 신제품 및 크로스 플랫폼 전략(1)토드(Toad) 신제품 및 크로스 플랫폼 전략(1)
토드(Toad) 신제품 및 크로스 플랫폼 전략(1)mosaicnet
 
CMPE 297 Lecture: Building Infrastructure Clouds with OpenStack
CMPE 297 Lecture: Building Infrastructure Clouds with OpenStackCMPE 297 Lecture: Building Infrastructure Clouds with OpenStack
CMPE 297 Lecture: Building Infrastructure Clouds with OpenStackJoe Arnold
 
Building high traffic http front-ends. theo schlossnagle. зал 1
Building high traffic http front-ends. theo schlossnagle. зал 1Building high traffic http front-ends. theo schlossnagle. зал 1
Building high traffic http front-ends. theo schlossnagle. зал 1rit2011
 
Search Analytics Business Value & NoSQL Backend
Search Analytics Business Value & NoSQL BackendSearch Analytics Business Value & NoSQL Backend
Search Analytics Business Value & NoSQL BackendSematext Group, Inc.
 

Similar to Migrating from FAST ESP to Lucene Solr (20)

Esp2solr eurocon-2011-presentation-111021215049-phpapp02
Esp2solr eurocon-2011-presentation-111021215049-phpapp02Esp2solr eurocon-2011-presentation-111021215049-phpapp02
Esp2solr eurocon-2011-presentation-111021215049-phpapp02
 
Monitoring is easy, why are we so bad at it presentation
Monitoring is easy, why are we so bad at it  presentationMonitoring is easy, why are we so bad at it  presentation
Monitoring is easy, why are we so bad at it presentation
 
Practical Cloud Security
Practical Cloud SecurityPractical Cloud Security
Practical Cloud Security
 
Splunk at Expedia - Gartner Symposium
Splunk at Expedia - Gartner SymposiumSplunk at Expedia - Gartner Symposium
Splunk at Expedia - Gartner Symposium
 
Drupal vs Sharepoint
Drupal vs SharepointDrupal vs Sharepoint
Drupal vs Sharepoint
 
Powered by Oracle! Te ayudamos a distribuir tu aplicación en todo el mundo
Powered by Oracle! Te ayudamos a distribuir tu aplicación en todo el mundoPowered by Oracle! Te ayudamos a distribuir tu aplicación en todo el mundo
Powered by Oracle! Te ayudamos a distribuir tu aplicación en todo el mundo
 
SplunkLive New York 2011: DealerTrack
SplunkLive New York 2011: DealerTrackSplunkLive New York 2011: DealerTrack
SplunkLive New York 2011: DealerTrack
 
Who is KARL? and what does he know about Knowledge Management
Who is KARL? and what does he know about Knowledge ManagementWho is KARL? and what does he know about Knowledge Management
Who is KARL? and what does he know about Knowledge Management
 
How Plone's Security Works
How Plone's Security WorksHow Plone's Security Works
How Plone's Security Works
 
Data Segmenting in Anzo
Data Segmenting in AnzoData Segmenting in Anzo
Data Segmenting in Anzo
 
ELK at LinkedIn - Kafka, scaling, lessons learned
ELK at LinkedIn - Kafka, scaling, lessons learnedELK at LinkedIn - Kafka, scaling, lessons learned
ELK at LinkedIn - Kafka, scaling, lessons learned
 
Blackhat Workshop
Blackhat WorkshopBlackhat Workshop
Blackhat Workshop
 
Community Code: Xero
Community Code: XeroCommunity Code: Xero
Community Code: Xero
 
Taking eZ Find beyond full-text search
Taking eZ Find beyond  full-text searchTaking eZ Find beyond  full-text search
Taking eZ Find beyond full-text search
 
토드(Toad) 신제품 및 크로스 플랫폼 전략(1)
토드(Toad) 신제품 및 크로스 플랫폼 전략(1)토드(Toad) 신제품 및 크로스 플랫폼 전략(1)
토드(Toad) 신제품 및 크로스 플랫폼 전략(1)
 
CMPE 297 Lecture: Building Infrastructure Clouds with OpenStack
CMPE 297 Lecture: Building Infrastructure Clouds with OpenStackCMPE 297 Lecture: Building Infrastructure Clouds with OpenStack
CMPE 297 Lecture: Building Infrastructure Clouds with OpenStack
 
Webops dashboards
Webops dashboardsWebops dashboards
Webops dashboards
 
Http front-ends
Http front-endsHttp front-ends
Http front-ends
 
Building high traffic http front-ends. theo schlossnagle. зал 1
Building high traffic http front-ends. theo schlossnagle. зал 1Building high traffic http front-ends. theo schlossnagle. зал 1
Building high traffic http front-ends. theo schlossnagle. зал 1
 
Search Analytics Business Value & NoSQL Backend
Search Analytics Business Value & NoSQL BackendSearch Analytics Business Value & NoSQL Backend
Search Analytics Business Value & NoSQL Backend
 

More from lucenerevolution

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucenelucenerevolution
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! lucenerevolution
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solrlucenerevolution
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationslucenerevolution
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloudlucenerevolution
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusterslucenerevolution
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiledlucenerevolution
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs lucenerevolution
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchlucenerevolution
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Stormlucenerevolution
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?lucenerevolution
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APIlucenerevolution
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMlucenerevolution
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucenelucenerevolution
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenallucenerevolution
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside downlucenerevolution
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - finallucenerevolution
 

More from lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 

Recently uploaded

Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 

Migrating from FAST ESP to Lucene Solr

  • 1. Migration from FAST ESP to Lucene Solr Presented by Michael McIntosh michaelm@tnrglobal.com, Oct 19th, 2011 Wednesday, October 19, 11
  • 2. What will we cover? Core Aspects of ESP to Solr Migration Migration Overview Crawling Content Processing Content Searching Content Scaling for Growth Questions? © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 3. Who am I? • 7+ Years FAST ESP • 10+ Years in Search • 15+ Years in Software • Early Lycos Developer • I also develop brain-computer interfaces :) © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 4. Who are we? • 7+ Years in Search • 15+ Years in Web Dev • 30+ Years in Software • Focus on ESP, Solr, Lucene, and the Cloud • Scalable Web & Search Solution Experts © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 5. Migration Overview © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 6. Migration Challenges • Our clients depend on ESP 5.3 • No future support for Linux ESP • We need a viable exit strategy • We want a fairly painless approach • How do we provide an alternative? © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 7. Migration Use Case Federated Product Search ...millions of parts and services... • XML documents (highly-structured) • PDF documents (semi-structured) • HTML documents (unstructured) © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 8. Our Approach Solr Search Platform (SolrSP) • Custom Scalable Crawler using Heritrix • Events & Queues managed with RabbitMQ • Caching & Persistence supported via Riak • Python pipeline replacement using Pypes • Advanced Linguistics via NLTK or Rosette © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 9. Crawling Content © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 10. Crawling for ESP • For XML content, our scripts query a service, download resources and feed • For PDF content, our scripts query a database, download PDF urls and feed • For HTML, our scripts query a database, download seed URLs and launch ESP’s Enterprise Crawler © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 11. Crawling for Solr • For XML & PDF content, the approach remains the same with a different writer • We tried Nutch crawler, but found it challenging to make it do what we needed • We tried Lucid Works bundled crawler, but found the exposed functionality did not offer the level of flexibility we needed © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 12. Crawling with Heritrix • Heritrix, created by the Internet Archive, supports much of the same functionality that the ESP Enterprise Crawler provides • We wrapped Heritrix to provide a higher level interface for service management • Made it scalable and added document caching via Riak to support refresh crawling © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 13. Crawler Architecture Crawl Job Crawler Request Manager Queue Cluster (RabbitMQ) Heritrix Heritrix Heritrix Messenger Messenger Messenger Heritrix Heritrix Heritrix Crawler Crawler Crawler Persistance Cluster (Riak) © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 14. Processing Content © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 15. Processing for ESP ESP Processing is document-centric • For XML, we transform, tag metadata, classify content before indexing • For PDF, we split pages, generate thumbnails, tag metadata and classify before indexing • For HTML, we normalize, clean content, tag metadata and classify before indexing © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 16. Processing for Solr Solr Processing is field-centric • Solr analyzers work on a field by field basis and lack the flexible workflow ESP provides • Using some Solr analyzers for the now, but evaluating alternatives (Rosette, NLTK) • Hadoop + Cascading looks promising • We use Stackless Python with Pypes to make ESP stage migration less painful © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 17. Processing with Pypes • Written in Python • Easy stage migration • Very flexible & robust • Branching & Merging • Single Input, Many Outputs • Trivial to embed and extend © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 18. Processor Migration ...From ESP © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 19. Processor Migration ...to Pypes © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 20. Searching Content © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 21. Feature Differences • ESP has robust faceting support but facets must be defined at index time, unlike Solr faceting • Solr does most of the heavy lifting at query time, which allows for more flexible approaches • Solr now directly supports taxonomy (hierarchical) faceting functionality (for drill down categories) • Solr now supports field collapsing which we use heavily in ESP installation to collapse result sets • ESP to Solr schema mapping fairly strait-forward © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 22. Search Interface • Solr has no direct equivalent to FAST Query Language (FQL) but function queries look like a possible option for complex queries • If you don’t have overly complex queries, the edismax query parser looks like a good option • Solr doesn’t have an easily extendable search-front component like ESP, but we like TwigKit for that • Default Solr stemmer isn’t as good as the ESP lemmatizer, so if you need good lemmatization consider Rosette Linguistics Platform or NLTK © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 23. Scaling for Growth © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 24. About the hardware... • Solr allows you to use the familiar rows / columns layout ESP uses • Add shards to scale content, add search slaves to scale queries • We’re currently using master/slave indexer/ search setup, but options are numerous • We’re developing a solution to support scaling at will, a pain point for ESP as well © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 25. Its not just hardware... • Use Fabric to automate cluster installs, data builds and deployment tasks • Use Jenkins to automate, manage and track Fabric tasks • Use Supervisor to manage multiple services running on each node • Use Lucid Works for better out-of-the-box stemming, alerts, services and support © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 26. Migration In a Nutshell • We now consider Solr robust enough to be a viable replacement of a FAST ESP solution • You supply the glue, or work with someone like us to tie the different components together • If you have many custom pipeline stages, consider using Pypes to ease your initial ESP migration • Fully supported versions of Solr are available via Lucid Works using latest cutting edge features © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 27. Resources Lucid Works http://www.lucidimagination.com/ Rosette http://www.basistech.com/lucene/ Heritrix http://crawler.archive.org/ TwigKit http://twigkit.com/ Pypes https://bitbucket.org/diji/pypes/ Riak http://basho.com/ NLTK http://www.nltk.org/ RabbitMQ http://www.rabbitmq.com/ Cascading http://www.cascading.org/ Fabric http://fabfile.org/ Jenkins http://jenkins-ci.org/ Supervisor http://supervisord.org/ © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 28. Questions? • Contact Us! • Website: http://www.tnrglobal.com • E-Mail: fast2solr@tnrglobal.com • Phone: 001-413-425-1499 Thank you for your time! © 2011 TNR Global, LLC. Wednesday, October 19, 11