SlideShare uma empresa Scribd logo
1 de 27
Baixar para ler offline
Scaling search at Trovit with Solr
                     and Hadoop
                                 Marc Sturlese, Trovit
                            marc@trovit.com, 19 October 2011




Wednesday, October 19, 11
My Background
          Marc Sturlese
          Trovit
          Software engineer focused on R&D
          Responsible for search and scalability




                                   2



Wednesday, October 19, 11
Agenda
          What is Trovit? Why Solr and Hadoop?
          The data workflow
          Distributed indexing strategy
          Moving indexing features out of Solr
            • Text analysis
            • Deduplication
       Performance
       Questions



                                3



Wednesday, October 19, 11
What is Trovit?
       Search engine for classified ads
       Tech company located in Barcelona
       Started 5 years ago just in a single country
       Now it’s in 33 countries and 4 business
        categories
       Main purpose is to serve good quality results to
        the end user as fast as possible




                                     4



Wednesday, October 19, 11
What is Trovit?




                                    5



Wednesday, October 19, 11
Why Solr and Hadoop?
       Start as custom Lucene search server
       Solr is very extensible and has a great
        community so we made the move
       Datastore was MySQL and custom Data Import
        Handler for indexing
       Scale up was not the way!
       Sharded MySQL strategies are hard to maintain
       Hadoop seemed a good feed




                                6



Wednesday, October 19, 11
The data workflow
       Documents are crunched by a pipeline of
        MapReduce jobs
       Stats are saved for each pipeline phase to keep
        track of what happen every moment
       Hive is used for those stats generation




                                      7



Wednesday, October 19, 11
The data workflow
       Pipeline overview

                                                    t-1



   Incoming          Ad
                                        Diff              Expiration        Deduplication Indexing
    data             processor


                                    t           t                       t              t




                            stats       stats                   stats          stats        stats




                                                            8



Wednesday, October 19, 11
The data workflow
       Deployment overview                                Multicast
                                                           deployment
                                                           whole indexes     Slave


                                       HDFS
                                       to local
   Incoming                                                                  Slave
                            Pipeline              Index
    data
                                                  repository




                                                               Rsync index   Slave
                                        Fresh updates          updates
                                        with a “minimal
                                        data processor”




                                                   9



Wednesday, October 19, 11
The data workflow
       Index constantly built from scratch. Keep
        desired number of segments for good search
        performance
       “Minimal data processor” allows fresh data
        appear in the search results
       HDFS makes backups really convenient
       Multicast system allows to send indexes to all
        slaves at the same time. The only limit is your
        bandwidth



                                     10



Wednesday, October 19, 11
Distributed indexing strategy
      First looked at SOLR-1301
      Extends InputFormat allowing just an index per
       reducer
      Good to generate huge indexes building a
       shard per reduce
      To achieve the goal with minimal time, shards
       should have very similar size
      Reduce side indexing seemed the way but...
       indexes differ a lot of size depending on the
       country and vertical

                              11



Wednesday, October 19, 11
Distributed indexing strategy
      Single monolithic indexes or shards
      Another approach, 2 phases indexing (2
       sequential MapReduce jobs)
           • Partial indexing: Generate lots of “micro
             indexes” per each monolithic or sharded index
           • Merge: Groups all the “micro indexes” and
             merge them to get the production data.




                                  12



Wednesday, October 19, 11
Distributed indexing strategy
       2 phases indexing overview
                               HDFS serialized data




                            Partial indexer: Micro indexes




                            Merger: Production indexes



                                          13



Wednesday, October 19, 11
Distributed indexing - Partial
                      generation
       Map reads serialized data and emit grouping by
        micro index
       Reduce receives ads grouped as “micro index”
        and builds it
       Embedded Solr Server for indexing and
        optimize
       Solr cores configuration is stored in HDFS.
       Indexes are build in local disk and then
        uploaded to HDFS



                              14



Wednesday, October 19, 11
Distributed indexing - Partial
                      generation
                            Input          K: id; V: Ad

     MAP                    Code          Assign micro index to Ad

                            Output         K: microIndexId; V: Ad


                            Input         K: microIndexId; V:AdList<>

    REDUCE                  Code          Build index

                            Output         K: Null; V: Message

                                     15



Wednesday, October 19, 11
Distributed indexing - Merge
                         phase
       Merging is done in plain Lucene
       Map reads a list of the paths of the micro
        indexes and emits grouping per shard or
        monolithic
       Reduce receives the list and does the proper
        merge
       Partial indexes are downloaded to local disk,
        merged to a single one and uploaded back to
        HDFS
       Since Lucene 3.1 addIndexes(Directory) uses
        copy, merge can be very fast
                              16



Wednesday, October 19, 11
Distributed indexing – Merge
                         phase
                            Input         K: lineNum; V: MicroIndexPath


                            Code          Get index name
     MAP
                            Output        K: indexName; V: MicroIndexPath


                                          K: indexName;
                                          V: MicroIndexPathList<>
                            Input

    REDUCE                  Code          Merge micro indexes


                            Output        K: Null; V: Message


                                     17



Wednesday, October 19, 11
Distributed indexing strategy
      Pros:
           • Highly scalable
           • Allows indexes with very different size keeping
             good performance
           • Easy to manage
      Cons:
           • Time uploading and downloading from HDFS
             before it gets into production




                                   18



Wednesday, October 19, 11
Moving features out of Solr
      Useful when you have to deal with lots of data
      Text processing with Solr and Hadoop
      Distributing Solr Deduplication




                              19



Wednesday, October 19, 11
Text processing with Solr and
                      Hadoop
       Solr has many powerful analyzers already
        implemented
       Mahout tokenizes text using plain Lucene and
        Hadoop
       The setUp method on a Map instantiates the
        Analyzer
       A Map receives serialized data and that is
        processed using Solr analyzers
       Analyzer can receive configuration parameters
        from a job-site.xml file


                              20



Wednesday, October 19, 11
Text processing with Solr and
                      Hadoop
          //init Solr analyzer

          final List<TokenFilterFactory> filters = new ArrayList<TokenFilterFactory>();

               final TokenFilterFactory wordDelimiter = new WordDelimiterFilterFactory();

               Map<String, String> args = new HashMap<String, String>();

               args.put("generateWordParts", conf.get(WORD_PARTS));

               args.put("splitOnNumerics", conf.get(NUMERIC_SPLIT);

               wordDelimiter.init(args);

               final TokenFilterFactory accent = new ISOLatin1AccentFilterFactory();

               final TokenFilterFactory lowerCase = new LowerCaseFilterFactory();

               filters.add(wordDelimiter);

               filters.add(accent);

               filters.add(lowerCase);

               final TokenizerFactory tokenizer = new StandardTokenizerFactory();

              analyzer = new TokenizerChain(null, tokenizer, filters.toArray(new
          TokenFilterFactory[filters.size()]));



                                                   21



Wednesday, October 19, 11
Text processing with Solr and
                      Hadoop
               //Tokenizing text

               ...

              HashSet<String> tokens = new HashSet<String>();

              TokenStream stream = analyzer.reusableTokenStream(fieldName, new
          StringReader(fieldValue));

               TermAttribute termAtt = (TermAttribute) stream.addAttribute(TermAttribute.class);

               while (stream.incrementToken()) {

                   tokens.add(termAtt.term());

               }

               return tokens;




                                                   22



Wednesday, October 19, 11
Distributed deduplication
       Compute near duplicates in a distributed
        environment
       Map receives serialized ads and emit building
        the key using Solr’s TextProfileSignature
       Reduce receives dup ads grouped. There, you
        decide
       Field names to compute the signature received
        as configuration parameters from a job-site.xml
        file



                               23



Wednesday, October 19, 11
Distributed deduplication

                            Input         K: id; V: Ad

     MAP                    Code          Build signature

                            Output        K: signature; V: Ad


                            Input         K: signature; V: AdList<>

    REDUCE                  Code          Dups logic

                            Output        K: id; V: Ad

                                     24



Wednesday, October 19, 11
Performance: Setting the
                         merge factor
       Used in LogByteMergePolicy and older
       When indexing, tells Lucene how many
        segments can be created before a merge
        happen
       Very low value will keep the index almost
        optimized. Good for search performance but
        indexing will be slower
       High value will generate lots, of files. Indexing
        will be faster but not search requests
       New versions of Solr default to
        TieredMergePolicy which don’t use it
                                25



Wednesday, October 19, 11
?
                            26



Wednesday, October 19, 11
Contact
       Thank you for your attention
       Marc Sturlese
            • marc@trovit.com
            • www.trovit.com




                                27



Wednesday, October 19, 11

Mais conteúdo relacionado

Semelhante a Scaling search at Trovit with Solr and Hadoop

Integrating Globus into LRZ's Data Science Storage Service
Integrating Globus into LRZ's Data Science Storage ServiceIntegrating Globus into LRZ's Data Science Storage Service
Integrating Globus into LRZ's Data Science Storage ServiceGlobus
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsAndrew Brust
 
Red Hat Storage: Emerging Use Cases
Red Hat Storage: Emerging Use CasesRed Hat Storage: Emerging Use Cases
Red Hat Storage: Emerging Use CasesRed_Hat_Storage
 
Red Hat® Ceph Storage and Network Solutions for Software Defined Infrastructure
Red Hat® Ceph Storage and Network Solutions for Software Defined InfrastructureRed Hat® Ceph Storage and Network Solutions for Software Defined Infrastructure
Red Hat® Ceph Storage and Network Solutions for Software Defined InfrastructureIntel® Software
 
Net flowhadoop flocon2013_yhlee_final
Net flowhadoop flocon2013_yhlee_finalNet flowhadoop flocon2013_yhlee_final
Net flowhadoop flocon2013_yhlee_finalYeounhee Lee
 
CloudFoundry and MongoDb, a marriage made in heaven
CloudFoundry and MongoDb, a marriage made in heavenCloudFoundry and MongoDb, a marriage made in heaven
CloudFoundry and MongoDb, a marriage made in heavenPatrick Chanezon
 
Red Hat Storage Day New York - Red Hat Gluster Storage: Historical Tick Data ...
Red Hat Storage Day New York - Red Hat Gluster Storage: Historical Tick Data ...Red Hat Storage Day New York - Red Hat Gluster Storage: Historical Tick Data ...
Red Hat Storage Day New York - Red Hat Gluster Storage: Historical Tick Data ...Red_Hat_Storage
 
Hadoop summit 2010, HONU
Hadoop summit 2010, HONUHadoop summit 2010, HONU
Hadoop summit 2010, HONUJerome Boulon
 
Making your Analytics Investment Pay Off - StampedeCon 2012
Making your Analytics Investment Pay Off - StampedeCon 2012Making your Analytics Investment Pay Off - StampedeCon 2012
Making your Analytics Investment Pay Off - StampedeCon 2012StampedeCon
 
Microservices forscale
Microservices forscaleMicroservices forscale
Microservices forscaleDeepak Singhvi
 
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...Data Con LA
 
Spark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with SparkSpark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with SparkMatt Ingenthron
 
Devteach 2017 Store 2 million of audit a day into elasticsearch
Devteach 2017 Store 2 million of audit a day into elasticsearchDevteach 2017 Store 2 million of audit a day into elasticsearch
Devteach 2017 Store 2 million of audit a day into elasticsearchTaswar Bhatti
 
A gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and HadoopA gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and HadoopStefano Paluello
 
Real Time Streaming with Flink & Couchbase
Real Time Streaming with Flink & CouchbaseReal Time Streaming with Flink & Couchbase
Real Time Streaming with Flink & CouchbaseManuel Hurtado
 
Indexing Big Data on Amazon AWS
Indexing Big Data on Amazon AWSIndexing Big Data on Amazon AWS
Indexing Big Data on Amazon AWSlucenerevolution
 
LSC@LDAPCon 2011
LSC@LDAPCon 2011LSC@LDAPCon 2011
LSC@LDAPCon 2011sbahloul
 
Redis for Fast Data Ingest
Redis for Fast Data IngestRedis for Fast Data Ingest
Redis for Fast Data IngestRedis Labs
 

Semelhante a Scaling search at Trovit with Solr and Hadoop (20)

Integrating Globus into LRZ's Data Science Storage Service
Integrating Globus into LRZ's Data Science Storage ServiceIntegrating Globus into LRZ's Data Science Storage Service
Integrating Globus into LRZ's Data Science Storage Service
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
 
Red Hat Storage: Emerging Use Cases
Red Hat Storage: Emerging Use CasesRed Hat Storage: Emerging Use Cases
Red Hat Storage: Emerging Use Cases
 
Red Hat® Ceph Storage and Network Solutions for Software Defined Infrastructure
Red Hat® Ceph Storage and Network Solutions for Software Defined InfrastructureRed Hat® Ceph Storage and Network Solutions for Software Defined Infrastructure
Red Hat® Ceph Storage and Network Solutions for Software Defined Infrastructure
 
Net flowhadoop flocon2013_yhlee_final
Net flowhadoop flocon2013_yhlee_finalNet flowhadoop flocon2013_yhlee_final
Net flowhadoop flocon2013_yhlee_final
 
CloudFoundry and MongoDb, a marriage made in heaven
CloudFoundry and MongoDb, a marriage made in heavenCloudFoundry and MongoDb, a marriage made in heaven
CloudFoundry and MongoDb, a marriage made in heaven
 
Red Hat Storage Day New York - Red Hat Gluster Storage: Historical Tick Data ...
Red Hat Storage Day New York - Red Hat Gluster Storage: Historical Tick Data ...Red Hat Storage Day New York - Red Hat Gluster Storage: Historical Tick Data ...
Red Hat Storage Day New York - Red Hat Gluster Storage: Historical Tick Data ...
 
Hadoop summit 2010, HONU
Hadoop summit 2010, HONUHadoop summit 2010, HONU
Hadoop summit 2010, HONU
 
Making your Analytics Investment Pay Off - StampedeCon 2012
Making your Analytics Investment Pay Off - StampedeCon 2012Making your Analytics Investment Pay Off - StampedeCon 2012
Making your Analytics Investment Pay Off - StampedeCon 2012
 
Microservices forscale
Microservices forscaleMicroservices forscale
Microservices forscale
 
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...
 
Spark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with SparkSpark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with Spark
 
HADOOP
HADOOPHADOOP
HADOOP
 
Devteach 2017 Store 2 million of audit a day into elasticsearch
Devteach 2017 Store 2 million of audit a day into elasticsearchDevteach 2017 Store 2 million of audit a day into elasticsearch
Devteach 2017 Store 2 million of audit a day into elasticsearch
 
A gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and HadoopA gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Real Time Streaming with Flink & Couchbase
Real Time Streaming with Flink & CouchbaseReal Time Streaming with Flink & Couchbase
Real Time Streaming with Flink & Couchbase
 
Indexing Big Data on Amazon AWS
Indexing Big Data on Amazon AWSIndexing Big Data on Amazon AWS
Indexing Big Data on Amazon AWS
 
LSC@LDAPCon 2011
LSC@LDAPCon 2011LSC@LDAPCon 2011
LSC@LDAPCon 2011
 
Redis for Fast Data Ingest
Redis for Fast Data IngestRedis for Fast Data Ingest
Redis for Fast Data Ingest
 

Último

AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 

Último (20)

AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 

Scaling search at Trovit with Solr and Hadoop

  • 1. Scaling search at Trovit with Solr and Hadoop Marc Sturlese, Trovit marc@trovit.com, 19 October 2011 Wednesday, October 19, 11
  • 2. My Background  Marc Sturlese  Trovit  Software engineer focused on R&D  Responsible for search and scalability 2 Wednesday, October 19, 11
  • 3. Agenda  What is Trovit? Why Solr and Hadoop?  The data workflow  Distributed indexing strategy  Moving indexing features out of Solr • Text analysis • Deduplication  Performance  Questions 3 Wednesday, October 19, 11
  • 4. What is Trovit?  Search engine for classified ads  Tech company located in Barcelona  Started 5 years ago just in a single country  Now it’s in 33 countries and 4 business categories  Main purpose is to serve good quality results to the end user as fast as possible 4 Wednesday, October 19, 11
  • 5. What is Trovit? 5 Wednesday, October 19, 11
  • 6. Why Solr and Hadoop?  Start as custom Lucene search server  Solr is very extensible and has a great community so we made the move  Datastore was MySQL and custom Data Import Handler for indexing  Scale up was not the way!  Sharded MySQL strategies are hard to maintain  Hadoop seemed a good feed 6 Wednesday, October 19, 11
  • 7. The data workflow  Documents are crunched by a pipeline of MapReduce jobs  Stats are saved for each pipeline phase to keep track of what happen every moment  Hive is used for those stats generation 7 Wednesday, October 19, 11
  • 8. The data workflow  Pipeline overview t-1 Incoming Ad Diff Expiration Deduplication Indexing data processor t t t t stats stats stats stats stats 8 Wednesday, October 19, 11
  • 9. The data workflow  Deployment overview Multicast deployment whole indexes Slave HDFS to local Incoming Slave Pipeline Index data repository Rsync index Slave Fresh updates updates with a “minimal data processor” 9 Wednesday, October 19, 11
  • 10. The data workflow  Index constantly built from scratch. Keep desired number of segments for good search performance  “Minimal data processor” allows fresh data appear in the search results  HDFS makes backups really convenient  Multicast system allows to send indexes to all slaves at the same time. The only limit is your bandwidth 10 Wednesday, October 19, 11
  • 11. Distributed indexing strategy  First looked at SOLR-1301  Extends InputFormat allowing just an index per reducer  Good to generate huge indexes building a shard per reduce  To achieve the goal with minimal time, shards should have very similar size  Reduce side indexing seemed the way but... indexes differ a lot of size depending on the country and vertical 11 Wednesday, October 19, 11
  • 12. Distributed indexing strategy  Single monolithic indexes or shards  Another approach, 2 phases indexing (2 sequential MapReduce jobs) • Partial indexing: Generate lots of “micro indexes” per each monolithic or sharded index • Merge: Groups all the “micro indexes” and merge them to get the production data. 12 Wednesday, October 19, 11
  • 13. Distributed indexing strategy  2 phases indexing overview HDFS serialized data Partial indexer: Micro indexes Merger: Production indexes 13 Wednesday, October 19, 11
  • 14. Distributed indexing - Partial generation  Map reads serialized data and emit grouping by micro index  Reduce receives ads grouped as “micro index” and builds it  Embedded Solr Server for indexing and optimize  Solr cores configuration is stored in HDFS.  Indexes are build in local disk and then uploaded to HDFS 14 Wednesday, October 19, 11
  • 15. Distributed indexing - Partial generation Input K: id; V: Ad MAP Code Assign micro index to Ad Output K: microIndexId; V: Ad Input K: microIndexId; V:AdList<> REDUCE Code Build index Output K: Null; V: Message 15 Wednesday, October 19, 11
  • 16. Distributed indexing - Merge phase  Merging is done in plain Lucene  Map reads a list of the paths of the micro indexes and emits grouping per shard or monolithic  Reduce receives the list and does the proper merge  Partial indexes are downloaded to local disk, merged to a single one and uploaded back to HDFS  Since Lucene 3.1 addIndexes(Directory) uses copy, merge can be very fast 16 Wednesday, October 19, 11
  • 17. Distributed indexing – Merge phase Input K: lineNum; V: MicroIndexPath Code Get index name MAP Output K: indexName; V: MicroIndexPath K: indexName; V: MicroIndexPathList<> Input REDUCE Code Merge micro indexes Output K: Null; V: Message 17 Wednesday, October 19, 11
  • 18. Distributed indexing strategy  Pros: • Highly scalable • Allows indexes with very different size keeping good performance • Easy to manage  Cons: • Time uploading and downloading from HDFS before it gets into production 18 Wednesday, October 19, 11
  • 19. Moving features out of Solr  Useful when you have to deal with lots of data  Text processing with Solr and Hadoop  Distributing Solr Deduplication 19 Wednesday, October 19, 11
  • 20. Text processing with Solr and Hadoop  Solr has many powerful analyzers already implemented  Mahout tokenizes text using plain Lucene and Hadoop  The setUp method on a Map instantiates the Analyzer  A Map receives serialized data and that is processed using Solr analyzers  Analyzer can receive configuration parameters from a job-site.xml file 20 Wednesday, October 19, 11
  • 21. Text processing with Solr and Hadoop //init Solr analyzer final List<TokenFilterFactory> filters = new ArrayList<TokenFilterFactory>(); final TokenFilterFactory wordDelimiter = new WordDelimiterFilterFactory(); Map<String, String> args = new HashMap<String, String>(); args.put("generateWordParts", conf.get(WORD_PARTS)); args.put("splitOnNumerics", conf.get(NUMERIC_SPLIT); wordDelimiter.init(args); final TokenFilterFactory accent = new ISOLatin1AccentFilterFactory(); final TokenFilterFactory lowerCase = new LowerCaseFilterFactory(); filters.add(wordDelimiter); filters.add(accent); filters.add(lowerCase); final TokenizerFactory tokenizer = new StandardTokenizerFactory(); analyzer = new TokenizerChain(null, tokenizer, filters.toArray(new TokenFilterFactory[filters.size()])); 21 Wednesday, October 19, 11
  • 22. Text processing with Solr and Hadoop //Tokenizing text ... HashSet<String> tokens = new HashSet<String>(); TokenStream stream = analyzer.reusableTokenStream(fieldName, new StringReader(fieldValue)); TermAttribute termAtt = (TermAttribute) stream.addAttribute(TermAttribute.class); while (stream.incrementToken()) { tokens.add(termAtt.term()); } return tokens; 22 Wednesday, October 19, 11
  • 23. Distributed deduplication  Compute near duplicates in a distributed environment  Map receives serialized ads and emit building the key using Solr’s TextProfileSignature  Reduce receives dup ads grouped. There, you decide  Field names to compute the signature received as configuration parameters from a job-site.xml file 23 Wednesday, October 19, 11
  • 24. Distributed deduplication Input K: id; V: Ad MAP Code Build signature Output K: signature; V: Ad Input K: signature; V: AdList<> REDUCE Code Dups logic Output K: id; V: Ad 24 Wednesday, October 19, 11
  • 25. Performance: Setting the merge factor  Used in LogByteMergePolicy and older  When indexing, tells Lucene how many segments can be created before a merge happen  Very low value will keep the index almost optimized. Good for search performance but indexing will be slower  High value will generate lots, of files. Indexing will be faster but not search requests  New versions of Solr default to TieredMergePolicy which don’t use it 25 Wednesday, October 19, 11
  • 26. ? 26 Wednesday, October 19, 11
  • 27. Contact  Thank you for your attention  Marc Sturlese • marc@trovit.com • www.trovit.com 27 Wednesday, October 19, 11