SlideShare uma empresa Scribd logo
1 de 41
Baixar para ler offline
Grouping & Joining
                Martijn van Groningen
                martijn.vangroningen@searchworkings.com
                Lucene Committer & PMC Member




Thursday, May 17, 2012
Grouping & Joining

       Overview
      ‣ Background



      ‣ Joining



      ‣ Result grouping



      ‣ Conclusion




                          Searchworkings.org - The online search community   2
Thursday, May 17, 2012
Background

       Lucene’s model
      ‣ Lucene is document based.



      ‣ Lucene doesn’t store information about relations between documents.



      ‣ Data often holds relations.



      ‣ Good free text search over relational data.




                          Searchworkings.org - The online search community    3
Thursday, May 17, 2012
Background

       Example
         ‣ Product

               ‣ Name

               ‣ Description

         ‣ Product-item

               ‣ Color

               ‣ Size

               ‣ Price



         ‣ Goal: Show the most applicable product based on product-item criteria.
                               Searchworkings.org - The online search community   4
Thursday, May 17, 2012
Background

       Common Lucene solutions
      ‣ Compound documents.

            ‣ May result in documents with many fields.

      ‣ Subsequent searches.

            ‣ May cause a lot network overhead.




      ‣ Non Lucene based approach:

            ‣ If free text search isn’t very important use a relational database.


                             Searchworkings.org - The online search community       5
Thursday, May 17, 2012
Background

       Example domain
      ‣ Compound Product & Product-items document.

      ‣ Each product-item has its own field prefix.




                          Searchworkings.org - The online search community   6
Thursday, May 17, 2012
Background

       Different solutions
      ‣ Lucene offers solutions to have a 'relational' like search.

            ‣ Parent child



      ‣ Grouping & joining aren't naturally supported.

            ‣ All the solutions do increase the search time.



      ‣ Some scenarios grouping and joining isn't the right solution.




                             Searchworkings.org - The online search community   7
Thursday, May 17, 2012
Joining
                Modelling relations




Thursday, May 17, 2012
Joining

       Introduction
      ‣ Support for parent child like search from Lucene 3.4

            ‣ Not a SQL join.



      ‣ The parent and each children are stored as documents.



      ‣ Two types:

            ‣ Index time join

            ‣ Query time join


                                Searchworkings.org - The online search community   9
Thursday, May 17, 2012
Joining

       Index time join
      ‣ Two block join queries:

            ‣ ToParentBlockJoinQuery

            ‣ ToChildBlockJoinQuery



      ‣ One Lucene collector:

            ‣ ToParentBlockJoinCollector



      ‣ Index time join requires block indexing.


                            Searchworkings.org - The online search community   10
Thursday, May 17, 2012
Joining

       Block indexing
      ‣ Atomically adding documents.

            ‣ A block of documents.



      ‣ Each document gets sequentially assigned Lucene document id.



      ‣ IndexWriter#addDocuments(docs);




                            Searchworkings.org - The online search community   11
Thursday, May 17, 2012
Joining

       Block indexing
      ‣ Index doesn't record blocks.

            ‣ Segment merging doesn’t re-order documents in a segment.



      ‣ App is responsible for identifying block documents.

            ‣ Marking the last document in a block.



      ‣ Adding a document to a block requires you to reindex the whole block.

      ‣ Removing a document from a block doesn’t requires reindexing a block.


                            Searchworkings.org - The online search community    12
Thursday, May 17, 2012
Joining

       Example domain
      ‣ Parent is the last document in a block.




                          Searchworkings.org - The online search community   13
Thursday, May 17, 2012
Joining

       Block indexing

                             Marking parent documents




                         Searchworkings.org - The online search community   14
Thursday, May 17, 2012
Joining

       Block indexing




                                                                            Add block




                                                                             Add block



                         Searchworkings.org - The online search community                15
Thursday, May 17, 2012
Joining

       ToParentBlockJoinQuery
      ‣ Parent filter marks the parent documents.



      ‣ Child query is executed in the parent space.




      ‣ ToChildBlockJoinQuery works in the opposite direction.
                         Searchworkings.org - The online search community   16
Thursday, May 17, 2012
Joining

       Query time joining
      ‣ Query time joining is executed in two phases and is field based:

            ‣ fromField

            ‣ toField



      ‣ Doesn’t require block indexing.




                          Searchworkings.org - The online search community   17
Thursday, May 17, 2012
Joining

       Query time joining
      ‣ First phase collects all the terms in the fromField for the documents
        that match with the original query.

            ‣ Currently doesn’t take the score from original query into account.



      ‣ The second phase returns the documents that match with the collected
        terms from the previous phase in the toField.



      ‣ Two different implementations:

            ‣ JoinUtil - Lucene (≥ 3.6)

            ‣ Join query parser - Solr (trunk)
                             Searchworkings.org - The online search community      18
Thursday, May 17, 2012
Joining

       Query time joining - Indexing




                           Referrer the product id.
                         Searchworkings.org - The online search community   19
Thursday, May 17, 2012
Joining

       Query time joining - Indexing




                         Searchworkings.org - The online search community   20
Thursday, May 17, 2012
Joining

       Query time joining




      ‣ Result will contain one product.

      ‣ Possible to join over two indices.




                          Searchworkings.org - The online search community   21
Thursday, May 17, 2012
Joining

       Final thoughts
      ‣ Joining module has good solutions to model parent child relations.



      ‣ Use block join if you care about scoring.

            ‣ Frequent updates can be problematic.

      ‣ Use query time join for parent child filtering.

            ‣ Query time join is slower than index time join.



      ‣ Mostly a Lucene feature only.

      ‣ All code is annotated as experimental.
                             Searchworkings.org - The online search community   22
Thursday, May 17, 2012
Result grouping
                Previously known as Field Collapsing.




Thursday, May 17, 2012
Result grouping

       Introduction
      ‣ Group matching documents that share a common property.



      ‣ Search hit represents a group.

            ‣ Facet counts & total hit count represent groups.



      ‣ Per group collect information

            ‣ Most relevant document.

            ‣ Top three documents.

            ‣ Aggregated counts
                             Searchworkings.org - The online search community   24
Thursday, May 17, 2012
Result grouping

       Usages
      ‣ Group documents by a shared property

            ‣ Product-item by product id (Parent child)



      ‣ Collapse similar looking documents

            ‣ E.g. all results from the Wikipedia domains.



      ‣ Remove duplicates from the search result.

            ‣ Based on a field that contains a hash


                             Searchworkings.org - The online search community   25
Thursday, May 17, 2012
Result grouping

       Example domain




      ‣ Each Product-item is a document, but includes the product data.


                         Searchworkings.org - The online search community   26
Thursday, May 17, 2012
Result grouping

       Implementation
      ‣ Result grouping implemented with Lucene collectors.

            ‣ Module in trunk and a contrib in 3.x versions.



      ‣ Two pass result grouping.

            ‣ Grouping by indexed field, function or doc values.



      ‣ Single pass result grouping.

            ‣ Requires block indexing.


                             Searchworkings.org - The online search community   27
Thursday, May 17, 2012
Result grouping

       Two pass implementation
      ‣ First pass collects the top N groups.

            ‣ Per group: group value + sort value



      ‣ Second pass collects data for each top group.

            ‣ The top N documents per group.

            ‣ Possible other aggregated information.



      ‣ Second pass search ignores all documents outside topN groups.


                            Searchworkings.org - The online search community   28
Thursday, May 17, 2012
Result grouping

       Result grouping - Indexing




                         Searchworkings.org - The online search community   29
Thursday, May 17, 2012
Result grouping

       Result grouping - Searching




                         Searchworkings.org - The online search community   30
Thursday, May 17, 2012
Result grouping

       Result grouping made easier
     ‣ GroupingSearch




     ‣ Solr

         ‣ http://myhost/solr/select?q=shirt&group=true&group.field=product_id

     ‣ Many more options:

         ‣ http://wiki.apache.org/solr/FieldCollapsing
                            Searchworkings.org - The online search community     31
Thursday, May 17, 2012
Result grouping

       Parent child result
      ‣ TopGroups - Equivalent to TopDocs.

            ‣ Hit count

            ‣ Group count

            ‣ Groups

                  ‣ Top documents



      ‣ Facet and total count can represent groups instead of documents.

            ‣ But requires more query time.


                              Searchworkings.org - The online search community   32
Thursday, May 17, 2012
Conclusion
                Compare...




Thursday, May 17, 2012
Conclusion

       Compare the parent child solutions
      ‣ Result grouping

            ‣ + Distributed support & Parent child relation as hit.

            ‣ - Parent data duplication

            ‣ - Impact on query time



      ‣ Joining

            ‣ + Fast & no data duplication

            ‣ - Index time join not optimal for updates

            ‣ - Query time join is limited.
                              Searchworkings.org - The online search community   34
Thursday, May 17, 2012
Conclusion

       Compare the parent child solutions
      ‣ Compound documents.

            ‣ + Fast and works out-of-the box with all features.

            ‣ - Not flexible when it comes to updates.

            ‣ - Document granularity is set in stone.




                             Searchworkings.org - The online search community   35
Thursday, May 17, 2012
Any questions?




                                          36

Thursday, May 17, 2012
Extra slides
                We have time left!




Thursday, May 17, 2012
Conclusion

       Future work
       ‣ Higher level parent-child API.

             ‣ Needs to cover search & indexing.



       ‣ Joining

             ‣ Distributed support.

             ‣ Represent a hit as a parent child relation in the search result.



       ‣ Result grouping

             ‣ Aggregated grouped information like: sum, avg, min, max etc.
                              Searchworkings.org - The online search community    38
Thursday, May 17, 2012
Joining

       ToParentBlockJoinCollector




         ‣ TopGroups contains a group per top N parent document.

         ‣ Each group contains a parent and child documents.

                          Searchworkings.org - The online search community   39
Thursday, May 17, 2012
Result grouping

       Groups & facet counts
      ‣ Faceting and result grouping are different features.

            ‣ But are often used together!



      ‣ Facet counts can be based on:

            ‣ Found documents.

            ‣ Found groups.

            ‣ Combination of facet value and group.



      ‣ All options are supported in Solr.
                              Searchworkings.org - The online search community   40
Thursday, May 17, 2012
Result grouping

       Doc values
      ‣ Doc values / Column Stride values



      ‣ Prevents the creation of expensive data structures in FieldCache.



      ‣ Inverted index is meant for free text search.



      ‣ All grouping collectors have doc values based implementations!




                          Searchworkings.org - The online search community   41
Thursday, May 17, 2012

Mais conteúdo relacionado

Mais procurados

Solving PostgreSQL wicked problems
Solving PostgreSQL wicked problemsSolving PostgreSQL wicked problems
Solving PostgreSQL wicked problemsAlexander Korotkov
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsDatabricks
 
Live traffic capture and replay in cassandra 4.0
Live traffic capture and replay in cassandra 4.0Live traffic capture and replay in cassandra 4.0
Live traffic capture and replay in cassandra 4.0Vinay Kumar Chella
 
MongoDB Fundamentals
MongoDB FundamentalsMongoDB Fundamentals
MongoDB FundamentalsMongoDB
 
Introduction to JCR and Apache Jackrabbi
Introduction to JCR and Apache JackrabbiIntroduction to JCR and Apache Jackrabbi
Introduction to JCR and Apache JackrabbiJukka Zitting
 
Smart Join Algorithms for Fighting Skew at Scale
Smart Join Algorithms for Fighting Skew at ScaleSmart Join Algorithms for Fighting Skew at Scale
Smart Join Algorithms for Fighting Skew at ScaleDatabricks
 
MongoDB World 2019: The Sights (and Smells) of a Bad Query
MongoDB World 2019: The Sights (and Smells) of a Bad QueryMongoDB World 2019: The Sights (and Smells) of a Bad Query
MongoDB World 2019: The Sights (and Smells) of a Bad QueryMongoDB
 
[Pgday.Seoul 2020] SQL Tuning
[Pgday.Seoul 2020] SQL Tuning[Pgday.Seoul 2020] SQL Tuning
[Pgday.Seoul 2020] SQL TuningPgDay.Seoul
 
JCR - Java Content Repositories
JCR - Java Content RepositoriesJCR - Java Content Repositories
JCR - Java Content RepositoriesCarsten Ziegeler
 
Content Storage With Apache Jackrabbit
Content Storage With Apache JackrabbitContent Storage With Apache Jackrabbit
Content Storage With Apache JackrabbitJukka Zitting
 
MongoDB Aggregation Performance
MongoDB Aggregation PerformanceMongoDB Aggregation Performance
MongoDB Aggregation PerformanceMongoDB
 
MongoDB .local Toronto 2019: Tips and Tricks for Effective Indexing
MongoDB .local Toronto 2019: Tips and Tricks for Effective IndexingMongoDB .local Toronto 2019: Tips and Tricks for Effective Indexing
MongoDB .local Toronto 2019: Tips and Tricks for Effective IndexingMongoDB
 
RedisConf18 - Redis Memory Optimization
RedisConf18 - Redis Memory OptimizationRedisConf18 - Redis Memory Optimization
RedisConf18 - Redis Memory OptimizationRedis Labs
 
Creating Continuously Up to Date Materialized Aggregates
Creating Continuously Up to Date Materialized AggregatesCreating Continuously Up to Date Materialized Aggregates
Creating Continuously Up to Date Materialized AggregatesEDB
 
Introduction to Mongodb execution plan and optimizer
Introduction to Mongodb execution plan and optimizerIntroduction to Mongodb execution plan and optimizer
Introduction to Mongodb execution plan and optimizerMydbops
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBMongoDB
 
State transfer With Galera
State transfer With GaleraState transfer With Galera
State transfer With GaleraMydbops
 

Mais procurados (20)

Solving PostgreSQL wicked problems
Solving PostgreSQL wicked problemsSolving PostgreSQL wicked problems
Solving PostgreSQL wicked problems
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark Metrics
 
Live traffic capture and replay in cassandra 4.0
Live traffic capture and replay in cassandra 4.0Live traffic capture and replay in cassandra 4.0
Live traffic capture and replay in cassandra 4.0
 
MongoDB Fundamentals
MongoDB FundamentalsMongoDB Fundamentals
MongoDB Fundamentals
 
Introduction to JCR and Apache Jackrabbi
Introduction to JCR and Apache JackrabbiIntroduction to JCR and Apache Jackrabbi
Introduction to JCR and Apache Jackrabbi
 
Smart Join Algorithms for Fighting Skew at Scale
Smart Join Algorithms for Fighting Skew at ScaleSmart Join Algorithms for Fighting Skew at Scale
Smart Join Algorithms for Fighting Skew at Scale
 
MongoDB World 2019: The Sights (and Smells) of a Bad Query
MongoDB World 2019: The Sights (and Smells) of a Bad QueryMongoDB World 2019: The Sights (and Smells) of a Bad Query
MongoDB World 2019: The Sights (and Smells) of a Bad Query
 
[Pgday.Seoul 2020] SQL Tuning
[Pgday.Seoul 2020] SQL Tuning[Pgday.Seoul 2020] SQL Tuning
[Pgday.Seoul 2020] SQL Tuning
 
JCR - Java Content Repositories
JCR - Java Content RepositoriesJCR - Java Content Repositories
JCR - Java Content Repositories
 
Content Storage With Apache Jackrabbit
Content Storage With Apache JackrabbitContent Storage With Apache Jackrabbit
Content Storage With Apache Jackrabbit
 
MongoDB Aggregation Performance
MongoDB Aggregation PerformanceMongoDB Aggregation Performance
MongoDB Aggregation Performance
 
Introduction to JCR
Introduction to JCR Introduction to JCR
Introduction to JCR
 
MongoDB 101
MongoDB 101MongoDB 101
MongoDB 101
 
MongoDB .local Toronto 2019: Tips and Tricks for Effective Indexing
MongoDB .local Toronto 2019: Tips and Tricks for Effective IndexingMongoDB .local Toronto 2019: Tips and Tricks for Effective Indexing
MongoDB .local Toronto 2019: Tips and Tricks for Effective Indexing
 
RedisConf18 - Redis Memory Optimization
RedisConf18 - Redis Memory OptimizationRedisConf18 - Redis Memory Optimization
RedisConf18 - Redis Memory Optimization
 
Creating Continuously Up to Date Materialized Aggregates
Creating Continuously Up to Date Materialized AggregatesCreating Continuously Up to Date Materialized Aggregates
Creating Continuously Up to Date Materialized Aggregates
 
Introduction to Mongodb execution plan and optimizer
Introduction to Mongodb execution plan and optimizerIntroduction to Mongodb execution plan and optimizer
Introduction to Mongodb execution plan and optimizer
 
Sql query patterns, optimized
Sql query patterns, optimizedSql query patterns, optimized
Sql query patterns, optimized
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
State transfer With Galera
State transfer With GaleraState transfer With Galera
State transfer With Galera
 

Semelhante a Grouping and Joining in Lucene/Solr

Symfony2 and MongoDB
Symfony2 and MongoDBSymfony2 and MongoDB
Symfony2 and MongoDBPablo Godel
 
Prateek Jain dissertation defense, Kno.e.sis, Wright State University
Prateek Jain dissertation defense, Kno.e.sis, Wright State UniversityPrateek Jain dissertation defense, Kno.e.sis, Wright State University
Prateek Jain dissertation defense, Kno.e.sis, Wright State UniversityPrateek Jain
 
Linked Open Data Alignment and Enrichment Using Bootstrapping Based Techniques
Linked Open Data Alignment and Enrichment Using Bootstrapping Based TechniquesLinked Open Data Alignment and Enrichment Using Bootstrapping Based Techniques
Linked Open Data Alignment and Enrichment Using Bootstrapping Based TechniquesPrateek Jain
 
Pedagogy coloda
Pedagogy colodaPedagogy coloda
Pedagogy colodaKBTNHKU
 
The W3C and the web design ecosystem
The W3C and the web design ecosystemThe W3C and the web design ecosystem
The W3C and the web design ecosystemChris Mills
 
ORCID Outreach Meeting dev breakout session
ORCID Outreach Meeting dev breakout sessionORCID Outreach Meeting dev breakout session
ORCID Outreach Meeting dev breakout sessionGudmundur Thorisson
 
Symfony2 y MongoDB - deSymfony 2012
Symfony2 y MongoDB - deSymfony 2012Symfony2 y MongoDB - deSymfony 2012
Symfony2 y MongoDB - deSymfony 2012Pablo Godel
 
DigiBird: on the fly collection integration using crowdsourcing
DigiBird: on the fly collection integration using crowdsourcing DigiBird: on the fly collection integration using crowdsourcing
DigiBird: on the fly collection integration using crowdsourcing VU University Amsterdam
 
Splendid: SPARQL Endpoint Federation Exploiting VOID Descriptions
Splendid: SPARQL Endpoint Federation Exploiting VOID DescriptionsSplendid: SPARQL Endpoint Federation Exploiting VOID Descriptions
Splendid: SPARQL Endpoint Federation Exploiting VOID DescriptionsOlafGoerlitz
 
Linked Open Data Principles, benefits of LOD for sustainable development
Linked Open Data Principles, benefits of LOD for sustainable developmentLinked Open Data Principles, benefits of LOD for sustainable development
Linked Open Data Principles, benefits of LOD for sustainable developmentMartin Kaltenböck
 
Linked Open Data: The Essentials
Linked Open Data: The EssentialsLinked Open Data: The Essentials
Linked Open Data: The Essentialsreeep
 
Moving beyond sameAs with PLATO: Partonomy detection for Linked Data
Moving beyond sameAs with PLATO: Partonomy detection for Linked DataMoving beyond sameAs with PLATO: Partonomy detection for Linked Data
Moving beyond sameAs with PLATO: Partonomy detection for Linked DataPrateek Jain
 
Introduction to the BioJS project
Introduction to the BioJS projectIntroduction to the BioJS project
Introduction to the BioJS projectRafael C. Jimenez
 
Open standards and open source mean open for business cms expo session mc-k...
Open standards and open source mean open for business   cms expo session mc-k...Open standards and open source mean open for business   cms expo session mc-k...
Open standards and open source mean open for business cms expo session mc-k...Cheryl McKinnon
 
Late March and April 2012 Media 21 Calendar Updated
Late March and April 2012 Media 21 Calendar UpdatedLate March and April 2012 Media 21 Calendar Updated
Late March and April 2012 Media 21 Calendar UpdatedB. Hamilton
 
Construction and Querying of Dynamic Knowledge Graphs
Construction and Querying of Dynamic Knowledge GraphsConstruction and Querying of Dynamic Knowledge Graphs
Construction and Querying of Dynamic Knowledge GraphsSutanay Choudhury
 

Semelhante a Grouping and Joining in Lucene/Solr (20)

Symfony2 and MongoDB
Symfony2 and MongoDBSymfony2 and MongoDB
Symfony2 and MongoDB
 
Prateek Jain dissertation defense, Kno.e.sis, Wright State University
Prateek Jain dissertation defense, Kno.e.sis, Wright State UniversityPrateek Jain dissertation defense, Kno.e.sis, Wright State University
Prateek Jain dissertation defense, Kno.e.sis, Wright State University
 
Prateek Jain's Dissertation Defense - Linked Open Data Alignment and Querying
Prateek Jain's Dissertation Defense - Linked Open Data Alignment and QueryingPrateek Jain's Dissertation Defense - Linked Open Data Alignment and Querying
Prateek Jain's Dissertation Defense - Linked Open Data Alignment and Querying
 
Linked Open Data Alignment and Enrichment Using Bootstrapping Based Techniques
Linked Open Data Alignment and Enrichment Using Bootstrapping Based TechniquesLinked Open Data Alignment and Enrichment Using Bootstrapping Based Techniques
Linked Open Data Alignment and Enrichment Using Bootstrapping Based Techniques
 
PhD Proposal Defense - Prateek Jain
PhD Proposal Defense - Prateek JainPhD Proposal Defense - Prateek Jain
PhD Proposal Defense - Prateek Jain
 
Pedagogy coloda
Pedagogy colodaPedagogy coloda
Pedagogy coloda
 
NoTube: Models & Semantics
NoTube: Models & SemanticsNoTube: Models & Semantics
NoTube: Models & Semantics
 
The W3C and the web design ecosystem
The W3C and the web design ecosystemThe W3C and the web design ecosystem
The W3C and the web design ecosystem
 
ORCID Outreach Meeting dev breakout session
ORCID Outreach Meeting dev breakout sessionORCID Outreach Meeting dev breakout session
ORCID Outreach Meeting dev breakout session
 
Symfony2 y MongoDB - deSymfony 2012
Symfony2 y MongoDB - deSymfony 2012Symfony2 y MongoDB - deSymfony 2012
Symfony2 y MongoDB - deSymfony 2012
 
DigiBird: on the fly collection integration using crowdsourcing
DigiBird: on the fly collection integration using crowdsourcing DigiBird: on the fly collection integration using crowdsourcing
DigiBird: on the fly collection integration using crowdsourcing
 
Splendid: SPARQL Endpoint Federation Exploiting VOID Descriptions
Splendid: SPARQL Endpoint Federation Exploiting VOID DescriptionsSplendid: SPARQL Endpoint Federation Exploiting VOID Descriptions
Splendid: SPARQL Endpoint Federation Exploiting VOID Descriptions
 
Linked Open Data Principles, benefits of LOD for sustainable development
Linked Open Data Principles, benefits of LOD for sustainable developmentLinked Open Data Principles, benefits of LOD for sustainable development
Linked Open Data Principles, benefits of LOD for sustainable development
 
Linked Open Data: The Essentials
Linked Open Data: The EssentialsLinked Open Data: The Essentials
Linked Open Data: The Essentials
 
Moving beyond sameAs with PLATO: Partonomy detection for Linked Data
Moving beyond sameAs with PLATO: Partonomy detection for Linked DataMoving beyond sameAs with PLATO: Partonomy detection for Linked Data
Moving beyond sameAs with PLATO: Partonomy detection for Linked Data
 
Introduction to the BioJS project
Introduction to the BioJS projectIntroduction to the BioJS project
Introduction to the BioJS project
 
Open standards and open source mean open for business cms expo session mc-k...
Open standards and open source mean open for business   cms expo session mc-k...Open standards and open source mean open for business   cms expo session mc-k...
Open standards and open source mean open for business cms expo session mc-k...
 
Late March and April 2012 Media 21 Calendar Updated
Late March and April 2012 Media 21 Calendar UpdatedLate March and April 2012 Media 21 Calendar Updated
Late March and April 2012 Media 21 Calendar Updated
 
No More SQL
No More SQLNo More SQL
No More SQL
 
Construction and Querying of Dynamic Knowledge Graphs
Construction and Querying of Dynamic Knowledge GraphsConstruction and Querying of Dynamic Knowledge Graphs
Construction and Querying of Dynamic Knowledge Graphs
 

Mais de lucenerevolution

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucenelucenerevolution
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! lucenerevolution
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solrlucenerevolution
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationslucenerevolution
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloudlucenerevolution
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusterslucenerevolution
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiledlucenerevolution
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs lucenerevolution
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchlucenerevolution
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Stormlucenerevolution
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?lucenerevolution
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APIlucenerevolution
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMlucenerevolution
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenallucenerevolution
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside downlucenerevolution
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - finallucenerevolution
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadooplucenerevolution
 

Mais de lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadoop
 

Último

Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 

Último (20)

Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 

Grouping and Joining in Lucene/Solr

  • 1. Grouping & Joining Martijn van Groningen martijn.vangroningen@searchworkings.com Lucene Committer & PMC Member Thursday, May 17, 2012
  • 2. Grouping & Joining Overview ‣ Background ‣ Joining ‣ Result grouping ‣ Conclusion Searchworkings.org - The online search community 2 Thursday, May 17, 2012
  • 3. Background Lucene’s model ‣ Lucene is document based. ‣ Lucene doesn’t store information about relations between documents. ‣ Data often holds relations. ‣ Good free text search over relational data. Searchworkings.org - The online search community 3 Thursday, May 17, 2012
  • 4. Background Example ‣ Product ‣ Name ‣ Description ‣ Product-item ‣ Color ‣ Size ‣ Price ‣ Goal: Show the most applicable product based on product-item criteria. Searchworkings.org - The online search community 4 Thursday, May 17, 2012
  • 5. Background Common Lucene solutions ‣ Compound documents. ‣ May result in documents with many fields. ‣ Subsequent searches. ‣ May cause a lot network overhead. ‣ Non Lucene based approach: ‣ If free text search isn’t very important use a relational database. Searchworkings.org - The online search community 5 Thursday, May 17, 2012
  • 6. Background Example domain ‣ Compound Product & Product-items document. ‣ Each product-item has its own field prefix. Searchworkings.org - The online search community 6 Thursday, May 17, 2012
  • 7. Background Different solutions ‣ Lucene offers solutions to have a 'relational' like search. ‣ Parent child ‣ Grouping & joining aren't naturally supported. ‣ All the solutions do increase the search time. ‣ Some scenarios grouping and joining isn't the right solution. Searchworkings.org - The online search community 7 Thursday, May 17, 2012
  • 8. Joining Modelling relations Thursday, May 17, 2012
  • 9. Joining Introduction ‣ Support for parent child like search from Lucene 3.4 ‣ Not a SQL join. ‣ The parent and each children are stored as documents. ‣ Two types: ‣ Index time join ‣ Query time join Searchworkings.org - The online search community 9 Thursday, May 17, 2012
  • 10. Joining Index time join ‣ Two block join queries: ‣ ToParentBlockJoinQuery ‣ ToChildBlockJoinQuery ‣ One Lucene collector: ‣ ToParentBlockJoinCollector ‣ Index time join requires block indexing. Searchworkings.org - The online search community 10 Thursday, May 17, 2012
  • 11. Joining Block indexing ‣ Atomically adding documents. ‣ A block of documents. ‣ Each document gets sequentially assigned Lucene document id. ‣ IndexWriter#addDocuments(docs); Searchworkings.org - The online search community 11 Thursday, May 17, 2012
  • 12. Joining Block indexing ‣ Index doesn't record blocks. ‣ Segment merging doesn’t re-order documents in a segment. ‣ App is responsible for identifying block documents. ‣ Marking the last document in a block. ‣ Adding a document to a block requires you to reindex the whole block. ‣ Removing a document from a block doesn’t requires reindexing a block. Searchworkings.org - The online search community 12 Thursday, May 17, 2012
  • 13. Joining Example domain ‣ Parent is the last document in a block. Searchworkings.org - The online search community 13 Thursday, May 17, 2012
  • 14. Joining Block indexing Marking parent documents Searchworkings.org - The online search community 14 Thursday, May 17, 2012
  • 15. Joining Block indexing Add block Add block Searchworkings.org - The online search community 15 Thursday, May 17, 2012
  • 16. Joining ToParentBlockJoinQuery ‣ Parent filter marks the parent documents. ‣ Child query is executed in the parent space. ‣ ToChildBlockJoinQuery works in the opposite direction. Searchworkings.org - The online search community 16 Thursday, May 17, 2012
  • 17. Joining Query time joining ‣ Query time joining is executed in two phases and is field based: ‣ fromField ‣ toField ‣ Doesn’t require block indexing. Searchworkings.org - The online search community 17 Thursday, May 17, 2012
  • 18. Joining Query time joining ‣ First phase collects all the terms in the fromField for the documents that match with the original query. ‣ Currently doesn’t take the score from original query into account. ‣ The second phase returns the documents that match with the collected terms from the previous phase in the toField. ‣ Two different implementations: ‣ JoinUtil - Lucene (≥ 3.6) ‣ Join query parser - Solr (trunk) Searchworkings.org - The online search community 18 Thursday, May 17, 2012
  • 19. Joining Query time joining - Indexing Referrer the product id. Searchworkings.org - The online search community 19 Thursday, May 17, 2012
  • 20. Joining Query time joining - Indexing Searchworkings.org - The online search community 20 Thursday, May 17, 2012
  • 21. Joining Query time joining ‣ Result will contain one product. ‣ Possible to join over two indices. Searchworkings.org - The online search community 21 Thursday, May 17, 2012
  • 22. Joining Final thoughts ‣ Joining module has good solutions to model parent child relations. ‣ Use block join if you care about scoring. ‣ Frequent updates can be problematic. ‣ Use query time join for parent child filtering. ‣ Query time join is slower than index time join. ‣ Mostly a Lucene feature only. ‣ All code is annotated as experimental. Searchworkings.org - The online search community 22 Thursday, May 17, 2012
  • 23. Result grouping Previously known as Field Collapsing. Thursday, May 17, 2012
  • 24. Result grouping Introduction ‣ Group matching documents that share a common property. ‣ Search hit represents a group. ‣ Facet counts & total hit count represent groups. ‣ Per group collect information ‣ Most relevant document. ‣ Top three documents. ‣ Aggregated counts Searchworkings.org - The online search community 24 Thursday, May 17, 2012
  • 25. Result grouping Usages ‣ Group documents by a shared property ‣ Product-item by product id (Parent child) ‣ Collapse similar looking documents ‣ E.g. all results from the Wikipedia domains. ‣ Remove duplicates from the search result. ‣ Based on a field that contains a hash Searchworkings.org - The online search community 25 Thursday, May 17, 2012
  • 26. Result grouping Example domain ‣ Each Product-item is a document, but includes the product data. Searchworkings.org - The online search community 26 Thursday, May 17, 2012
  • 27. Result grouping Implementation ‣ Result grouping implemented with Lucene collectors. ‣ Module in trunk and a contrib in 3.x versions. ‣ Two pass result grouping. ‣ Grouping by indexed field, function or doc values. ‣ Single pass result grouping. ‣ Requires block indexing. Searchworkings.org - The online search community 27 Thursday, May 17, 2012
  • 28. Result grouping Two pass implementation ‣ First pass collects the top N groups. ‣ Per group: group value + sort value ‣ Second pass collects data for each top group. ‣ The top N documents per group. ‣ Possible other aggregated information. ‣ Second pass search ignores all documents outside topN groups. Searchworkings.org - The online search community 28 Thursday, May 17, 2012
  • 29. Result grouping Result grouping - Indexing Searchworkings.org - The online search community 29 Thursday, May 17, 2012
  • 30. Result grouping Result grouping - Searching Searchworkings.org - The online search community 30 Thursday, May 17, 2012
  • 31. Result grouping Result grouping made easier ‣ GroupingSearch ‣ Solr ‣ http://myhost/solr/select?q=shirt&group=true&group.field=product_id ‣ Many more options: ‣ http://wiki.apache.org/solr/FieldCollapsing Searchworkings.org - The online search community 31 Thursday, May 17, 2012
  • 32. Result grouping Parent child result ‣ TopGroups - Equivalent to TopDocs. ‣ Hit count ‣ Group count ‣ Groups ‣ Top documents ‣ Facet and total count can represent groups instead of documents. ‣ But requires more query time. Searchworkings.org - The online search community 32 Thursday, May 17, 2012
  • 33. Conclusion Compare... Thursday, May 17, 2012
  • 34. Conclusion Compare the parent child solutions ‣ Result grouping ‣ + Distributed support & Parent child relation as hit. ‣ - Parent data duplication ‣ - Impact on query time ‣ Joining ‣ + Fast & no data duplication ‣ - Index time join not optimal for updates ‣ - Query time join is limited. Searchworkings.org - The online search community 34 Thursday, May 17, 2012
  • 35. Conclusion Compare the parent child solutions ‣ Compound documents. ‣ + Fast and works out-of-the box with all features. ‣ - Not flexible when it comes to updates. ‣ - Document granularity is set in stone. Searchworkings.org - The online search community 35 Thursday, May 17, 2012
  • 36. Any questions? 36 Thursday, May 17, 2012
  • 37. Extra slides We have time left! Thursday, May 17, 2012
  • 38. Conclusion Future work ‣ Higher level parent-child API. ‣ Needs to cover search & indexing. ‣ Joining ‣ Distributed support. ‣ Represent a hit as a parent child relation in the search result. ‣ Result grouping ‣ Aggregated grouped information like: sum, avg, min, max etc. Searchworkings.org - The online search community 38 Thursday, May 17, 2012
  • 39. Joining ToParentBlockJoinCollector ‣ TopGroups contains a group per top N parent document. ‣ Each group contains a parent and child documents. Searchworkings.org - The online search community 39 Thursday, May 17, 2012
  • 40. Result grouping Groups & facet counts ‣ Faceting and result grouping are different features. ‣ But are often used together! ‣ Facet counts can be based on: ‣ Found documents. ‣ Found groups. ‣ Combination of facet value and group. ‣ All options are supported in Solr. Searchworkings.org - The online search community 40 Thursday, May 17, 2012
  • 41. Result grouping Doc values ‣ Doc values / Column Stride values ‣ Prevents the creation of expensive data structures in FieldCache. ‣ Inverted index is meant for free text search. ‣ All grouping collectors have doc values based implementations! Searchworkings.org - The online search community 41 Thursday, May 17, 2012