SlideShare uma empresa Scribd logo
1 de 32
Baixar para ler offline
Extending Solr: Building a Cloud-like
   Knowledge Discovery Platform


       Trey Grainger,,CareerBuilder
Overview
CareerBuilder’s Cloud-like Knowledge Discovery Platform
   • Scalable approaches to multi-lingual text
     analysis (with research study)
       Multiple fields vs Multiple Cores vs Single Field

   • Custom Scoring
       Payloads and on-the-fly bucket scoring
       Implementing a keyword spamming penalty

   • Solr as a Cloud Service
       Scalable, customizable search for everybody

   • Knowledge Discovery & Data Analytics
My background
 Trey Grainger
   • Search Technology Development Team Lead
     @ CareerBuilder.com


 Relevant Background:
   • Search & Recommendations
   • High-volume, N-tier Architectures
   • NLP, Relevancy Tuning, user group testing & machine
     learning


 Fun Side Project:
   • Founder and Site Architect @ Celiaccess.com
CareerBuilder’s Search Scale
 Over 1 million new jobs each month
 Over 40 million resumes
 ~150 globally distributed search servers
  (in the U.S., Europe, & Asia)
 Several thousand unique, dynamically generated
  indexes
 Over a million searches an hour
 >100 Million Search Documents
Job Search
Resume Search
Talent Network Search
Auto-Complete
Geo-spatial Search
Recommendations
 We classify all content (Jobs, Resumes, etc.) and index
  the classified content into Solr

 We use a combination of collaborative filtering and
  classification techniques

 We utilize a custom scorer and payloads to apply
  higher bucket weights to more relevant content

 Recommendations are real-time and largely driven by
  search
Job Recommendations
Resume Recommendations
Multi-lingual Analysis
 Approach 1: Different Field Per Language
   •   Advantages:
         Simple, easiest to implement
   •   Disadvantages:
         My require keeping duplicate copies of your text per language
         If searching across each field (dismax style), slows search down, especially if
          handling many languages


 Approach 2: Different Solr Core per language
   Each core has your field defined with a different Analyzer chain
   specific to that core’s language
   •   Advantages:
         Searching can be completely language-agnostic and additional overhead to search
          more languages simultaneously is negligible
   •   Disadvantages:
         Multi-lingual documents require indexing to multiple cores, potentially messing up
          relevancy and adding complexity
         Have to write your own language-dependent sharding
         If you don’t already have distributed search, this adds complexity and overhead
Multi-lingual Analysis
   Approach 3: All languages in one field
         •   Advantages:
               Only one field needed regardless of number of languages
               Avoids a field explosion or a Solr core explosion as you scale to handle more languages

         •   Disadvantages:
               Can end up with some “noise” in the index if you process most text in lots of languages
                (especially if stemming and not lemmatizing)
               Currently requires writing your own Tokenizer or Filter


   Strategy:
         •   1) Copy token stream and create a stemmer/lemmatizer for each language
             2) Pass the original into each stemmer/lemmatizer
             3) Stack the outputs of each stemmer/lemmatizer

Input:
Output:
Multi-lingual Analysis
 Case Study: Stemming vs. Lemmatization
  •   Example: dries >> dri vs dries >> dry




                          Measuring Recall Overlap Between Options

  Take-away: Lemmatization allows you to greatly increase recall while
  preserving the precision you lose with stemming

  i.e. English shows 92% increase in recall using Lemmatization with
  minimal impact on precision
Custom Scoring
   Search Terms can be boosted differently:
     •   q=web^2 development^5 AND jobtitle:(software engineer)^10


   Some Fields can be weighted (scored) higher than others
     •   i.e. Field1^10, Field2^5, Field3^2, Field 4^.01

   Content within Fields can be boosted differently
     •   design [1] / engineer [1] / really [ ] / great [ ] / job [ ] / ten[3] / years[3] / experience[3] /
         careerbuilder [2] / design [2], …
         Field1: bucket=[1] boost=10; Field2: bucket=[2] boost=1.5; Field3: bucket=[] weight=1; Field4: bucket=3
         weight=1.5


     •   We can pass in a parameter to solr at query time specifying the boost to apply to each
         bucket i.e. …&bucketWeights=1:10;2:1.5;3:1.5

   You can also do index-time boosting, but this reduces your ability to do query-side
    relevancy experiments and requires norms to always be on

   By making all scoring parameters overridable at query time, we are able to do A / B
    testing to consistently improve our relevancy model
Stopping Keyword Spamming
 We already subclass PayloadTermQuery and tie in custom scoring
  for our buckets weights

 For each payload “bucket” (or across all buckets), we can count
  the number of hits and penalize the score if a particular keyword
  appears too many times

 Payload scoring then essentially becomes
    •   BucketBoost(payloadBucket) * HitMap(#hitsPerbucket)


 By adjusting our HitMap function, we can thus generate any kind of
  relevancy curve for how much each additional term adds to (or
  subtracts from) the relevancy score for that document
    •   ex: Bell curve, Linear, Bi-linear, Linear with drop-off, custom map, etc.
CareerBuilder’s Search Cloud
 Goals:
  • Make search easy to use and accessible to all engineers (not
    just the search team)

  • Allow schema changes without mucking with solr (on hundreds
    of servers)

  • Make solr installs generic and independent of any particular
    implementation
Creating a virtual search engine
 3 Main Cloud Actions: Index, Search, Delete
Creating a virtual search engine
 Creating a Schema
Creating a virtual search engine
 Creating a Document




 Processing Results
  •   A QueryResult object comes back from the SearchEngine.Search method with all of
      the main types (search records, facets, meta info, etc) parsed out into objects



 Behind the Scenes:
  •   We have a distributed architecture handling queuing all documents to
      appropriate datacenters, feeding the clusters, and load-balancing
      searches between all available clusters for the given search pool.
Knowledge Discovery & Data Analytics
Knowledge Discovery & Data Analytics
Knowledge Discovery & Data Analytics




                                       25
Knowledge Discovery & Data Analytics
Knowledge Discovery & Data Analytics
Knowledge Discovery & Data Analytics
Clustering: Nursing
Clustering: .Net
Clustering: Hyperion Developer
Take Aways
 Know how your linguistics affect precision and recall
  and choose wisely; know how to tweak for your domain.

 A flexible software api that turn Solr into a SAAS type
  cloud app can greatly increase agility and adoption of
  search.

 Search isn’t just about finding and navigating content…
  it can be used to learn from and create it, as well.
Contact
 Trey Grainger
       • trey.grainger@careerbuilder.com
       • http://www.careerbuilder.com

Mais conteúdo relacionado

Mais procurados

Scaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrScaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrTrey Grainger
 
Building a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKBuilding a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKLucidworks (Archived)
 
Building a lightweight discovery interface for Chinese patents
Building a lightweight discovery interface for Chinese patentsBuilding a lightweight discovery interface for Chinese patents
Building a lightweight discovery interface for Chinese patentsOpenSource Connections
 
Practical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and SparkPractical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and SparkJake Mannix
 
1. python for data science
1. python for data science1. python for data science
1. python for data scienceDr Nisha Arora
 
Journey of Implementing Solr at Target: Presented by Raja Ramachandran, Target
Journey of Implementing Solr at Target: Presented by Raja Ramachandran, TargetJourney of Implementing Solr at Target: Presented by Raja Ramachandran, Target
Journey of Implementing Solr at Target: Presented by Raja Ramachandran, TargetLucidworks
 
Eventually Elasticsearch: Eventual Consistency in the Real World
Eventually Elasticsearch: Eventual Consistency in the Real WorldEventually Elasticsearch: Eventual Consistency in the Real World
Eventually Elasticsearch: Eventual Consistency in the Real WorldBeyondTrees
 
Learning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search GuildLearning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search GuildSujit Pal
 
ElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learnedElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learnedBeyondTrees
 
Use cases for cassandra in federal and state government
Use cases for cassandra in federal and state governmentUse cases for cassandra in federal and state government
Use cases for cassandra in federal and state governmentOpenSource Connections
 
Search summit-2018-ltr-presentation
Search summit-2018-ltr-presentationSearch summit-2018-ltr-presentation
Search summit-2018-ltr-presentationSujit Pal
 
Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4Grant Ingersoll
 
Elasticsearch Introduction to Data model, Search & Aggregations
Elasticsearch Introduction to Data model, Search & AggregationsElasticsearch Introduction to Data model, Search & Aggregations
Elasticsearch Introduction to Data model, Search & AggregationsAlaa Elhadba
 
Search summit-2018-content-engineering-slides
Search summit-2018-content-engineering-slidesSearch summit-2018-content-engineering-slides
Search summit-2018-content-engineering-slidesSujit Pal
 
Best Practices for Hyperparameter Tuning with MLflow
Best Practices for Hyperparameter Tuning with MLflowBest Practices for Hyperparameter Tuning with MLflow
Best Practices for Hyperparameter Tuning with MLflowDatabricks
 
Webinar: Fusion 2.3 Preview - Enhanced Features with Solr & Spark
Webinar: Fusion 2.3 Preview - Enhanced Features with Solr & SparkWebinar: Fusion 2.3 Preview - Enhanced Features with Solr & Spark
Webinar: Fusion 2.3 Preview - Enhanced Features with Solr & SparkLucidworks
 
[db tech showcase Tokyo 2019] Azure Cosmos DB Deep Dive ~ Partitioning, Globa...
[db tech showcase Tokyo 2019] Azure Cosmos DB Deep Dive ~ Partitioning, Globa...[db tech showcase Tokyo 2019] Azure Cosmos DB Deep Dive ~ Partitioning, Globa...
[db tech showcase Tokyo 2019] Azure Cosmos DB Deep Dive ~ Partitioning, Globa...Naoki (Neo) SATO
 
Search domain basics
Search domain basicsSearch domain basics
Search domain basicspmanvi
 
Improving data interoperability in Python and R
Improving data interoperability in Python and RImproving data interoperability in Python and R
Improving data interoperability in Python and RWes McKinney
 

Mais procurados (20)

Scaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrScaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solr
 
Building a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKBuilding a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLK
 
Building a lightweight discovery interface for Chinese patents
Building a lightweight discovery interface for Chinese patentsBuilding a lightweight discovery interface for Chinese patents
Building a lightweight discovery interface for Chinese patents
 
Practical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and SparkPractical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and Spark
 
1. python for data science
1. python for data science1. python for data science
1. python for data science
 
Journey of Implementing Solr at Target: Presented by Raja Ramachandran, Target
Journey of Implementing Solr at Target: Presented by Raja Ramachandran, TargetJourney of Implementing Solr at Target: Presented by Raja Ramachandran, Target
Journey of Implementing Solr at Target: Presented by Raja Ramachandran, Target
 
Eventually Elasticsearch: Eventual Consistency in the Real World
Eventually Elasticsearch: Eventual Consistency in the Real WorldEventually Elasticsearch: Eventual Consistency in the Real World
Eventually Elasticsearch: Eventual Consistency in the Real World
 
Learning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search GuildLearning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search Guild
 
ElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learnedElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learned
 
Use cases for cassandra in federal and state government
Use cases for cassandra in federal and state governmentUse cases for cassandra in federal and state government
Use cases for cassandra in federal and state government
 
Search summit-2018-ltr-presentation
Search summit-2018-ltr-presentationSearch summit-2018-ltr-presentation
Search summit-2018-ltr-presentation
 
Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4
 
Elasticsearch Introduction to Data model, Search & Aggregations
Elasticsearch Introduction to Data model, Search & AggregationsElasticsearch Introduction to Data model, Search & Aggregations
Elasticsearch Introduction to Data model, Search & Aggregations
 
Search summit-2018-content-engineering-slides
Search summit-2018-content-engineering-slidesSearch summit-2018-content-engineering-slides
Search summit-2018-content-engineering-slides
 
Best Practices for Hyperparameter Tuning with MLflow
Best Practices for Hyperparameter Tuning with MLflowBest Practices for Hyperparameter Tuning with MLflow
Best Practices for Hyperparameter Tuning with MLflow
 
Solr 101
Solr 101Solr 101
Solr 101
 
Webinar: Fusion 2.3 Preview - Enhanced Features with Solr & Spark
Webinar: Fusion 2.3 Preview - Enhanced Features with Solr & SparkWebinar: Fusion 2.3 Preview - Enhanced Features with Solr & Spark
Webinar: Fusion 2.3 Preview - Enhanced Features with Solr & Spark
 
[db tech showcase Tokyo 2019] Azure Cosmos DB Deep Dive ~ Partitioning, Globa...
[db tech showcase Tokyo 2019] Azure Cosmos DB Deep Dive ~ Partitioning, Globa...[db tech showcase Tokyo 2019] Azure Cosmos DB Deep Dive ~ Partitioning, Globa...
[db tech showcase Tokyo 2019] Azure Cosmos DB Deep Dive ~ Partitioning, Globa...
 
Search domain basics
Search domain basicsSearch domain basics
Search domain basics
 
Improving data interoperability in Python and R
Improving data interoperability in Python and RImproving data interoperability in Python and R
Improving data interoperability in Python and R
 

Semelhante a Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platform, By Trey Grainger

Extending Solr: Building a Cloud-like Knowledge Discovery Platform
Extending Solr: Building a Cloud-like Knowledge Discovery PlatformExtending Solr: Building a Cloud-like Knowledge Discovery Platform
Extending Solr: Building a Cloud-like Knowledge Discovery PlatformTrey Grainger
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Lucidworks
 
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global Lucidworks
 
Building Large Arabic Multi-Domain Resources for Sentiment Analysis
Building Large Arabic Multi-Domain Resources for Sentiment Analysis Building Large Arabic Multi-Domain Resources for Sentiment Analysis
Building Large Arabic Multi-Domain Resources for Sentiment Analysis Hady Elsahar
 
Introduction to Cloudera Search Training
Introduction to Cloudera Search TrainingIntroduction to Cloudera Search Training
Introduction to Cloudera Search TrainingCloudera, Inc.
 
Dice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank TalkDice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank TalkSimon Hughes
 
Design Like a Pro: Scripting Best Practices
Design Like a Pro: Scripting Best PracticesDesign Like a Pro: Scripting Best Practices
Design Like a Pro: Scripting Best PracticesInductive Automation
 
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Lucidworks
 
Design Like a Pro: Scripting Best Practices
Design Like a Pro: Scripting Best PracticesDesign Like a Pro: Scripting Best Practices
Design Like a Pro: Scripting Best PracticesInductive Automation
 
Антон Бойко "Разделяй и властвуй — набор практик для построения масштабируемо...
Антон Бойко "Разделяй и властвуй — набор практик для построения масштабируемо...Антон Бойко "Разделяй и властвуй — набор практик для построения масштабируемо...
Антон Бойко "Разделяй и властвуй — набор практик для построения масштабируемо...Marina Peregud
 
Automated product categorization
Automated product categorizationAutomated product categorization
Automated product categorizationAndreas Loupasakis
 
Automated product categorization
Automated product categorization   Automated product categorization
Automated product categorization Warply
 
Software design with Domain-driven design
Software design with Domain-driven design Software design with Domain-driven design
Software design with Domain-driven design Allan Mangune
 
Azure CosmosDb - Where we are
Azure CosmosDb - Where we areAzure CosmosDb - Where we are
Azure CosmosDb - Where we areMarco Parenzan
 
Distributed teams
Distributed teamsDistributed teams
Distributed teamsKush Shah
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemTrey Grainger
 
Towards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsTowards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsMarina Santini
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 CareerBuilder.com
 
Webinar: Fusion 3.1 - What's New
Webinar: Fusion 3.1 - What's NewWebinar: Fusion 3.1 - What's New
Webinar: Fusion 3.1 - What's NewLucidworks
 

Semelhante a Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platform, By Trey Grainger (20)

Extending Solr: Building a Cloud-like Knowledge Discovery Platform
Extending Solr: Building a Cloud-like Knowledge Discovery PlatformExtending Solr: Building a Cloud-like Knowledge Discovery Platform
Extending Solr: Building a Cloud-like Knowledge Discovery Platform
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
 
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global
 
Building Large Arabic Multi-Domain Resources for Sentiment Analysis
Building Large Arabic Multi-Domain Resources for Sentiment Analysis Building Large Arabic Multi-Domain Resources for Sentiment Analysis
Building Large Arabic Multi-Domain Resources for Sentiment Analysis
 
Introduction to Cloudera Search Training
Introduction to Cloudera Search TrainingIntroduction to Cloudera Search Training
Introduction to Cloudera Search Training
 
Dice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank TalkDice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank Talk
 
Design Like a Pro: Scripting Best Practices
Design Like a Pro: Scripting Best PracticesDesign Like a Pro: Scripting Best Practices
Design Like a Pro: Scripting Best Practices
 
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
 
Design Like a Pro: Scripting Best Practices
Design Like a Pro: Scripting Best PracticesDesign Like a Pro: Scripting Best Practices
Design Like a Pro: Scripting Best Practices
 
Антон Бойко "Разделяй и властвуй — набор практик для построения масштабируемо...
Антон Бойко "Разделяй и властвуй — набор практик для построения масштабируемо...Антон Бойко "Разделяй и властвуй — набор практик для построения масштабируемо...
Антон Бойко "Разделяй и властвуй — набор практик для построения масштабируемо...
 
Automated product categorization
Automated product categorizationAutomated product categorization
Automated product categorization
 
Automated product categorization
Automated product categorization   Automated product categorization
Automated product categorization
 
Software design with Domain-driven design
Software design with Domain-driven design Software design with Domain-driven design
Software design with Domain-driven design
 
Azure CosmosDb - Where we are
Azure CosmosDb - Where we areAzure CosmosDb - Where we are
Azure CosmosDb - Where we are
 
Distributed teams
Distributed teamsDistributed teams
Distributed teams
 
Distributed_teams
Distributed_teamsDistributed_teams
Distributed_teams
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data Ecosystem
 
Towards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsTowards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology Applications
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018
 
Webinar: Fusion 3.1 - What's New
Webinar: Fusion 3.1 - What's NewWebinar: Fusion 3.1 - What's New
Webinar: Fusion 3.1 - What's New
 

Mais de lucenerevolution

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucenelucenerevolution
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! lucenerevolution
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solrlucenerevolution
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationslucenerevolution
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloudlucenerevolution
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusterslucenerevolution
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiledlucenerevolution
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs lucenerevolution
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchlucenerevolution
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Stormlucenerevolution
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?lucenerevolution
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APIlucenerevolution
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMlucenerevolution
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucenelucenerevolution
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenallucenerevolution
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside downlucenerevolution
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - finallucenerevolution
 

Mais de lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 

Último

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 

Último (20)

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 

Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platform, By Trey Grainger

  • 1. Extending Solr: Building a Cloud-like Knowledge Discovery Platform Trey Grainger,,CareerBuilder
  • 2. Overview CareerBuilder’s Cloud-like Knowledge Discovery Platform • Scalable approaches to multi-lingual text analysis (with research study)  Multiple fields vs Multiple Cores vs Single Field • Custom Scoring  Payloads and on-the-fly bucket scoring  Implementing a keyword spamming penalty • Solr as a Cloud Service  Scalable, customizable search for everybody • Knowledge Discovery & Data Analytics
  • 3. My background  Trey Grainger • Search Technology Development Team Lead @ CareerBuilder.com  Relevant Background: • Search & Recommendations • High-volume, N-tier Architectures • NLP, Relevancy Tuning, user group testing & machine learning  Fun Side Project: • Founder and Site Architect @ Celiaccess.com
  • 4. CareerBuilder’s Search Scale  Over 1 million new jobs each month  Over 40 million resumes  ~150 globally distributed search servers (in the U.S., Europe, & Asia)  Several thousand unique, dynamically generated indexes  Over a million searches an hour  >100 Million Search Documents
  • 10. Recommendations  We classify all content (Jobs, Resumes, etc.) and index the classified content into Solr  We use a combination of collaborative filtering and classification techniques  We utilize a custom scorer and payloads to apply higher bucket weights to more relevant content  Recommendations are real-time and largely driven by search
  • 13. Multi-lingual Analysis  Approach 1: Different Field Per Language • Advantages:  Simple, easiest to implement • Disadvantages:  My require keeping duplicate copies of your text per language  If searching across each field (dismax style), slows search down, especially if handling many languages  Approach 2: Different Solr Core per language Each core has your field defined with a different Analyzer chain specific to that core’s language • Advantages:  Searching can be completely language-agnostic and additional overhead to search more languages simultaneously is negligible • Disadvantages:  Multi-lingual documents require indexing to multiple cores, potentially messing up relevancy and adding complexity  Have to write your own language-dependent sharding  If you don’t already have distributed search, this adds complexity and overhead
  • 14. Multi-lingual Analysis  Approach 3: All languages in one field • Advantages:  Only one field needed regardless of number of languages  Avoids a field explosion or a Solr core explosion as you scale to handle more languages • Disadvantages:  Can end up with some “noise” in the index if you process most text in lots of languages (especially if stemming and not lemmatizing)  Currently requires writing your own Tokenizer or Filter  Strategy: • 1) Copy token stream and create a stemmer/lemmatizer for each language 2) Pass the original into each stemmer/lemmatizer 3) Stack the outputs of each stemmer/lemmatizer Input: Output:
  • 15. Multi-lingual Analysis  Case Study: Stemming vs. Lemmatization • Example: dries >> dri vs dries >> dry Measuring Recall Overlap Between Options Take-away: Lemmatization allows you to greatly increase recall while preserving the precision you lose with stemming i.e. English shows 92% increase in recall using Lemmatization with minimal impact on precision
  • 16. Custom Scoring  Search Terms can be boosted differently: • q=web^2 development^5 AND jobtitle:(software engineer)^10  Some Fields can be weighted (scored) higher than others • i.e. Field1^10, Field2^5, Field3^2, Field 4^.01  Content within Fields can be boosted differently • design [1] / engineer [1] / really [ ] / great [ ] / job [ ] / ten[3] / years[3] / experience[3] / careerbuilder [2] / design [2], … Field1: bucket=[1] boost=10; Field2: bucket=[2] boost=1.5; Field3: bucket=[] weight=1; Field4: bucket=3 weight=1.5 • We can pass in a parameter to solr at query time specifying the boost to apply to each bucket i.e. …&bucketWeights=1:10;2:1.5;3:1.5  You can also do index-time boosting, but this reduces your ability to do query-side relevancy experiments and requires norms to always be on  By making all scoring parameters overridable at query time, we are able to do A / B testing to consistently improve our relevancy model
  • 17. Stopping Keyword Spamming  We already subclass PayloadTermQuery and tie in custom scoring for our buckets weights  For each payload “bucket” (or across all buckets), we can count the number of hits and penalize the score if a particular keyword appears too many times  Payload scoring then essentially becomes • BucketBoost(payloadBucket) * HitMap(#hitsPerbucket)  By adjusting our HitMap function, we can thus generate any kind of relevancy curve for how much each additional term adds to (or subtracts from) the relevancy score for that document • ex: Bell curve, Linear, Bi-linear, Linear with drop-off, custom map, etc.
  • 18. CareerBuilder’s Search Cloud  Goals: • Make search easy to use and accessible to all engineers (not just the search team) • Allow schema changes without mucking with solr (on hundreds of servers) • Make solr installs generic and independent of any particular implementation
  • 19. Creating a virtual search engine  3 Main Cloud Actions: Index, Search, Delete
  • 20. Creating a virtual search engine  Creating a Schema
  • 21. Creating a virtual search engine  Creating a Document  Processing Results • A QueryResult object comes back from the SearchEngine.Search method with all of the main types (search records, facets, meta info, etc) parsed out into objects  Behind the Scenes: • We have a distributed architecture handling queuing all documents to appropriate datacenters, feeding the clusters, and load-balancing searches between all available clusters for the given search pool.
  • 22. Knowledge Discovery & Data Analytics
  • 23. Knowledge Discovery & Data Analytics
  • 24. Knowledge Discovery & Data Analytics 25
  • 25. Knowledge Discovery & Data Analytics
  • 26. Knowledge Discovery & Data Analytics
  • 27. Knowledge Discovery & Data Analytics
  • 31. Take Aways  Know how your linguistics affect precision and recall and choose wisely; know how to tweak for your domain.  A flexible software api that turn Solr into a SAAS type cloud app can greatly increase agility and adoption of search.  Search isn’t just about finding and navigating content… it can be used to learn from and create it, as well.
  • 32. Contact  Trey Grainger • trey.grainger@careerbuilder.com • http://www.careerbuilder.com