SlideShare uma empresa Scribd logo
1 de 37
Baixar para ler offline
Lucene And Solr Document
Classification
Alessandro Benedetti, Software Engineer, Sease Ltd.
Alessandro Benedetti
● Search Consultant
● R&D Software Engineer
● Master in Computer Science
● Apache Lucene/Solr Enthusiast
● Semantic, NLP, Machine Learning Technologies passionate
● Beach Volleyball Player & Snowboarder
Who I am
● Classification
● Lucene Approach
● Solr Integration
● Demo
● Extensions
● Future Work
Agenda
“Classification is the problem of identifying
to which of a set of categories
(sub-populations) a new observation
belongs, on the basis of a training set of data
containing observations (or instances)
whose category membership is known. “
Wikipedia
Classification
● E-mail spam filter
● Document categorization
● Sexually explicit content detection
● Medical diagnosis
● E-commerce
● Language identification
Real World Use Cases
● Supervised learning
● Labelled training samples
● Documents modelled as
feature vectors
● Term occurrences as features
● Model predicts unseen documents
label
Basics Of Text Classification
Apache Lucene
Apache LuceneTM
is a high-performance, full-featured text search engine library
written entirely in Java.
It is a technology suitable for nearly any application that requires full-text search,
especially cross-platform.
Apache Lucene is an open source project available for free download.
● Lucene index has complex data structures
● Lot of organizations have already indexes in place
● Pre existent data can be used to classify
● No need to train a model from a separate training set
● From training set to Inverted index
Apache Lucene For Classification
● Advanced configurable text analysis
● Term frequencies
● Term positions
● Document frequencies
● Norms
● Part of speech tags and custom payload
Apache Lucene For Classification
● Given an index with labelled documents
● Each document has a class field
● Given an unknown document in input
● Given a set of relevant fields
● Search the top K most similar documents
● Fetch the classes from the retrieved documents
● Return most occurring class(es)
● Class ranking in retrieved documents is important !
K Nearest Neighbours
● KNN uses Lucene More Like This
● Lucene query component
● Extract interesting terms* from the input document fields
● Build a Lucene query
● Run the query against the search index
● Resulting documents are “the similar documents”
* an interesting term is a term :
- occurring frequently in the seed document (high term frequency)
- but quite rare in the corpus (high inverted document frequency)
More Like This
Assumptions
● Term occurrences are probabilistic independent features
● Terms positions are irrelevant ( bag of words )
Calculate the probability score of each available class C
● Prior ( #DocsInClassC / #Docs )
● Likelihood ( P(d|c) = P(t1, t2,..., tn|c) == P(t1|c) * P(t2|c) * … * P(tn|c))
Where given term t
P(t|c) = TF(t) in documents of class c +1 /
#terms in all documents of class c + #docs of class c
Assign top scoring class
Naive Bayes Classifier
● Documents are the Lucene unit of information
● Documents are a map field -> value
● Each field may be analysed differently
(different tokenization and token filtering)
● Each field may have a different weight for the classification
(affecting differently the similarity score)
Document Classification
Solr is the popular, blazing fast, open source NoSQL search platform from
the Apache Lucene project.
Its major features include powerful full-text search, hit highlighting, faceted
search and analytics, rich document parsing, geospatial search, extensive
REST APIs as well as parallel SQL.
Apache Solr
Index Time Integration - SOLR-7739
● Ingest the document
● Assign the class
● Set the class as a field value
● Index the document
Request Handler Integration (TO DO) - SOLR-7738
Return an assigned class :
● Given a text and a field
● Given an input document
● Given an indexed document id
Solr Integration
● Pipeline of processors
● Each single document flows
through the chain
● Each processor is executed once
● Last processor triggers the
update command
Update Request Processor Chain
● Update Component
● Configurable Singleton Factory
● Single instance per request thread
● Process a single Document
● SolrCloud compatible*
* Pre processor / Post processor
Update Request Processor
● Access the Index Reader
● A Lucene Document Classifier is instantiated
● A class is assigned by the classifier
● A new field is added to the original Document, with the class
● The document goes through the next processing steps
Classification Update Request Processor
...
<initParams path="/update/**,/query,/select,/tvrh,/elevate,/spell,/browse">
<lst name="defaults">
<str name="df">text</str>
<str name="update.chain">classification</str>
</lst>
</initParams>
...
Solrconfig.xml - Update Handler
...
<updateRequestProcessorChain name="classification">
<processor class="solr.ClassificationUpdateProcessorFactory">
...
</processor>
<processor class="solr.RunUpdateProcessorFactory"/>
</updateRequestProcessorChain>
...
Solrconfig.xml - Chain configuration
<processor class="solr.ClassificationUpdateProcessorFactory">
<str name="inputFields">title^1.5,content,author</str>
<str name="classField">cat</str>
<str name="algorithm">knn</str>
<str name="knn.k">20</str>
<str name="knn.minTf">1</str>
<str name="knn.minDf">5</str>
</processor>
N.B. classField must be stored
Solrconfig.xml - K nearest neighbour classifier config
<processor class="solr.ClassificationUpdateProcessorFactory">
<str name="inputFields">title^1.5,content,author</str>
<str name="classField">cat</str>
<str name="algorithm">bayes</str>
</processor>
N.B. classField must be Indexed (take care of analysis)
Solrconfig.xml - Naive Bayes classifier config
● Lucene >= 6.0
● Solr >= 6.1
● Classification needs a training set ->
An index with initially human assigned classes is required
Solr Classification - Important Notes
● Sci-Fi StackExchange dataset
● Roughly 18.000 questions and answers
● Roughly 6.000 tagged
● 70 % Training Set + 30% test set
Solr Classification - Demo
● Index the training set documents
(this is our ground truth)
● Index the test set
(classification will happen automatically at indexing time)
● Evaluate the test set
(a simple java app to verify that the automatically assigned classes
are consistent with what expected)
Solr Classification - Demo
● True Positive : Predicted class == actual class
● False Positive : Predicted class != actual class
● True Negative : Not predicted class != actual class
● False Negative : Not predicted class == actual class
Precision = TP / TP+FP
Recall = TP / TP+FN
Solr Classification - System Evaluation Metrics
● Index the training set documents
(this is our ground truth)
● Index the test set
(classification will happen automatically at indexing time)
● Evaluate the test set
(a simple java app to verify that the automatically assigned classes
are consistent with what expected)
Solr Classification - Demo
MaxOutputClasses 1
[System Global Accuracy]0.5095676824946846
[System Globel Recall]0.2686846038863976
TP{star-wars}59
FP{star-wars}75
FN{star-wars}7
[Precision (of predicted)]{star-wars}0.44029850746268656
[Recall for class)]{star-wars}0.8939393939393939
TP{harry-potter}147
FP{harry-potter}137
FN{harry-potter}3
[Precision (of predicted)]{harry-potter}0.5176056338028169
[Recall for class]{harry-potter}0.98
Solr Classification - Demo - Full Dataset
MaxOutputClasses 5
[System Global Accuracy]0.20481927710843373
[System Globel Recall]0.5399850523168909
TP{star-wars}66
FP{star-wars}400
FN{star-wars}0
[Precision (of predicted)]{star-wars}0.14163090128755365
[Recall for class)]{star-wars}1.0
TP{harry-potter}150
FP{harry-potter}584
FN{harry-potter}0
[Precision (of predicted)]{harry-potter}0.20435967302452315
[Recall for class]{harry-potter}1.0
Solr Classification - Demo - Full Dataset
MaxOutputClasses 1
[System Global Accuracy]0.9907407407407407
[System Globel Recall]0.6750788643533123
TP{star-wars}64
FP{star-wars}0
FN{star-wars}2
[Precision (of predicted)]{star-wars}1.0
[Recall for class)]{star-wars}0.9696969696969697
TP{harry-potter}150
FP{harry-potter}2
FN{harry-potter}0
[Precision (of predicted)]{harry-potter}0.9868421052631579
[Recall for class]{harry-potter}1.0
Solr Classification - Demo - Partial Dataset
MaxOutputClasses 5
[System Global Accuracy]0.24259259259259258
[System Globel Recall]0.8264984227129337
TP{star-wars}66
FP{star-wars}52
FN{star-wars}0
[Precision (of predicted)]{star-wars}0.559322033898305
[Recall for class)]{star-wars}1.0
TP{harry-potter}150
FP{harry-potter}48
FN{harry-potter}0
[Precision (of predicted)]{harry-potter}0.7575757575757576
[Recall for class]{harry-potter}1.0
Solr Classification - Demo - Partial Dataset
Multi classes support
● Class field may be multi valued
● Assign multiple classes
● Not only the top scoring but top N (parameter)
Split human/auto assigned classes
● classTrainingField
● classOutputField
Default : use the same field
Solr Classification - Extensions SOLR-8871
Classification Context Filtering
● Reduce the document space to consider ->
reduce the training set
● Useful when only a subset of the index may be interesting for
classification
● Consider only the human labelled documents as training data
Solr Classification - Extensions SOLR-8871
Individual Field Weighting
● When classifying, each field has a different importance
e.g.
title vs content
● Set a different boost per field
● Knn compatible
● Bayes compatible
Solr Classification - Extensions SOLR-8871
● Numeric Field Support (Knn)
(Euclidean distance based)
● Lat lon support (Knn)
(geo distance based)
● SolrCloud support
(use the entire sharded index as training set)
Solr Classification - Future Work
Questions ?
● Special thanks to Tommaso Teofili,
Apache committer who followed the developments and made possible the
contributions.
● And to the
Audience :)

Mais conteúdo relacionado

Mais procurados

Dense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfDense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfSease
 
Vector databases and neural search
Vector databases and neural searchVector databases and neural search
Vector databases and neural searchDmitry Kan
 
Elasticsearch 한글 형태소 분석기 Nori 노리
Elasticsearch 한글 형태소 분석기 Nori 노리Elasticsearch 한글 형태소 분석기 Nori 노리
Elasticsearch 한글 형태소 분석기 Nori 노리종민 김
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...Xavier Amatriain
 
Drone Data Flowing Through Apache NiFi
Drone Data Flowing Through Apache NiFiDrone Data Flowing Through Apache NiFi
Drone Data Flowing Through Apache NiFiTimothy Spann
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers Arvind Devaraj
 
devops 2년차 이직 성공기.pptx
devops 2년차 이직 성공기.pptxdevops 2년차 이직 성공기.pptx
devops 2년차 이직 성공기.pptxByungho Lee
 
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15MLconf
 
The Semantic Knowledge Graph
The Semantic Knowledge GraphThe Semantic Knowledge Graph
The Semantic Knowledge GraphTrey Grainger
 
Reproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorchReproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorchDatabricks
 
Data Management - Full Stack Deep Learning
Data Management - Full Stack Deep LearningData Management - Full Stack Deep Learning
Data Management - Full Stack Deep LearningSergey Karayev
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer modelsDing Li
 
[216]네이버 검색 사용자를 만족시켜라! 의도파악과 의미검색
[216]네이버 검색 사용자를 만족시켜라!   의도파악과 의미검색[216]네이버 검색 사용자를 만족시켜라!   의도파악과 의미검색
[216]네이버 검색 사용자를 만족시켜라! 의도파악과 의미검색NAVER D2
 
Deep Dive into Apache Kafka
Deep Dive into Apache KafkaDeep Dive into Apache Kafka
Deep Dive into Apache Kafkaconfluent
 
WIPS(특허검색) 이용방법
WIPS(특허검색) 이용방법 WIPS(특허검색) 이용방법
WIPS(특허검색) 이용방법 POSTECH Library
 
Real-Time Recommendations with Hopsworks and OpenSearch - MLOps World 2022
Real-Time Recommendations  with Hopsworks and OpenSearch - MLOps World 2022Real-Time Recommendations  with Hopsworks and OpenSearch - MLOps World 2022
Real-Time Recommendations with Hopsworks and OpenSearch - MLOps World 2022Jim Dowling
 
Word representations in vector space
Word representations in vector spaceWord representations in vector space
Word representations in vector spaceAbdullah Khan Zehady
 
Anatomy of an eCommerce Search Engine by Mayur Datar
Anatomy of an eCommerce Search Engine by Mayur DatarAnatomy of an eCommerce Search Engine by Mayur Datar
Anatomy of an eCommerce Search Engine by Mayur DatarNaresh Jain
 

Mais procurados (20)

Dense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfDense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdf
 
Vector databases and neural search
Vector databases and neural searchVector databases and neural search
Vector databases and neural search
 
Elasticsearch 한글 형태소 분석기 Nori 노리
Elasticsearch 한글 형태소 분석기 Nori 노리Elasticsearch 한글 형태소 분석기 Nori 노리
Elasticsearch 한글 형태소 분석기 Nori 노리
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
 
Drone Data Flowing Through Apache NiFi
Drone Data Flowing Through Apache NiFiDrone Data Flowing Through Apache NiFi
Drone Data Flowing Through Apache NiFi
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers
 
devops 2년차 이직 성공기.pptx
devops 2년차 이직 성공기.pptxdevops 2년차 이직 성공기.pptx
devops 2년차 이직 성공기.pptx
 
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
 
The Semantic Knowledge Graph
The Semantic Knowledge GraphThe Semantic Knowledge Graph
The Semantic Knowledge Graph
 
Reproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorchReproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorch
 
Data Management - Full Stack Deep Learning
Data Management - Full Stack Deep LearningData Management - Full Stack Deep Learning
Data Management - Full Stack Deep Learning
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
 
[216]네이버 검색 사용자를 만족시켜라! 의도파악과 의미검색
[216]네이버 검색 사용자를 만족시켜라!   의도파악과 의미검색[216]네이버 검색 사용자를 만족시켜라!   의도파악과 의미검색
[216]네이버 검색 사용자를 만족시켜라! 의도파악과 의미검색
 
Deep Dive into Apache Kafka
Deep Dive into Apache KafkaDeep Dive into Apache Kafka
Deep Dive into Apache Kafka
 
WIPS(특허검색) 이용방법
WIPS(특허검색) 이용방법 WIPS(특허검색) 이용방법
WIPS(특허검색) 이용방법
 
Real-Time Recommendations with Hopsworks and OpenSearch - MLOps World 2022
Real-Time Recommendations  with Hopsworks and OpenSearch - MLOps World 2022Real-Time Recommendations  with Hopsworks and OpenSearch - MLOps World 2022
Real-Time Recommendations with Hopsworks and OpenSearch - MLOps World 2022
 
Word representations in vector space
Word representations in vector spaceWord representations in vector space
Word representations in vector space
 
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
 
Anatomy of an eCommerce Search Engine by Mayur Datar
Anatomy of an eCommerce Search Engine by Mayur DatarAnatomy of an eCommerce Search Engine by Mayur Datar
Anatomy of an eCommerce Search Engine by Mayur Datar
 

Semelhante a Apache Lucene/Solr Document Classification

PostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 Taipei
PostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 TaipeiPostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 Taipei
PostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 TaipeiSatoshi Nagayasu
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr DevelopersErik Hatcher
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr DevelopersErik Hatcher
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrErik Hatcher
 
Hacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsHacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsOpenSource Connections
 
Basic R Learning
Basic R LearningBasic R Learning
Basic R LearningKumar P
 
ELK-Stack-Essential-Concepts-TheELKStack-LunchandLearn.pdf
ELK-Stack-Essential-Concepts-TheELKStack-LunchandLearn.pdfELK-Stack-Essential-Concepts-TheELKStack-LunchandLearn.pdf
ELK-Stack-Essential-Concepts-TheELKStack-LunchandLearn.pdfcadejaumafiq
 
Information Retrieval - Data Science Bootcamp
Information Retrieval - Data Science BootcampInformation Retrieval - Data Science Bootcamp
Information Retrieval - Data Science BootcampKais Hassan, PhD
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemTrey Grainger
 
Data Science with Solr and Spark
Data Science with Solr and SparkData Science with Solr and Spark
Data Science with Solr and SparkLucidworks
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAsad Abbas
 
Examiness hints and tips from the trenches
Examiness hints and tips from the trenchesExaminess hints and tips from the trenches
Examiness hints and tips from the trenchesIsmail Mayat
 
Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Manish kumar
 
Building Search & Recommendation Engines
Building Search & Recommendation EnginesBuilding Search & Recommendation Engines
Building Search & Recommendation EnginesTrey Grainger
 

Semelhante a Apache Lucene/Solr Document Classification (20)

PostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 Taipei
PostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 TaipeiPostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 Taipei
PostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 Taipei
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Hacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsHacking Lucene for Custom Search Results
Hacking Lucene for Custom Search Results
 
advancedR.pdf
advancedR.pdfadvancedR.pdf
advancedR.pdf
 
Advanced r
Advanced rAdvanced r
Advanced r
 
Basic R Learning
Basic R LearningBasic R Learning
Basic R Learning
 
Advanced R cheat sheet
Advanced R cheat sheetAdvanced R cheat sheet
Advanced R cheat sheet
 
Full Text Search In PostgreSQL
Full Text Search In PostgreSQLFull Text Search In PostgreSQL
Full Text Search In PostgreSQL
 
ELK-Stack-Essential-Concepts-TheELKStack-LunchandLearn.pdf
ELK-Stack-Essential-Concepts-TheELKStack-LunchandLearn.pdfELK-Stack-Essential-Concepts-TheELKStack-LunchandLearn.pdf
ELK-Stack-Essential-Concepts-TheELKStack-LunchandLearn.pdf
 
Information Retrieval - Data Science Bootcamp
Information Retrieval - Data Science BootcampInformation Retrieval - Data Science Bootcamp
Information Retrieval - Data Science Bootcamp
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data Ecosystem
 
Apache solr
Apache solrApache solr
Apache solr
 
Data Science with Solr and Spark
Data Science with Solr and SparkData Science with Solr and Spark
Data Science with Solr and Spark
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using Lucene
 
Android Database
Android DatabaseAndroid Database
Android Database
 
Examiness hints and tips from the trenches
Examiness hints and tips from the trenchesExaminess hints and tips from the trenches
Examiness hints and tips from the trenches
 
Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)
 
Building Search & Recommendation Engines
Building Search & Recommendation EnginesBuilding Search & Recommendation Engines
Building Search & Recommendation Engines
 

Mais de Sease

Multi Valued Vectors Lucene
Multi Valued Vectors LuceneMulti Valued Vectors Lucene
Multi Valued Vectors LuceneSease
 
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...Sease
 
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaSease
 
Introducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache LuceneIntroducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache LuceneSease
 
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...Sease
 
How does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspectiveHow does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspectiveSease
 
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaSease
 
Large Scale Indexing
Large Scale IndexingLarge Scale Indexing
Large Scale IndexingSease
 
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...Sease
 
How to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptxHow to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptxSease
 
Online Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr InterleavingOnline Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr InterleavingSease
 
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...Sease
 
Advanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache LuceneAdvanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache LuceneSease
 
Search Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSearch Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSease
 
Introduction to Music Information Retrieval
Introduction to Music Information RetrievalIntroduction to Music Information Retrieval
Introduction to Music Information RetrievalSease
 
Rated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality EvaluationRated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality EvaluationSease
 
Explainability for Learning to Rank
Explainability for Learning to RankExplainability for Learning to Rank
Explainability for Learning to RankSease
 
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @ChorusRated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @ChorusSease
 
Interactive Questions and Answers - London Information Retrieval Meetup
Interactive Questions and Answers - London Information Retrieval MeetupInteractive Questions and Answers - London Information Retrieval Meetup
Interactive Questions and Answers - London Information Retrieval MeetupSease
 
A Learning to Rank Project on a Daily Song Ranking Problem
A Learning to Rank Project on a Daily Song Ranking ProblemA Learning to Rank Project on a Daily Song Ranking Problem
A Learning to Rank Project on a Daily Song Ranking ProblemSease
 

Mais de Sease (20)

Multi Valued Vectors Lucene
Multi Valued Vectors LuceneMulti Valued Vectors Lucene
Multi Valued Vectors Lucene
 
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
 
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With Kibana
 
Introducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache LuceneIntroducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache Lucene
 
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
 
How does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspectiveHow does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspective
 
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With Kibana
 
Large Scale Indexing
Large Scale IndexingLarge Scale Indexing
Large Scale Indexing
 
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
 
How to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptxHow to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptx
 
Online Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr InterleavingOnline Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr Interleaving
 
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
 
Advanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache LuceneAdvanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache Lucene
 
Search Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSearch Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer Perspective
 
Introduction to Music Information Retrieval
Introduction to Music Information RetrievalIntroduction to Music Information Retrieval
Introduction to Music Information Retrieval
 
Rated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality EvaluationRated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
 
Explainability for Learning to Rank
Explainability for Learning to RankExplainability for Learning to Rank
Explainability for Learning to Rank
 
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @ChorusRated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
 
Interactive Questions and Answers - London Information Retrieval Meetup
Interactive Questions and Answers - London Information Retrieval MeetupInteractive Questions and Answers - London Information Retrieval Meetup
Interactive Questions and Answers - London Information Retrieval Meetup
 
A Learning to Rank Project on a Daily Song Ranking Problem
A Learning to Rank Project on a Daily Song Ranking ProblemA Learning to Rank Project on a Daily Song Ranking Problem
A Learning to Rank Project on a Daily Song Ranking Problem
 

Último

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 

Último (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 

Apache Lucene/Solr Document Classification

  • 1. Lucene And Solr Document Classification Alessandro Benedetti, Software Engineer, Sease Ltd.
  • 2. Alessandro Benedetti ● Search Consultant ● R&D Software Engineer ● Master in Computer Science ● Apache Lucene/Solr Enthusiast ● Semantic, NLP, Machine Learning Technologies passionate ● Beach Volleyball Player & Snowboarder Who I am
  • 3. ● Classification ● Lucene Approach ● Solr Integration ● Demo ● Extensions ● Future Work Agenda
  • 4. “Classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. “ Wikipedia Classification
  • 5. ● E-mail spam filter ● Document categorization ● Sexually explicit content detection ● Medical diagnosis ● E-commerce ● Language identification Real World Use Cases
  • 6. ● Supervised learning ● Labelled training samples ● Documents modelled as feature vectors ● Term occurrences as features ● Model predicts unseen documents label Basics Of Text Classification
  • 7. Apache Lucene Apache LuceneTM is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. Apache Lucene is an open source project available for free download.
  • 8. ● Lucene index has complex data structures ● Lot of organizations have already indexes in place ● Pre existent data can be used to classify ● No need to train a model from a separate training set ● From training set to Inverted index Apache Lucene For Classification
  • 9. ● Advanced configurable text analysis ● Term frequencies ● Term positions ● Document frequencies ● Norms ● Part of speech tags and custom payload Apache Lucene For Classification
  • 10. ● Given an index with labelled documents ● Each document has a class field ● Given an unknown document in input ● Given a set of relevant fields ● Search the top K most similar documents ● Fetch the classes from the retrieved documents ● Return most occurring class(es) ● Class ranking in retrieved documents is important ! K Nearest Neighbours
  • 11. ● KNN uses Lucene More Like This ● Lucene query component ● Extract interesting terms* from the input document fields ● Build a Lucene query ● Run the query against the search index ● Resulting documents are “the similar documents” * an interesting term is a term : - occurring frequently in the seed document (high term frequency) - but quite rare in the corpus (high inverted document frequency) More Like This
  • 12. Assumptions ● Term occurrences are probabilistic independent features ● Terms positions are irrelevant ( bag of words ) Calculate the probability score of each available class C ● Prior ( #DocsInClassC / #Docs ) ● Likelihood ( P(d|c) = P(t1, t2,..., tn|c) == P(t1|c) * P(t2|c) * … * P(tn|c)) Where given term t P(t|c) = TF(t) in documents of class c +1 / #terms in all documents of class c + #docs of class c Assign top scoring class Naive Bayes Classifier
  • 13. ● Documents are the Lucene unit of information ● Documents are a map field -> value ● Each field may be analysed differently (different tokenization and token filtering) ● Each field may have a different weight for the classification (affecting differently the similarity score) Document Classification
  • 14. Solr is the popular, blazing fast, open source NoSQL search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search and analytics, rich document parsing, geospatial search, extensive REST APIs as well as parallel SQL. Apache Solr
  • 15. Index Time Integration - SOLR-7739 ● Ingest the document ● Assign the class ● Set the class as a field value ● Index the document Request Handler Integration (TO DO) - SOLR-7738 Return an assigned class : ● Given a text and a field ● Given an input document ● Given an indexed document id Solr Integration
  • 16. ● Pipeline of processors ● Each single document flows through the chain ● Each processor is executed once ● Last processor triggers the update command Update Request Processor Chain
  • 17. ● Update Component ● Configurable Singleton Factory ● Single instance per request thread ● Process a single Document ● SolrCloud compatible* * Pre processor / Post processor Update Request Processor
  • 18. ● Access the Index Reader ● A Lucene Document Classifier is instantiated ● A class is assigned by the classifier ● A new field is added to the original Document, with the class ● The document goes through the next processing steps Classification Update Request Processor
  • 19. ... <initParams path="/update/**,/query,/select,/tvrh,/elevate,/spell,/browse"> <lst name="defaults"> <str name="df">text</str> <str name="update.chain">classification</str> </lst> </initParams> ... Solrconfig.xml - Update Handler
  • 20. ... <updateRequestProcessorChain name="classification"> <processor class="solr.ClassificationUpdateProcessorFactory"> ... </processor> <processor class="solr.RunUpdateProcessorFactory"/> </updateRequestProcessorChain> ... Solrconfig.xml - Chain configuration
  • 21. <processor class="solr.ClassificationUpdateProcessorFactory"> <str name="inputFields">title^1.5,content,author</str> <str name="classField">cat</str> <str name="algorithm">knn</str> <str name="knn.k">20</str> <str name="knn.minTf">1</str> <str name="knn.minDf">5</str> </processor> N.B. classField must be stored Solrconfig.xml - K nearest neighbour classifier config
  • 22. <processor class="solr.ClassificationUpdateProcessorFactory"> <str name="inputFields">title^1.5,content,author</str> <str name="classField">cat</str> <str name="algorithm">bayes</str> </processor> N.B. classField must be Indexed (take care of analysis) Solrconfig.xml - Naive Bayes classifier config
  • 23. ● Lucene >= 6.0 ● Solr >= 6.1 ● Classification needs a training set -> An index with initially human assigned classes is required Solr Classification - Important Notes
  • 24. ● Sci-Fi StackExchange dataset ● Roughly 18.000 questions and answers ● Roughly 6.000 tagged ● 70 % Training Set + 30% test set Solr Classification - Demo
  • 25. ● Index the training set documents (this is our ground truth) ● Index the test set (classification will happen automatically at indexing time) ● Evaluate the test set (a simple java app to verify that the automatically assigned classes are consistent with what expected) Solr Classification - Demo
  • 26. ● True Positive : Predicted class == actual class ● False Positive : Predicted class != actual class ● True Negative : Not predicted class != actual class ● False Negative : Not predicted class == actual class Precision = TP / TP+FP Recall = TP / TP+FN Solr Classification - System Evaluation Metrics
  • 27. ● Index the training set documents (this is our ground truth) ● Index the test set (classification will happen automatically at indexing time) ● Evaluate the test set (a simple java app to verify that the automatically assigned classes are consistent with what expected) Solr Classification - Demo
  • 28. MaxOutputClasses 1 [System Global Accuracy]0.5095676824946846 [System Globel Recall]0.2686846038863976 TP{star-wars}59 FP{star-wars}75 FN{star-wars}7 [Precision (of predicted)]{star-wars}0.44029850746268656 [Recall for class)]{star-wars}0.8939393939393939 TP{harry-potter}147 FP{harry-potter}137 FN{harry-potter}3 [Precision (of predicted)]{harry-potter}0.5176056338028169 [Recall for class]{harry-potter}0.98 Solr Classification - Demo - Full Dataset
  • 29. MaxOutputClasses 5 [System Global Accuracy]0.20481927710843373 [System Globel Recall]0.5399850523168909 TP{star-wars}66 FP{star-wars}400 FN{star-wars}0 [Precision (of predicted)]{star-wars}0.14163090128755365 [Recall for class)]{star-wars}1.0 TP{harry-potter}150 FP{harry-potter}584 FN{harry-potter}0 [Precision (of predicted)]{harry-potter}0.20435967302452315 [Recall for class]{harry-potter}1.0 Solr Classification - Demo - Full Dataset
  • 30. MaxOutputClasses 1 [System Global Accuracy]0.9907407407407407 [System Globel Recall]0.6750788643533123 TP{star-wars}64 FP{star-wars}0 FN{star-wars}2 [Precision (of predicted)]{star-wars}1.0 [Recall for class)]{star-wars}0.9696969696969697 TP{harry-potter}150 FP{harry-potter}2 FN{harry-potter}0 [Precision (of predicted)]{harry-potter}0.9868421052631579 [Recall for class]{harry-potter}1.0 Solr Classification - Demo - Partial Dataset
  • 31. MaxOutputClasses 5 [System Global Accuracy]0.24259259259259258 [System Globel Recall]0.8264984227129337 TP{star-wars}66 FP{star-wars}52 FN{star-wars}0 [Precision (of predicted)]{star-wars}0.559322033898305 [Recall for class)]{star-wars}1.0 TP{harry-potter}150 FP{harry-potter}48 FN{harry-potter}0 [Precision (of predicted)]{harry-potter}0.7575757575757576 [Recall for class]{harry-potter}1.0 Solr Classification - Demo - Partial Dataset
  • 32. Multi classes support ● Class field may be multi valued ● Assign multiple classes ● Not only the top scoring but top N (parameter) Split human/auto assigned classes ● classTrainingField ● classOutputField Default : use the same field Solr Classification - Extensions SOLR-8871
  • 33. Classification Context Filtering ● Reduce the document space to consider -> reduce the training set ● Useful when only a subset of the index may be interesting for classification ● Consider only the human labelled documents as training data Solr Classification - Extensions SOLR-8871
  • 34. Individual Field Weighting ● When classifying, each field has a different importance e.g. title vs content ● Set a different boost per field ● Knn compatible ● Bayes compatible Solr Classification - Extensions SOLR-8871
  • 35. ● Numeric Field Support (Knn) (Euclidean distance based) ● Lat lon support (Knn) (geo distance based) ● SolrCloud support (use the entire sharded index as training set) Solr Classification - Future Work
  • 37. ● Special thanks to Tommaso Teofili, Apache committer who followed the developments and made possible the contributions. ● And to the Audience :)