SlideShare uma empresa Scribd logo
1 de 30
Baixar para ler offline
Text categorization with
   Lucene and Solr
     Tommaso Teofili
     tommaso [at] apache [dot] org




                                     1
About me
l      ASF member having fun with:
  l    Lucene / Solr
  l    Hama
  l    UIMA
  l    Stanbol
  l    … some others
l      SW engineer @ Adobe R&D




                                      2
Agenda
l    Classification
l    Lucene classification module
l    Solr text categorization services
l    Conclusions




                                          3
Classification
l    Let the algorithm assign one or more labels
      (classes) to some item given some
      previous knowledge
      l    Spam filter
      l    Tagging system
      l    Digit recognition system
      l    Text categorization
      l    etc.




                                           4
Classification?
why with Lucene?




                   5
The short story
l      Lucene already has a lot of features for
        common information retrieval needs
  l    Postings
  l    Term vectors
  l    Statistics
  l    Positions
  l    TF / IDF
  l    maybe Payloads
  l    etc.
l      We may avoid bringing in new components
        to do classification just leveraging what we
        get for free from Lucene
                                              6
The (slightly) longer story #1
l      While playing with NLP stuff
l      Need to implement a naïve bayes classifier
  l    Not possible to plug in stuff requiring touching
        the architecture
  l    Not really interested in (near) real time
        performance
l      Iteration 1
  l    Plain in memory Java stuff
l      Iteration 2
  l    Same stuff but using Lucene instead of loading
        things into memory
         l    Too much faster J
                                                    7
The (slightly) longer story #2
l      So I realized
  l    Lucene has so many features stored you can
        take advantage of for free
  l    Therefore writing the classification algorithm is
        relatively simple
  l    In many cases you’re just not adding anything to
        the architecture
         l    Your Lucene index was already there for searching
  l    Lucene index is, to some extent, already a model
        which we just need to “query” with the proper
        algorithm
  l    And it is fast enough
                                                           8
Lucene classification module
l      Work in progress on trunk
l      LUCENE-4345
  l    Establishing classification API
  l    With currently two implementations
         l    Naïve bayes
         l    K nearest neighbor




                                             9
Lucene classification module
l      Classifier API
  l    Training
  l    void train(atomicReader, contentField,
        classField, analyzer) throws IOException

         l    atomicReader : the reader on the Lucene index to
               use for classification
               l    still unsure if IR’d be better
         l    textFieldName : the name of the field which contains
               documents’ texts
         l    classFieldName : the name of the field which contains
               the class assigned to existing documents
         l    analyzer : the item used for analyzing the unseen
               texts
                                                            10
Lucene classification module
l      Classifier API
  l    Classifying
  l    ClassificationResult assignClass(String text)
        throws IOException
         l    text: the unseen text of the document to classify
         l    ClassificationResult : the object containing the
               assigned class along with the related score




                                                              11
K Nearest neighbor classifier
l    Fairly simple classification algorithm
l    Given some new unseen item
l    I search in my knowledge base the k items
      which are nearer to the new one
l    I get the k classes assigned to the k
      nearest items
l    I assign to the new item the class that is
      most frequent in the k returned items



                                           12
K Nearest neighbor classifier
l      How can we do this in Lucene?
  l    We have VSM for representing documents as
        vectors and eventually find distances
  l    Lucene MoreLikeThis module can do a lot for it
  l    Given a new document
         l    It’s represented as a MoreLikeThisQuery which filters
               out too frequent words and helps on keeping only the
               relevant tokens for finding the neighbors
         l    The query is executed returning only the first k results
         l    The result is then browsed in order to find the most
               frequent class and that is then assigned with a score
               of classFreq / k


                                                               13
Naïve Bayes classifier
l      Slightly more complicated
l      Based on probabilities
l      C = argmax( P(d|c) * P(c) )
  l    P(d|c) : likelihood
  l    P(c) : prior
  l    With some assumptions:
         l    bag of words assumption: positions don't matter
         l    conditional independence: the feature probabilities
               are independent given a class




                                                             14
Naïve Bayes classifier
l      Prior calculation is easy
  l    It’s the relative frequency of each class
        #docsWithClassC / #docs
l      Likelihood is easy too because of the bag
        of words assumption
  l    P(d|c) := P(x1,..,xn|c) == P(x1|c)*...P(xn|c)
  l    So we just need probabilities of single terms
         l    P(x|c) := (tf of x in documents with class c + 1)/
               (#terms in docs with class c + #docs)




                                                                15
Naïve Bayes classifier
l      Does the bag of words assumption affect
        the classifier’s precision?
  l    Yes in theory
         l    in text documents (nearby) words are strictly
               correlated
  l    Not always in practice
         l    depending on your index data it may or not have an
               impact




                                                               16
Using different indexes
l      The Classifier API makes usage of an
        AtomicReader to get the data for the
        training
l      It must not be the very same index used for
        every day index / search
  l    For performance reasons
  l    For enhancing classifier effectiveness
         l    Using more specific analyzers
         l    Indexing data in a different way
                l    e.g. one big document for each class and use kNN (with a
                      small k) or TF-IDF similarity



                                                                       17
Things to consider - bootstrapping
l      How are your first documents classified?
  l    Manually
         l    Categories are already there in the documents
         l    Someone is explicitly charged to do that (e.g. article
               authors) at some point in time
  l    (semi) automatically
         l    Using some existing service / library
                l    With or without human supervision

  l    In either case the classifier needs something to
        be fed with to be effective



                                                               18
Things to consider – tokenizing
l      How are your content field tokenized?
  l    Whitespace
         l    It doesn’t work for each language
  l    Standard
  l    Sentence
  l    What about using N-Grams?
  l    What about using Shingles?




                                                   19
Things to consider - filtering
l      Some words may / should be filtered while
  l    Training
  l    Classifying
l      Often
  l    Stopwords
  l    Punctuation
  l    Not relevant PoS tagged tokens



                                            20
Raw benchmarking
l      Tried both algorithms on ~1M docs index
  l    Naïve bayes is affected by the # of classes
  l    kNN is affected by k being large
l      None of them took more than 1-2m to train
        even with great number of classes or large
        k values




                                                  21
From Lucene to Solr
l      The Lucene classifiers can be easily used
        in Solr
  l    As specific search services
         l    A classification based more like this
  l    While indexing
         l    For automatic text categorization




                                                       22
Classification based MLT
l      Use case:
  l    “give me all the documents that belong to the
        same category of a new not indexed document”
  l    Slightly different from basic MLT since it does
        not return the nearest docs
  l    That is useful if the user doesn’t want / need to
        index the document and still want to find all the
        other documents of the same category, whatever
        this category means


                                                   23
Classification based MLT
l      ClassificationMLTHandler
  l    String d = req.getParams().get(DOC);
  l    ClassificationResult r = classifier.assignClass(d);
  l    String c = r.getAssignedClass();
  l    req.getSearcher().search(new TermQuery(new
        Term(classFieldName, c)), rows);




                                                    24
Automatic text categorization
l      Once a doc reaches Solr
l      We can use the Lucene classifiers to
        automate assigning document’s category
l      We can leverage existing Solr facilites for
        enhancing the indexing pipeline
  l    An UpdateChain can be decorated with one or
        more UpdateRequestProcessors




                                               25
Automatic text categorization
l      Configuration
  l    <updateRequestProcessorChain name=“ctgr">
  l    <processor
        class=”solr.CategorizationUpdateRequestProces
        sorFactory">
  l    <processor
        class="solr.RunUpdateProcessorFactory" />
  l    </updateRequestProcessorChain>



                                               26
Automatic text categorization
l      CategorizationUpdateRequestProcessor
l      void processAdd(AddUpdateCommand
        cmd) throws IOException
         l    String text = solrInputDocument.getFieldValue(“text”);
         l    String class = classifier.assignClass(text);
         l    solrInputDocument.addField(“cat”, class);
  l    Every now and then need to retrain to get latest
        stuff in the current index, but that can be done in
        the background without affecting performances


                                                             27
Automatic text categorization
l      CategorizationUpdateRequestProcessor
l      Finer grained control
  l    Use automatic text categorization only if a value
        does not exist for the “cat” field
  l    Add the classifier output class to the “cat” field
        only if it’s above a certain score




                                                      28
Wrap up
l      Simple classifiers with no or little effort
l      No architecture change
l      Both available to Lucene and Solr
l      Still reasonably fast
l      A lot more can be done
  l    Implement a MaxEnt Lucene based classifier
         l    which takes into account words correlation



                                                            29
Thanks!




          30

Mais conteúdo relacionado

Mais procurados

Content based image retrieval(cbir)
Content based image retrieval(cbir)Content based image retrieval(cbir)
Content based image retrieval(cbir)paddu123
 
Machine Learning - Object Detection and Classification
Machine Learning - Object Detection and ClassificationMachine Learning - Object Detection and Classification
Machine Learning - Object Detection and ClassificationVikas Jain
 
Neural networks and deep learning
Neural networks and deep learningNeural networks and deep learning
Neural networks and deep learningJörgen Sandig
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processingrohitnayak
 
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)Sebastian Ruder
 
DIP Erosion Dilation
DIP Erosion DilationDIP Erosion Dilation
DIP Erosion DilationAntony Vigil
 
Lightweight Natural Language Processing (NLP)
Lightweight Natural Language Processing (NLP)Lightweight Natural Language Processing (NLP)
Lightweight Natural Language Processing (NLP)Lithium
 
Feature Extraction
Feature ExtractionFeature Extraction
Feature Extractionskylian
 
Implementing Semantic Search
Implementing Semantic SearchImplementing Semantic Search
Implementing Semantic SearchPaul Wlodarczyk
 
Supervised Learning Based Approach to Aspect Based Sentiment Analysis
Supervised Learning Based Approach to Aspect Based Sentiment AnalysisSupervised Learning Based Approach to Aspect Based Sentiment Analysis
Supervised Learning Based Approach to Aspect Based Sentiment AnalysisTharindu Kumara
 
Lecture 2&3 Computer vision image formation ,filters&edge detection
Lecture 2&3 Computer vision image formation ,filters&edge detectionLecture 2&3 Computer vision image formation ,filters&edge detection
Lecture 2&3 Computer vision image formation ,filters&edge detectioncairo university
 
Protecting user data in profile matching social networks
Protecting user data in profile matching social networksProtecting user data in profile matching social networks
Protecting user data in profile matching social networksVenkat Projects
 
Neocognitron
NeocognitronNeocognitron
NeocognitronESCOM
 
Opinion Mining Tutorial (Sentiment Analysis)
Opinion Mining Tutorial (Sentiment Analysis)Opinion Mining Tutorial (Sentiment Analysis)
Opinion Mining Tutorial (Sentiment Analysis)Kavita Ganesan
 
Convolution Neural Network (CNN)
Convolution Neural Network (CNN)Convolution Neural Network (CNN)
Convolution Neural Network (CNN)Basit Rafiq
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with PythonBenjamin Bengfort
 

Mais procurados (20)

Content based image retrieval(cbir)
Content based image retrieval(cbir)Content based image retrieval(cbir)
Content based image retrieval(cbir)
 
Search@flipkart
Search@flipkartSearch@flipkart
Search@flipkart
 
Cnn
CnnCnn
Cnn
 
Machine Learning - Object Detection and Classification
Machine Learning - Object Detection and ClassificationMachine Learning - Object Detection and Classification
Machine Learning - Object Detection and Classification
 
Neural networks and deep learning
Neural networks and deep learningNeural networks and deep learning
Neural networks and deep learning
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)
 
DIP Erosion Dilation
DIP Erosion DilationDIP Erosion Dilation
DIP Erosion Dilation
 
Lightweight Natural Language Processing (NLP)
Lightweight Natural Language Processing (NLP)Lightweight Natural Language Processing (NLP)
Lightweight Natural Language Processing (NLP)
 
Feature Extraction
Feature ExtractionFeature Extraction
Feature Extraction
 
Lstm
LstmLstm
Lstm
 
Implementing Semantic Search
Implementing Semantic SearchImplementing Semantic Search
Implementing Semantic Search
 
OpenNLP demo
OpenNLP demoOpenNLP demo
OpenNLP demo
 
Supervised Learning Based Approach to Aspect Based Sentiment Analysis
Supervised Learning Based Approach to Aspect Based Sentiment AnalysisSupervised Learning Based Approach to Aspect Based Sentiment Analysis
Supervised Learning Based Approach to Aspect Based Sentiment Analysis
 
Lecture 2&3 Computer vision image formation ,filters&edge detection
Lecture 2&3 Computer vision image formation ,filters&edge detectionLecture 2&3 Computer vision image formation ,filters&edge detection
Lecture 2&3 Computer vision image formation ,filters&edge detection
 
Protecting user data in profile matching social networks
Protecting user data in profile matching social networksProtecting user data in profile matching social networks
Protecting user data in profile matching social networks
 
Neocognitron
NeocognitronNeocognitron
Neocognitron
 
Opinion Mining Tutorial (Sentiment Analysis)
Opinion Mining Tutorial (Sentiment Analysis)Opinion Mining Tutorial (Sentiment Analysis)
Opinion Mining Tutorial (Sentiment Analysis)
 
Convolution Neural Network (CNN)
Convolution Neural Network (CNN)Convolution Neural Network (CNN)
Convolution Neural Network (CNN)
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with Python
 

Semelhante a Text categorization with Lucene and Solr

Object Oriented Programming.pptx
Object Oriented Programming.pptxObject Oriented Programming.pptx
Object Oriented Programming.pptxSAICHARANREDDYN
 
Programming in Scala - Lecture One
Programming in Scala - Lecture OneProgramming in Scala - Lecture One
Programming in Scala - Lecture OneAngelo Corsaro
 
Object oriented programming in python
Object oriented programming in pythonObject oriented programming in python
Object oriented programming in pythonnitamhaske
 
Summer Training Project On Python Programming
Summer Training Project On Python ProgrammingSummer Training Project On Python Programming
Summer Training Project On Python ProgrammingKAUSHAL KUMAR JHA
 
What To Leave Implicit
What To Leave ImplicitWhat To Leave Implicit
What To Leave ImplicitMartin Odersky
 
Object Oriented Programming Tutorial.pptx
Object Oriented Programming Tutorial.pptxObject Oriented Programming Tutorial.pptx
Object Oriented Programming Tutorial.pptxethiouniverse
 
Conventional and Reasonable
Conventional and ReasonableConventional and Reasonable
Conventional and ReasonableKevlin Henney
 
PRESENTATION ON PYTHON.pptx
PRESENTATION ON PYTHON.pptxPRESENTATION ON PYTHON.pptx
PRESENTATION ON PYTHON.pptxabhishek364864
 
Torch: a scientific computing framework for machine-learning practitioners
Torch: a scientific computing framework for machine-learning practitionersTorch: a scientific computing framework for machine-learning practitioners
Torch: a scientific computing framework for machine-learning practitionersHoffman Lab
 
Fundamental of deep learning
Fundamental of deep learningFundamental of deep learning
Fundamental of deep learningStanley Wang
 
Learning puppet chapter 2
Learning puppet chapter 2Learning puppet chapter 2
Learning puppet chapter 2Vishal Biyani
 
Lotusphere 2007 BP301 Advanced Object Oriented Programming for LotusScript
Lotusphere 2007 BP301 Advanced Object Oriented Programming for LotusScriptLotusphere 2007 BP301 Advanced Object Oriented Programming for LotusScript
Lotusphere 2007 BP301 Advanced Object Oriented Programming for LotusScriptBill Buchan
 
EKON 12 Running OpenLDAP
EKON 12 Running OpenLDAP EKON 12 Running OpenLDAP
EKON 12 Running OpenLDAP Max Kleiner
 

Semelhante a Text categorization with Lucene and Solr (20)

Oop basic concepts
Oop basic conceptsOop basic concepts
Oop basic concepts
 
Java Notes
Java NotesJava Notes
Java Notes
 
Object Oriented Programming.pptx
Object Oriented Programming.pptxObject Oriented Programming.pptx
Object Oriented Programming.pptx
 
Programming in Scala - Lecture One
Programming in Scala - Lecture OneProgramming in Scala - Lecture One
Programming in Scala - Lecture One
 
Object oriented programming in python
Object oriented programming in pythonObject oriented programming in python
Object oriented programming in python
 
Proyecto
ProyectoProyecto
Proyecto
 
Summer Training Project On Python Programming
Summer Training Project On Python ProgrammingSummer Training Project On Python Programming
Summer Training Project On Python Programming
 
What To Leave Implicit
What To Leave ImplicitWhat To Leave Implicit
What To Leave Implicit
 
Introduction to Scala
Introduction to ScalaIntroduction to Scala
Introduction to Scala
 
NLP and LSA getting started
NLP and LSA getting startedNLP and LSA getting started
NLP and LSA getting started
 
Seminar dm
Seminar dmSeminar dm
Seminar dm
 
Object Oriented Programming Tutorial.pptx
Object Oriented Programming Tutorial.pptxObject Oriented Programming Tutorial.pptx
Object Oriented Programming Tutorial.pptx
 
Scala Days NYC 2016
Scala Days NYC 2016Scala Days NYC 2016
Scala Days NYC 2016
 
Conventional and Reasonable
Conventional and ReasonableConventional and Reasonable
Conventional and Reasonable
 
PRESENTATION ON PYTHON.pptx
PRESENTATION ON PYTHON.pptxPRESENTATION ON PYTHON.pptx
PRESENTATION ON PYTHON.pptx
 
Torch: a scientific computing framework for machine-learning practitioners
Torch: a scientific computing framework for machine-learning practitionersTorch: a scientific computing framework for machine-learning practitioners
Torch: a scientific computing framework for machine-learning practitioners
 
Fundamental of deep learning
Fundamental of deep learningFundamental of deep learning
Fundamental of deep learning
 
Learning puppet chapter 2
Learning puppet chapter 2Learning puppet chapter 2
Learning puppet chapter 2
 
Lotusphere 2007 BP301 Advanced Object Oriented Programming for LotusScript
Lotusphere 2007 BP301 Advanced Object Oriented Programming for LotusScriptLotusphere 2007 BP301 Advanced Object Oriented Programming for LotusScript
Lotusphere 2007 BP301 Advanced Object Oriented Programming for LotusScript
 
EKON 12 Running OpenLDAP
EKON 12 Running OpenLDAP EKON 12 Running OpenLDAP
EKON 12 Running OpenLDAP
 

Mais de Tommaso Teofili

Affect Enriched Word Embeddings for News IR
Affect Enriched Word Embeddings for News IRAffect Enriched Word Embeddings for News IR
Affect Enriched Word Embeddings for News IRTommaso Teofili
 
Flexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit OakFlexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit OakTommaso Teofili
 
Data replication in Sling
Data replication in SlingData replication in Sling
Data replication in SlingTommaso Teofili
 
Search engines in the industry
Search engines in the industrySearch engines in the industry
Search engines in the industryTommaso Teofili
 
Scaling search in Oak with Solr
Scaling search in Oak with Solr Scaling search in Oak with Solr
Scaling search in Oak with Solr Tommaso Teofili
 
Machine learning with Apache Hama
Machine learning with Apache HamaMachine learning with Apache Hama
Machine learning with Apache HamaTommaso Teofili
 
Adapting Apache UIMA to OSGi
Adapting Apache UIMA to OSGiAdapting Apache UIMA to OSGi
Adapting Apache UIMA to OSGiTommaso Teofili
 
Domeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and ClerezzaDomeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and ClerezzaTommaso Teofili
 
Natural Language Search in Solr
Natural Language Search in SolrNatural Language Search in Solr
Natural Language Search in SolrTommaso Teofili
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash courseTommaso Teofili
 
Apache UIMA - Hands on code
Apache UIMA - Hands on codeApache UIMA - Hands on code
Apache UIMA - Hands on codeTommaso Teofili
 
Apache UIMA Introduction
Apache UIMA IntroductionApache UIMA Introduction
Apache UIMA IntroductionTommaso Teofili
 
OSS Enterprise Search EU Tour
OSS Enterprise Search EU TourOSS Enterprise Search EU Tour
OSS Enterprise Search EU TourTommaso Teofili
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platformTommaso Teofili
 
Information Extraction with UIMA - Usecases
Information Extraction with UIMA - UsecasesInformation Extraction with UIMA - Usecases
Information Extraction with UIMA - UsecasesTommaso Teofili
 
Apache UIMA and Metadata Generation
Apache UIMA and Metadata GenerationApache UIMA and Metadata Generation
Apache UIMA and Metadata GenerationTommaso Teofili
 
Data and Information Extraction on the Web
Data and Information Extraction on the WebData and Information Extraction on the Web
Data and Information Extraction on the WebTommaso Teofili
 
Apache UIMA and Semantic Search
Apache UIMA and Semantic SearchApache UIMA and Semantic Search
Apache UIMA and Semantic SearchTommaso Teofili
 

Mais de Tommaso Teofili (19)

Affect Enriched Word Embeddings for News IR
Affect Enriched Word Embeddings for News IRAffect Enriched Word Embeddings for News IR
Affect Enriched Word Embeddings for News IR
 
Flexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit OakFlexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit Oak
 
Data replication in Sling
Data replication in SlingData replication in Sling
Data replication in Sling
 
Search engines in the industry
Search engines in the industrySearch engines in the industry
Search engines in the industry
 
Scaling search in Oak with Solr
Scaling search in Oak with Solr Scaling search in Oak with Solr
Scaling search in Oak with Solr
 
Machine learning with Apache Hama
Machine learning with Apache HamaMachine learning with Apache Hama
Machine learning with Apache Hama
 
Adapting Apache UIMA to OSGi
Adapting Apache UIMA to OSGiAdapting Apache UIMA to OSGi
Adapting Apache UIMA to OSGi
 
Oak / Solr integration
Oak / Solr integrationOak / Solr integration
Oak / Solr integration
 
Domeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and ClerezzaDomeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and Clerezza
 
Natural Language Search in Solr
Natural Language Search in SolrNatural Language Search in Solr
Natural Language Search in Solr
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
Apache UIMA - Hands on code
Apache UIMA - Hands on codeApache UIMA - Hands on code
Apache UIMA - Hands on code
 
Apache UIMA Introduction
Apache UIMA IntroductionApache UIMA Introduction
Apache UIMA Introduction
 
OSS Enterprise Search EU Tour
OSS Enterprise Search EU TourOSS Enterprise Search EU Tour
OSS Enterprise Search EU Tour
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platform
 
Information Extraction with UIMA - Usecases
Information Extraction with UIMA - UsecasesInformation Extraction with UIMA - Usecases
Information Extraction with UIMA - Usecases
 
Apache UIMA and Metadata Generation
Apache UIMA and Metadata GenerationApache UIMA and Metadata Generation
Apache UIMA and Metadata Generation
 
Data and Information Extraction on the Web
Data and Information Extraction on the WebData and Information Extraction on the Web
Data and Information Extraction on the Web
 
Apache UIMA and Semantic Search
Apache UIMA and Semantic SearchApache UIMA and Semantic Search
Apache UIMA and Semantic Search
 

Último

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 

Último (20)

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 

Text categorization with Lucene and Solr

  • 1. Text categorization with Lucene and Solr Tommaso Teofili tommaso [at] apache [dot] org 1
  • 2. About me l  ASF member having fun with: l  Lucene / Solr l  Hama l  UIMA l  Stanbol l  … some others l  SW engineer @ Adobe R&D 2
  • 3. Agenda l  Classification l  Lucene classification module l  Solr text categorization services l  Conclusions 3
  • 4. Classification l  Let the algorithm assign one or more labels (classes) to some item given some previous knowledge l  Spam filter l  Tagging system l  Digit recognition system l  Text categorization l  etc. 4
  • 6. The short story l  Lucene already has a lot of features for common information retrieval needs l  Postings l  Term vectors l  Statistics l  Positions l  TF / IDF l  maybe Payloads l  etc. l  We may avoid bringing in new components to do classification just leveraging what we get for free from Lucene 6
  • 7. The (slightly) longer story #1 l  While playing with NLP stuff l  Need to implement a naïve bayes classifier l  Not possible to plug in stuff requiring touching the architecture l  Not really interested in (near) real time performance l  Iteration 1 l  Plain in memory Java stuff l  Iteration 2 l  Same stuff but using Lucene instead of loading things into memory l  Too much faster J 7
  • 8. The (slightly) longer story #2 l  So I realized l  Lucene has so many features stored you can take advantage of for free l  Therefore writing the classification algorithm is relatively simple l  In many cases you’re just not adding anything to the architecture l  Your Lucene index was already there for searching l  Lucene index is, to some extent, already a model which we just need to “query” with the proper algorithm l  And it is fast enough 8
  • 9. Lucene classification module l  Work in progress on trunk l  LUCENE-4345 l  Establishing classification API l  With currently two implementations l  Naïve bayes l  K nearest neighbor 9
  • 10. Lucene classification module l  Classifier API l  Training l  void train(atomicReader, contentField, classField, analyzer) throws IOException l  atomicReader : the reader on the Lucene index to use for classification l  still unsure if IR’d be better l  textFieldName : the name of the field which contains documents’ texts l  classFieldName : the name of the field which contains the class assigned to existing documents l  analyzer : the item used for analyzing the unseen texts 10
  • 11. Lucene classification module l  Classifier API l  Classifying l  ClassificationResult assignClass(String text) throws IOException l  text: the unseen text of the document to classify l  ClassificationResult : the object containing the assigned class along with the related score 11
  • 12. K Nearest neighbor classifier l  Fairly simple classification algorithm l  Given some new unseen item l  I search in my knowledge base the k items which are nearer to the new one l  I get the k classes assigned to the k nearest items l  I assign to the new item the class that is most frequent in the k returned items 12
  • 13. K Nearest neighbor classifier l  How can we do this in Lucene? l  We have VSM for representing documents as vectors and eventually find distances l  Lucene MoreLikeThis module can do a lot for it l  Given a new document l  It’s represented as a MoreLikeThisQuery which filters out too frequent words and helps on keeping only the relevant tokens for finding the neighbors l  The query is executed returning only the first k results l  The result is then browsed in order to find the most frequent class and that is then assigned with a score of classFreq / k 13
  • 14. Naïve Bayes classifier l  Slightly more complicated l  Based on probabilities l  C = argmax( P(d|c) * P(c) ) l  P(d|c) : likelihood l  P(c) : prior l  With some assumptions: l  bag of words assumption: positions don't matter l  conditional independence: the feature probabilities are independent given a class 14
  • 15. Naïve Bayes classifier l  Prior calculation is easy l  It’s the relative frequency of each class #docsWithClassC / #docs l  Likelihood is easy too because of the bag of words assumption l  P(d|c) := P(x1,..,xn|c) == P(x1|c)*...P(xn|c) l  So we just need probabilities of single terms l  P(x|c) := (tf of x in documents with class c + 1)/ (#terms in docs with class c + #docs) 15
  • 16. Naïve Bayes classifier l  Does the bag of words assumption affect the classifier’s precision? l  Yes in theory l  in text documents (nearby) words are strictly correlated l  Not always in practice l  depending on your index data it may or not have an impact 16
  • 17. Using different indexes l  The Classifier API makes usage of an AtomicReader to get the data for the training l  It must not be the very same index used for every day index / search l  For performance reasons l  For enhancing classifier effectiveness l  Using more specific analyzers l  Indexing data in a different way l  e.g. one big document for each class and use kNN (with a small k) or TF-IDF similarity 17
  • 18. Things to consider - bootstrapping l  How are your first documents classified? l  Manually l  Categories are already there in the documents l  Someone is explicitly charged to do that (e.g. article authors) at some point in time l  (semi) automatically l  Using some existing service / library l  With or without human supervision l  In either case the classifier needs something to be fed with to be effective 18
  • 19. Things to consider – tokenizing l  How are your content field tokenized? l  Whitespace l  It doesn’t work for each language l  Standard l  Sentence l  What about using N-Grams? l  What about using Shingles? 19
  • 20. Things to consider - filtering l  Some words may / should be filtered while l  Training l  Classifying l  Often l  Stopwords l  Punctuation l  Not relevant PoS tagged tokens 20
  • 21. Raw benchmarking l  Tried both algorithms on ~1M docs index l  Naïve bayes is affected by the # of classes l  kNN is affected by k being large l  None of them took more than 1-2m to train even with great number of classes or large k values 21
  • 22. From Lucene to Solr l  The Lucene classifiers can be easily used in Solr l  As specific search services l  A classification based more like this l  While indexing l  For automatic text categorization 22
  • 23. Classification based MLT l  Use case: l  “give me all the documents that belong to the same category of a new not indexed document” l  Slightly different from basic MLT since it does not return the nearest docs l  That is useful if the user doesn’t want / need to index the document and still want to find all the other documents of the same category, whatever this category means 23
  • 24. Classification based MLT l  ClassificationMLTHandler l  String d = req.getParams().get(DOC); l  ClassificationResult r = classifier.assignClass(d); l  String c = r.getAssignedClass(); l  req.getSearcher().search(new TermQuery(new Term(classFieldName, c)), rows); 24
  • 25. Automatic text categorization l  Once a doc reaches Solr l  We can use the Lucene classifiers to automate assigning document’s category l  We can leverage existing Solr facilites for enhancing the indexing pipeline l  An UpdateChain can be decorated with one or more UpdateRequestProcessors 25
  • 26. Automatic text categorization l  Configuration l  <updateRequestProcessorChain name=“ctgr"> l  <processor class=”solr.CategorizationUpdateRequestProces sorFactory"> l  <processor class="solr.RunUpdateProcessorFactory" /> l  </updateRequestProcessorChain> 26
  • 27. Automatic text categorization l  CategorizationUpdateRequestProcessor l  void processAdd(AddUpdateCommand cmd) throws IOException l  String text = solrInputDocument.getFieldValue(“text”); l  String class = classifier.assignClass(text); l  solrInputDocument.addField(“cat”, class); l  Every now and then need to retrain to get latest stuff in the current index, but that can be done in the background without affecting performances 27
  • 28. Automatic text categorization l  CategorizationUpdateRequestProcessor l  Finer grained control l  Use automatic text categorization only if a value does not exist for the “cat” field l  Add the classifier output class to the “cat” field only if it’s above a certain score 28
  • 29. Wrap up l  Simple classifiers with no or little effort l  No architecture change l  Both available to Lucene and Solr l  Still reasonably fast l  A lot more can be done l  Implement a MaxEnt Lucene based classifier l  which takes into account words correlation 29
  • 30. Thanks! 30