SlideShare uma empresa Scribd logo
1 de 15
Baixar para ler offline
INCORPORATING
PROBABILISTIC
RETRIEVAL
KNOWLEDGE INTO
TFIDF-BASED SEARCH
ENGINE
Alex Lin
Senior Architect
Intelligent Mining
alin at IntelligentMinining.com
Overview of Retrieval Models
  Boolean Retrieval
  Vector Space Model

  Probabilistic Model

  Language Model
Boolean Retrieval
  lincolnAND NOT (car AND automobile)
  The earliest model and still in use today

  The result is very easy to explain to users

  Highly efficient computationally

  The major drawback – lack of sophisticated
   ranking algorithm.
Vector Space Model
    Term2   Doc1


                   Doc2

                                                t
                   Query
                                            ∑d       ij   *qj
                                            j=1
                             Cos(Di ,Q) =
                                            t              t
                     Term3
                                            ∑ d * ∑q2
                                                    ij
                                                                 2
                                                                 j
                                            j=1            j=1




 Major flaws: It lacks guidance on the details of
                   €
 how weighting and ranking algorithms are
 related to relevance
Probabilistic Retrieval Model

             Relevant       P(R|D)

                                     Document




              Non-
             Relevant      P(NR|D)




                             P(D | R)P(R)
    Bayes’ Rule   P(R | D) =
                                P(D)



    €
Probabilistic Retrieval Model
                       P(D | R)P(R)               P(D | NR)P(NR)
          P(R | D) =                  P(NR | D) =
                          P(D)                          P(D)


          IfP(D | R)P(R) > P(D | NR)P(NR)
€                         €
          then classify D as relevant

    €
Estimate P(D|R) and P(D|NR)
  Define        D = (d1,d2 ,...,dt )
                                t
        then    P(D | R) = ∏ P(di | R)
                                i=1
                                t

    €          P(D | NR) = ∏ P(di | NR)
                                i=1


€
        Binary Independence Model
€        term independence + binary features in documents
Likelihood Ratio
      Likelihood   ratio:
           P(D | R)   P(NR)
                    >
          P(D | NR)    P(R)
                                si: in non-relevant set, the probability of term i occurring
                                pi: in relevant set, the probability of term i occurring

           P(D | R)          pi          1− pi           pi (1− si )
                    =∏ ⋅ ∏                     = ∑ log
€         P(D | NR) i:d i =1 si i:d i = 0 1− si i:d i =1 si (1− pi )
                                               (ri + 0.5) /(R − ri + 0.5)
                      = ∑ log
                       i:d i = q i =1 (n i − ri + 0.5) /(N − n i − R + ri + 0.5)
€
                             N: total number of Non-relevant documents
                             ni: number of non-relevant documents that contain a term
                             ri: number of relevant documents that contain a term
                             R: total number of Relevant documents
          €
Combine with BM25 Ranking
    Algorithm
      BM25   extends the scoring function for the binary
       independence model to include document and
       query term weight.
      It performs very well in TREC experiments


                              (ri + 0.5) /(R − ri + 0.5)        (k + 1) f i (k 2 + 1)qf i
    R(q,D) = ∑ log                                             ⋅ i         ⋅
            i∈Q
                     (n i − ri + 0.5) /(N − n i − R + ri + 0.5) K + f i      k 2 + qf i

                                                                                         dl
                                                                 K = k1 ((1− b) + b ⋅         )
                                                                                        avgdl
€
                                k1 k2 b: tuning parameters
                                dl: document length
                                avgdl: average document length in data set
                                                  €
                                qf: term frequency in query terms
Weighted Fields Boolean Search
 doc-id       field0     field1                     …   text
   1
   2
   3
   …
   n


                   R(q,D) = ∑    ∑w        f   mi
                          i∈q f ∈ fileds




          €
Apply Probabilistic Knowledge
into Fields
           Higher     gradient         Lower

 doc-id   field0      field1           …       Text
   1
          Lightyear    Buzz
   2
   3
   …
   n



          Relevant


                          P(R|D)


                                   Document
           Non-
          Relevant    P(NR|D)
Use the Knowledge during Ranking
     doc-id         field0      field1    …           Text
       1
                    Lightyear    Buzz
       2
       3
       …
       n



      The    goal is:
                                    t
                         t
      P(D | R) = ∏ P(di | R) = ∑ log(P(di | R)) ≈ ∑ ∑ w f mi
                         i=1
                                   i=1           i∈q f ∈F



                                                    Learnable

€
Comparison of Approaches
                                      f ik              N
    RTF −IDF = tf ik ⋅ idf i =    t
                                                  ⋅ log
                                                        nk
                                 ∑f          ij
                                 j=1

                   (k1 + 1) f i (k2 + 1)qf i                                          dl
    Rbm 25 (q,D) =             ⋅                              K = k1 ((1− b) + b ⋅         )
                    K + fi       k 2 + qf i                                          avgdl
€                                  (ri + 0.5) /(R − ri + 0.5)        (k + 1) f i (k 2 + 1)qf i
    R(q,D) = ∑ log                                                  ⋅ 1         ⋅
               i∈Q
                          (n i − ri + 0.5) /(N − n i − R + ri + 0.5) K + f i      k 2 + qf i
€                                               €
                                                              IDF                      TF


€                                (k1 + 1) f i (k 2 + 1)qf i
    R(q,D) = ∑ ∑ w f mi ⋅                    ⋅
               i∈q f ∈F           K + fi       k 2 + qf i

                          IDF                           TF

€
Other Considerations
  Thisis not a formal model
  Require user relevance feedback (search log)

  Harder to handle real-time search queries

  How to Prevent Love/Hate attacks
Thank you

Mais conteúdo relacionado

Mais procurados

Information retrieval introduction
Information retrieval introductionInformation retrieval introduction
Information retrieval introductionnimmyjans4
 
Information Retrieval Models
Information Retrieval ModelsInformation Retrieval Models
Information Retrieval ModelsNisha Arankandath
 
5013 Indexing Presentation
5013 Indexing Presentation5013 Indexing Presentation
5013 Indexing Presentationlmartin8
 
Information retrieval (introduction)
Information  retrieval (introduction) Information  retrieval (introduction)
Information retrieval (introduction) Primya Tamil
 
Information visualization - introduction
Information visualization - introductionInformation visualization - introduction
Information visualization - introductionKatrien Verbert
 
Probabilistic retrieval model
Probabilistic retrieval modelProbabilistic retrieval model
Probabilistic retrieval modelbaradhimarch81
 
Digital preservation: an introduction
Digital preservation: an introductionDigital preservation: an introduction
Digital preservation: an introductionPublicLibraryServices
 
User Focused Digital Library: A Practical Guide
User Focused Digital Library: A Practical GuideUser Focused Digital Library: A Practical Guide
User Focused Digital Library: A Practical GuideSophia Guevara
 
The vector space model
The vector space modelThe vector space model
The vector space modelpkgosh
 
eprints digital library software
eprints digital library softwareeprints digital library software
eprints digital library softwaresonia naomi bandao
 
Greenstone Digital Library
Greenstone Digital LibraryGreenstone Digital Library
Greenstone Digital LibraryImran Mansuri
 
Introduction to Metadata
Introduction to MetadataIntroduction to Metadata
Introduction to MetadataJenn Riley
 

Mais procurados (20)

Information retrieval introduction
Information retrieval introductionInformation retrieval introduction
Information retrieval introduction
 
Information Retrieval Models
Information Retrieval ModelsInformation Retrieval Models
Information Retrieval Models
 
5013 Indexing Presentation
5013 Indexing Presentation5013 Indexing Presentation
5013 Indexing Presentation
 
Information retrieval (introduction)
Information  retrieval (introduction) Information  retrieval (introduction)
Information retrieval (introduction)
 
Information visualization - introduction
Information visualization - introductionInformation visualization - introduction
Information visualization - introduction
 
Probabilistic retrieval model
Probabilistic retrieval modelProbabilistic retrieval model
Probabilistic retrieval model
 
Digital preservation: an introduction
Digital preservation: an introductionDigital preservation: an introduction
Digital preservation: an introduction
 
User Focused Digital Library: A Practical Guide
User Focused Digital Library: A Practical GuideUser Focused Digital Library: A Practical Guide
User Focused Digital Library: A Practical Guide
 
The vector space model
The vector space modelThe vector space model
The vector space model
 
Lec1,2
Lec1,2Lec1,2
Lec1,2
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrieval
 
eprints digital library software
eprints digital library softwareeprints digital library software
eprints digital library software
 
Digital Library Initiatives in India
Digital Library Initiatives in IndiaDigital Library Initiatives in India
Digital Library Initiatives in India
 
Multimedia Information Retrieval
Multimedia Information RetrievalMultimedia Information Retrieval
Multimedia Information Retrieval
 
Interoperability in Digital Libraries
Interoperability in Digital LibrariesInteroperability in Digital Libraries
Interoperability in Digital Libraries
 
Automatic indexing
Automatic indexingAutomatic indexing
Automatic indexing
 
Greenstone Digital Library
Greenstone Digital LibraryGreenstone Digital Library
Greenstone Digital Library
 
Digital library softaware greenstone & dsapce
Digital library softaware greenstone & dsapceDigital library softaware greenstone & dsapce
Digital library softaware greenstone & dsapce
 
Multimedia application in libraries gaurav boudh
Multimedia application in libraries gaurav boudhMultimedia application in libraries gaurav boudh
Multimedia application in libraries gaurav boudh
 
Introduction to Metadata
Introduction to MetadataIntroduction to Metadata
Introduction to Metadata
 

Destaque

Vector space model of information retrieval
Vector space model of information retrievalVector space model of information retrieval
Vector space model of information retrievalNanthini Dominique
 
Document similarity with vector space model
Document similarity with vector space modelDocument similarity with vector space model
Document similarity with vector space modeldalal404
 
Search: Probabilistic Information Retrieval
Search: Probabilistic Information RetrievalSearch: Probabilistic Information Retrieval
Search: Probabilistic Information RetrievalVipul Munot
 
Research IT at the University of Bristol
Research IT at the University of BristolResearch IT at the University of Bristol
Research IT at the University of BristolSimon Price
 
SubSift: a novel application of the vector space model to support the academi...
SubSift: a novel application of the vector space model to support the academi...SubSift: a novel application of the vector space model to support the academi...
SubSift: a novel application of the vector space model to support the academi...Simon Price
 
Probabilistic Information Retrieval
Probabilistic Information RetrievalProbabilistic Information Retrieval
Probabilistic Information RetrievalHarsh Thakkar
 
Fuzzy Logic ppt
Fuzzy Logic pptFuzzy Logic ppt
Fuzzy Logic pptRitu Bafna
 
similarity measure
similarity measure similarity measure
similarity measure ZHAO Sam
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text MiningMinha Hwang
 
Deep Learning for Information Retrieval
Deep Learning for Information RetrievalDeep Learning for Information Retrieval
Deep Learning for Information RetrievalRoelof Pieters
 
Genetic Algorithm by Example
Genetic Algorithm by ExampleGenetic Algorithm by Example
Genetic Algorithm by ExampleNobal Niraula
 
Extending BM25 with multiple query operators
Extending BM25 with multiple query operatorsExtending BM25 with multiple query operators
Extending BM25 with multiple query operatorsRoi Blanco
 

Destaque (16)

Lec 4,5
Lec 4,5Lec 4,5
Lec 4,5
 
Vector space model of information retrieval
Vector space model of information retrievalVector space model of information retrieval
Vector space model of information retrieval
 
Document similarity with vector space model
Document similarity with vector space modelDocument similarity with vector space model
Document similarity with vector space model
 
Ir models
Ir modelsIr models
Ir models
 
Search: Probabilistic Information Retrieval
Search: Probabilistic Information RetrievalSearch: Probabilistic Information Retrieval
Search: Probabilistic Information Retrieval
 
Research IT at the University of Bristol
Research IT at the University of BristolResearch IT at the University of Bristol
Research IT at the University of Bristol
 
SubSift: a novel application of the vector space model to support the academi...
SubSift: a novel application of the vector space model to support the academi...SubSift: a novel application of the vector space model to support the academi...
SubSift: a novel application of the vector space model to support the academi...
 
Probabilistic Information Retrieval
Probabilistic Information RetrievalProbabilistic Information Retrieval
Probabilistic Information Retrieval
 
SAX-VSM
SAX-VSMSAX-VSM
SAX-VSM
 
Ir 08
Ir   08Ir   08
Ir 08
 
Fuzzy Logic ppt
Fuzzy Logic pptFuzzy Logic ppt
Fuzzy Logic ppt
 
similarity measure
similarity measure similarity measure
similarity measure
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text Mining
 
Deep Learning for Information Retrieval
Deep Learning for Information RetrievalDeep Learning for Information Retrieval
Deep Learning for Information Retrieval
 
Genetic Algorithm by Example
Genetic Algorithm by ExampleGenetic Algorithm by Example
Genetic Algorithm by Example
 
Extending BM25 with multiple query operators
Extending BM25 with multiple query operatorsExtending BM25 with multiple query operators
Extending BM25 with multiple query operators
 

Semelhante a Probabilistic Retrieval

Probabilistic Retrieval TFIDF
Probabilistic Retrieval TFIDFProbabilistic Retrieval TFIDF
Probabilistic Retrieval TFIDFDKALab
 
Inductive Triple Graphs: A purely functional approach to represent RDF
Inductive Triple Graphs: A purely functional approach to represent RDFInductive Triple Graphs: A purely functional approach to represent RDF
Inductive Triple Graphs: A purely functional approach to represent RDFJose Emilio Labra Gayo
 
CVPR2010: Advanced ITinCVPR in a Nutshell: part 6: Mixtures
CVPR2010: Advanced ITinCVPR in a Nutshell: part 6: MixturesCVPR2010: Advanced ITinCVPR in a Nutshell: part 6: Mixtures
CVPR2010: Advanced ITinCVPR in a Nutshell: part 6: Mixtureszukun
 
Class 18: Measuring Cost
Class 18: Measuring CostClass 18: Measuring Cost
Class 18: Measuring CostDavid Evans
 
Lecture4 kenrels functions_rkhs
Lecture4 kenrels functions_rkhsLecture4 kenrels functions_rkhs
Lecture4 kenrels functions_rkhsStéphane Canu
 
Data Exchange over RDF
Data Exchange over RDFData Exchange over RDF
Data Exchange over RDFnet2-project
 
Volume and edge skeleton computation in high dimensions
Volume and edge skeleton computation in high dimensionsVolume and edge skeleton computation in high dimensions
Volume and edge skeleton computation in high dimensionsVissarion Fisikopoulos
 
Slides2 130201091056-phpapp01
Slides2 130201091056-phpapp01Slides2 130201091056-phpapp01
Slides2 130201091056-phpapp01Deb Roy
 
Bayesian case studies, practical 2
Bayesian case studies, practical 2Bayesian case studies, practical 2
Bayesian case studies, practical 2Robin Ryder
 
Scope Graphs: A fresh look at name binding in programming languages
Scope Graphs: A fresh look at name binding in programming languagesScope Graphs: A fresh look at name binding in programming languages
Scope Graphs: A fresh look at name binding in programming languagesEelco Visser
 
A note on arithmetic progressions in sets of integers
A note on arithmetic progressions in sets of integersA note on arithmetic progressions in sets of integers
A note on arithmetic progressions in sets of integersLukas Nabergall
 
Engr 371 final exam april 2010
Engr 371 final exam april 2010Engr 371 final exam april 2010
Engr 371 final exam april 2010amnesiann
 
Algorithm Design and Complexity - Course 11
Algorithm Design and Complexity - Course 11Algorithm Design and Complexity - Course 11
Algorithm Design and Complexity - Course 11Traian Rebedea
 

Semelhante a Probabilistic Retrieval (20)

Probabilistic Retrieval TFIDF
Probabilistic Retrieval TFIDFProbabilistic Retrieval TFIDF
Probabilistic Retrieval TFIDF
 
Ml4nlp04 1
Ml4nlp04 1Ml4nlp04 1
Ml4nlp04 1
 
Inductive Triple Graphs: A purely functional approach to represent RDF
Inductive Triple Graphs: A purely functional approach to represent RDFInductive Triple Graphs: A purely functional approach to represent RDF
Inductive Triple Graphs: A purely functional approach to represent RDF
 
CVPR2010: Advanced ITinCVPR in a Nutshell: part 6: Mixtures
CVPR2010: Advanced ITinCVPR in a Nutshell: part 6: MixturesCVPR2010: Advanced ITinCVPR in a Nutshell: part 6: Mixtures
CVPR2010: Advanced ITinCVPR in a Nutshell: part 6: Mixtures
 
Newfile6
Newfile6Newfile6
Newfile6
 
Class 18: Measuring Cost
Class 18: Measuring CostClass 18: Measuring Cost
Class 18: Measuring Cost
 
Analysis of algo
Analysis of algoAnalysis of algo
Analysis of algo
 
Lista exercintegrais
Lista exercintegraisLista exercintegrais
Lista exercintegrais
 
Lecture4 kenrels functions_rkhs
Lecture4 kenrels functions_rkhsLecture4 kenrels functions_rkhs
Lecture4 kenrels functions_rkhs
 
Data Exchange over RDF
Data Exchange over RDFData Exchange over RDF
Data Exchange over RDF
 
Volume and edge skeleton computation in high dimensions
Volume and edge skeleton computation in high dimensionsVolume and edge skeleton computation in high dimensions
Volume and edge skeleton computation in high dimensions
 
Slides2 130201091056-phpapp01
Slides2 130201091056-phpapp01Slides2 130201091056-phpapp01
Slides2 130201091056-phpapp01
 
Bayesian case studies, practical 2
Bayesian case studies, practical 2Bayesian case studies, practical 2
Bayesian case studies, practical 2
 
Problem
ProblemProblem
Problem
 
Scope Graphs: A fresh look at name binding in programming languages
Scope Graphs: A fresh look at name binding in programming languagesScope Graphs: A fresh look at name binding in programming languages
Scope Graphs: A fresh look at name binding in programming languages
 
S 7
S 7S 7
S 7
 
A note on arithmetic progressions in sets of integers
A note on arithmetic progressions in sets of integersA note on arithmetic progressions in sets of integers
A note on arithmetic progressions in sets of integers
 
Engr 371 final exam april 2010
Engr 371 final exam april 2010Engr 371 final exam april 2010
Engr 371 final exam april 2010
 
Algorithm Design and Complexity - Course 11
Algorithm Design and Complexity - Course 11Algorithm Design and Complexity - Course 11
Algorithm Design and Complexity - Course 11
 
Codes and Isogenies
Codes and IsogeniesCodes and Isogenies
Codes and Isogenies
 

Mais de otisg

Search at Tumblr (nyc search meetup)
Search at Tumblr (nyc search meetup)Search at Tumblr (nyc search meetup)
Search at Tumblr (nyc search meetup)otisg
 
Lucandra
LucandraLucandra
Lucandraotisg
 
Finite State Queries In Lucene
Finite State Queries In LuceneFinite State Queries In Lucene
Finite State Queries In Luceneotisg
 
UIMA
UIMAUIMA
UIMAotisg
 
Faceted Search and Solr
Faceted Search and SolrFaceted Search and Solr
Faceted Search and Solrotisg
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introductionotisg
 

Mais de otisg (6)

Search at Tumblr (nyc search meetup)
Search at Tumblr (nyc search meetup)Search at Tumblr (nyc search meetup)
Search at Tumblr (nyc search meetup)
 
Lucandra
LucandraLucandra
Lucandra
 
Finite State Queries In Lucene
Finite State Queries In LuceneFinite State Queries In Lucene
Finite State Queries In Lucene
 
UIMA
UIMAUIMA
UIMA
 
Faceted Search and Solr
Faceted Search and SolrFaceted Search and Solr
Faceted Search and Solr
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introduction
 

Último

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAnitaRaj43
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 

Último (20)

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 

Probabilistic Retrieval

  • 1. INCORPORATING PROBABILISTIC RETRIEVAL KNOWLEDGE INTO TFIDF-BASED SEARCH ENGINE Alex Lin Senior Architect Intelligent Mining alin at IntelligentMinining.com
  • 2. Overview of Retrieval Models   Boolean Retrieval   Vector Space Model   Probabilistic Model   Language Model
  • 3. Boolean Retrieval   lincolnAND NOT (car AND automobile)   The earliest model and still in use today   The result is very easy to explain to users   Highly efficient computationally   The major drawback – lack of sophisticated ranking algorithm.
  • 4. Vector Space Model Term2 Doc1 Doc2 t Query ∑d ij *qj j=1 Cos(Di ,Q) = t t Term3 ∑ d * ∑q2 ij 2 j j=1 j=1 Major flaws: It lacks guidance on the details of € how weighting and ranking algorithms are related to relevance
  • 5. Probabilistic Retrieval Model Relevant P(R|D) Document Non- Relevant P(NR|D) P(D | R)P(R) Bayes’ Rule P(R | D) = P(D) €
  • 6. Probabilistic Retrieval Model P(D | R)P(R) P(D | NR)P(NR) P(R | D) = P(NR | D) = P(D) P(D)   IfP(D | R)P(R) > P(D | NR)P(NR) € € then classify D as relevant €
  • 7. Estimate P(D|R) and P(D|NR)   Define D = (d1,d2 ,...,dt ) t then P(D | R) = ∏ P(di | R) i=1 t € P(D | NR) = ∏ P(di | NR) i=1 €   Binary Independence Model € term independence + binary features in documents
  • 8. Likelihood Ratio   Likelihood ratio: P(D | R) P(NR) > P(D | NR) P(R) si: in non-relevant set, the probability of term i occurring pi: in relevant set, the probability of term i occurring P(D | R) pi 1− pi pi (1− si ) =∏ ⋅ ∏ = ∑ log € P(D | NR) i:d i =1 si i:d i = 0 1− si i:d i =1 si (1− pi ) (ri + 0.5) /(R − ri + 0.5) = ∑ log i:d i = q i =1 (n i − ri + 0.5) /(N − n i − R + ri + 0.5) € N: total number of Non-relevant documents ni: number of non-relevant documents that contain a term ri: number of relevant documents that contain a term R: total number of Relevant documents €
  • 9. Combine with BM25 Ranking Algorithm   BM25 extends the scoring function for the binary independence model to include document and query term weight.   It performs very well in TREC experiments (ri + 0.5) /(R − ri + 0.5) (k + 1) f i (k 2 + 1)qf i R(q,D) = ∑ log ⋅ i ⋅ i∈Q (n i − ri + 0.5) /(N − n i − R + ri + 0.5) K + f i k 2 + qf i dl K = k1 ((1− b) + b ⋅ ) avgdl € k1 k2 b: tuning parameters dl: document length avgdl: average document length in data set € qf: term frequency in query terms
  • 10. Weighted Fields Boolean Search doc-id field0 field1 … text 1 2 3 … n R(q,D) = ∑ ∑w f mi i∈q f ∈ fileds €
  • 11. Apply Probabilistic Knowledge into Fields Higher gradient Lower doc-id field0 field1 … Text 1 Lightyear Buzz 2 3 … n Relevant P(R|D) Document Non- Relevant P(NR|D)
  • 12. Use the Knowledge during Ranking doc-id field0 field1 … Text 1 Lightyear Buzz 2 3 … n   The goal is: t t P(D | R) = ∏ P(di | R) = ∑ log(P(di | R)) ≈ ∑ ∑ w f mi i=1 i=1 i∈q f ∈F Learnable €
  • 13. Comparison of Approaches f ik N RTF −IDF = tf ik ⋅ idf i = t ⋅ log nk ∑f ij j=1 (k1 + 1) f i (k2 + 1)qf i dl Rbm 25 (q,D) = ⋅ K = k1 ((1− b) + b ⋅ ) K + fi k 2 + qf i avgdl € (ri + 0.5) /(R − ri + 0.5) (k + 1) f i (k 2 + 1)qf i R(q,D) = ∑ log ⋅ 1 ⋅ i∈Q (n i − ri + 0.5) /(N − n i − R + ri + 0.5) K + f i k 2 + qf i € € IDF TF € (k1 + 1) f i (k 2 + 1)qf i R(q,D) = ∑ ∑ w f mi ⋅ ⋅ i∈q f ∈F K + fi k 2 + qf i IDF TF €
  • 14. Other Considerations   Thisis not a formal model   Require user relevance feedback (search log)   Harder to handle real-time search queries   How to Prevent Love/Hate attacks

Notas do Editor

  1. Si: in non-relevant set, the probability of term i occurringPi: inrelevant set, the probability of term i occurringN: total number of Non-relevant documentsni: number of non-relevant documents that contain a termri: number of relevant documents that contain a term R: total number of Relevant documents