SlideShare uma empresa Scribd logo
1 de 28
Baixar para ler offline
The Future of Search in
                                     Plone
                                          Sally Kleinfeldt
                                            and friends
                                  Plone Conference, San Francisco
                                        November 3, 2011




Tuesday, November 29, 2011
Motivation


                             •   Raise awareness

                             •   Promote discussion

                             •   Forge consensus




Tuesday, November 29, 2011
Agenda


                             •   Introduction to IR concepts

                             •   Description of Solr and ZCatalog

                             •   Discussion




Tuesday, November 29, 2011
IR 101




Tuesday, November 29, 2011
IR 101

                             •   Transformations

                             •   Terms

                             •   Models

                             •   Measures




Tuesday, November 29, 2011
IR 101
                                     Transformations
                             •   Turn binary, HTML, or other document
                                 formats into fields and strings

                             •   Parse the strings into a set of terms

                             •   Build indexes of the terms specific to the IR
                                 model used

                             •   Queries are parsed into query operators and
                                 strings, which are parsed into terms




Tuesday, November 29, 2011
IR 101
                                     String => Terms
                             •   Tokenization - locate word boundaries

                             •   Normalization - remove capitals and diacritics

                             •   Stopping - remove stop words (a, of, on,
                                 the...)

                             •   Stemming - reduce to word stems (walks,
                                 walking => walk)

                             •   Recognizers - concepts, parts of speech,
                                 names, locations...

                             •   Must be identical for documents and queries


Tuesday, November 29, 2011
IR 101
                                              Terms

                             •   Application specific

                             •   Words or phrases

                             •   IR models assign weights to terms in
                                 documents




Tuesday, November 29, 2011
IR 101
                                       Term Weighting
                             •   Simplest:Yes/No Boolean value

                             •   Better: Term Frequency - # occurrences

                             •   More meaningful: tf-idf

                                 •   Term Freq * Inverse Document Freq

                                 •   How many documents contain the term?

                                 •   Increase weight of rare terms and vice
                                     versa




Tuesday, November 29, 2011
IR 101
                                      Boolean Model
                             •   First and most adopted

                             •   Based on Boolean logic + set theory

                             •   Does a document contain query terms - Y/N

                             •   Intuitive, easy to implement

                             •   No ranking, special query language, too many
                                 or too few results

                             •   Typical for library systems




Tuesday, November 29, 2011
IR 101
                                 Vector Space Models
                             •   Represent documents and queries as vectors
                                 of terms

                             •   Term values are weighted - by count or tf-idf

                             •   Use vector operations to compare
                                 documents with queries

                             •   Relevance score based on cosine of angle
                                 between doc/query vectors




Tuesday, November 29, 2011
IR 101
                                 Probabilistic Models
                             •   Compute probability that a document is
                                 relevant to a query

                             •   Relevance ranking functions range from
                                 simple to complex

                             •   Sophisticated ranking functions include

                                 •   Okapi BM25 (uses tf and idf)

                                 •   Machine learning formulas (use training
                                     data)




Tuesday, November 29, 2011
IR 101
                             Extending the Models
                             •   Many many refinements possible

                                 •   Term interdependencies

                                 •   Fuzzy sets

                                 •   Semantic analysis, link analysis

                                 •   Combining models (Extended Boolean)

                             •   The best search engines represent thousands
                                 of engineering hours




Tuesday, November 29, 2011
IR 101
                                              Measures
                             •   Search engine results are measured against:

                                 •   Precision - Percent of results that are
                                     relevant

                                 •   Recall - Percent of relevant results that are
                                     returned

                                 •   F-Score - Harmonic mean of precision and
                                     recall




Tuesday, November 29, 2011
ZCatalog and Solr




Tuesday, November 29, 2011
ZCatalog
                             •   Zope/Plone search engine

                             •   Full text and field searching

                             •   Probabilistic model using Okapi BM25

                             •   OOTB ZCTextIndex very simple

                             •   TextIndexNG adds multilingual, better parsing
                                 components, binary transforms, synonyms




Tuesday, November 29, 2011
Solr
                             •   Popular open source enterprise search
                                 platform

                             •   Eliminating smaller commercial search
                                 companies

                             •   Java, based on Lucene Java search library,
                                 sophisticated vector space ++ model

                             •   RESTful APIs

                             •   Large, active community

                             •   Powers Twitter, Wikipedia, Netflix...



Tuesday, November 29, 2011
What does Solr have
                             that ZCatalog Doesn’t?
                             •   Better relevance ranking

                             •   More search features: snippets, hit
                                 highlighting, spelling suggestions, synonyms,
                                 more like this, faceted search

                             •   More configurable: stop words, field
                                 boosting, parsing components

                             •   An army of engineers working on it




Tuesday, November 29, 2011
Plone + Solr
                                              Today
                             •   Two add-ons available

                                 •   collective.solr - Intercepts catalog queries
                                     and dispatches them to Solr

                                 •   alm.solrindex - adds a new index type to
                                     the catalog, SolrIndex

                             •   Plus a buildout recipe:
                                 collective.recipe.solrinstance




Tuesday, November 29, 2011
Conclusions from
                             Conference Discussion




Tuesday, November 29, 2011
Why Does Plone
                                     Need Solr?
                             •   Certain types of projects need it, for features
                                 or because ZCatalog can’t scale to very large
                                 sites

                             •   We need it to keep up with the enterprise
                                 CMS pack




Tuesday, November 29, 2011
Points of Agreement
                             •   It will be impossible to completely replace
                                 ZCatalog with Solr

                                 •   Solr indexing will never be transactional

                                 •   Removing ZCatalog from Zope would be
                                     very difficult

                                 •   Tackle small, focused ZCatalog
                                     improvements when possible - like
                                     improving indexing interface




Tuesday, November 29, 2011
Points of Agreement

                             •   Navigation and search should be handled
                                 separately

                                 •   Navigation needs to be transactional,
                                     search does not

                                 •   Split out a catalog used for navigation from
                                     the general catalog

                                 •   Explore a non-catalog utility to support
                                     navigation, optimize for speed




Tuesday, November 29, 2011
Points of Agreement
                             •   Treating Solr integration simply as ZCatalog
                                 replacement does not take best advantage of
                                 Solr features

                                 •   ZCatalog can’t represent the richness of
                                     Solr, focus on the Solr API

                                 •   Take advantage of spelling suggestions,
                                     facets, results snippets with hit highlighting,
                                     synonyms, more like this, etc.

                                 •   Provide Solr indexing, field weighting, etc.
                                     configuration choices in the control panel



Tuesday, November 29, 2011
Points of Agreement
                             •   Neither of the current Solr add-ons provides
                                 the best foundation for the future

                                 •   But they’ve taught us how to do things
                                     better

                             •   Non-Solr approaches to improved Plone
                                 search should be deprecated

                                 •   Andreas Jung is not planning improvements
                                     to TextIndexNG!




Tuesday, November 29, 2011
Points of Agreement


                             •   Stop investing in ZCatalog as a search engine,
                                 Solr is the future




Tuesday, November 29, 2011
Plone + Solr
                                            Roadmap
                             •   Short term: Make Solr integration easy with
                                 an approved add-on (like LDAP)

                                 •   Build on what we’ve learned and create a
                                     better add-on to replace collective.solr and
                                     alm.solrindex

                                 •   Who wants to sponsor a sprint?




Tuesday, November 29, 2011
Plone + Solr
                                            Roadmap
                             •   Long term: Ship Solr integration with Plone,
                                 but don’t require Solr

                                 •   Solr has a lot of overhead and is not always
                                     needed

                                 •   But using it should be as easy as answering
                                     yes to a “Build with Solr?” installation
                                     option




Tuesday, November 29, 2011

Mais conteúdo relacionado

Mais de Jazkarta, Inc.

Traveling through time and place with Plone
Traveling through time and place with PloneTraveling through time and place with Plone
Traveling through time and place with PloneJazkarta, Inc.
 
Questions: A Form Library for Python with SurveyJS Frontend
Questions: A Form Library for Python with SurveyJS FrontendQuestions: A Form Library for Python with SurveyJS Frontend
Questions: A Form Library for Python with SurveyJS FrontendJazkarta, Inc.
 
The User Experience: Editing Composite Pages in Plone 6 and Beyond
The User Experience: Editing Composite Pages in Plone 6 and BeyondThe User Experience: Editing Composite Pages in Plone 6 and Beyond
The User Experience: Editing Composite Pages in Plone 6 and BeyondJazkarta, Inc.
 
WTA and Plone After 13 Years
WTA and Plone After 13 YearsWTA and Plone After 13 Years
WTA and Plone After 13 YearsJazkarta, Inc.
 
Collaborating With Orchid Data
Collaborating With Orchid DataCollaborating With Orchid Data
Collaborating With Orchid DataJazkarta, Inc.
 
Spend a Week Hacking in Sorrento!
Spend a Week Hacking in Sorrento!Spend a Week Hacking in Sorrento!
Spend a Week Hacking in Sorrento!Jazkarta, Inc.
 
Plone 5 Upgrades In Real Life
Plone 5 Upgrades In Real LifePlone 5 Upgrades In Real Life
Plone 5 Upgrades In Real LifeJazkarta, Inc.
 
Accessibility in Plone: The Good, the Bad, and the Ugly
Accessibility in Plone: The Good, the Bad, and the UglyAccessibility in Plone: The Good, the Bad, and the Ugly
Accessibility in Plone: The Good, the Bad, and the UglyJazkarta, Inc.
 
Getting Paid Without GetPaid
Getting Paid Without GetPaidGetting Paid Without GetPaid
Getting Paid Without GetPaidJazkarta, Inc.
 
An Open Source Platform for Social Science Research
An Open Source Platform for Social Science ResearchAn Open Source Platform for Social Science Research
An Open Source Platform for Social Science ResearchJazkarta, Inc.
 
For the Love of Volunteers! How Do You Choose the Right Technology to Manage ...
For the Love of Volunteers! How Do You Choose the Right Technology to Manage ...For the Love of Volunteers! How Do You Choose the Right Technology to Manage ...
For the Love of Volunteers! How Do You Choose the Right Technology to Manage ...Jazkarta, Inc.
 
Anatomy of a Large Website Project
Anatomy of a Large Website ProjectAnatomy of a Large Website Project
Anatomy of a Large Website ProjectJazkarta, Inc.
 
Anatomy of a Large Website Project - With Presenter Notes
Anatomy of a Large Website Project - With Presenter NotesAnatomy of a Large Website Project - With Presenter Notes
Anatomy of a Large Website Project - With Presenter NotesJazkarta, Inc.
 
Plone Hosting: A Panel Discussion
Plone Hosting: A Panel DiscussionPlone Hosting: A Panel Discussion
Plone Hosting: A Panel DiscussionJazkarta, Inc.
 
Academic Websites in Plone
Academic Websites in PloneAcademic Websites in Plone
Academic Websites in PloneJazkarta, Inc.
 
Online exhibits in Plone
Online exhibits in PloneOnline exhibits in Plone
Online exhibits in PloneJazkarta, Inc.
 
Pyramid Deployment and Maintenance
Pyramid Deployment and MaintenancePyramid Deployment and Maintenance
Pyramid Deployment and MaintenanceJazkarta, Inc.
 

Mais de Jazkarta, Inc. (20)

Traveling through time and place with Plone
Traveling through time and place with PloneTraveling through time and place with Plone
Traveling through time and place with Plone
 
Questions: A Form Library for Python with SurveyJS Frontend
Questions: A Form Library for Python with SurveyJS FrontendQuestions: A Form Library for Python with SurveyJS Frontend
Questions: A Form Library for Python with SurveyJS Frontend
 
The User Experience: Editing Composite Pages in Plone 6 and Beyond
The User Experience: Editing Composite Pages in Plone 6 and BeyondThe User Experience: Editing Composite Pages in Plone 6 and Beyond
The User Experience: Editing Composite Pages in Plone 6 and Beyond
 
WTA and Plone After 13 Years
WTA and Plone After 13 YearsWTA and Plone After 13 Years
WTA and Plone After 13 Years
 
Collaborating With Orchid Data
Collaborating With Orchid DataCollaborating With Orchid Data
Collaborating With Orchid Data
 
Spend a Week Hacking in Sorrento!
Spend a Week Hacking in Sorrento!Spend a Week Hacking in Sorrento!
Spend a Week Hacking in Sorrento!
 
Plone 5 Upgrades In Real Life
Plone 5 Upgrades In Real LifePlone 5 Upgrades In Real Life
Plone 5 Upgrades In Real Life
 
Accessibility in Plone: The Good, the Bad, and the Ugly
Accessibility in Plone: The Good, the Bad, and the UglyAccessibility in Plone: The Good, the Bad, and the Ugly
Accessibility in Plone: The Good, the Bad, and the Ugly
 
Getting Paid Without GetPaid
Getting Paid Without GetPaidGetting Paid Without GetPaid
Getting Paid Without GetPaid
 
An Open Source Platform for Social Science Research
An Open Source Platform for Social Science ResearchAn Open Source Platform for Social Science Research
An Open Source Platform for Social Science Research
 
For the Love of Volunteers! How Do You Choose the Right Technology to Manage ...
For the Love of Volunteers! How Do You Choose the Right Technology to Manage ...For the Love of Volunteers! How Do You Choose the Right Technology to Manage ...
For the Love of Volunteers! How Do You Choose the Right Technology to Manage ...
 
Anatomy of a Large Website Project
Anatomy of a Large Website ProjectAnatomy of a Large Website Project
Anatomy of a Large Website Project
 
Anatomy of a Large Website Project - With Presenter Notes
Anatomy of a Large Website Project - With Presenter NotesAnatomy of a Large Website Project - With Presenter Notes
Anatomy of a Large Website Project - With Presenter Notes
 
Plone Hosting: A Panel Discussion
Plone Hosting: A Panel DiscussionPlone Hosting: A Panel Discussion
Plone Hosting: A Panel Discussion
 
Plone+Salesforce
Plone+SalesforcePlone+Salesforce
Plone+Salesforce
 
Academic Websites in Plone
Academic Websites in PloneAcademic Websites in Plone
Academic Websites in Plone
 
Plone
PlonePlone
Plone
 
Online exhibits in Plone
Online exhibits in PloneOnline exhibits in Plone
Online exhibits in Plone
 
ZODB Tips and Tricks
ZODB Tips and TricksZODB Tips and Tricks
ZODB Tips and Tricks
 
Pyramid Deployment and Maintenance
Pyramid Deployment and MaintenancePyramid Deployment and Maintenance
Pyramid Deployment and Maintenance
 

Último

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 

Último (20)

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 

The Future of Search in Plone

  • 1. The Future of Search in Plone Sally Kleinfeldt and friends Plone Conference, San Francisco November 3, 2011 Tuesday, November 29, 2011
  • 2. Motivation • Raise awareness • Promote discussion • Forge consensus Tuesday, November 29, 2011
  • 3. Agenda • Introduction to IR concepts • Description of Solr and ZCatalog • Discussion Tuesday, November 29, 2011
  • 5. IR 101 • Transformations • Terms • Models • Measures Tuesday, November 29, 2011
  • 6. IR 101 Transformations • Turn binary, HTML, or other document formats into fields and strings • Parse the strings into a set of terms • Build indexes of the terms specific to the IR model used • Queries are parsed into query operators and strings, which are parsed into terms Tuesday, November 29, 2011
  • 7. IR 101 String => Terms • Tokenization - locate word boundaries • Normalization - remove capitals and diacritics • Stopping - remove stop words (a, of, on, the...) • Stemming - reduce to word stems (walks, walking => walk) • Recognizers - concepts, parts of speech, names, locations... • Must be identical for documents and queries Tuesday, November 29, 2011
  • 8. IR 101 Terms • Application specific • Words or phrases • IR models assign weights to terms in documents Tuesday, November 29, 2011
  • 9. IR 101 Term Weighting • Simplest:Yes/No Boolean value • Better: Term Frequency - # occurrences • More meaningful: tf-idf • Term Freq * Inverse Document Freq • How many documents contain the term? • Increase weight of rare terms and vice versa Tuesday, November 29, 2011
  • 10. IR 101 Boolean Model • First and most adopted • Based on Boolean logic + set theory • Does a document contain query terms - Y/N • Intuitive, easy to implement • No ranking, special query language, too many or too few results • Typical for library systems Tuesday, November 29, 2011
  • 11. IR 101 Vector Space Models • Represent documents and queries as vectors of terms • Term values are weighted - by count or tf-idf • Use vector operations to compare documents with queries • Relevance score based on cosine of angle between doc/query vectors Tuesday, November 29, 2011
  • 12. IR 101 Probabilistic Models • Compute probability that a document is relevant to a query • Relevance ranking functions range from simple to complex • Sophisticated ranking functions include • Okapi BM25 (uses tf and idf) • Machine learning formulas (use training data) Tuesday, November 29, 2011
  • 13. IR 101 Extending the Models • Many many refinements possible • Term interdependencies • Fuzzy sets • Semantic analysis, link analysis • Combining models (Extended Boolean) • The best search engines represent thousands of engineering hours Tuesday, November 29, 2011
  • 14. IR 101 Measures • Search engine results are measured against: • Precision - Percent of results that are relevant • Recall - Percent of relevant results that are returned • F-Score - Harmonic mean of precision and recall Tuesday, November 29, 2011
  • 15. ZCatalog and Solr Tuesday, November 29, 2011
  • 16. ZCatalog • Zope/Plone search engine • Full text and field searching • Probabilistic model using Okapi BM25 • OOTB ZCTextIndex very simple • TextIndexNG adds multilingual, better parsing components, binary transforms, synonyms Tuesday, November 29, 2011
  • 17. Solr • Popular open source enterprise search platform • Eliminating smaller commercial search companies • Java, based on Lucene Java search library, sophisticated vector space ++ model • RESTful APIs • Large, active community • Powers Twitter, Wikipedia, Netflix... Tuesday, November 29, 2011
  • 18. What does Solr have that ZCatalog Doesn’t? • Better relevance ranking • More search features: snippets, hit highlighting, spelling suggestions, synonyms, more like this, faceted search • More configurable: stop words, field boosting, parsing components • An army of engineers working on it Tuesday, November 29, 2011
  • 19. Plone + Solr Today • Two add-ons available • collective.solr - Intercepts catalog queries and dispatches them to Solr • alm.solrindex - adds a new index type to the catalog, SolrIndex • Plus a buildout recipe: collective.recipe.solrinstance Tuesday, November 29, 2011
  • 20. Conclusions from Conference Discussion Tuesday, November 29, 2011
  • 21. Why Does Plone Need Solr? • Certain types of projects need it, for features or because ZCatalog can’t scale to very large sites • We need it to keep up with the enterprise CMS pack Tuesday, November 29, 2011
  • 22. Points of Agreement • It will be impossible to completely replace ZCatalog with Solr • Solr indexing will never be transactional • Removing ZCatalog from Zope would be very difficult • Tackle small, focused ZCatalog improvements when possible - like improving indexing interface Tuesday, November 29, 2011
  • 23. Points of Agreement • Navigation and search should be handled separately • Navigation needs to be transactional, search does not • Split out a catalog used for navigation from the general catalog • Explore a non-catalog utility to support navigation, optimize for speed Tuesday, November 29, 2011
  • 24. Points of Agreement • Treating Solr integration simply as ZCatalog replacement does not take best advantage of Solr features • ZCatalog can’t represent the richness of Solr, focus on the Solr API • Take advantage of spelling suggestions, facets, results snippets with hit highlighting, synonyms, more like this, etc. • Provide Solr indexing, field weighting, etc. configuration choices in the control panel Tuesday, November 29, 2011
  • 25. Points of Agreement • Neither of the current Solr add-ons provides the best foundation for the future • But they’ve taught us how to do things better • Non-Solr approaches to improved Plone search should be deprecated • Andreas Jung is not planning improvements to TextIndexNG! Tuesday, November 29, 2011
  • 26. Points of Agreement • Stop investing in ZCatalog as a search engine, Solr is the future Tuesday, November 29, 2011
  • 27. Plone + Solr Roadmap • Short term: Make Solr integration easy with an approved add-on (like LDAP) • Build on what we’ve learned and create a better add-on to replace collective.solr and alm.solrindex • Who wants to sponsor a sprint? Tuesday, November 29, 2011
  • 28. Plone + Solr Roadmap • Long term: Ship Solr integration with Plone, but don’t require Solr • Solr has a lot of overhead and is not always needed • But using it should be as easy as answering yes to a “Build with Solr?” installation option Tuesday, November 29, 2011