SlideShare uma empresa Scribd logo
1 de 55
Baixar para ler offline
Smart Search




   and Beyond
Who?

            Chris Davenport
       Production Leadership Team




Smart Search and Beyond
Solving the search problem




Smart Search and Beyond
Old Joomla Search Sucks!
                    Cannot rank by
                     relevance across
                     content types
                    Only very crude
                     filtering
                    Can be slow to
                     search



Smart Search and Beyond
Table of Contents
01   Smart Search so far
02   Smart Search in action
03   Smart Search under the hood
04   Smart Search tips and tricks
05   Smart Search where next?




Smart Search and Beyond
A Short History
 ‣ Old Joomla Search
  • Introduced in Mambo
  • Largely unchanged since

 ‣ JXTended Finder for Joomla 1.5
 ‣ Finder Integration Working Group
  • Smart Search for Joomla 2.5

 ‣ Search Working Group



Smart Search and Beyond
Smart Search for Joomla 2.5
 ‣ Separate index
 ‣ Auto-completion
 ‣ Facetted search
 ‣ Relevancy ordering
 ‣ Did you mean?
 ‣ ...and more besides



Smart Search and Beyond
Table of Contents
01   Smart Search so far
02   Smart Search in action
03   Smart Search under the hood
04   Smart Search tips and tricks
05   Smart Search where next?




Smart Search and Beyond
Auto-completion




Smart Search and Beyond
Another example




Smart Search and Beyond
Another example




Smart Search and Beyond
Table of Contents
01   Smart Search so far
02   Smart Search in action
03   Smart Search under the hood
04   Smart Search tips and tricks
05   Smart Search where next?




Smart Search and Beyond
Under the hood




Smart Search and Beyond
A problem in two halves




Smart Search and Beyond
First half: Indexing


             INDEX




             Raw data




Smart Search and Beyond
Second half: Querying


     Search    INDEX   Search
     queries           results




Smart Search and Beyond
Search results
Search results are rendered purely from
data in the index, not the raw data.




Smart Search and Beyond
Indexing




Smart Search and Beyond
Indexing

      Parsing          Stemming

    Tokenisation        Analysis

 Token aggregation   Term weighting

     Filtration       Classification




Smart Search and Beyond
Terms index




Smart Search and Beyond
Parsing
 ‣ Extract plain text from raw data
  • HTML, RTF supported out-of-the-box
  • PDF, MS Word could be supported

 ‣ For example, HTML
  • Essentially the same as PHP strip_tags




Smart Search and Beyond
Tokenisation
 ‣ Fold to lowercase
 ‣ Special handling for plus, dash, comma,
   dot and quotes
 ‣ Remove non-alphanumerics
 ‣ Replace multiple spaces with one space
 ‣ Special support for Chinese




Smart Search and Beyond
Token aggregation
On a clear disk you can seek forever
on             a              clear
on a           a clear        clear disk
on a clear     a clear disk   clear disk you
disk           you            can
disk you       you can        can seek
disk you can   you can seek   can seek forever
seek           forever
seek forever


Smart Search and Beyond
Filtration
 ‣ “Stop word removal”
  • Not removed, just given a low weight

 ‣ jos_finder_terms_common
 ‣ English only
  • Other languages need to add their common
    words to the table




Smart Search and Beyond
Stemming
fishing

fished
               fish
fisher

fish




Smart Search and Beyond
Stemming
 ‣ “Snowball” is used by default
  • Danish, German, English, Spanish, Finnish,
    French, Hungarian, Italian, Norwegian, Dutch,
    Portuguese, Romanian, Russian, Swedish and
    Turkish
  • BUT it requires PHP extension

 ‣ “English only” uses a pure PHP stemmer
  • Recommended for all English sites



Smart Search and Beyond
Morphological analysis
 ‣ Currently uses Soundex
 ‣ Not used in search as such
 ‣ Used for the “Did you mean?” feature
 ‣ If no search results found, then...
  • Match on Soundex code
  • Return nearest term/phrase by Levenshtein
    distance



Smart Search and Beyond
Term weighting

Context         Multiplier
Title           1.7
Text            0.7
Meta            1.2
Path            2.0
Miscellaneous   0.3




Smart Search and Beyond
Classification




Smart Search and Beyond
Taxonomies
 ‣ “Content maps” in Administrator
 ‣ Basis for facetted search
 ‣ Multi-level taxonomies not fully
   supported (yet)




Smart Search and Beyond
Taxonomies - drop-downs




Smart Search and Beyond
Taxonomies - checkboxes




Smart Search and Beyond
Taxonomies - links




Smart Search and Beyond
Database ERD




Smart Search and Beyond
Smart Search Plug-ins
               /plugins

   /content     /finder     /system
    /finder   /categories   /highlight
               /contacts
                /content
              /newsfeeds
               /weblinks




Smart Search and Beyond
Smart Search Plug-ins
content/finder           finder/[type]
  onContentBeforeSave         onFinderBeforeSave
   onContentAfterSave          onFinderAfterSave
  onContentAfterDelete        onFinderAfterDelete
 onContentChangeState        onFinderChangeState
 onCategoryChangeState   onFinderCategoryChangeState




Smart Search and Beyond
Query parsing
                      URI argument      Query string
Terms                 q=Some+text       Some text
Phrases               q=”Some+text”     “Some text”
Logical operators     q=This+and+that   This and that
Before a date         d1=2012-05-16     before:2012-05-16
After a date          d2=2012-05-18     after:2012-05-18
Content type filter   t[]=98233         type:Articles
Taxonomy filter       t[]=30922         author:Chris Davenport
Static filter         f=2
Highlight             qh=Some+text




Smart Search and Beyond
Results rendering
 ‣ com_finder
  • search                  Search results
    ‣ default.php           page
    ‣ form.php
    ‣ default_results.php

    ‣ default_result.php    For custom types
    ‣ default_[type].php

 ‣ mod_finder
    ‣ default.php           Search module


Smart Search and Beyond
Layout overrides example




Smart Search and Beyond
Alternative override




Smart Search and Beyond
Table of Contents
01   Smart Search so far
02   Smart Search in action
03   Smart Search under the hood
04   Smart Search tips and tricks
05   Smart Search where next?




Smart Search and Beyond
Tips and tricks




Smart Search and Beyond
Tips and tricks
 ‣ HTML Parser
  • Invalid HTML can confuse the parser
  • Invalid UTF8 is ignored
  • Text in attributes is ignored




Smart Search and Beyond
When to do a purge
 ‣ Indexing is incremental so most of the time you don't
   need to.
 ‣ Changes to taxonomies that do not involve changes to
   content items
 ‣ Changes to term weights
 ‣ Changing the stemmer
 ‣ Changes to content items that do not trigger the standard
   content events
 ‣ IMPORTANT
  • If you have static filters they will be lost when you do a purge.




Smart Search and Beyond
Tuning Smart Search
 ‣ Use the CLI for indexing
  • http://docs.joomla.org/Setting_up_automatic_Smart_
    Search_indexing

 ‣ Out of memory issues
  • Please report out of memory issues so we can
    understand them better.
  • Reduce batch size
    ‣ Default is 50. Drop it to 5 or even 1.

  • Terms per batch
    ‣ Can be increased BUT NEEDS APACHE SERVER CONFIG
      CHANGE



Smart Search and Beyond
Table of Contents
01   Smart Search so far
02   Smart Search in action
03   Smart Search under the hood
04   Smart Search tips and tricks
05   Smart Search where next?




Smart Search and Beyond
Where next?




Smart Search and Beyond
Search Working Group
 ‣ Meeting at J and Beyond
  • 19 May 2012 11:30 AM

 ‣ Stable ready for merge July 2012
 ‣ Joomla 3.0 release September 2012
 ‣ Meeting at Joomla World Conference
  • San Jose, California, November 2012




Smart Search and Beyond
Improved language support
 ‣ Improve common word support
 ‣ Improve stemmer support
  • Native PHP stemmers?

 ‣ Improve morphological coding
  • Non-English alternatives to Soundex

 ‣ Mixed language content items
  • Language tagging of tokens/terms?


Smart Search and Beyond
Other possibilities
 ‣ Preserve static filters on purge/index
 ‣ Decouple indexing via message queues
 ‣ Easier support for range queries
 ‣ Search logging via JLog
 ‣ Variable-length token aggregation
 ‣ Multi-level taxonomies
 ‣ Add parsers for PDF, MS Word

Smart Search and Beyond
Search API
 ‣ Very important going forward
 ‣ Too big a leap for Joomla 3.0
 ‣ Develop in parallel during 3.x cycle
 ‣ Use in Smart Search for Joomla 4.0




Smart Search and Beyond
Documentation


http://docs.joomla.org/Category:Smart_Search




Smart Search and Beyond
Questions?




Smart Search and Beyond
Don't forget


   Search Working Group
         Meeting
    Saturday 19 May 2012
          11:30 AM




Smart Search and Beyond
Haystack - Mark Duncan CC-BY-SA 2.0 Generic
 http://commons.wikimedia.org/wiki/File%3AHaystack_-_geograph.org.uk_-_462934.jpg

 Under the hood - ilovebutter CC-BY 2.0 Generic
 http://commons.wikimedia.org/wiki/File:Trabant_601_S_of_Trabi_Safari_in_Dresden_8.jpg

 Child sucking thumb - Thahira CC-BY-SA 3.0 Unported
 http://commons.wikimedia.org/wiki/File:Sucking_finger.jpg

 Future car - Arthur C. Bade (1899–1975), Science and Mechanics Publishing - Public domain
 http://commons.wikimedia.org/wiki/File:Car_of_the_Future_1950_unrestored.jpg

 Magician - Kellar: Levitation, magician poster, ca. 1894 - CC-BY 2.0 Generic
 http://commons.wikimedia.org/wiki/File:Flickr_-_%E2%80%A6trialsanderrors_-_Kellar,_Levitation,_magician_poster,_ca._1894.jpg

 Index pages - Starbäck (1828-1885) and Föreningens Boktryckeri, Norrköping, Sweden (scanned by Ristesson Ent.) - Public
 domain
 http://commons.wikimedia.org/wiki/File:Index_Pages.jpg

 Twenty Questions - DuMont Television/Rosen Studios, New York-photographer. - Public domain
 http://commons.wikimedia.org/wiki/File:20_questions_1954.JPG

 Linnaeus taxonomy - Public domain
 http://commons.wikimedia.org/wiki/File:Linnaeus_-_Regnum_Animale_%281735%29.png


 All other images are Copyright (C) 2012 Chris Davenport unless I've accidentally missed crediting them.




Image Credits

Mais conteúdo relacionado

Semelhante a JAB2012 Smart Search Presentation

Key Success Factors for Enterprise Content Management
Key Success Factors for Enterprise Content ManagementKey Success Factors for Enterprise Content Management
Key Success Factors for Enterprise Content ManagementIntlock Ltd.
 
International seo and content clustering
International seo and content clusteringInternational seo and content clustering
International seo and content clusteringEnterprise Ireland
 
International seo and content clustering | John Caldwell | CreatorSEO
International seo and content clustering | John Caldwell | CreatorSEOInternational seo and content clustering | John Caldwell | CreatorSEO
International seo and content clustering | John Caldwell | CreatorSEOEnterprise Ireland
 
International Seo and Content Clustering
International Seo and Content ClusteringInternational Seo and Content Clustering
International Seo and Content ClusteringEoin O Siochru
 
Search Engine Optimisation and Growth Hacking| John Caldwell | CreatorSEO
Search Engine Optimisation and Growth Hacking| John Caldwell | CreatorSEOSearch Engine Optimisation and Growth Hacking| John Caldwell | CreatorSEO
Search Engine Optimisation and Growth Hacking| John Caldwell | CreatorSEOEnterprise Ireland
 
Enterprise Ireland presentation - International seo and content June 2018
Enterprise Ireland  presentation - International seo and content   June 2018Enterprise Ireland  presentation - International seo and content   June 2018
Enterprise Ireland presentation - International seo and content June 2018John Caldwell
 
International SEO and Content Silos | John Caldwell | CreatorSEO
International SEO and Content Silos | John Caldwell | CreatorSEOInternational SEO and Content Silos | John Caldwell | CreatorSEO
International SEO and Content Silos | John Caldwell | CreatorSEOEnterprise Ireland
 
International Search Engine Optimisation - SEO
International Search Engine Optimisation - SEOInternational Search Engine Optimisation - SEO
International Search Engine Optimisation - SEOEnterprise Ireland
 
nternational SEO and Content Clustering | John Caldwell | CreatorSEO / ABC Di...
nternational SEO and Content Clustering | John Caldwell | CreatorSEO / ABC Di...nternational SEO and Content Clustering | John Caldwell | CreatorSEO / ABC Di...
nternational SEO and Content Clustering | John Caldwell | CreatorSEO / ABC Di...Enterprise Ireland
 
How to Run LinkedIn Searches Like a Pro [Webcast]
How to Run LinkedIn Searches Like a Pro [Webcast]How to Run LinkedIn Searches Like a Pro [Webcast]
How to Run LinkedIn Searches Like a Pro [Webcast]LinkedIn Talent Solutions
 
Legal Marketing Association 2019: SEO 2020: Building Your Strategy for On-SER...
Legal Marketing Association 2019: SEO 2020: Building Your Strategy for On-SER...Legal Marketing Association 2019: SEO 2020: Building Your Strategy for On-SER...
Legal Marketing Association 2019: SEO 2020: Building Your Strategy for On-SER...Diane Kulseth
 
Searching in SharePoint
Searching in SharePointSearching in SharePoint
Searching in SharePointArno Flapper
 
SEO for Ecommerce - an overview
SEO for Ecommerce - an overviewSEO for Ecommerce - an overview
SEO for Ecommerce - an overviewErudite
 
DITA and SEO
DITA and SEODITA and SEO
DITA and SEOIXIASOFT
 
WordPress SEO Basics - Melbourne WordPress Meetup
WordPress SEO Basics - Melbourne WordPress MeetupWordPress SEO Basics - Melbourne WordPress Meetup
WordPress SEO Basics - Melbourne WordPress MeetupChris Burgess
 
SEO for Online Startups - Small Business Festival Victoria 2015
SEO for Online Startups - Small Business Festival Victoria 2015SEO for Online Startups - Small Business Festival Victoria 2015
SEO for Online Startups - Small Business Festival Victoria 2015Optimising
 
TCDrupal 2018: SEO! Snippets! Schema!
TCDrupal 2018: SEO! Snippets! Schema! TCDrupal 2018: SEO! Snippets! Schema!
TCDrupal 2018: SEO! Snippets! Schema! Diane Kulseth
 
Improving Your Onsite Search
Improving Your Onsite SearchImproving Your Onsite Search
Improving Your Onsite SearchCaroline Roberts
 
International SEO and Content Silos | John Caldwell | CreatorSEO
International SEO and Content Silos | John Caldwell | CreatorSEOInternational SEO and Content Silos | John Caldwell | CreatorSEO
International SEO and Content Silos | John Caldwell | CreatorSEOEnterprise Ireland
 

Semelhante a JAB2012 Smart Search Presentation (20)

Key Success Factors for Enterprise Content Management
Key Success Factors for Enterprise Content ManagementKey Success Factors for Enterprise Content Management
Key Success Factors for Enterprise Content Management
 
International seo and content clustering
International seo and content clusteringInternational seo and content clustering
International seo and content clustering
 
International seo and content clustering | John Caldwell | CreatorSEO
International seo and content clustering | John Caldwell | CreatorSEOInternational seo and content clustering | John Caldwell | CreatorSEO
International seo and content clustering | John Caldwell | CreatorSEO
 
International Seo and Content Clustering
International Seo and Content ClusteringInternational Seo and Content Clustering
International Seo and Content Clustering
 
Search Engine Optimisation and Growth Hacking| John Caldwell | CreatorSEO
Search Engine Optimisation and Growth Hacking| John Caldwell | CreatorSEOSearch Engine Optimisation and Growth Hacking| John Caldwell | CreatorSEO
Search Engine Optimisation and Growth Hacking| John Caldwell | CreatorSEO
 
Enterprise Ireland presentation - International seo and content June 2018
Enterprise Ireland  presentation - International seo and content   June 2018Enterprise Ireland  presentation - International seo and content   June 2018
Enterprise Ireland presentation - International seo and content June 2018
 
International SEO and Content Silos | John Caldwell | CreatorSEO
International SEO and Content Silos | John Caldwell | CreatorSEOInternational SEO and Content Silos | John Caldwell | CreatorSEO
International SEO and Content Silos | John Caldwell | CreatorSEO
 
International Search Engine Optimisation - SEO
International Search Engine Optimisation - SEOInternational Search Engine Optimisation - SEO
International Search Engine Optimisation - SEO
 
nternational SEO and Content Clustering | John Caldwell | CreatorSEO / ABC Di...
nternational SEO and Content Clustering | John Caldwell | CreatorSEO / ABC Di...nternational SEO and Content Clustering | John Caldwell | CreatorSEO / ABC Di...
nternational SEO and Content Clustering | John Caldwell | CreatorSEO / ABC Di...
 
How to Run LinkedIn Searches Like a Pro [Webcast]
How to Run LinkedIn Searches Like a Pro [Webcast]How to Run LinkedIn Searches Like a Pro [Webcast]
How to Run LinkedIn Searches Like a Pro [Webcast]
 
Legal Marketing Association 2019: SEO 2020: Building Your Strategy for On-SER...
Legal Marketing Association 2019: SEO 2020: Building Your Strategy for On-SER...Legal Marketing Association 2019: SEO 2020: Building Your Strategy for On-SER...
Legal Marketing Association 2019: SEO 2020: Building Your Strategy for On-SER...
 
Searching in SharePoint
Searching in SharePointSearching in SharePoint
Searching in SharePoint
 
SEO for Ecommerce - an overview
SEO for Ecommerce - an overviewSEO for Ecommerce - an overview
SEO for Ecommerce - an overview
 
DITA and SEO
DITA and SEODITA and SEO
DITA and SEO
 
WordPress SEO Basics - Melbourne WordPress Meetup
WordPress SEO Basics - Melbourne WordPress MeetupWordPress SEO Basics - Melbourne WordPress Meetup
WordPress SEO Basics - Melbourne WordPress Meetup
 
SEO for Online Startups - Small Business Festival Victoria 2015
SEO for Online Startups - Small Business Festival Victoria 2015SEO for Online Startups - Small Business Festival Victoria 2015
SEO for Online Startups - Small Business Festival Victoria 2015
 
TCDrupal 2018: SEO! Snippets! Schema!
TCDrupal 2018: SEO! Snippets! Schema! TCDrupal 2018: SEO! Snippets! Schema!
TCDrupal 2018: SEO! Snippets! Schema!
 
Improving Your Onsite Search
Improving Your Onsite SearchImproving Your Onsite Search
Improving Your Onsite Search
 
International SEO and Content Silos | John Caldwell | CreatorSEO
International SEO and Content Silos | John Caldwell | CreatorSEOInternational SEO and Content Silos | John Caldwell | CreatorSEO
International SEO and Content Silos | John Caldwell | CreatorSEO
 
SEO for humans, without the jargon- Halton Business Fair November 16
SEO for humans, without the jargon- Halton Business Fair November 16SEO for humans, without the jargon- Halton Business Fair November 16
SEO for humans, without the jargon- Halton Business Fair November 16
 

Último

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 

Último (20)

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 

JAB2012 Smart Search Presentation

  • 1. Smart Search and Beyond
  • 2. Who? Chris Davenport Production Leadership Team Smart Search and Beyond
  • 3. Solving the search problem Smart Search and Beyond
  • 4. Old Joomla Search Sucks! Cannot rank by relevance across content types Only very crude filtering Can be slow to search Smart Search and Beyond
  • 5. Table of Contents 01 Smart Search so far 02 Smart Search in action 03 Smart Search under the hood 04 Smart Search tips and tricks 05 Smart Search where next? Smart Search and Beyond
  • 6. A Short History ‣ Old Joomla Search • Introduced in Mambo • Largely unchanged since ‣ JXTended Finder for Joomla 1.5 ‣ Finder Integration Working Group • Smart Search for Joomla 2.5 ‣ Search Working Group Smart Search and Beyond
  • 7. Smart Search for Joomla 2.5 ‣ Separate index ‣ Auto-completion ‣ Facetted search ‣ Relevancy ordering ‣ Did you mean? ‣ ...and more besides Smart Search and Beyond
  • 8. Table of Contents 01 Smart Search so far 02 Smart Search in action 03 Smart Search under the hood 04 Smart Search tips and tricks 05 Smart Search where next? Smart Search and Beyond
  • 12. Table of Contents 01 Smart Search so far 02 Smart Search in action 03 Smart Search under the hood 04 Smart Search tips and tricks 05 Smart Search where next? Smart Search and Beyond
  • 13. Under the hood Smart Search and Beyond
  • 14. A problem in two halves Smart Search and Beyond
  • 15. First half: Indexing INDEX Raw data Smart Search and Beyond
  • 16. Second half: Querying Search INDEX Search queries results Smart Search and Beyond
  • 17. Search results Search results are rendered purely from data in the index, not the raw data. Smart Search and Beyond
  • 19. Indexing Parsing Stemming Tokenisation Analysis Token aggregation Term weighting Filtration Classification Smart Search and Beyond
  • 21. Parsing ‣ Extract plain text from raw data • HTML, RTF supported out-of-the-box • PDF, MS Word could be supported ‣ For example, HTML • Essentially the same as PHP strip_tags Smart Search and Beyond
  • 22. Tokenisation ‣ Fold to lowercase ‣ Special handling for plus, dash, comma, dot and quotes ‣ Remove non-alphanumerics ‣ Replace multiple spaces with one space ‣ Special support for Chinese Smart Search and Beyond
  • 23. Token aggregation On a clear disk you can seek forever on a clear on a a clear clear disk on a clear a clear disk clear disk you disk you can disk you you can can seek disk you can you can seek can seek forever seek forever seek forever Smart Search and Beyond
  • 24. Filtration ‣ “Stop word removal” • Not removed, just given a low weight ‣ jos_finder_terms_common ‣ English only • Other languages need to add their common words to the table Smart Search and Beyond
  • 25. Stemming fishing fished fish fisher fish Smart Search and Beyond
  • 26. Stemming ‣ “Snowball” is used by default • Danish, German, English, Spanish, Finnish, French, Hungarian, Italian, Norwegian, Dutch, Portuguese, Romanian, Russian, Swedish and Turkish • BUT it requires PHP extension ‣ “English only” uses a pure PHP stemmer • Recommended for all English sites Smart Search and Beyond
  • 27. Morphological analysis ‣ Currently uses Soundex ‣ Not used in search as such ‣ Used for the “Did you mean?” feature ‣ If no search results found, then... • Match on Soundex code • Return nearest term/phrase by Levenshtein distance Smart Search and Beyond
  • 28. Term weighting Context Multiplier Title 1.7 Text 0.7 Meta 1.2 Path 2.0 Miscellaneous 0.3 Smart Search and Beyond
  • 30. Taxonomies ‣ “Content maps” in Administrator ‣ Basis for facetted search ‣ Multi-level taxonomies not fully supported (yet) Smart Search and Beyond
  • 31. Taxonomies - drop-downs Smart Search and Beyond
  • 32. Taxonomies - checkboxes Smart Search and Beyond
  • 33. Taxonomies - links Smart Search and Beyond
  • 35. Smart Search Plug-ins /plugins /content /finder /system /finder /categories /highlight /contacts /content /newsfeeds /weblinks Smart Search and Beyond
  • 36. Smart Search Plug-ins content/finder finder/[type] onContentBeforeSave onFinderBeforeSave onContentAfterSave onFinderAfterSave onContentAfterDelete onFinderAfterDelete onContentChangeState onFinderChangeState onCategoryChangeState onFinderCategoryChangeState Smart Search and Beyond
  • 37. Query parsing URI argument Query string Terms q=Some+text Some text Phrases q=”Some+text” “Some text” Logical operators q=This+and+that This and that Before a date d1=2012-05-16 before:2012-05-16 After a date d2=2012-05-18 after:2012-05-18 Content type filter t[]=98233 type:Articles Taxonomy filter t[]=30922 author:Chris Davenport Static filter f=2 Highlight qh=Some+text Smart Search and Beyond
  • 38. Results rendering ‣ com_finder • search Search results ‣ default.php page ‣ form.php ‣ default_results.php ‣ default_result.php For custom types ‣ default_[type].php ‣ mod_finder ‣ default.php Search module Smart Search and Beyond
  • 39. Layout overrides example Smart Search and Beyond
  • 41. Table of Contents 01 Smart Search so far 02 Smart Search in action 03 Smart Search under the hood 04 Smart Search tips and tricks 05 Smart Search where next? Smart Search and Beyond
  • 42. Tips and tricks Smart Search and Beyond
  • 43. Tips and tricks ‣ HTML Parser • Invalid HTML can confuse the parser • Invalid UTF8 is ignored • Text in attributes is ignored Smart Search and Beyond
  • 44. When to do a purge ‣ Indexing is incremental so most of the time you don't need to. ‣ Changes to taxonomies that do not involve changes to content items ‣ Changes to term weights ‣ Changing the stemmer ‣ Changes to content items that do not trigger the standard content events ‣ IMPORTANT • If you have static filters they will be lost when you do a purge. Smart Search and Beyond
  • 45. Tuning Smart Search ‣ Use the CLI for indexing • http://docs.joomla.org/Setting_up_automatic_Smart_ Search_indexing ‣ Out of memory issues • Please report out of memory issues so we can understand them better. • Reduce batch size ‣ Default is 50. Drop it to 5 or even 1. • Terms per batch ‣ Can be increased BUT NEEDS APACHE SERVER CONFIG CHANGE Smart Search and Beyond
  • 46. Table of Contents 01 Smart Search so far 02 Smart Search in action 03 Smart Search under the hood 04 Smart Search tips and tricks 05 Smart Search where next? Smart Search and Beyond
  • 48. Search Working Group ‣ Meeting at J and Beyond • 19 May 2012 11:30 AM ‣ Stable ready for merge July 2012 ‣ Joomla 3.0 release September 2012 ‣ Meeting at Joomla World Conference • San Jose, California, November 2012 Smart Search and Beyond
  • 49. Improved language support ‣ Improve common word support ‣ Improve stemmer support • Native PHP stemmers? ‣ Improve morphological coding • Non-English alternatives to Soundex ‣ Mixed language content items • Language tagging of tokens/terms? Smart Search and Beyond
  • 50. Other possibilities ‣ Preserve static filters on purge/index ‣ Decouple indexing via message queues ‣ Easier support for range queries ‣ Search logging via JLog ‣ Variable-length token aggregation ‣ Multi-level taxonomies ‣ Add parsers for PDF, MS Word Smart Search and Beyond
  • 51. Search API ‣ Very important going forward ‣ Too big a leap for Joomla 3.0 ‣ Develop in parallel during 3.x cycle ‣ Use in Smart Search for Joomla 4.0 Smart Search and Beyond
  • 54. Don't forget Search Working Group Meeting Saturday 19 May 2012 11:30 AM Smart Search and Beyond
  • 55. Haystack - Mark Duncan CC-BY-SA 2.0 Generic http://commons.wikimedia.org/wiki/File%3AHaystack_-_geograph.org.uk_-_462934.jpg Under the hood - ilovebutter CC-BY 2.0 Generic http://commons.wikimedia.org/wiki/File:Trabant_601_S_of_Trabi_Safari_in_Dresden_8.jpg Child sucking thumb - Thahira CC-BY-SA 3.0 Unported http://commons.wikimedia.org/wiki/File:Sucking_finger.jpg Future car - Arthur C. Bade (1899–1975), Science and Mechanics Publishing - Public domain http://commons.wikimedia.org/wiki/File:Car_of_the_Future_1950_unrestored.jpg Magician - Kellar: Levitation, magician poster, ca. 1894 - CC-BY 2.0 Generic http://commons.wikimedia.org/wiki/File:Flickr_-_%E2%80%A6trialsanderrors_-_Kellar,_Levitation,_magician_poster,_ca._1894.jpg Index pages - Starbäck (1828-1885) and Föreningens Boktryckeri, Norrköping, Sweden (scanned by Ristesson Ent.) - Public domain http://commons.wikimedia.org/wiki/File:Index_Pages.jpg Twenty Questions - DuMont Television/Rosen Studios, New York-photographer. - Public domain http://commons.wikimedia.org/wiki/File:20_questions_1954.JPG Linnaeus taxonomy - Public domain http://commons.wikimedia.org/wiki/File:Linnaeus_-_Regnum_Animale_%281735%29.png All other images are Copyright (C) 2012 Chris Davenport unless I've accidentally missed crediting them. Image Credits