4. Old Joomla Search Sucks!
Cannot rank by
relevance across
content types
Only very crude
filtering
Can be slow to
search
Smart Search and Beyond
5. Table of Contents
01 Smart Search so far
02 Smart Search in action
03 Smart Search under the hood
04 Smart Search tips and tricks
05 Smart Search where next?
Smart Search and Beyond
6. A Short History
‣ Old Joomla Search
• Introduced in Mambo
• Largely unchanged since
‣ JXTended Finder for Joomla 1.5
‣ Finder Integration Working Group
• Smart Search for Joomla 2.5
‣ Search Working Group
Smart Search and Beyond
7. Smart Search for Joomla 2.5
‣ Separate index
‣ Auto-completion
‣ Facetted search
‣ Relevancy ordering
‣ Did you mean?
‣ ...and more besides
Smart Search and Beyond
8. Table of Contents
01 Smart Search so far
02 Smart Search in action
03 Smart Search under the hood
04 Smart Search tips and tricks
05 Smart Search where next?
Smart Search and Beyond
12. Table of Contents
01 Smart Search so far
02 Smart Search in action
03 Smart Search under the hood
04 Smart Search tips and tricks
05 Smart Search where next?
Smart Search and Beyond
21. Parsing
‣ Extract plain text from raw data
• HTML, RTF supported out-of-the-box
• PDF, MS Word could be supported
‣ For example, HTML
• Essentially the same as PHP strip_tags
Smart Search and Beyond
22. Tokenisation
‣ Fold to lowercase
‣ Special handling for plus, dash, comma,
dot and quotes
‣ Remove non-alphanumerics
‣ Replace multiple spaces with one space
‣ Special support for Chinese
Smart Search and Beyond
23. Token aggregation
On a clear disk you can seek forever
on a clear
on a a clear clear disk
on a clear a clear disk clear disk you
disk you can
disk you you can can seek
disk you can you can seek can seek forever
seek forever
seek forever
Smart Search and Beyond
24. Filtration
‣ “Stop word removal”
• Not removed, just given a low weight
‣ jos_finder_terms_common
‣ English only
• Other languages need to add their common
words to the table
Smart Search and Beyond
26. Stemming
‣ “Snowball” is used by default
• Danish, German, English, Spanish, Finnish,
French, Hungarian, Italian, Norwegian, Dutch,
Portuguese, Romanian, Russian, Swedish and
Turkish
• BUT it requires PHP extension
‣ “English only” uses a pure PHP stemmer
• Recommended for all English sites
Smart Search and Beyond
27. Morphological analysis
‣ Currently uses Soundex
‣ Not used in search as such
‣ Used for the “Did you mean?” feature
‣ If no search results found, then...
• Match on Soundex code
• Return nearest term/phrase by Levenshtein
distance
Smart Search and Beyond
28. Term weighting
Context Multiplier
Title 1.7
Text 0.7
Meta 1.2
Path 2.0
Miscellaneous 0.3
Smart Search and Beyond
37. Query parsing
URI argument Query string
Terms q=Some+text Some text
Phrases q=”Some+text” “Some text”
Logical operators q=This+and+that This and that
Before a date d1=2012-05-16 before:2012-05-16
After a date d2=2012-05-18 after:2012-05-18
Content type filter t[]=98233 type:Articles
Taxonomy filter t[]=30922 author:Chris Davenport
Static filter f=2
Highlight qh=Some+text
Smart Search and Beyond
41. Table of Contents
01 Smart Search so far
02 Smart Search in action
03 Smart Search under the hood
04 Smart Search tips and tricks
05 Smart Search where next?
Smart Search and Beyond
43. Tips and tricks
‣ HTML Parser
• Invalid HTML can confuse the parser
• Invalid UTF8 is ignored
• Text in attributes is ignored
Smart Search and Beyond
44. When to do a purge
‣ Indexing is incremental so most of the time you don't
need to.
‣ Changes to taxonomies that do not involve changes to
content items
‣ Changes to term weights
‣ Changing the stemmer
‣ Changes to content items that do not trigger the standard
content events
‣ IMPORTANT
• If you have static filters they will be lost when you do a purge.
Smart Search and Beyond
45. Tuning Smart Search
‣ Use the CLI for indexing
• http://docs.joomla.org/Setting_up_automatic_Smart_
Search_indexing
‣ Out of memory issues
• Please report out of memory issues so we can
understand them better.
• Reduce batch size
‣ Default is 50. Drop it to 5 or even 1.
• Terms per batch
‣ Can be increased BUT NEEDS APACHE SERVER CONFIG
CHANGE
Smart Search and Beyond
46. Table of Contents
01 Smart Search so far
02 Smart Search in action
03 Smart Search under the hood
04 Smart Search tips and tricks
05 Smart Search where next?
Smart Search and Beyond
48. Search Working Group
‣ Meeting at J and Beyond
• 19 May 2012 11:30 AM
‣ Stable ready for merge July 2012
‣ Joomla 3.0 release September 2012
‣ Meeting at Joomla World Conference
• San Jose, California, November 2012
Smart Search and Beyond
49. Improved language support
‣ Improve common word support
‣ Improve stemmer support
• Native PHP stemmers?
‣ Improve morphological coding
• Non-English alternatives to Soundex
‣ Mixed language content items
• Language tagging of tokens/terms?
Smart Search and Beyond
50. Other possibilities
‣ Preserve static filters on purge/index
‣ Decouple indexing via message queues
‣ Easier support for range queries
‣ Search logging via JLog
‣ Variable-length token aggregation
‣ Multi-level taxonomies
‣ Add parsers for PDF, MS Word
Smart Search and Beyond
51. Search API
‣ Very important going forward
‣ Too big a leap for Joomla 3.0
‣ Develop in parallel during 3.x cycle
‣ Use in Smart Search for Joomla 4.0
Smart Search and Beyond
54. Don't forget
Search Working Group
Meeting
Saturday 19 May 2012
11:30 AM
Smart Search and Beyond
55. Haystack - Mark Duncan CC-BY-SA 2.0 Generic
http://commons.wikimedia.org/wiki/File%3AHaystack_-_geograph.org.uk_-_462934.jpg
Under the hood - ilovebutter CC-BY 2.0 Generic
http://commons.wikimedia.org/wiki/File:Trabant_601_S_of_Trabi_Safari_in_Dresden_8.jpg
Child sucking thumb - Thahira CC-BY-SA 3.0 Unported
http://commons.wikimedia.org/wiki/File:Sucking_finger.jpg
Future car - Arthur C. Bade (1899–1975), Science and Mechanics Publishing - Public domain
http://commons.wikimedia.org/wiki/File:Car_of_the_Future_1950_unrestored.jpg
Magician - Kellar: Levitation, magician poster, ca. 1894 - CC-BY 2.0 Generic
http://commons.wikimedia.org/wiki/File:Flickr_-_%E2%80%A6trialsanderrors_-_Kellar,_Levitation,_magician_poster,_ca._1894.jpg
Index pages - Starbäck (1828-1885) and Föreningens Boktryckeri, Norrköping, Sweden (scanned by Ristesson Ent.) - Public
domain
http://commons.wikimedia.org/wiki/File:Index_Pages.jpg
Twenty Questions - DuMont Television/Rosen Studios, New York-photographer. - Public domain
http://commons.wikimedia.org/wiki/File:20_questions_1954.JPG
Linnaeus taxonomy - Public domain
http://commons.wikimedia.org/wiki/File:Linnaeus_-_Regnum_Animale_%281735%29.png
All other images are Copyright (C) 2012 Chris Davenport unless I've accidentally missed crediting them.
Image Credits