Before Google, before search, heck, even before SQL, search and retrieve meant one thing: the library. And you think you have a lot of noisy data in crusty formats to search? Even if you don't have 100 million books in your catalog, Solr applications for library data offer practical, general purpose solutions to some of the knottiest search problems.
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Practical Search with Solr: Beyond just Looking it Up
1. Practical Search with Solr:
Beyond just Looking it Up
29 April 2010
Bess Sadler, Stanford University Library
Naomi Dushay, Stanford University Library
Tom Burton-West, the Hathi Trust Project
2. Slides posted at the end of this
Agenda presentation; full replay
available within
~48 hours of live webcast
Introductions
What, the data’s dirty? Bess Sadler
Clean data is easy to search and browse. However, you probably
don’t have clean data.
Queries are not obvious: Naomi Dushay
Browsing ordered lists; Dismax; When simple search is not enough
Big, Bigger, Biggest: Tom Burton West
Large scale issues: Phrase queries and common words; OCR
Q&A
Lucid Imagination, Inc. – http://www.lucidimagination.com
2
3. About the Presenters
Bess Sadler, Stanford University Library
Senior Software Engineer at Stanford University Library, and co-founder
of Blacklight (http://projectblacklight.org), formerly the Chief Architect
for the Online Library Environment at the U-VA. www-sul.stanford.edu
Naomi Dushay, Stanford University Library
Senior Software Engineer at Stanford University Library, expert in digital
library research; formerly a member of the core infrastructure team of
the National Science Digital Library. www-sul.stanford.edu
Tom Burton-West, the Hathi Trust Project
Information Retrieval Programmer in the University of Michigan’s Digital
Library Production Service; works on the Hathi Trust Large Scale Search
project and blogs about it at www.hathitrust.org/blogs.
Lucid Imagination, Inc. – http://www.lucidimagination.com
3
4. What, the data’s dirty?
Clean data is
easy to search
and browse.
However, you
probably don’t Bess Sadler
have clean data. Stanford University Library
Lucid Imagination, Inc. – http://www.lucidimagination.com 4 4
5. Before we begin, you should know
Some basics around Solr that we won’t cover
Gets and posts
Search Index is not a DBMS
XML
Strong Data Typing
We’ll refer to these in the talk; if you’re unfamiliar with them,
see: bit.ly/practical-solr for some quick definitions of these
terms
Lucid Imagination, Inc. – http://www.lucidimagination.com
5
6. Mapping Library Data Types:
Your data is not as different as you think it is
Library Engineering Health Care Intellectual Legal
Property
Books Specs Research papers Patents Contracts
Personal Name Concept Disease Types Mechanisms Parties
Publication Formal Journals Filing and Rulings and court
Documentations Disclosure Docs documents
Combined Facets: Test results, analog Test results, analog Authors, titles, prior Exhibits, photos,
Book, Video, data files, Media data files, Media art, assignees, criminal evidence,
Journal, files, data sets, rich files,patient records claims, descriptions, emails/e-discovery
Newspaper, documents figures
Physical Artifacts, SKUs
Digital artifacts
Other domains: pharmaceutical, manufacturing, etc. are similar in the diversity
of document types and data types within the documents
Lucid Imagination, Inc. – http://www.lucidimagination.com
6
7. Data is weird
Not Normal: The data is not always in the fields or places you
expect, even when you have a detailed spec.
Local practices differ
Practices change over time
Sometimes stuff is just wrong (but remember: it’s better to be
consistent than right)
Be prepared for cleanup – indexing your data is going to uncover a
lot of problems you never knew about before
Formats are not necessarily optimized for discovery
For example: PDFs are optimized for presentation, not discovery;
putting them into a discovery system presents its own challenges.
Lucid Imagination, Inc. – http://www.lucidimagination.com
7
8. Using Solr Cell (aka Extracting Request Handler)/PDFBox
Good news examples
Please, God, just some metadata!
When we got lucky, we had another source of the metadata
Bad News examples
Typography
Text inset boxes
It’s only a little easier than OCR …
Advanced PDFBox options only work
when there is a lot of consistency
Lucid Imagination, Inc. – http://www.lucidimagination.com
8
9. Search vs. Browse
Search:
More focused -- the user is looking for a known item, or has a
specific question to be answered. (e.g., a citation, a part
number, a specific judicial ruling, “that book by Steinbeck”)
Browse:
The user has a generalized, nebulous information need that
they will refine as they interact with a collection of resources.
(e.g., finding a good book to read, shopping for accessories,
keeping current in one’s field)
Lucid Imagination, Inc. – http://www.lucidimagination.com
9
10. Search Challenges
Relevancy – indexing the full text isn’t good enough
Fielded search – context is meaningful (“Cook” example)
Fielded search – will data be where you expect it to be?
Users don’t’ speak your jargon:
“indian cooking” is “Cookery Indic”
Stemming -- Nature/Naturalism
How do you know you have your relevancy rankings right?
You ask!
Lucid Imagination, Inc. – http://www.lucidimagination.com
10
11. Why Browsing Is Important
Search is not enough
What is a facet?
Here is how it works in Solr
Here is why your users will like you for doing it.
More challenges related to browsing coming up…
Lucid Imagination, Inc. – http://www.lucidimagination.com
11
12. Queries are not obvious
Browsing Ordered Lists
A Little About Dismax
and
When Simple Search
Naomi Dushay
Is Not Enough
Stanford University Library
Lucid Imagination, Inc. – http://www.lucidimagination.com 12
13. Candidates for Browsing
Names (Employees, Customers, Students, Authors)
Part Numbers One Strategy for Data
that is
When Spelling is Unclear Not Normalized is
Browsing Ordered Lists
uighur, uyghur, uyghar, uigher
Strings of Both Letters and Digits, such as SKUs, Part
Numbers, Invoice Numbers, Transaction Record Numbers
Addresses in Sequence
Titles (Books, …)
Lucid Imagination, Inc. – http://www.lucidimagination.com
13
14. Some Values are Easily Ordered
Numeric Values
Dates (if normalized)
Some Letter Tokens (e.g. categories)
Lucid Imagination, Inc. – http://www.lucidimagination.com
14
15. Values Difficult to Sort Lexically
Digits in non-numeric context
lexical sort of numbers: 1, 111, 20, 222, 8 …
“A715C74”
“The Princess and the Pea”
“Sir Isaac Newton”
“Die Fledermaus”
piña vs. pina
Lucid Imagination, Inc. – http://www.lucidimagination.com
15
16. Call Numbers are Difficult to Sort Lexically
(applies to SKUs, Part Numbers, A7 .L3 .V2
Non-uniform serial numbers
across domains, etc.) A7 .L3 V2
A7 .L3 V.2
Letters combined with DIGITS
A7 .L3 1902 V.2
Some digits are decimals,
some are integers M5 .L3 2000 .K2 1880
Inconsistent punctuation M5 .L3 .K451 V.5
Suffixes to be ignored for M5 .L3 K2 D MAJ 1880
sorting purposes M5 .L3 K2 OP.7:NO.6 1880
A7 .L3 1902 V2 TANEYTOWN
M5 .L3 K2 .Q2 MD:CRAP0*DMA 1981
Lucid Imagination, Inc. – http://www.lucidimagination.com
16
17. Normalization for Sorting is a Process
It might not
Programmatically need to be
Normalize Data perfect.
Clean up data
Assess
Sorted Output
Humans
Find Dirty Data
Automated test
Lucid Imagination, Inc. – http://www.lucidimagination.com
17
18. Basic Sorting Normalization Strategies
Normalize Letter Case (e.g. all lowercase)
Leading Spaces (can use zeros for digits; space works)
Trailing Spaces
Skip Ignored Characters (“The Fly”, “Ms. Jane Doe”)
Numbers sorted as an Integer (leading spaces/zeros),
vs. as a Decimal (trailing spaces/zeros)?
Normalization should
accommodate dirty data
whenever practical.
Lucid Imagination, Inc. – http://www.lucidimagination.com
18
20. Weird Values Happen
ZDVD 4971
MFILM 24 REEL 5
Shelved by title
XX(123457)
call # varies
no call number
Lucid Imagination, Inc. – http://www.lucidimagination.com
20
21. Solr Performance Issue: Query Time Sorting
q=sortfield["666" TO *]&rows=10
Will sort ALL of the sortfield values at Query Time
Response time abysmal for sortfields
with huge numbers of values
Try this: Terms Component
Lucid Imagination, Inc. – http://www.lucidimagination.com
21
23. /solr/alphaTerms?
terms.fl=shelfkey&
terms.lower=lc+hc++0337.000000+f0.500000+f0.512000&
per_page=10
TermsComponent
queries the part of the
index that is already
lexically sorted for each
field.
Lucid Imagination, Inc. – http://www.lucidimagination.com
23
24. Now that I Have Terms, How do I get the Documents?
/solr/select?
q=sortkey:( “a++67+mn++4” OR “a++67+mp+85”)
&qt=standard
(URL encode if you need to)
Lucid Imagination, Inc. – http://www.lucidimagination.com
24
25. Sortfield Value : Document NOT always 1:1
1:Many One Sortfield Value – Multiple Documents
One product, multiple generations of user manuals
One court case, multiple briefing and disclosure documents
Many:1 One Document – Multiple Sortfield Values
Which value are you going to pick for browsing list?
Allow user to select in UI, if possible
Lucid Imagination, Inc. – http://www.lucidimagination.com
25
26. What About Browsing Before the Known Sort Value?
n Before n After
http://hayward-ca.gov/refreshyourlife/wp-content/uploads/2009/07/fiction-spines.jpg
Lucid Imagination, Inc. – http://www.lucidimagination.com
26
27. Create Reverse Sortkey
Use a simple character mapping to reverse the sort order
IF sortkey HAS reversekey GETS
0 Z
1 Y
… …
9 Q
A P
… …
Z 0
Lucid Imagination, Inc. – http://www.lucidimagination.com
27
29. A Little About Dismax
Lucid Imagination, Inc. – http://www.lucidimagination.com
29
30. Solr QueryParsing Strategies
FEATURE LUCENE DISMAX
Boolean √
Each Text Box -> Groups of Index Fields √ √
Each Text Box -> Complex Boosting Equation yuck √
Multiple Text Boxes yuck √
Multiple Query Words Match Across Fields √
Boosting Matches Simple √
“Author” “Title” “Subject” Searches
Lucid Imagination, Inc. – http://www.lucidimagination.com
30
31. Dismax (disjoint max) Query Parser:
Some of My Favorite Things
Assign boost values for field matching at query time BUT
complex boosting formulae can reside in solrconfig.xml
Index can be neutral; assign query time boosting to fields for
different types of queries
Easy to boost exact phrase matches higher than query terms
scattered across document.
Tune how many query words MUST match, and what the
other matching thresholds/parameters might be
http://wiki.apache.org/solr/DisMaxRequestHandler
Lucid Imagination, Inc. – http://www.lucidimagination.com
31
32. Example Dismax Request Handler
<!-- author search request handler -->
<requestHandler name="search_author" class="solr.SearchHandler" >
<lst name="defaults">
<str name="defType">dismax</str>
<!-- require 4 or more terms to match … -->
<str name="mm">4<-1 4<90%</str>
<!– boost formula -->
<str name="qf">author_unstem^10 author native_script_author</str>
<!-- boost phrase matches -->
<str name="pf">author_unstem^100 author^10 native_script_author^10</str>
…
http://wiki.apache.org/solr/DisMaxRequestHandler
Lucid Imagination, Inc. – http://www.lucidimagination.com
32
33. Sometimes,
Simple Search + Facets
is Not Enough
Lucid Imagination, Inc. – http://www.lucidimagination.com
33
34. WHEN isn’t it enough?
Pay attention to user feedback
Study Search Logs
Queries without results
Lucid Imagination, Inc. – http://www.lucidimagination.com
34
35. Our Users Also Asked for:
Boolean
Targeting a particular (group of) fields
“… combined searching feature so that I can specify the author
and title.”
(author) Mozart (title) sonata 21 – not a book about Mozart’s
sonatas
“I often search publisher AND year, or publisher AND place of
publication, and occasionally need all three terms in
combination.”
(publisher) “Little, Brown & Co” – not “The Little Brown Jug”
Plaintiff, Defendant, Attorney – all?
Lucid Imagination, Inc. – http://www.lucidimagination.com
35
36. Search Form has More Than One Text Box
Want Features of Dismax
Need Way to Boost Appropriately for Each Text Box
Need Way to Combine Text Boxes
Lucid Imagination, Inc. – http://www.lucidimagination.com
36
37. Local Params LocalParams allow additional, localized
instructions to be sent as part of the
query.
Ways to Parse Query Terms
Send in Non-Default Values for Variables
Use Variables Declared in Request Handler That Don’t
Map To QueryParser Arguments
http://wiki.apache.org/solr/LocalParams
Lucid Imagination, Inc. – http://www.lucidimagination.com
37
39. Using LocalParams Variables
Text boxes combined with AND
_query_:”{!dismax qf=$qf_title pf=$pf_title}title terms” AND
_query_ ”{!dismax qf=$qf_author pf=$pf_author} author terms”
Text boxes combined with OR
_query_:”{!dismax qf=$qf_title pf=$pf_title}title terms” OR
_query_ ”{!dismax qf=$qf_author pf=$pf_author} author terms”
Lucid Imagination, Inc. – http://www.lucidimagination.com
39
40. Note: DISMAX doesn’t do Boolean within the text
boxes: there are workarounds …
edismax (Solr 1.5)
faking it:
http://www.stanford.edu/people/~ndushay/code4lib2010/advSe
archSolrQueries.pdf
Lucid Imagination, Inc. – http://www.lucidimagination.com
40
41. My Favorite Places To Find Information
LucidImagination Search
http://www.lucidimagination.com/search/
(NOT a coerced statement!)
Solr wikis
http://wiki.apache.org/solr/FrontPage
Lucid Imagination, Inc. – http://www.lucidimagination.com
41
42. Big, Bigger Biggest
Large scale issues:
Phrase queries and common words
OCR
Tom Burton West
Hathi Trust Project
Lucid Imagination, Inc. – http://www.lucidimagination.com 42
43. Hathi Trust Large Scale Search Challenges
Goal: Design a system for full-text search that
will scale to 5 million to 20 million volumes (at a reasonable cost.)
Challenges:
Must scale to 20 million full-text volumes
Very long documents compared to
most large-scale search applications
Multilingual collection
OCR quality varies
Lucid Imagination, Inc. – http://www.lucidimagination.com
43
44. Index Size, Caching, and Memory
Our documents average about 300 pages
which is about 700KB of OCR.
Our 5 million document index is between 2 and 3 terabytes.
About 300 GB per million documents
Large index means disk I/O is bottleneck
Tradeoff JVM vs OS memory
Solr uses OS memory (disk I/O caching) for caching of postings
Memory available for disk I/O caching has most impact on response
time (assuming adequate cache warming)
Fitting entire index in memory not feasible with terabyte size index
Lucid Imagination, Inc. – http://www.lucidimagination.com
44
45. Response time varies with query
Average: 673
Median: 91
90th: 328
99th: 7,504
Lucid Imagination, Inc. – http://www.lucidimagination.com
45
46. Slowest 5 % of queries
The slowest 5% of queries took about
1 second or longer.
Response Time 95th percentile (seconds) The slowest 1% of queries took
between 10 seconds and 2 minutes.
1,000
Response Time
Slowest 0.5% of queries took
(seconds)
100
between 30 seconds and 2 minutes
10
These queries affect response time of
1
other queries
0
940 950 960 970 980 990 1,000
Cache pollution
Query number Contention for resources
Slowest queries are phrase queries
containing common words
Lucid Imagination, Inc. – http://www.lucidimagination.com 46
47. Query processing
Phrase queries use position index (Boolean queries do not).
Position index accounts for 85% of index size
Position list for common words such as
“the” can be many GB in size
This causes lots of disk I/O .
Solr depends on the operating systems disk cache to reduce disk
I/O requirements for words that occur in more than one query
I/O from Phrase queries containing
common words pollutes the cache
Lucid Imagination, Inc. – http://www.lucidimagination.com
47
48. Slow Queries
Slowest test query: “the lives and literature of the beat
generation” took 2 minutes.
4MB data read for Boolean query.
9,000+ MB read for Phrase query.
NUMBER OF POSTINGS LIST TOTAL TERM OCCURRENCES POSITION LIST
WORD
DOCUMENTS (SIZE MB) (MILLIONS) (SIZE MB)
the 800,000 0.8 4,351 4,351
of 892,000 0.89 2,795 2,795
and 769,000 0.77 1,870 1,870
literature 435,000 0.44 9 9
generation 414,000 0.41 5 5
lives 432,000 0.43 5 5
beat 278,000 0.28 1 1
TOTAL 4.02 9,036
Lucid Imagination, Inc. – http://www.lucidimagination.com
48
49. Why not use Stop Words?
The word “the” occurs more than 4 billion times in our 1 million
document index.
Removing “stop” words (“the”, “of” etc.) not desirable for our use cases.
Couldn’t search for many phrases
“to be or not to be”
“the who”
“man in the moon” vs. “man on the moon”
Stop words in one language are content words in another language
German stop words “war” and “die” are content words in English
English stop words “is” and “by” are content words (“ice” and “village”)
in Swedish
Lucid Imagination, Inc. – http://www.lucidimagination.com
49
50. “CommonGrams”
Ported Nutch “CommonGrams” algorithm to Solr
Create Bi-Grams selectively for any two word sequence containing
common terms
Slowest query: “The lives and literature of the beat generation”
“the-lives” “lives-and”
“and-literature” “literature-of”
“of-the” “the-beat” “generation”
Lucid Imagination, Inc. – http://www.lucidimagination.com
50
51. Standard index vs. CommonGrams
Standard Index Common Grams
TOTAL TOTAL
NUMBER OF NUMBER OF
OCCURRENCES OCCURRENCES
WORD DOCS DOCS
IN CORPUS IN CORPUS
(THOUSANDS) (THOUSANDS)
(MILLIONS) TERM (MILLIONS)
the 2,013 386 of-the 446 396
of 1,299 440 generation 2.42 262
and 855 376 the-lives 0.36 128
literature 4 210 literature-of 0.35 103
lives 2 194 lives-and 0.25 115
generation 2 199 and-literature 0.24 77
beat 0.6 130 the-beat 0.06 26
TOTAL 4,176 TOTAL 450
Lucid Imagination, Inc. – http://www.lucidimagination.com 51
52. Comparison of Response time (ms)
SLOWEST
AVERAGE MEDIAN 90th 99th QUERY
Standard Index 459 32 146 6,784 120,595
Common
68 3 71 2,226 7,800
Grams
Lucid Imagination, Inc. – http://www.lucidimagination.com
52
53. Other issues
Analyze your slowest queries
We analyzed the slowest queries from our query logs and
discovered additional “common words” to be added to our list.
We used Solr Admin panel to run our slowest queries from our
logs with the “debug” flag checked.
We discovered that words such as “l’art” were being split into
two token phrase queries.
We used the Solr Admin Analysis tool and determined that the
analyzer we were using was the culprit.
Lucid Imagination, Inc. – http://www.lucidimagination.com
53
54. Other issues
We broke Solr … temporarily
Dirty OCR in combination with over 200 languages creates
indexes with over 2.4 billion unique terms
Solr/Lucene index size was limited to 2.1 Billion unique terms
Patched: Now it’s 274 Billion
Dirty OCR is difficult to remove without removing “good” words.
Because Solr/Lucene tii/tis index uses pointers into the frequency
and position files we suspect that the performance impact is
minimal compared to disk I/O demands, but we will be testing
soon.
Lucid Imagination, Inc. – http://www.lucidimagination.com
54
55. Q&A
Download these slides at
http://bit.ly/practical-solr
On demand replay is
available within 24-48
hours of the live webcast
Lucid Imagination, Inc. – http://www.lucidimagination.com
55