Practical Search with Solr: Beyond just Looking it Up

Practical Search with Solr:
Beyond just Looking it Up
29 April 2010

Bess Sadler, Stanford University Library
Naomi Dushay, Stanford University Library
Tom Burton-West, the Hathi Trust Project

Slides posted at the end of this
Agenda presentation; full replay
available within
~48 hours of live webcast
Introductions
What, the data’s dirty? Bess Sadler
Clean data is easy to search and browse. However, you probably
don’t have clean data.
Queries are not obvious: Naomi Dushay
Browsing ordered lists; Dismax; When simple search is not enough
Big, Bigger, Biggest: Tom Burton West
Large scale issues: Phrase queries and common words; OCR
Q&A

Lucid Imagination, Inc. – http://www.lucidimagination.com
2

About the Presenters
Bess Sadler, Stanford University Library
Senior Software Engineer at Stanford University Library, and co-founder
of Blacklight (http://projectblacklight.org), formerly the Chief Architect
for the Online Library Environment at the U-VA. www-sul.stanford.edu
Naomi Dushay, Stanford University Library
Senior Software Engineer at Stanford University Library, expert in digital
library research; formerly a member of the core infrastructure team of
the National Science Digital Library. www-sul.stanford.edu
Tom Burton-West, the Hathi Trust Project
Information Retrieval Programmer in the University of Michigan’s Digital
Library Production Service; works on the Hathi Trust Large Scale Search
project and blogs about it at www.hathitrust.org/blogs.

3

What, the data’s dirty?
Clean data is
easy to search
and browse.
However, you
probably don’t Bess Sadler
have clean data. Stanford University Library

Lucid Imagination, Inc. – http://www.lucidimagination.com 4 4

Before we begin, you should know

Some basics around Solr that we won’t cover
Gets and posts
Search Index is not a DBMS
XML
Strong Data Typing
We’ll refer to these in the talk; if you’re unfamiliar with them,
see: bit.ly/practical-solr for some quick definitions of these
terms

5

Mapping Library Data Types:
Your data is not as different as you think it is
Library Engineering Health Care Intellectual Legal
Property
Books Specs Research papers Patents Contracts

Personal Name Concept Disease Types Mechanisms Parties

Publication Formal Journals Filing and Rulings and court
Documentations Disclosure Docs documents

Combined Facets: Test results, analog Test results, analog Authors, titles, prior Exhibits, photos,
Book, Video, data files, Media data files, Media art, assignees, criminal evidence,
Journal, files, data sets, rich files,patient records claims, descriptions, emails/e-discovery
Newspaper, documents figures
Physical Artifacts, SKUs
Digital artifacts

Other domains: pharmaceutical, manufacturing, etc. are similar in the diversity
of document types and data types within the documents
6

Data is weird
Not Normal: The data is not always in the fields or places you
expect, even when you have a detailed spec.
Local practices differ
Practices change over time
Sometimes stuff is just wrong (but remember: it’s better to be
consistent than right)
Be prepared for cleanup – indexing your data is going to uncover a
lot of problems you never knew about before
Formats are not necessarily optimized for discovery
For example: PDFs are optimized for presentation, not discovery;
putting them into a discovery system presents its own challenges.

7

Using Solr Cell (aka Extracting Request Handler)/PDFBox

Good news examples
Please, God, just some metadata!
When we got lucky, we had another source of the metadata
Bad News examples
Typography
Text inset boxes
It’s only a little easier than OCR …
Advanced PDFBox options only work
when there is a lot of consistency

8

Search vs. Browse

Search:
More focused -- the user is looking for a known item, or has a
specific question to be answered. (e.g., a citation, a part
number, a specific judicial ruling, “that book by Steinbeck”)
Browse:
The user has a generalized, nebulous information need that
they will refine as they interact with a collection of resources.
(e.g., finding a good book to read, shopping for accessories,
keeping current in one’s field)

9

Search Challenges

Relevancy – indexing the full text isn’t good enough
Fielded search – context is meaningful (“Cook” example)
Fielded search – will data be where you expect it to be?
Users don’t’ speak your jargon:
“indian cooking” is “Cookery Indic”
Stemming -- Nature/Naturalism
How do you know you have your relevancy rankings right?
You ask!

10

Why Browsing Is Important

Search is not enough
What is a facet?
Here is how it works in Solr
Here is why your users will like you for doing it.
More challenges related to browsing coming up…

11

Queries are not obvious
Browsing Ordered Lists
A Little About Dismax
and
When Simple Search
Naomi Dushay
Is Not Enough
Stanford University Library

Lucid Imagination, Inc. – http://www.lucidimagination.com 12

Candidates for Browsing
Names (Employees, Customers, Students, Authors)
Part Numbers One Strategy for Data
that is
When Spelling is Unclear Not Normalized is
Browsing Ordered Lists
uighur, uyghur, uyghar, uigher
Strings of Both Letters and Digits, such as SKUs, Part
Numbers, Invoice Numbers, Transaction Record Numbers
Addresses in Sequence
Titles (Books, …)

13

Some Values are Easily Ordered
Numeric Values
Dates (if normalized)
Some Letter Tokens (e.g. categories)

14

Values Difficult to Sort Lexically
Digits in non-numeric context
lexical sort of numbers: 1, 111, 20, 222, 8 …
“A715C74”
“The Princess and the Pea”
“Sir Isaac Newton”
“Die Fledermaus”
piña vs. pina

15

Call Numbers are Difficult to Sort Lexically
(applies to SKUs, Part Numbers, A7 .L3 .V2
Non-uniform serial numbers
across domains, etc.) A7 .L3 V2
A7 .L3 V.2
Letters combined with DIGITS
A7 .L3 1902 V.2
Some digits are decimals,
some are integers M5 .L3 2000 .K2 1880

Inconsistent punctuation M5 .L3 .K451 V.5

Suffixes to be ignored for M5 .L3 K2 D MAJ 1880
sorting purposes M5 .L3 K2 OP.7:NO.6 1880
A7 .L3 1902 V2 TANEYTOWN
M5 .L3 K2 .Q2 MD:CRAP0*DMA 1981

16

Normalization for Sorting is a Process
It might not
Programmatically need to be
Normalize Data perfect.

Clean up data
Assess
Sorted Output

Humans
Find Dirty Data
Automated test

17

Basic Sorting Normalization Strategies
Normalize Letter Case (e.g. all lowercase)
Leading Spaces (can use zeros for digits; space works)
Trailing Spaces
Skip Ignored Characters (“The Fly”, “Ms. Jane Doe”)
Numbers sorted as an Integer (leading spaces/zeros),
vs. as a Decimal (trailing spaces/zeros)?

Normalization should
accommodate dirty data
whenever practical.

18

a+++0007.000000
A7 .B3 b0.300000
a+++0007.000000
A7 B33 b0.330000
a+++0017.000000
A17 .B4 b0.400000
19

Weird Values Happen

ZDVD 4971
MFILM 24 REEL 5
Shelved by title
XX(123457)
call # varies
no call number

20

Solr Performance Issue: Query Time Sorting

q=sortfield["666" TO *]&rows=10
Will sort ALL of the sortfield values at Query Time
Response time abysmal for sortfields
with huge numbers of values
Try this: Terms Component

21

Solrconfig.xml

QUERY LOOKS LIKE:

http://host:port/solr/alphaTerms?
terms.fl=
terms.lower=
per_page=
22

/solr/alphaTerms?
terms.fl=shelfkey&
terms.lower=lc+hc++0337.000000+f0.500000+f0.512000&
per_page=10

TermsComponent
queries the part of the
index that is already
lexically sorted for each
field.

23

Now that I Have Terms, How do I get the Documents?
/solr/select?
q=sortkey:( “a++67+mn++4” OR “a++67+mp+85”)
&qt=standard

(URL encode if you need to)

24

Sortfield Value : Document NOT always 1:1

1:Many One Sortfield Value – Multiple Documents
One product, multiple generations of user manuals
One court case, multiple briefing and disclosure documents

Many:1 One Document – Multiple Sortfield Values
Which value are you going to pick for browsing list?
Allow user to select in UI, if possible

25

What About Browsing Before the Known Sort Value?

n Before n After
http://hayward-ca.gov/refreshyourlife/wp-content/uploads/2009/07/fiction-spines.jpg

26

Create Reverse Sortkey
Use a simple character mapping to reverse the sort order
IF sortkey HAS reversekey GETS
0 Z
1 Y
… …
9 Q
A P
… …
Z 0
27

A Little About Dismax

29

Solr QueryParsing Strategies

FEATURE LUCENE DISMAX

Boolean √
Each Text Box -> Groups of Index Fields √ √
Each Text Box -> Complex Boosting Equation yuck √
Multiple Text Boxes yuck √
Multiple Query Words Match Across Fields √
Boosting Matches Simple √
“Author” “Title” “Subject” Searches

30

Dismax (disjoint max) Query Parser:
Some of My Favorite Things
Assign boost values for field matching at query time BUT
complex boosting formulae can reside in solrconfig.xml
Index can be neutral; assign query time boosting to fields for
different types of queries
Easy to boost exact phrase matches higher than query terms
scattered across document.
Tune how many query words MUST match, and what the
other matching thresholds/parameters might be
http://wiki.apache.org/solr/DisMaxRequestHandler

31

Example Dismax Request Handler

<requestHandler name="search_author" class="solr.SearchHandler" >
<lst name="defaults">
<str name="defType">dismax</str>


<str name="mm">4<-1 4<90%</str>

<!– boost formula -->
<str name="qf">author_unstem^10 author native_script_author</str>


<str name="pf">author_unstem^100 author^10 native_script_author^10</str>
…

http://wiki.apache.org/solr/DisMaxRequestHandler

32

Sometimes,
Simple Search + Facets
is Not Enough

33

WHEN isn’t it enough?

Pay attention to user feedback
Study Search Logs
Queries without results

34

Our Users Also Asked for:
Boolean
Targeting a particular (group of) fields
“… combined searching feature so that I can specify the author
and title.”
(author) Mozart (title) sonata 21 – not a book about Mozart’s
sonatas
“I often search publisher AND year, or publisher AND place of
publication, and occasionally need all three terms in
combination.”
(publisher) “Little, Brown & Co” – not “The Little Brown Jug”
Plaintiff, Defendant, Attorney – all?

35

Search Form has More Than One Text Box
Want Features of Dismax
Need Way to Boost Appropriately for Each Text Box
Need Way to Combine Text Boxes

36

Local Params LocalParams allow additional, localized
instructions to be sent as part of the
query.

Ways to Parse Query Terms
Send in Non-Default Values for Variables
Use Variables Declared in Request Handler That Don’t
Map To QueryParser Arguments

http://wiki.apache.org/solr/LocalParams

37

Solrconfig.xml
<requestHandler name=”multi_box" class="solr.SearchHandler" >
<lst name="defaults">
<str name="defType">lucene</str>
<str name=”q.op”>AND</str>

<!– author box -->
<str name="qf_author">author fields boost formula</str>
<str name="pf_author">author phrase boosts</str>

<!– title box -->
<str name="qf_title">title fields boost formula</str>
<str name="pf_title">title phrase boosts</str>

…

38

Using LocalParams Variables
Text boxes combined with AND
_query_:”{!dismax qf=$qf_title pf=$pf_title}title terms” AND
_query_ ”{!dismax qf=$qf_author pf=$pf_author} author terms”

Text boxes combined with OR
_query_:”{!dismax qf=$qf_title pf=$pf_title}title terms” OR
_query_ ”{!dismax qf=$qf_author pf=$pf_author} author terms”

39

Note: DISMAX doesn’t do Boolean within the text
boxes: there are workarounds …
edismax (Solr 1.5)
faking it:
http://www.stanford.edu/people/~ndushay/code4lib2010/advSe
archSolrQueries.pdf

40

My Favorite Places To Find Information
LucidImagination Search
http://www.lucidimagination.com/search/
(NOT a coerced statement!)
Solr wikis
http://wiki.apache.org/solr/FrontPage

41

Big, Bigger Biggest
Large scale issues:
Phrase queries and common words
OCR

Tom Burton West
Hathi Trust Project


Hathi Trust Large Scale Search Challenges
Goal: Design a system for full-text search that
will scale to 5 million to 20 million volumes (at a reasonable cost.)
Challenges:
Must scale to 20 million full-text volumes
Very long documents compared to
most large-scale search applications
Multilingual collection
OCR quality varies

43

Index Size, Caching, and Memory

Our documents average about 300 pages
which is about 700KB of OCR.
Our 5 million document index is between 2 and 3 terabytes.
About 300 GB per million documents
Large index means disk I/O is bottleneck
Tradeoff JVM vs OS memory
Solr uses OS memory (disk I/O caching) for caching of postings
Memory available for disk I/O caching has most impact on response
time (assuming adequate cache warming)
Fitting entire index in memory not feasible with terabyte size index
44

Response time varies with query

Average: 673
Median: 91
90th: 328
99th: 7,504

45

Slowest 5 % of queries
The slowest 5% of queries took about
1 second or longer.
Response Time 95th percentile (seconds) The slowest 1% of queries took
between 10 seconds and 2 minutes.
1,000
Response Time

Slowest 0.5% of queries took
(seconds)

100
between 30 seconds and 2 minutes
10
These queries affect response time of
1
other queries
0
940 950 960 970 980 990 1,000
Cache pollution
Query number Contention for resources
Slowest queries are phrase queries
containing common words


Query processing
Phrase queries use position index (Boolean queries do not).
Position index accounts for 85% of index size
Position list for common words such as
“the” can be many GB in size
This causes lots of disk I/O .
Solr depends on the operating systems disk cache to reduce disk
I/O requirements for words that occur in more than one query
I/O from Phrase queries containing
common words pollutes the cache

47

Slow Queries
Slowest test query: “the lives and literature of the beat
generation” took 2 minutes.
4MB data read for Boolean query.
9,000+ MB read for Phrase query.
NUMBER OF POSTINGS LIST TOTAL TERM OCCURRENCES POSITION LIST
WORD
DOCUMENTS (SIZE MB) (MILLIONS) (SIZE MB)

the 800,000 0.8 4,351 4,351
of 892,000 0.89 2,795 2,795
and 769,000 0.77 1,870 1,870
literature 435,000 0.44 9 9
generation 414,000 0.41 5 5
lives 432,000 0.43 5 5
beat 278,000 0.28 1 1
TOTAL 4.02 9,036

48

Why not use Stop Words?
The word “the” occurs more than 4 billion times in our 1 million
document index.
Removing “stop” words (“the”, “of” etc.) not desirable for our use cases.
Couldn’t search for many phrases
“to be or not to be”
“the who”
“man in the moon” vs. “man on the moon”
Stop words in one language are content words in another language
German stop words “war” and “die” are content words in English
English stop words “is” and “by” are content words (“ice” and “village”)
in Swedish

49

“CommonGrams”

Ported Nutch “CommonGrams” algorithm to Solr
Create Bi-Grams selectively for any two word sequence containing
common terms
Slowest query: “The lives and literature of the beat generation”
“the-lives” “lives-and”
“and-literature” “literature-of”
“of-the” “the-beat” “generation”

50

Standard index vs. CommonGrams
Standard Index Common Grams
TOTAL TOTAL
NUMBER OF NUMBER OF
OCCURRENCES OCCURRENCES
WORD DOCS DOCS
IN CORPUS IN CORPUS
(THOUSANDS) (THOUSANDS)
(MILLIONS) TERM (MILLIONS)

the 2,013 386 of-the 446 396
of 1,299 440 generation 2.42 262
and 855 376 the-lives 0.36 128
literature 4 210 literature-of 0.35 103
lives 2 194 lives-and 0.25 115
generation 2 199 and-literature 0.24 77
beat 0.6 130 the-beat 0.06 26
TOTAL 4,176 TOTAL 450


Comparison of Response time (ms)
SLOWEST
AVERAGE MEDIAN 90th 99th QUERY
Standard Index 459 32 146 6,784 120,595
Common
68 3 71 2,226 7,800
Grams

52

Other issues

Analyze your slowest queries
We analyzed the slowest queries from our query logs and
discovered additional “common words” to be added to our list.
We used Solr Admin panel to run our slowest queries from our
logs with the “debug” flag checked.
We discovered that words such as “l’art” were being split into
two token phrase queries.
We used the Solr Admin Analysis tool and determined that the
analyzer we were using was the culprit.

53

Other issues

We broke Solr … temporarily
Dirty OCR in combination with over 200 languages creates
indexes with over 2.4 billion unique terms
Solr/Lucene index size was limited to 2.1 Billion unique terms
Patched: Now it’s 274 Billion
Dirty OCR is difficult to remove without removing “good” words.
Because Solr/Lucene tii/tis index uses pointers into the frequency
and position files we suspect that the performance impact is
minimal compared to disk I/O demands, but we will be testing
soon.

54

Q&A
Download these slides at
http://bit.ly/practical-solr
On demand replay is
available within 24-48
hours of the live webcast
55

Practical Search with Solr: Beyond just Looking it Up

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to Practical Search with Solr: Beyond just Looking it Up

Similar to Practical Search with Solr: Beyond just Looking it Up (20)

More from Lucidworks (Archived)

More from Lucidworks (Archived) (20)

Recently uploaded

Recently uploaded (20)

Practical Search with Solr: Beyond just Looking it Up