Lucene BootCamp

Lucene Boot Camp
Grant Ingersoll
Lucid Imagination
Nov. 12, 2007
Atlanta, Georgia

Intro
• My Background
• Your Background
• Brief History of Lucene
• Goals for Tutorial
– Understand Lucene core capabilities
– Real examples, real code, real data
• Ask Questions!!!!!

Schedule
1. 10-10:10 Introducing Lucene and Search
2. 10:10-12 Indexing, Analysis, Searching, Performance
3. 12-12:05 Break
4. 12-1 More on Indexing, Analysis, Searching, Performance
5. 1-2:30 Lunch
6. 2:30-2:40 Recap, Questions, Content
7. 2:40-4:40 Class Example
8. 4-4:20 Break
9. 4:20-5 Class Example
10. 5-5:20 Lucene Contributions (time permitting)
11. 5:20-5:25 Open Discussion (time permitting)
12. 5:25-5:30 Resources/Wrap Up

Lucene is…
• NOT a crawler
– See Nutch
• NOT an application
– See PoweredBy on the Wiki
• NOT a library for doing Google PageRank
or other link analysis algorithms
– See Nutch
• A library for enabling text based search

A Few Words about Solr
• HTTP-based Search Server
• XML Configuration
• XML, JSON, Ruby, PHP, Java support
• Caching, Replication
• Many, many nice features that Lucene users
need
• http://lucene.apache.org/solr

Search Basics
• Goal: Identify documents that
are similar to input query
• Lucene uses a modified Vector
Space Model (VSM)
– Boolean + VSM
– TF-IDF
– The words in the document
and the query each define a
Vector in an n-dimensional
space
– Sim(q1, d1) = cos Θ
– In Lucene, boolean approach
restricts what documents to
score
q1
d1
Θ
dj= <w1,j,w2,j,…,wn,j>
q= <w1,q,w2,q,…wn,q>
w = weight assigned to term

Indexing
• Process of preparing and adding text to
Lucene
– Optimized for searching
• Key Point: Lucene only indexes Strings
– What does this mean?
• Lucene doesn’t care about XML, Word, PDF, etc.
– There are many good open source extractors available
• It’s our job to convert whatever file format we have
into something Lucene can use

Indexing Classes
• Analyzer
– Creates tokens using a Tokenizer and filters
them through zero or more TokenFilters
• IndexWriter
– Responsible for converting text into internal
Lucene format

Indexing Classes
• Directory
– Where the Index is stored
– RAMDirectory, FSDirectory, others
• Document
– A collection of Fields
– Can be boosted
• Field
– Free text, keywords, dates, etc.
– Defines attributes for storing, indexing
– Can be boosted
– Field Constructors and parameters
• Open up Fieldable and Field in IDE

How to Index
• Create IndexWriter
• For each input
– Create a Document
– Add Fields to the Document
– Add the Document to the IndexWriter
• Close the IndexWriter
• Optimize (optional)

Task 1.a
• From the Boot Camp Files, use the basic.ReutersIndexer
skeleton to start
• Index the small Reuters Collection using the
IndexWriter, a Directory and
StandardAnalyzer
– Boost every 10 documents by 3
• Questions to Answer:
– What Fields should I define?
– What attributes should each Field have?
• What Fields should OMIT_NORMS?
– Pick a field to boost and give a reason why you think it should be
boosted

Searching
• Key Classes:
– Searcher
• Provides methods for searching
• Take a moment to look at the Searcher class declaration
• IndexSearcher, MultiSearcher,
ParallelMultiSearcher
– IndexReader
• Loads a snapshot of the index into memory for searching
– Hits
• Storage/caching of results from searching
– QueryParser
• JavaCC grammar for creating Lucene Queries
• http://lucene.apache.org/java/docs/queryparsersyntax.html
– Query
• Logical representation of program’s information need

Query Parsing
• Basic syntax:
title:hockey +(body:stanley AND body:cup)
• OR/AND must be uppercase
• Default operator is OR (can be changed)
• Supports fairly advanced syntax, see the website
– http://lucene.apache.org/java/docs/queryparsersyntax.html
• Doesn’t always play nice, so beware
– Many applications construct queries programmatically
or restrict syntax

Task 1.b
• Using the ReutersIndexerTest.java skeleton in the boot
camp files
– Search your newly created index using queries you develop
– Delete a Document by the doc id
• Hints:
– Use a IndexSearcher
– Create a Query using the QueryParser
– Display the results from the Hits
• Questions:
– What is the default field for the QueryParser?
– What Analyzer to use?

Task 1 Results
• Locks
– Lucene maintains locks on files to prevent
index corruption
– Located in same directory as index
• Scores from Hits are normalized
– Scores across queries are NOT comparable
• Lucene 2.3 has some transactional
semantics for indexing, but is not a DB

Deletion and Updates
• Deletions can be a bit confusing
– Both IndexReader and IndexWriter
have delete methods
• Updates are always a delete and an add
• Updates are always a delete and an add
– Yes, that is a repeat!
– Nature of data structures used in search

Analysis
• Analysis is the process of creating Tokens to be indexed
• Analysis is usually done to improve results overall, but it
comes with a price
• Lucene comes with many different Analyzers,
Tokenizers and TokenFilters, each with their own
goals
– See contrib/analyzers
• StandardAnalyzer is included with the core JAR and
does a good job for most English and Latin-based tasks
• Often times you want the same content analyzed in
different ways
• Consider a catch-all Field in addition to other Fields

Commonly Used Analyzers
• StandardAnalyzer
• WhitespaceAnalyzer
• PerFieldAnalyzerWrapper
• SimpleAnalyzer

Indexing in a Nutshell
• For each Document
– For each Field to be tokenized
• Create the tokens using the specified Tokenizer
– Tokens consist of a String, position, type and offset information
• Pass the tokens through the chained TokenFilters where
they can be changed or removed
• Add the end result to the inverted index
• Position information can be altered
– Useful when removing words or to prevent phrases
from matching

Inverted Index
aardvark
hood
red
little
riding
robin
women
zoo
Little Red Riding Hood
Robin Hood
Little Women
0 1
0 2
0
0
2
1
0
1
2

Tokenization
• Split words into Tokens to be processed
• Tokenization is fairly straightforward for
most languages that use a space for word
segmentation
– More difficult for some East Asian languages
– See the CJK Analyzer

Modifying Tokens
• TokenFilters are used to alter the token
stream to be indexed
• Common tasks:
– Remove stopwords
– Lower case
– Stem/Normalize -> Wi-Fi -> Wi Fi
– Add Synonyms
• StandardAnalyzer does things that you may
not want

Custom Analyzers
• Solution: write your own Analyzer
• Better solution: write a configurable
Analyzer so you only need one Analyzer
that you can easily change for your projects
– See Solr
• Tokenizers and TokenFilters must
be newly constructed for each input

Special Cases
• Dates and numbers need special treatment to be
searchable
– o.a.l.document.DateTools
– org.apache.solr.util.NumberUtils
• Altering Position Information
– Increase Position Gap between sentences to prevent
phrases from crossing sentence boundaries
– Index synonyms at the same position so query can
match regardless of synonym used

Indexing Performance
• Behind the Scenes
– Lucene indexes Documents into memory
– At certain trigger points, memory (segments)
are flushed to the Directory
– Segments are periodically merged
• Lucene 2.3 has significant performance
improvements

IndexWriter Performance
Factors
• maxBufferedDocs
– Minimum # of docs before merge occurs and a new segment is
created
– Usually, Larger == faster, but more RAM
• mergeFactor
– How often segments are merged
– Smaller == less RAM, better for incremental updates
– Larger == faster, better for batch indexing
• maxFieldLength
– Limit the number of terms in a Document

Lucene 2.3 IndexWriter
Changes
• setRAMBufferSizeMB
– New model for automagically controlling indexing
factors based on the amount of memory in use
– Obsoletes setMaxBufferedDocs and
setMergeFactor
• Takes storage and term vectors out of the merge
process
• Turn off auto-commit if there are stored fields and
term vectors
• Provides significant performance increase

Index Threading
• IndexWriter and IndexReader are thread-
safe and can be shared between threads without
external synchronization
• One open IndexWriter per Directory
• Parallel Indexing
– Index to separate Directory instances
– Merge using IndexWriter.addIndexes
– Could also distribute and collect

Benchmarking Indexing
• contrib/benchmark
• Try out different algorithms between Lucene 2.2
and trunk (2.3)
– contrib/benchmark/conf:
• indexing.alg
• indexing-multithreaded.alg
• Info:
– Mac Pro 2 x 2GHz Dual-Core Xeon
– 4 GB RAM
– ant run-task -Dtask.alg=./conf/indexing.alg -Dtask.mem=1024M

Benchmarking Results
Records/Sec Avg. T
Mem
2.2 421 39M
Trunk 2,122 52M
Trunk-mt
(4)
3,680 57M
Your results will depend on analysis, etc.

Searching
• Earlier we touched on basics of search
using the QueryParser
• Now look at:
– Searcher/IndexReader Lifecycle
– Query classes
– More details on the QueryParser
– Filters
– Sorting

Lifecycle
• Recall that the IndexReader loads a snapshot
of index into memory
– This means updates made since loading the index will
not be seen
• Business rules are needed to define how often to
reload the index, if at all
– IndexReader.isCurrent() can help
• Loading an index is an expensive operation
– Do not open a Searcher/IndexReader for every
search

Query Classes
• TermQuery is basis for all non-span queries
• BooleanQuery combines multiple Query
instances as clauses
– should
– required
• PhraseQuery finds terms occurring near each
other, position-wise
– “slop” is the edit distance between two terms
• Take 2-3 minutes to explore Query
implementations

Spans
• Spans provide information about where
matches took place
• Not supported by the QueryParser
• Can be used in BooleanQuery clauses
• Take 2-3 minutes to explore SpanQuery
classes
– SpanNearQuery useful for doing phrase
matching

QueryParser
• MultiFieldQueryParser
• Boolean operators cause confusion
– Better to think in terms of required (+ operator) and not
allowed (- operator)
• Check JIRA for QueryParser issues
• http://www.gossamer-threads.com/lists/lucene/java-user/40945
• Most applications either modify QP, create their
own, or restrict to a subset of the syntax
• Your users may not need all the “flexibility” of
the QP

Sorting
• Lucene default sort is by score
• Searcher has several methods that take in a
Sort object
• Sorting should be addressed during indexing
• Sorting is done on Fields containing a single
term that can be used for comparison
• The SortField defines the different sort types
available
– AUTO, STRING, INT, FLOAT, CUSTOM, SCORE,
DOC

Sorting II
• Look at Searcher, Sort and
SortField
• Custom sorting is done with a
SortComparatorSource
• Sorting can be very expensive
– Terms are cached in the FieldCache
• SortFilterTest.java example

Filters
• Filters restrict the search space to a
subset of Documents
• Use Cases
– Search within a Search
– Restrict by date
– Rating
– Security
– Author

Filter Classes
• QueryWrapperFilter (QueryFilter)
– Restrict to subset of Documents that match a Query
• RangeFilter
– Restrict to Documents that fall within a range
– Better alternative to RangeQuery
• CachingWrapperFilter
– Wrap another Filter and provide caching
• SortFilterTest.java example

Expert Results
• Searcher has several “expert” methods
– Hits is not always what you need due to:
• Caching
• Normalized Scores
• Reexecutes Query repeatedly as results are accessed
• HitCollector allows low-level access to all
Documents as they are scored
• TopDocs represents top n docs that match
– TopDocsTest in examples

Searchers
• MultiSearcher
– Search over multiple Searchables, including remote
• MultiReader
– Not a Searcher, but can be used with
IndexSearcher to achieve same results for local
indexes
• ParallelMultiSearcher
– Like MultiSearcher, but threaded
• RemoteSearchable
– RMI based remote searching
• Look at MultiSearcherTest in example
code

Search Performance
• Search speed is based on a number of factors:
– Query Type(s)
– Query Size
– Analysis
– Occurrences of Query Terms
– Optimize
– Index Size
– Index type (RAMDirectory, other)
– Usual Suspects
• CPU
• Memory
• I/O
• Business Needs

Query Types
• Be careful with WildcardQuery as it rewrites
to a BooleanQuery containing all the terms
that match the wildcards
• Avoid starting a WildcardQuery with wildcard
• Use ConstantScoreRangeQuery instead of
RangeQuery
• Be careful with range queries and dates
– User mailing list and Wiki have useful tips for
optimizing date handling

Query Size
• Stopword removal
• Search an “all” field instead of many fields with the same
terms
• Disambiguation
– May be useful when doing synonym expansion
– Difficult to automate and may be slower
– Some applications may allow the user to disambiguate
• Relevance Feedback/More Like This
– Use most important words
– “Important” can be defined in a number of ways

Usual Suspects
• CPU
– Profile your application
• Memory
– Examine your heap size, garbage collection approach
• I/O
– Cache your Searcher
• Define business logic for refreshing based on indexing needs
– Warm your Searcher before going live -- See Solr
• Business Needs
– Do you really need to support Wildcards?
– What about date range queries down to the millisecond?

Explanations
• explain(Query, int) method is
useful for understanding why a Document
scored the way it did
• ExplainsTest in sample code
• Open Luke and try some queries and then
use the “explain” button

FieldSelector
• Prior to version 2.1, Lucene always loaded all
Fields in a Document
• FieldSelector API addition allows Lucene to
skip large Fields
– Options: Load, Lazy Load, No Load, Load and Break,
Load for Merge, Size, Size and Break
• Makes storage of original content more viable
without large cost of loading it when not used
• FieldSelectorTest in example code

Scoring and Similarity
• Lucene has sophisticated scoring
mechanism designed to meet most needs
• Has hooks for modifying scores
• Scoring is handled by the Query, Weight
and Scorer class

Affecting Relevance
• FunctionQuery from Solr (variation in
Lucene)
• Override Similarity
• Implement own Query and related classes
• Payloads
• HitCollector
• Take 5 to examine these

Recap
• Indexing
• Searching
• Performance
• Odds and Ends
– Explains
– FieldSelector
– Relevance

Next Up
• Dealing with Content
– File Formats
– Extraction
• Large Task
• Miscellaneous
• Wrapping Up

File Formats
• Several open source libraries, projects for extracting content to use in
Lucene
– PDF: PDFBox
• http://www.pdfbox.org/
– Word: POI, Open Office, TextMining
• http://www.textmining.org/textmining.zip
– XML: SAX or Pull parser
– HTML: Neko, Jtidy
• http://people.apache.org/~andyc/neko/doc/html/
• http://jtidy.sourceforge.net/
• Tika
– http://incubator.apache.org/tika/
• Aperture
– http://aperture.sourceforge.net

Aperture Basics
• Crawlers
• Data Connectors
• Extraction Wrappers
– POI, PDFBox, HTML, XML, etc.
• http://aperture.wiki.sourceforge.net/Extractors
will give you info on what comes back from
Aperture
• LuceneApertureCallbackHandler
in example code

Large Task
• Using the skeleton files in the
com.lucenebootcamp.training.full package:
– Get some content:
• Web, file system
• Different file formats
– Index it
• Plan out your fields, boosts, field properties
• Support updates and deletes
• Optional:
– How fast can you make it go? Divide and conquer?
Multithreaded?

Large Task
• Search Content
– Allow for arbitrary user queries across multiple
Fields via command line or simple web interface
– How fast can you make it?
• Support:
– Sort
– Filter
– Explains
• How much slower is to retrieve an explanation?

Large Task
• Document Retrieval
– Display/write out the one or more documents
– Support FieldSelector

Large Task
• Optional Tasks
– Hit Highlighting using contrib/Highlighter
– Multithreaded indexing and Search
– Explore other Field construction options
• Binary fields, term vectors
– Use Lucene trunk version and try out some of the
changes in indexing
– Try out Solr or Nutch at http://lucene.apache.org/
• What’s do they offer that Lucene Java doesn’t that you might
need?

Large Task Metadata
– Pair up if you want
– Ask questions
– 2 hours
– Use Luke to check your index!
– Explore other parts of Lucene that you are
interested in
– Be prepared to discuss/share with the class

Large Task Post-Mortem
• Volunteers to share?

Term Information
• TermEnum gives access to terms and how many
Documents they occur in
– IndexReader.terms()
– IndexReader.termPositions()
• TermDocs gives access to the frequency of a
term in a Document
– IndexReader.termDocs()
• Term Vectors give access to term frequency
information in a given Document
– IndexReader.getTermFreqVector
• TermsTest in sample code

Lucene Contributions
• Many people have generously contributed code to
help solve common problems
• These are in contrib directory of the source
• Popular:
– Analyzers
– Highlighter
– Queries and MoreLikeThis
– Snowball Stemmers
– Spellchecker

Open Discussion
• Multilingual Best Practices
– UNICODE
– One Index versus many
• Advanced Analysis
• Distributed Lucene
• Crawling
• Hadoop
• Nutch
• Solr

Resources
• http://lucene.apache.org/
• http://en.wikipedia.org/wiki/Vector_space_model
• Modern Information Retrieval by Baeza-Yates and Ribeiro-Neto
• Lucene In Action by Hatcher and Gospodnetić
• Wiki
• Mailing Lists
– java-user@lucene.apache.org
• Discussions on how to use Lucene
– java-dev@lucene.apache.org
• Discussions on how to develop Lucene
• Issue Tracking
– https://issues.apache.org/jira/secure/Dashboard.jspa
• We always welcome patches
– Ask on the mailing list before reporting a bug

Resources
• trainer@lucenebootcamp.com

Finally…
• Please take the time to fill out a survey to
help me improve this training
– Located in base directory of source
– Email it to me at trainer@lucenebootcamp.com
• There are several Lucene related talks on
Friday

Task 2
• Take 10-15 minutes, pair up, and write an
Analyzer and Unit Test
– Examine results in Luke
– Run some searches
• Ideas:
– Combine existing Tokenizers and TokenFilters
– Normalize abbreviations
– Filter out all words beginning with the letter A
– Identify/Mark sentences
• Questions:
– What would help improve search results?

Task 2 Results
• Share what you did and why
• Improving Results (in most cases)
– Stemming
– Ignore Case
– Stopword Removal
– Synonyms
– Pay attention to business needs

Grab Bag
• Accessing Term Information
– TermEnum
– TermDocs
– Term Vectors
• FieldSelector
• Scoring and Similarity
• File Formats

Task 6
• Count and print all the unique terms in the
index and their frequencies
– Notes:
• Half of the class write it using TermEnum and
TermDocs
• Other Half write it using Term Vectors
• Time your Task
• Only count the title and body content

Task 6 Results
• Term Vector approach is faster on smaller
collections
• TermEnum approach is faster on larger
collections

Task 4
• Re-index your collection
– Add in a “rating” field that randomly assigns a number
between 0 and 9
• Write searches to sort by
• Date
• Title
• Rating, Date, Doc Id
• A Custom Sort
• Questions
– How to sort the title?
– How to sort multiple Fields?

Task 4 Results
• Add stitle to use for sorting the title

Task 5
• Create and search using Filters to:
– Restrict to all docs written on Feb. 26, 1987
– Restrict to all docs with the word “computer”
in title
• Also:
– Create a Filter where the length of the body +
title is greater than X

Task 5 Results
• Solr has more advanced Filter
mechanisms that may be worth using
• Cache filters

Task 7
• Pair up if you like and take 30-40 minutes to:
– Pick two file formats to work on
– Identify content in that format
• Can you index contents on your hard drive?
• Project Gutenberg, Creative Commons, Wikipedia
• Combine w/ Reuters collection
– Extract the content and index it using the appropriate
library
– Store the content as a Field
– Search the content
– Load Documents with and without
FieldSelector and measure performance

Task 7 (cont.)
• Include score and explanation in results
• Dump results to XML or HTML
• Be prepared to share with class what you did
– What libraries did you use?
– What content did you use?
– What is your Document structure?
– What issues did you have?

Task 7 Results
• Explain what your group did
• Build a Content Handler Framework
– Or help out with Tika

Task 8
• Building on Task 7
– Incorporate one or more contrib packages into
your solution

Lucene BootCamp

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Lucene BootCamp

Semelhante a Lucene BootCamp (20)

Último

Último (20)

Lucene BootCamp

Notas do Editor