2. Introductions
Presenter
Architect/Development Team Leader @Trasys Greece
IR (Information Retrieval)
Java EE projects for European Agencies
The tracing and recovery of specific information from stored
data
IR is interdisciplinary, based on computer science,
mathematics, library science, information science, information
architecture, cognitive psychology, linguistics, and statistics.
Lucene
Open Source – Apache Software License
(http://lucene.apache.org)
Founder: Doug Cutting
0.01 release on March 2000 (SourceForge)
1.2 release June 2002 (First apache Jakarta Release)
Its own top level apache project in 2005
Current version is 3.1
3. More Lucene Intro…
Lucene is high performance, scalable IR
library (not a ready to use application)
Number of full featured search applications
built on top (More later…)
Lucene ports and bindings in many other
programming environments incl. Perl,
Python, Ruby, C/C++, PHP and C# (.NET)
Lucene „Powered By‟ apps (a few of
many): LinkedIn, Apple, MySpace, Eclipse
IDE, MS Outlook, Atlassian (JIRA). See
more @ http://wiki.apache.org/lucenejava/PoweredBy
4. Components of a Search
Application (1/4)
Acquire Content
Gather and scope the content
e.g. from the web with a spider
or crawler, a CMS, a Database
or a file system
Projects helping
Solr: handles RDBMS and XML
feeds and rich documents
through Tika integration
Nutch: web crawler - sister
project at apache
Grub: open source web crawler
5. Components of a Search
Application (2/4)
Build document
Define the document
The unit of the search engine
Has fields
De-normalization involved
Projects helping: Usually the
same frameworks cover both this
and the previous step
Compass and its evolution
ElasticSearch
Hibernate Search
DBSight
Oracle/Lucene Integration
6. Components of a Search
Application (3/4)
Analyze Document
Handled by Analyzers
Built-in and contributed
Built with tokenizers and token
filters
Index Document
Through Lucene API or your
framework of choice
Search User
Interface/Render Results
Application specific
7. Components of a Search
Application (4/4)
Query Builder
Lucene provides one
Frameworks provide extensions but also
the application itself e.g. advanced
search
Run Query
Retrieve documents running the query
built
Three common theoretical models
Administration
Boolean model
Vector space model
Probabilistic model
e.g. tuning options
Analytics
reporting
8. How Lucene models content
Documents
Fields
Denormalization of content
Flexible Schema
Inverted Index
10. Basic Indexing
Adding documents
RAMDirectory directory = new RAMDirectory();
IndexWriter writer = new IndexWriter(directory,
new WhitespaceAnalyzer(),
IndexWriter.MaxFieldLength.UNLIMITED);
Document doc = new Document();
doc.add(new Field(“post",
"the JHUG meeting is on this Saturday",
Field.Store.YES,
Field.Index.ANALYZED));
Deleting and updating documents
Field options
Store
Analyze
Norms
Term vectors
Boost
11. Scoring – The formula
tf(t in d): Term frequency factor for the term (t) in the document
(d), i.e. how many times the term t occurs in the document.
idf(t): Inverse document frequency of the term: a measure of how
“unique” the term is. Very common terms have a low idf; very
rare terms have a high idf.
boost(t.field in d): Field & Document boost, as set during indexing.
This may be used to statically boost certain fields and certain
documents over others.
lengthNorm(t.field in d): Normalization value of a field, given the
number of terms within the field. This value is computed during
indexing and stored in the index norms. Shorter fields (fewer
tokens) get a bigger boost from this factor.
coord(q, d): Coordination factor, based on the number of query
terms the document contains. The coordination factor gives an
AND-like boost to documents that contain more of the search
terms than other documents
queryNorm(q): Normalization value for a query, given the sum of
the squared weights of each of the query terms.
12. Querying – the API
Variety of Query class implementations
TermQuery
PhraseQuery
TermRangeQuery
NumericRangeQuery
PrefixQuery
BooleanQuery
WildCardQuery
FuzzyQuery
MatchAllDocsQuery
…
13. Querying - Example
private void indexSingleFieldDocs(Field[] fields) throws Exception {
IndexWriter writer = new IndexWriter(directory,
new WhitespaceAnalyzer(), IndexWriter.MaxFieldLength.UNLIMITED);
for (int i = 0; i < fields.length; i++) {
Document doc = new Document();
doc.add(fields[i]);
writer.addDocument(doc);
}
writer.optimize();
writer.close();
}
public void wildcard() throws Exception {
indexSingleFieldDocs(new Field[]
{ new Field("contents", "wild", Field.Store.YES, Field.Index.ANALYZED),
new Field("contents", "child", Field.Store.YES, Field.Index.ANALYZED),
new Field("contents", "mild", Field.Store.YES, Field.Index.ANALYZED),
new Field("contents", "mildew", Field.Store.YES, Field.Index.ANALYZED) });
IndexSearcher searcher = new IndexSearcher(directory, true);
Query query = new WildcardQuery(new Term("contents", "?ild*"));
TopDocs matches = searcher.search(query, 10);
}
14. Querying - QueryParser
Query query = new QueryParser("subject",
analyzer).parse("(clinical OR ethics) AND
methodology");
trachea AND esophagus
The default join condition is OR e.g. trachea esophagus
cough AND (trachea OR esophagus)
trachea NOT esophagus
full_title:trachea
"trachea disease"
"trachea disease“~5
is_gender_male:y
[2010-01-01 TO 2010-07-01]
esophaguz~
Trachea^5 esophagus
15. Analyzers - Internals
At Indexing and querying time
Inside an analyzer
Operates on a TokenStream
A token has a text value and metadata like
Start end character offsets
Token type
Position increment
Optionally application specific bit flags and byte[]
payload
Token stream is abstract. Tokenizer and TokenFilter
are the concrete ones
Tokenizer reads chars and produces tokens
Token filter ingests tokens and produces new ones
The composite pattern is implemented and they form
a chain of one another
16. Analyzers – building blocks
Analyzers can be created by combining token streams (Order is
important)
Building blocks provided in core
CharTokenizer
WhitespaceTokenizer
KeywordTokenizer.
LetterTokenizer
LowerCaseTokenizer
SinkTokenizer
StandardTokenizer
LowerCaseFilter
StopFilter
PorterStemFilter
TeeTokenFilter
ASCIIFoldingFilter
CachingTokenFilter
LengthFilter
StandardFilter
17. Analyzers - core
WhitespaceAnalyzer Splits tokens at
whitespace
SimpleAnalyzer Divides text at non letter
characters and lowercases
StopAnalyzer Divides text at non letter
characters, lowercases, and removes stop words
KeywordAnalyzer Treats entire text as a single
token
StandardAnalyzer Tokenizes based on a
sophisticated grammar that recognizes emailaddresses, acronyms, Chinese-JapaneseKorean characters,alphanumerics, and more
lowercases and removes stop words
18. Analyzers – Example (1/2)
Analyzing “The JHUG meeting is on this Saturday"
WhitespaceAnalyzer:
[The] [JHUG] [meeting] [is] [on] [this] [Saturday]
SimpleAnalyzer:
[the] [jhug] [meeting] [is] [on] [this] [saturday]
StopAnalyzer:
[jhug] [meeting] [saturday]
StandardAnalyzer:
[jhug] [meeting] [Saturday]
20. Analyzers – Beyond the built in
language-specific analyzers, under contrib/analyzers.
language-specific stemming and stop-word removal
Sounds Like analyzer e.g. MetaphoneReplacementAnalyzer
that transforms terms to their phonetic roots
SynonymAnalyzer
Nutch Analysis: bigrams for stop words
Stemming analysis
The PorterStemFilter. It stems words using the Porter
stemming algorithm created by Dr. Martin Porter, and it‟s
best defined in his own words:
The Porter stemming algorithm (or „Porter stemmer‟) is a process
for removing the commoner morphological and inflexional endings
from words in English. Its main use is as part of a term
normalisation process that is usually done when setting up
Information Retrieval systems.
SnowballAnalyzer: Stemming for many European
languages
21. Filters
Narrow the search space
Overloaded search methods that
accept Filter instances
Examples
TermRangeFilter
NumericRangeFilter
PrefixFilter
QueryWrapperFilter
SpanQueryFilter
ChainedFilter
22. Example: Filters for Security
Constraints known at indexing time
Index the constraint as a field
Search wrapping a TermQuery on the constraint
field with a QueryWrapperFilter
Factor in information at search time
A custom filter
Filter will access an external privilege store that will
provide some means of identifying documents in
the index e.g. a unique term with regard to
permissions
Return an DocIdSet to Lucene. Bit positions match
the document numbers. Enabled bits mean the
document for that position is available to be
searched against the query; unset bits mean the
documents won‟t be considered in the search
23. Internals - Concurrency
Any number of IndexReaders open
Only one IndexWriter at a time
Locking with write lock file
IndexReaders may be open while the
index is being changed by an
IndexWriter
IndexSearchers use underlying
IndexReaders
It will see changes only when the writer
commits and is reopened
Both are thread safe/friendly classes
24. Internals - Indexing concepts
Index is made up from segment files
Deleting documents does not actually deletes - only
marks for deletion
Index writes are buffered and flushed periodically
Segments need to be merged
Automatically by the IndexWriter
Explicit calls to optimize
There is the notion of commit (as you would
expect), which has 4 steps
Flush buffered documents and deletions
Sync files; force OS to write to stable storage of the
underlying I/O system
Write and sync the segments_N file
Remove old commits
25. Internals - Transactions
Two-phase commit is supported
prepareCommit performs steps 1,2 and
most of 3
Lucene implements the ACID
transactional model
Atomicity: all or nothing commit
Consistency: e.g. update will mean both
delete and add
Isolation: IndexReaders cannot see what
has not been comitted
Durability: Index is not corrupted and
persists in storage
26. Architectures
Cluster nodes that share a remote file system
index
Index in database
Much slower
Separate write and read indexes (replication)
Slower than local
Possible limitations due to client side caching
(Samba, NFS, AFP) or stale file handles (NFS)
relies on the IndexDeletionPolicy feature of Lucene
Out of the box in Solr and ElasticSearch
Autonomous search servers (e.g. Solr,
ElasticSearch)
Loose coupling through JSON or XML
30. Cool extra features- Spellchecking
You will need a dictionary of valid words
You could use the unique terms in your index
Given the dictionary you could
To present or not to present (the suggestion)
Use a Sounds like algorithm like Soundex or Metaphone
Or use Ngrams
E.g. squirrel as a 3gram is squ, qui, uir, irr, rre, rel. As a
4gram squi, quir, uirr, irre, rrel. Mistakenly searching for
squirel would match 5 grams, with 2 shared between the
3grams and 4grams. This would score high!
Maybe use the Levenshtein distance
Other ideas
Use the rest of the terms in the query to bias
Maybe combine distance with frequency of term
Check result numbers in initial and corrected searches
31. Even More features
Sorting
SpanQueries
Use a field for sorting instead of relevance e.g. when you use the MatchAllDocsQuery
Beware it uses FieldCache which resides in RAM!
distance between terms (span)
Family of queries like SpanNearQuery or SpanOrQuery and others
Synonyms
Injection during indexing or during searching?
Leverage a synonyms knowledge base
Key thing is to understand that synonyms must be injected on the same position
increments
Answer to the query “Greek Restaurants Near Me”
An efficient technique is to use grids
Assign non-unique grid numbers at areas (e.g. in a mercator space)
Index documents with a field containing the grid numbers that match their positional lingitude and
latitude
MoreLikeThis
A good strategy is to convert it into an index
Spatial Searches
A MultiPhraseQuery is appropriate for searching time
During indexing will allow faster searches
One use of term vectors
Function Queries
e.g. add boosts for fields at search time
32. Some last things to bare in mind
It would be wise to back up you index
Performance has some trade-offs
search latency
indexing throughput
near real time results
index replication
index optimization
Resource consumption
You can have hot back ups (supported through the
CommitDeletionPolicy)
Disk space
File descriptors
Memory
„Luke‟ is a really handy tool
You can repair a corrupted index (corrupted
segments are just lost… D‟oh!)
33. Resources
Book: Lucene in Action
Solr:
http://lucene.apache.org/solr/
Vector Space Model:
http://en.wikipedia.org/wiki/V
ector_Space_Model
IR Links:
http://wiki.apache.org/lucenejava/InformationRetrieval