IR with lucene

Introduction to Information
Retrieval with Lucene
By Stylianos Gkorilas

Introductions


Presenter


Architect/Development Team Leader @Trasys Greece




IR (Information Retrieval)





Java EE projects for European Agencies

The tracing and recovery of specific information from stored
data
IR is interdisciplinary, based on computer science,
mathematics, library science, information science, information
architecture, cognitive psychology, linguistics, and statistics.

Lucene







Open Source – Apache Software License
(http://lucene.apache.org)
Founder: Doug Cutting
0.01 release on March 2000 (SourceForge)
1.2 release June 2002 (First apache Jakarta Release)
Its own top level apache project in 2005
Current version is 3.1

More Lucene Intro…


Lucene is high performance, scalable IR
library (not a ready to use application)






Number of full featured search applications
built on top (More later…)

Lucene ports and bindings in many other
programming environments incl. Perl,
Python, Ruby, C/C++, PHP and C# (.NET)
Lucene „Powered By‟ apps (a few of
many): LinkedIn, Apple, MySpace, Eclipse
IDE, MS Outlook, Atlassian (JIRA). See
more @ http://wiki.apache.org/lucenejava/PoweredBy

Components of a Search
Application (1/4)


Acquire Content


Gather and scope the content




e.g. from the web with a spider
or crawler, a CMS, a Database
or a file system

Projects helping
Solr: handles RDBMS and XML
feeds and rich documents
through Tika integration
 Nutch: web crawler - sister
project at apache
 Grub: open source web crawler


Application (2/4)


Build document


Define the document







The unit of the search engine
Has fields
De-normalization involved

Projects helping: Usually the
same frameworks cover both this
and the previous step






Compass and its evolution
ElasticSearch
Hibernate Search
DBSight
Oracle/Lucene Integration

Application (3/4)


Analyze Document


Handled by Analyzers
Built-in and contributed
 Built with tokenizers and token
filters




Index Document




Through Lucene API or your
framework of choice

Search User
Interface/Render Results


Application specific

Application (4/4)


Query Builder





Lucene provides one
Frameworks provide extensions but also
the application itself e.g. advanced
search

Run Query



Retrieve documents running the query
built
Three common theoretical models






Administration




Boolean model
Vector space model
Probabilistic model
e.g. tuning options

Analytics


reporting

How Lucene models content







Documents
Fields
Denormalization of content
Flexible Schema
Inverted Index

Basic Lucene Classes


Indexing
IndexWriter
 Directory
 Analyzer
 Document
 Field




Searching
IndexSearcher
 Query
 TopDocs
 Term
 QueryParser


Basic Indexing


Adding documents
RAMDirectory directory = new RAMDirectory();
IndexWriter writer = new IndexWriter(directory,
new WhitespaceAnalyzer(),
IndexWriter.MaxFieldLength.UNLIMITED);
Document doc = new Document();
doc.add(new Field(“post",
"the JHUG meeting is on this Saturday",
Field.Store.YES,
Field.Index.ANALYZED));




Deleting and updating documents
Field options







Store
Analyze
Norms
Term vectors
Boost

Scoring – The formula
tf(t in d): Term frequency factor for the term (t) in the document
(d), i.e. how many times the term t occurs in the document.
idf(t): Inverse document frequency of the term: a measure of how
“unique” the term is. Very common terms have a low idf; very
rare terms have a high idf.
boost(t.field in d): Field & Document boost, as set during indexing.
This may be used to statically boost certain fields and certain
documents over others.
lengthNorm(t.field in d): Normalization value of a field, given the
number of terms within the field. This value is computed during
indexing and stored in the index norms. Shorter fields (fewer
tokens) get a bigger boost from this factor.
coord(q, d): Coordination factor, based on the number of query
terms the document contains. The coordination factor gives an
AND-like boost to documents that contain more of the search
terms than other documents
queryNorm(q): Normalization value for a query, given the sum of
the squared weights of each of the query terms.

Querying – the API


Variety of Query class implementations















TermQuery
PhraseQuery
TermRangeQuery
NumericRangeQuery
PrefixQuery
BooleanQuery
WildCardQuery
FuzzyQuery
MatchAllDocsQuery
…

Querying - Example
private void indexSingleFieldDocs(Field[] fields) throws Exception {
IndexWriter writer = new IndexWriter(directory,
new WhitespaceAnalyzer(), IndexWriter.MaxFieldLength.UNLIMITED);
for (int i = 0; i < fields.length; i++) {
Document doc = new Document();
doc.add(fields[i]);
writer.addDocument(doc);
}

writer.optimize();
writer.close();
}
public void wildcard() throws Exception {
indexSingleFieldDocs(new Field[]
{ new Field("contents", "wild", Field.Store.YES, Field.Index.ANALYZED),
new Field("contents", "child", Field.Store.YES, Field.Index.ANALYZED),
new Field("contents", "mild", Field.Store.YES, Field.Index.ANALYZED),
new Field("contents", "mildew", Field.Store.YES, Field.Index.ANALYZED) });
IndexSearcher searcher = new IndexSearcher(directory, true);
Query query = new WildcardQuery(new Term("contents", "?ild*"));
TopDocs matches = searcher.search(query, 10);
}

Querying - QueryParser
Query query = new QueryParser("subject",
analyzer).parse("(clinical OR ethics) AND
methodology");














trachea AND esophagus
The default join condition is OR e.g. trachea esophagus
cough AND (trachea OR esophagus)
trachea NOT esophagus
full_title:trachea
"trachea disease"
"trachea disease“~5
is_gender_male:y
[2010-01-01 TO 2010-07-01]
esophaguz~
Trachea^5 esophagus

Analyzers - Internals



At Indexing and querying time
Inside an analyzer



Operates on a TokenStream
A token has a text value and metadata like








Start end character offsets
Token type
Position increment
Optionally application specific bit flags and byte[]
payload

Token stream is abstract. Tokenizer and TokenFilter
are the concrete ones





Tokenizer reads chars and produces tokens
Token filter ingests tokens and produces new ones
The composite pattern is implemented and they form
a chain of one another

Analyzers – building blocks



Analyzers can be created by combining token streams (Order is
important)
Building blocks provided in core


















CharTokenizer
WhitespaceTokenizer
KeywordTokenizer.
LetterTokenizer
LowerCaseTokenizer
SinkTokenizer
StandardTokenizer
LowerCaseFilter
StopFilter
PorterStemFilter
TeeTokenFilter
ASCIIFoldingFilter
CachingTokenFilter
LengthFilter
StandardFilter

Analyzers - core







WhitespaceAnalyzer Splits tokens at
whitespace
SimpleAnalyzer Divides text at non letter
characters and lowercases
StopAnalyzer Divides text at non letter
characters, lowercases, and removes stop words
KeywordAnalyzer Treats entire text as a single
token
StandardAnalyzer Tokenizes based on a
sophisticated grammar that recognizes emailaddresses, acronyms, Chinese-JapaneseKorean characters,alphanumerics, and more
lowercases and removes stop words

Analyzers – Example (1/2)
Analyzing “The JHUG meeting is on this Saturday"
WhitespaceAnalyzer:
[The] [JHUG] [meeting] [is] [on] [this] [Saturday]
SimpleAnalyzer:
[the] [jhug] [meeting] [is] [on] [this] [saturday]
StopAnalyzer:
[jhug] [meeting] [saturday]
StandardAnalyzer:
[jhug] [meeting] [Saturday]

Analyzers – Example (2/2)
Analyzing "XY&Z Corporation - xyz@example.com"
WhitespaceAnalyzer:
[XY&Z] [Corporation] [-] [xyz@example.com]
SimpleAnalyzer:
[xy] [z] [corporation] [xyz] [example] [com]
StopAnalyzer:
[xy] [z] [corporation] [xyz] [example] [com]
StandardAnalyzer:
[xy&z] [corporation] [xyz@example.com]

Analyzers – Beyond the built in


language-specific analyzers, under contrib/analyzers.







language-specific stemming and stop-word removal

Sounds Like analyzer e.g. MetaphoneReplacementAnalyzer
that transforms terms to their phonetic roots
SynonymAnalyzer
Nutch Analysis: bigrams for stop words
Stemming analysis


The PorterStemFilter. It stems words using the Porter
stemming algorithm created by Dr. Martin Porter, and it‟s
best defined in his own words:




The Porter stemming algorithm (or „Porter stemmer‟) is a process
for removing the commoner morphological and inflexional endings
from words in English. Its main use is as part of a term
normalisation process that is usually done when setting up
Information Retrieval systems.

SnowballAnalyzer: Stemming for many European
languages

Filters





Narrow the search space
Overloaded search methods that
accept Filter instances
Examples








TermRangeFilter
NumericRangeFilter
PrefixFilter
QueryWrapperFilter
SpanQueryFilter
ChainedFilter

Example: Filters for Security


Constraints known at indexing time





Index the constraint as a field
Search wrapping a TermQuery on the constraint
field with a QueryWrapperFilter

Factor in information at search time





A custom filter
Filter will access an external privilege store that will
provide some means of identifying documents in
the index e.g. a unique term with regard to
permissions
Return an DocIdSet to Lucene. Bit positions match
the document numbers. Enabled bits mean the
document for that position is available to be
searched against the query; unset bits mean the
documents won‟t be considered in the search

Internals - Concurrency


Any number of IndexReaders open




Only one IndexWriter at a time




Locking with write lock file

IndexReaders may be open while the
index is being changed by an
IndexWriter




IndexSearchers use underlying
IndexReaders

It will see changes only when the writer
commits and is reopened

Both are thread safe/friendly classes

Internals - Indexing concepts





Index is made up from segment files
Deleting documents does not actually deletes - only
marks for deletion
Index writes are buffered and flushed periodically
Segments need to be merged





Automatically by the IndexWriter
Explicit calls to optimize

There is the notion of commit (as you would
expect), which has 4 steps






Flush buffered documents and deletions
Sync files; force OS to write to stable storage of the
underlying I/O system
Write and sync the segments_N file
Remove old commits

Internals - Transactions


Two-phase commit is supported




prepareCommit performs steps 1,2 and
most of 3

Lucene implements the ACID
transactional model






Atomicity: all or nothing commit
Consistency: e.g. update will mean both
delete and add
Isolation: IndexReaders cannot see what
has not been comitted
Durability: Index is not corrupted and
persists in storage

Architectures


Cluster nodes that share a remote file system
index





Index in database




Much slower

Separate write and read indexes (replication)





Slower than local
Possible limitations due to client side caching
(Samba, NFS, AFP) or stale file handles (NFS)

relies on the IndexDeletionPolicy feature of Lucene
Out of the box in Solr and ElasticSearch

Autonomous search servers (e.g. Solr,
ElasticSearch)


Loose coupling through JSON or XML

Frameworks– Compass Document
definition via JPA mapping
<compass-core-mapping package="eu.emea.eudract.model.entity">
<class name="cta.sectiona.CtaIdentification" alias="cta" root="true" support-unmarshall="false">
<id name="ctaIdentificationId">
<meta-data>cta_id</meta-data>
</id>
<dynamic-meta-data name="ncaName" converter="jexl" store="yes">data.submissionOrg.name
</dynamic-meta-data>
<property name="fullTitle">
<meta-data>cta_full_title</meta-data>
</property><property name="sponsorProtocolVersionDate">
<meta-data format="yyyy-MM-dd" store="no">cta_sponsor_protocol_version_date</meta-data>
</property>
<property name="isResubmission">
<meta-data converter="shortToYesNoNaConverter" store="no">cta_is_resubmission</meta-data>
</property>
<component name="eudractNumber" />
</class>
<class name="eudractnumber.EudractNumber" alias="eudract_number" root="false">
<property name="eudractNumberId">
<meta-data converter="dashHandlingConverter" store="no">filteredEudractNumberId</meta-data>
<meta-data>eudract_number</meta-data>
</property>
<property name="paediatricClinicalTrial">
<meta-data converter="shortToYesNoNaConverter" store="no">paediatric_clinical_trial
</meta-data>
</property>
</class>
.....
</compass-core-mapping>

Frameworks– Solr Document definition
via DB mapping
<dataConfig>
<dataSource driver="org.hsqldb.jdbcDriver" url="jdbc:hsqldb:/temp/example/ex" user="sa" />
<document name="products">
<entity name="item" query="select * from item">
<field column="ID" name="id" />
<field column="NAME" name="name" />
<field column="MANU" name="manu" />
<field column="WEIGHT" name="weight" />
<field column="PRICE" name="price" />
<field column="POPULARITY" name="popularity" />
<field column="INSTOCK" name="inStock" />
<field column="INCLUDES" name="includes" />
<entity name="feature" query="select description from feature where item_id='${item.ID}'">
<field name="features" column="description" />
</entity>
<entity name="item_category" query="select CATEGORY_ID from item_category where item_id='${item.ID}'">
<entity name="category" query="select description from category where id =
'${item_category.CATEGORY_ID}'">
<field column="description" name="cat" />
</entity>
</entity>
</entity>
</document>
</dataConfig>

Frameworks– Compass/Lucene
Configuration
<compass name="default">
<setting name="compass.transaction.managerLookup">
org.compass.core.transaction.manager.OC4J</setting>
<setting name="compass.transaction.factory">
org.compass.core.transaction.JTASyncTransactionFactory</setting>
<setting name="compass.transaction.lockPollInterval">400</setting>
<setting name="compass.transaction.lockTimeout">90</setting>
<setting name="compass.engine.connection">file://P:/Tmp/stelinio</setting>






<setting name="compass.converter.dashHandlingConverter.type">
eu.emea.eudract.compasssearch.DashHandlingConverter
</setting>
<setting name="compass.converter.shortToYesNoNaConverter.type">
eu.emea.eudract.compasssearch.ShortToYesNoNaConverter
</setting>
<setting name="compass.converter.shortToPerDayOrTotalConverter.type">
eu.emea.eudract.compasssearch.ShortToPerDayOrTotalConverter
</setting>
<setting name="compass.engine.store.jdbc.dialect">
org.apache.lucene.store.jdbc.dialect.OracleDialect
</setting>
<setting name="compass.engine.analyzer.default.type">
org.apache.lucene.analysis.standard.StandardAnalyzer
</setting>
</compass>

Cool extra features- Spellchecking




You will need a dictionary of valid words
You could use the unique terms in your index
Given the dictionary you could






To present or not to present (the suggestion)




Use a Sounds like algorithm like Soundex or Metaphone
Or use Ngrams
E.g. squirrel as a 3gram is squ, qui, uir, irr, rre, rel. As a
4gram squi, quir, uirr, irre, rrel. Mistakenly searching for
squirel would match 5 grams, with 2 shared between the
3grams and 4grams. This would score high!
Maybe use the Levenshtein distance

Other ideas





Use the rest of the terms in the query to bias
Maybe combine distance with frequency of term
Check result numbers in initial and corrected searches

Even More features


Sorting





SpanQueries





Use a field for sorting instead of relevance e.g. when you use the MatchAllDocsQuery
Beware it uses FieldCache which resides in RAM!
distance between terms (span)
Family of queries like SpanNearQuery or SpanOrQuery and others

Synonyms


Injection during indexing or during searching?





Leverage a synonyms knowledge base








Key thing is to understand that synonyms must be injected on the same position
increments

Answer to the query “Greek Restaurants Near Me”
An efficient technique is to use grids



Assign non-unique grid numbers at areas (e.g. in a mercator space)
Index documents with a field containing the grid numbers that match their positional lingitude and
latitude

MoreLikeThis




A good strategy is to convert it into an index

Spatial Searches




A MultiPhraseQuery is appropriate for searching time
During indexing will allow faster searches

One use of term vectors

Function Queries


e.g. add boosts for fields at search time

Some last things to bare in mind


It would be wise to back up you index




Performance has some trade-offs














search latency
indexing throughput
near real time results
index replication
index optimization

Resource consumption




You can have hot back ups (supported through the
CommitDeletionPolicy)

Disk space
File descriptors
Memory

„Luke‟ is a really handy tool
You can repair a corrupted index (corrupted
segments are just lost… D‟oh!)

Resources







Book: Lucene in Action
Solr:
http://lucene.apache.org/solr/
Vector Space Model:
http://en.wikipedia.org/wiki/V
ector_Space_Model
IR Links:
http://wiki.apache.org/lucenejava/InformationRetrieval

IR with lucene

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a IR with lucene

Semelhante a IR with lucene (20)

Último

Último (20)

IR with lucene