SlideShare uma empresa Scribd logo
1 de 34
Baixar para ler offline
Introduction to Information
Retrieval with Lucene
By Stylianos Gkorilas
Introductions


Presenter


Architect/Development Team Leader @Trasys Greece




IR (Information Retrieval)





Java EE projects for European Agencies

The tracing and recovery of specific information from stored
data
IR is interdisciplinary, based on computer science,
mathematics, library science, information science, information
architecture, cognitive psychology, linguistics, and statistics.

Lucene







Open Source – Apache Software License
(http://lucene.apache.org)
Founder: Doug Cutting
0.01 release on March 2000 (SourceForge)
1.2 release June 2002 (First apache Jakarta Release)
Its own top level apache project in 2005
Current version is 3.1
More Lucene Intro…


Lucene is high performance, scalable IR
library (not a ready to use application)






Number of full featured search applications
built on top (More later…)

Lucene ports and bindings in many other
programming environments incl. Perl,
Python, Ruby, C/C++, PHP and C# (.NET)
Lucene „Powered By‟ apps (a few of
many): LinkedIn, Apple, MySpace, Eclipse
IDE, MS Outlook, Atlassian (JIRA). See
more @ http://wiki.apache.org/lucenejava/PoweredBy
Components of a Search
Application (1/4)


Acquire Content


Gather and scope the content




e.g. from the web with a spider
or crawler, a CMS, a Database
or a file system

Projects helping
Solr: handles RDBMS and XML
feeds and rich documents
through Tika integration
 Nutch: web crawler - sister
project at apache
 Grub: open source web crawler

Components of a Search
Application (2/4)


Build document


Define the document







The unit of the search engine
Has fields
De-normalization involved

Projects helping: Usually the
same frameworks cover both this
and the previous step






Compass and its evolution
ElasticSearch
Hibernate Search
DBSight
Oracle/Lucene Integration
Components of a Search
Application (3/4)


Analyze Document


Handled by Analyzers
Built-in and contributed
 Built with tokenizers and token
filters




Index Document




Through Lucene API or your
framework of choice

Search User
Interface/Render Results


Application specific
Components of a Search
Application (4/4)


Query Builder





Lucene provides one
Frameworks provide extensions but also
the application itself e.g. advanced
search

Run Query



Retrieve documents running the query
built
Three common theoretical models






Administration




Boolean model
Vector space model
Probabilistic model
e.g. tuning options

Analytics


reporting
How Lucene models content







Documents
Fields
Denormalization of content
Flexible Schema
Inverted Index
Basic Lucene Classes


Indexing
IndexWriter
 Directory
 Analyzer
 Document
 Field




Searching
IndexSearcher
 Query
 TopDocs
 Term
 QueryParser

Basic Indexing


Adding documents
RAMDirectory directory = new RAMDirectory();
IndexWriter writer = new IndexWriter(directory,
new WhitespaceAnalyzer(),
IndexWriter.MaxFieldLength.UNLIMITED);
Document doc = new Document();
doc.add(new Field(“post",
"the JHUG meeting is on this Saturday",
Field.Store.YES,
Field.Index.ANALYZED));




Deleting and updating documents
Field options







Store
Analyze
Norms
Term vectors
Boost
Scoring – The formula
tf(t in d): Term frequency factor for the term (t) in the document
(d), i.e. how many times the term t occurs in the document.
idf(t): Inverse document frequency of the term: a measure of how
“unique” the term is. Very common terms have a low idf; very
rare terms have a high idf.
boost(t.field in d): Field & Document boost, as set during indexing.
This may be used to statically boost certain fields and certain
documents over others.
lengthNorm(t.field in d): Normalization value of a field, given the
number of terms within the field. This value is computed during
indexing and stored in the index norms. Shorter fields (fewer
tokens) get a bigger boost from this factor.
coord(q, d): Coordination factor, based on the number of query
terms the document contains. The coordination factor gives an
AND-like boost to documents that contain more of the search
terms than other documents
queryNorm(q): Normalization value for a query, given the sum of
the squared weights of each of the query terms.
Querying – the API


Variety of Query class implementations















TermQuery
PhraseQuery
TermRangeQuery
NumericRangeQuery
PrefixQuery
BooleanQuery
WildCardQuery
FuzzyQuery
MatchAllDocsQuery
…
Querying - Example
private void indexSingleFieldDocs(Field[] fields) throws Exception {
IndexWriter writer = new IndexWriter(directory,
new WhitespaceAnalyzer(), IndexWriter.MaxFieldLength.UNLIMITED);
for (int i = 0; i < fields.length; i++) {
Document doc = new Document();
doc.add(fields[i]);
writer.addDocument(doc);
}

writer.optimize();
writer.close();
}
public void wildcard() throws Exception {
indexSingleFieldDocs(new Field[]
{ new Field("contents", "wild", Field.Store.YES, Field.Index.ANALYZED),
new Field("contents", "child", Field.Store.YES, Field.Index.ANALYZED),
new Field("contents", "mild", Field.Store.YES, Field.Index.ANALYZED),
new Field("contents", "mildew", Field.Store.YES, Field.Index.ANALYZED) });
IndexSearcher searcher = new IndexSearcher(directory, true);
Query query = new WildcardQuery(new Term("contents", "?ild*"));
TopDocs matches = searcher.search(query, 10);
}
Querying - QueryParser
Query query = new QueryParser("subject",
analyzer).parse("(clinical OR ethics) AND
methodology");














trachea AND esophagus
The default join condition is OR e.g. trachea esophagus
cough AND (trachea OR esophagus)
trachea NOT esophagus
full_title:trachea
"trachea disease"
"trachea disease“~5
is_gender_male:y
[2010-01-01 TO 2010-07-01]
esophaguz~
Trachea^5 esophagus
Analyzers - Internals



At Indexing and querying time
Inside an analyzer



Operates on a TokenStream
A token has a text value and metadata like








Start end character offsets
Token type
Position increment
Optionally application specific bit flags and byte[]
payload

Token stream is abstract. Tokenizer and TokenFilter
are the concrete ones





Tokenizer reads chars and produces tokens
Token filter ingests tokens and produces new ones
The composite pattern is implemented and they form
a chain of one another
Analyzers – building blocks



Analyzers can be created by combining token streams (Order is
important)
Building blocks provided in core


















CharTokenizer
WhitespaceTokenizer
KeywordTokenizer.
LetterTokenizer
LowerCaseTokenizer
SinkTokenizer
StandardTokenizer
LowerCaseFilter
StopFilter
PorterStemFilter
TeeTokenFilter
ASCIIFoldingFilter
CachingTokenFilter
LengthFilter
StandardFilter
Analyzers - core







WhitespaceAnalyzer Splits tokens at
whitespace
SimpleAnalyzer Divides text at non letter
characters and lowercases
StopAnalyzer Divides text at non letter
characters, lowercases, and removes stop words
KeywordAnalyzer Treats entire text as a single
token
StandardAnalyzer Tokenizes based on a
sophisticated grammar that recognizes emailaddresses, acronyms, Chinese-JapaneseKorean characters,alphanumerics, and more
lowercases and removes stop words
Analyzers – Example (1/2)
Analyzing “The JHUG meeting is on this Saturday"
WhitespaceAnalyzer:
[The] [JHUG] [meeting] [is] [on] [this] [Saturday]
SimpleAnalyzer:
[the] [jhug] [meeting] [is] [on] [this] [saturday]
StopAnalyzer:
[jhug] [meeting] [saturday]
StandardAnalyzer:
[jhug] [meeting] [Saturday]
Analyzers – Example (2/2)
Analyzing "XY&Z Corporation - xyz@example.com"
WhitespaceAnalyzer:
[XY&Z] [Corporation] [-] [xyz@example.com]
SimpleAnalyzer:
[xy] [z] [corporation] [xyz] [example] [com]
StopAnalyzer:
[xy] [z] [corporation] [xyz] [example] [com]
StandardAnalyzer:
[xy&z] [corporation] [xyz@example.com]
Analyzers – Beyond the built in


language-specific analyzers, under contrib/analyzers.







language-specific stemming and stop-word removal

Sounds Like analyzer e.g. MetaphoneReplacementAnalyzer
that transforms terms to their phonetic roots
SynonymAnalyzer
Nutch Analysis: bigrams for stop words
Stemming analysis


The PorterStemFilter. It stems words using the Porter
stemming algorithm created by Dr. Martin Porter, and it‟s
best defined in his own words:




The Porter stemming algorithm (or „Porter stemmer‟) is a process
for removing the commoner morphological and inflexional endings
from words in English. Its main use is as part of a term
normalisation process that is usually done when setting up
Information Retrieval systems.

SnowballAnalyzer: Stemming for many European
languages
Filters





Narrow the search space
Overloaded search methods that
accept Filter instances
Examples








TermRangeFilter
NumericRangeFilter
PrefixFilter
QueryWrapperFilter
SpanQueryFilter
ChainedFilter
Example: Filters for Security


Constraints known at indexing time





Index the constraint as a field
Search wrapping a TermQuery on the constraint
field with a QueryWrapperFilter

Factor in information at search time





A custom filter
Filter will access an external privilege store that will
provide some means of identifying documents in
the index e.g. a unique term with regard to
permissions
Return an DocIdSet to Lucene. Bit positions match
the document numbers. Enabled bits mean the
document for that position is available to be
searched against the query; unset bits mean the
documents won‟t be considered in the search
Internals - Concurrency


Any number of IndexReaders open




Only one IndexWriter at a time




Locking with write lock file

IndexReaders may be open while the
index is being changed by an
IndexWriter




IndexSearchers use underlying
IndexReaders

It will see changes only when the writer
commits and is reopened

Both are thread safe/friendly classes
Internals - Indexing concepts





Index is made up from segment files
Deleting documents does not actually deletes - only
marks for deletion
Index writes are buffered and flushed periodically
Segments need to be merged





Automatically by the IndexWriter
Explicit calls to optimize

There is the notion of commit (as you would
expect), which has 4 steps






Flush buffered documents and deletions
Sync files; force OS to write to stable storage of the
underlying I/O system
Write and sync the segments_N file
Remove old commits
Internals - Transactions


Two-phase commit is supported




prepareCommit performs steps 1,2 and
most of 3

Lucene implements the ACID
transactional model






Atomicity: all or nothing commit
Consistency: e.g. update will mean both
delete and add
Isolation: IndexReaders cannot see what
has not been comitted
Durability: Index is not corrupted and
persists in storage
Architectures


Cluster nodes that share a remote file system
index





Index in database




Much slower

Separate write and read indexes (replication)





Slower than local
Possible limitations due to client side caching
(Samba, NFS, AFP) or stale file handles (NFS)

relies on the IndexDeletionPolicy feature of Lucene
Out of the box in Solr and ElasticSearch

Autonomous search servers (e.g. Solr,
ElasticSearch)


Loose coupling through JSON or XML
Frameworks– Compass Document
definition via JPA mapping
<compass-core-mapping package="eu.emea.eudract.model.entity">
<class name="cta.sectiona.CtaIdentification" alias="cta" root="true" support-unmarshall="false">
<id name="ctaIdentificationId">
<meta-data>cta_id</meta-data>
</id>
<dynamic-meta-data name="ncaName" converter="jexl" store="yes">data.submissionOrg.name
</dynamic-meta-data>
<property name="fullTitle">
<meta-data>cta_full_title</meta-data>
</property><property name="sponsorProtocolVersionDate">
<meta-data format="yyyy-MM-dd" store="no">cta_sponsor_protocol_version_date</meta-data>
</property>
<property name="isResubmission">
<meta-data converter="shortToYesNoNaConverter" store="no">cta_is_resubmission</meta-data>
</property>
<component name="eudractNumber" />
</class>
<class name="eudractnumber.EudractNumber" alias="eudract_number" root="false">
<property name="eudractNumberId">
<meta-data converter="dashHandlingConverter" store="no">filteredEudractNumberId</meta-data>
<meta-data>eudract_number</meta-data>
</property>
<property name="paediatricClinicalTrial">
<meta-data converter="shortToYesNoNaConverter" store="no">paediatric_clinical_trial
</meta-data>
</property>
</class>
.....
</compass-core-mapping>
Frameworks– Solr Document definition
via DB mapping
<dataConfig>
<dataSource driver="org.hsqldb.jdbcDriver" url="jdbc:hsqldb:/temp/example/ex" user="sa" />
<document name="products">
<entity name="item" query="select * from item">
<field column="ID" name="id" />
<field column="NAME" name="name" />
<field column="MANU" name="manu" />
<field column="WEIGHT" name="weight" />
<field column="PRICE" name="price" />
<field column="POPULARITY" name="popularity" />
<field column="INSTOCK" name="inStock" />
<field column="INCLUDES" name="includes" />
<entity name="feature" query="select description from feature where item_id='${item.ID}'">
<field name="features" column="description" />
</entity>
<entity name="item_category" query="select CATEGORY_ID from item_category where item_id='${item.ID}'">
<entity name="category" query="select description from category where id =
'${item_category.CATEGORY_ID}'">
<field column="description" name="cat" />
</entity>
</entity>
</entity>
</document>
</dataConfig>
Frameworks– Compass/Lucene
Configuration
<compass name="default">
<setting name="compass.transaction.managerLookup">
org.compass.core.transaction.manager.OC4J</setting>
<setting name="compass.transaction.factory">
org.compass.core.transaction.JTASyncTransactionFactory</setting>
<setting name="compass.transaction.lockPollInterval">400</setting>
<setting name="compass.transaction.lockTimeout">90</setting>
<setting name="compass.engine.connection">file://P:/Tmp/stelinio</setting>
<!--<setting name="compass.engine.connection">
jdbc://jdbc/EudractV8DataSourceSecure</setting>-->
<!--<setting name="compass.engine.store.jdbc.connection.provider.class">-->
<!--org.compass.core.lucene.engine.store.jdbc.JndiDataSourceProvider-->
<!--</setting>-->
<!--<setting name="compass.engine.ramBufferSize">512</setting>-->
<!--<setting name="compass.engine.maxBufferedDocs">-1</setting>-->
<setting name="compass.converter.dashHandlingConverter.type">
eu.emea.eudract.compasssearch.DashHandlingConverter
</setting>
<setting name="compass.converter.shortToYesNoNaConverter.type">
eu.emea.eudract.compasssearch.ShortToYesNoNaConverter
</setting>
<setting name="compass.converter.shortToPerDayOrTotalConverter.type">
eu.emea.eudract.compasssearch.ShortToPerDayOrTotalConverter
</setting>
<setting name="compass.engine.store.jdbc.dialect">
org.apache.lucene.store.jdbc.dialect.OracleDialect
</setting>
<setting name="compass.engine.analyzer.default.type">
org.apache.lucene.analysis.standard.StandardAnalyzer
</setting>
</compass>
Cool extra features- Spellchecking




You will need a dictionary of valid words
You could use the unique terms in your index
Given the dictionary you could






To present or not to present (the suggestion)




Use a Sounds like algorithm like Soundex or Metaphone
Or use Ngrams
E.g. squirrel as a 3gram is squ, qui, uir, irr, rre, rel. As a
4gram squi, quir, uirr, irre, rrel. Mistakenly searching for
squirel would match 5 grams, with 2 shared between the
3grams and 4grams. This would score high!
Maybe use the Levenshtein distance

Other ideas





Use the rest of the terms in the query to bias
Maybe combine distance with frequency of term
Check result numbers in initial and corrected searches
Even More features


Sorting





SpanQueries





Use a field for sorting instead of relevance e.g. when you use the MatchAllDocsQuery
Beware it uses FieldCache which resides in RAM!
distance between terms (span)
Family of queries like SpanNearQuery or SpanOrQuery and others

Synonyms


Injection during indexing or during searching?





Leverage a synonyms knowledge base








Key thing is to understand that synonyms must be injected on the same position
increments

Answer to the query “Greek Restaurants Near Me”
An efficient technique is to use grids



Assign non-unique grid numbers at areas (e.g. in a mercator space)
Index documents with a field containing the grid numbers that match their positional lingitude and
latitude

MoreLikeThis




A good strategy is to convert it into an index

Spatial Searches




A MultiPhraseQuery is appropriate for searching time
During indexing will allow faster searches

One use of term vectors

Function Queries


e.g. add boosts for fields at search time
Some last things to bare in mind


It would be wise to back up you index




Performance has some trade-offs














search latency
indexing throughput
near real time results
index replication
index optimization

Resource consumption




You can have hot back ups (supported through the
CommitDeletionPolicy)

Disk space
File descriptors
Memory

„Luke‟ is a really handy tool
You can repair a corrupted index (corrupted
segments are just lost… D‟oh!)
Resources







Book: Lucene in Action
Solr:
http://lucene.apache.org/solr/
Vector Space Model:
http://en.wikipedia.org/wiki/V
ector_Space_Model
IR Links:
http://wiki.apache.org/lucenejava/InformationRetrieval
That’s it

Questions?

Mais conteúdo relacionado

Mais procurados

Stub Testing and Driver Testing
Stub Testing and Driver TestingStub Testing and Driver Testing
Stub Testing and Driver TestingPopescu Petre
 
ER Modeling and Introduction to RDBMS
ER Modeling and Introduction to RDBMSER Modeling and Introduction to RDBMS
ER Modeling and Introduction to RDBMSRubal Sagwal
 
Quality Management in Software Engineering SE24
Quality Management in Software Engineering SE24Quality Management in Software Engineering SE24
Quality Management in Software Engineering SE24koolkampus
 
String handling(string class)
String handling(string class)String handling(string class)
String handling(string class)Ravi_Kant_Sahu
 
Object-oriented Programming-with C#
Object-oriented Programming-with C#Object-oriented Programming-with C#
Object-oriented Programming-with C#Doncho Minkov
 
SQL, Embedded SQL, Dynamic SQL and SQLJ
SQL, Embedded SQL, Dynamic SQL and SQLJSQL, Embedded SQL, Dynamic SQL and SQLJ
SQL, Embedded SQL, Dynamic SQL and SQLJDharita Chokshi
 
OCP Java SE 8 Exam - Sample Questions - Java Streams API
OCP Java SE 8 Exam - Sample Questions - Java Streams APIOCP Java SE 8 Exam - Sample Questions - Java Streams API
OCP Java SE 8 Exam - Sample Questions - Java Streams APIGanesh Samarthyam
 
The vector space model
The vector space modelThe vector space model
The vector space modelpkgosh
 
Text similarity and the vector space model
Text similarity and the vector space modelText similarity and the vector space model
Text similarity and the vector space modelCarlos Castillo (ChaTo)
 
White box black box & gray box testing
White box black box & gray box testingWhite box black box & gray box testing
White box black box & gray box testingHimanshu
 
2CPP14 - Abstraction
2CPP14 - Abstraction2CPP14 - Abstraction
2CPP14 - AbstractionMichael Heron
 

Mais procurados (20)

Software testing
Software testing Software testing
Software testing
 
Fdd presentation
Fdd presentationFdd presentation
Fdd presentation
 
Testing Metrics
Testing MetricsTesting Metrics
Testing Metrics
 
Stub Testing and Driver Testing
Stub Testing and Driver TestingStub Testing and Driver Testing
Stub Testing and Driver Testing
 
ER Modeling and Introduction to RDBMS
ER Modeling and Introduction to RDBMSER Modeling and Introduction to RDBMS
ER Modeling and Introduction to RDBMS
 
Normal forms
Normal formsNormal forms
Normal forms
 
Map/Reduce intro
Map/Reduce introMap/Reduce intro
Map/Reduce intro
 
Quality Management in Software Engineering SE24
Quality Management in Software Engineering SE24Quality Management in Software Engineering SE24
Quality Management in Software Engineering SE24
 
String handling(string class)
String handling(string class)String handling(string class)
String handling(string class)
 
Object-oriented Programming-with C#
Object-oriented Programming-with C#Object-oriented Programming-with C#
Object-oriented Programming-with C#
 
Vector space model in information retrieval
Vector space model in information retrievalVector space model in information retrieval
Vector space model in information retrieval
 
Decision trees
Decision treesDecision trees
Decision trees
 
SQL, Embedded SQL, Dynamic SQL and SQLJ
SQL, Embedded SQL, Dynamic SQL and SQLJSQL, Embedded SQL, Dynamic SQL and SQLJ
SQL, Embedded SQL, Dynamic SQL and SQLJ
 
OCP Java SE 8 Exam - Sample Questions - Java Streams API
OCP Java SE 8 Exam - Sample Questions - Java Streams APIOCP Java SE 8 Exam - Sample Questions - Java Streams API
OCP Java SE 8 Exam - Sample Questions - Java Streams API
 
SQLITE Android
SQLITE AndroidSQLITE Android
SQLITE Android
 
The vector space model
The vector space modelThe vector space model
The vector space model
 
Text similarity and the vector space model
Text similarity and the vector space modelText similarity and the vector space model
Text similarity and the vector space model
 
White box black box & gray box testing
White box black box & gray box testingWhite box black box & gray box testing
White box black box & gray box testing
 
Functional dependency
Functional dependencyFunctional dependency
Functional dependency
 
2CPP14 - Abstraction
2CPP14 - Abstraction2CPP14 - Abstraction
2CPP14 - Abstraction
 

Semelhante a IR with lucene

Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introductionotisg
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAsad Abbas
 
Search Me: Using Lucene.Net
Search Me: Using Lucene.NetSearch Me: Using Lucene.Net
Search Me: Using Lucene.Netgramana
 
Tutorial 5 (lucene)
Tutorial 5 (lucene)Tutorial 5 (lucene)
Tutorial 5 (lucene)Kira
 
RESTo - restful semantic search tool for geospatial
RESTo - restful semantic search tool for geospatialRESTo - restful semantic search tool for geospatial
RESTo - restful semantic search tool for geospatialGasperi Jerome
 
Building Search & Recommendation Engines
Building Search & Recommendation EnginesBuilding Search & Recommendation Engines
Building Search & Recommendation EnginesTrey Grainger
 
Lucene Bootcamp -1
Lucene Bootcamp -1 Lucene Bootcamp -1
Lucene Bootcamp -1 GokulD
 
Text Analytics in Enterprise Search - Daniel Ling
Text Analytics in Enterprise Search - Daniel LingText Analytics in Enterprise Search - Daniel Ling
Text Analytics in Enterprise Search - Daniel Linglucenerevolution
 
Text Analytics in Enterprise Search
Text Analytics in Enterprise SearchText Analytics in Enterprise Search
Text Analytics in Enterprise SearchFindwise
 
Wanna search? Piece of cake!
Wanna search? Piece of cake!Wanna search? Piece of cake!
Wanna search? Piece of cake!Alex Kursov
 
STAT Requirement Analysis
STAT Requirement AnalysisSTAT Requirement Analysis
STAT Requirement Analysisstat
 
Bridging Batch and Real-time Systems for Anomaly Detection
Bridging Batch and Real-time Systems for Anomaly DetectionBridging Batch and Real-time Systems for Anomaly Detection
Bridging Batch and Real-time Systems for Anomaly DetectionDataWorks Summit
 

Semelhante a IR with lucene (20)

Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introduction
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using Lucene
 
Building a Search Engine Using Lucene
Building a Search Engine Using LuceneBuilding a Search Engine Using Lucene
Building a Search Engine Using Lucene
 
Search Me: Using Lucene.Net
Search Me: Using Lucene.NetSearch Me: Using Lucene.Net
Search Me: Using Lucene.Net
 
Introduction To Apache Lucene
Introduction To Apache LuceneIntroduction To Apache Lucene
Introduction To Apache Lucene
 
Tutorial 5 (lucene)
Tutorial 5 (lucene)Tutorial 5 (lucene)
Tutorial 5 (lucene)
 
Apache lucene
Apache luceneApache lucene
Apache lucene
 
Fast track to lucene
Fast track to luceneFast track to lucene
Fast track to lucene
 
Lucece Indexing
Lucece IndexingLucece Indexing
Lucece Indexing
 
RESTo - restful semantic search tool for geospatial
RESTo - restful semantic search tool for geospatialRESTo - restful semantic search tool for geospatial
RESTo - restful semantic search tool for geospatial
 
Democratizing Big Semantic Data management
Democratizing Big Semantic Data managementDemocratizing Big Semantic Data management
Democratizing Big Semantic Data management
 
Building Search & Recommendation Engines
Building Search & Recommendation EnginesBuilding Search & Recommendation Engines
Building Search & Recommendation Engines
 
Lucene Bootcamp -1
Lucene Bootcamp -1 Lucene Bootcamp -1
Lucene Bootcamp -1
 
Text Analytics in Enterprise Search - Daniel Ling
Text Analytics in Enterprise Search - Daniel LingText Analytics in Enterprise Search - Daniel Ling
Text Analytics in Enterprise Search - Daniel Ling
 
Text Analytics in Enterprise Search
Text Analytics in Enterprise SearchText Analytics in Enterprise Search
Text Analytics in Enterprise Search
 
Wanna search? Piece of cake!
Wanna search? Piece of cake!Wanna search? Piece of cake!
Wanna search? Piece of cake!
 
STAT Requirement Analysis
STAT Requirement AnalysisSTAT Requirement Analysis
STAT Requirement Analysis
 
Lucene basics
Lucene basicsLucene basics
Lucene basics
 
Bridging Batch and Real-time Systems for Anomaly Detection
Bridging Batch and Real-time Systems for Anomaly DetectionBridging Batch and Real-time Systems for Anomaly Detection
Bridging Batch and Real-time Systems for Anomaly Detection
 
ElasticSearch
ElasticSearchElasticSearch
ElasticSearch
 

Último

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 

Último (20)

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 

IR with lucene

  • 1. Introduction to Information Retrieval with Lucene By Stylianos Gkorilas
  • 2. Introductions  Presenter  Architect/Development Team Leader @Trasys Greece   IR (Information Retrieval)    Java EE projects for European Agencies The tracing and recovery of specific information from stored data IR is interdisciplinary, based on computer science, mathematics, library science, information science, information architecture, cognitive psychology, linguistics, and statistics. Lucene       Open Source – Apache Software License (http://lucene.apache.org) Founder: Doug Cutting 0.01 release on March 2000 (SourceForge) 1.2 release June 2002 (First apache Jakarta Release) Its own top level apache project in 2005 Current version is 3.1
  • 3. More Lucene Intro…  Lucene is high performance, scalable IR library (not a ready to use application)    Number of full featured search applications built on top (More later…) Lucene ports and bindings in many other programming environments incl. Perl, Python, Ruby, C/C++, PHP and C# (.NET) Lucene „Powered By‟ apps (a few of many): LinkedIn, Apple, MySpace, Eclipse IDE, MS Outlook, Atlassian (JIRA). See more @ http://wiki.apache.org/lucenejava/PoweredBy
  • 4. Components of a Search Application (1/4)  Acquire Content  Gather and scope the content   e.g. from the web with a spider or crawler, a CMS, a Database or a file system Projects helping Solr: handles RDBMS and XML feeds and rich documents through Tika integration  Nutch: web crawler - sister project at apache  Grub: open source web crawler 
  • 5. Components of a Search Application (2/4)  Build document  Define the document     The unit of the search engine Has fields De-normalization involved Projects helping: Usually the same frameworks cover both this and the previous step     Compass and its evolution ElasticSearch Hibernate Search DBSight Oracle/Lucene Integration
  • 6. Components of a Search Application (3/4)  Analyze Document  Handled by Analyzers Built-in and contributed  Built with tokenizers and token filters   Index Document   Through Lucene API or your framework of choice Search User Interface/Render Results  Application specific
  • 7. Components of a Search Application (4/4)  Query Builder    Lucene provides one Frameworks provide extensions but also the application itself e.g. advanced search Run Query   Retrieve documents running the query built Three common theoretical models     Administration   Boolean model Vector space model Probabilistic model e.g. tuning options Analytics  reporting
  • 8. How Lucene models content      Documents Fields Denormalization of content Flexible Schema Inverted Index
  • 9. Basic Lucene Classes  Indexing IndexWriter  Directory  Analyzer  Document  Field   Searching IndexSearcher  Query  TopDocs  Term  QueryParser 
  • 10. Basic Indexing  Adding documents RAMDirectory directory = new RAMDirectory(); IndexWriter writer = new IndexWriter(directory, new WhitespaceAnalyzer(), IndexWriter.MaxFieldLength.UNLIMITED); Document doc = new Document(); doc.add(new Field(“post", "the JHUG meeting is on this Saturday", Field.Store.YES, Field.Index.ANALYZED));   Deleting and updating documents Field options      Store Analyze Norms Term vectors Boost
  • 11. Scoring – The formula tf(t in d): Term frequency factor for the term (t) in the document (d), i.e. how many times the term t occurs in the document. idf(t): Inverse document frequency of the term: a measure of how “unique” the term is. Very common terms have a low idf; very rare terms have a high idf. boost(t.field in d): Field & Document boost, as set during indexing. This may be used to statically boost certain fields and certain documents over others. lengthNorm(t.field in d): Normalization value of a field, given the number of terms within the field. This value is computed during indexing and stored in the index norms. Shorter fields (fewer tokens) get a bigger boost from this factor. coord(q, d): Coordination factor, based on the number of query terms the document contains. The coordination factor gives an AND-like boost to documents that contain more of the search terms than other documents queryNorm(q): Normalization value for a query, given the sum of the squared weights of each of the query terms.
  • 12. Querying – the API  Variety of Query class implementations           TermQuery PhraseQuery TermRangeQuery NumericRangeQuery PrefixQuery BooleanQuery WildCardQuery FuzzyQuery MatchAllDocsQuery …
  • 13. Querying - Example private void indexSingleFieldDocs(Field[] fields) throws Exception { IndexWriter writer = new IndexWriter(directory, new WhitespaceAnalyzer(), IndexWriter.MaxFieldLength.UNLIMITED); for (int i = 0; i < fields.length; i++) { Document doc = new Document(); doc.add(fields[i]); writer.addDocument(doc); } writer.optimize(); writer.close(); } public void wildcard() throws Exception { indexSingleFieldDocs(new Field[] { new Field("contents", "wild", Field.Store.YES, Field.Index.ANALYZED), new Field("contents", "child", Field.Store.YES, Field.Index.ANALYZED), new Field("contents", "mild", Field.Store.YES, Field.Index.ANALYZED), new Field("contents", "mildew", Field.Store.YES, Field.Index.ANALYZED) }); IndexSearcher searcher = new IndexSearcher(directory, true); Query query = new WildcardQuery(new Term("contents", "?ild*")); TopDocs matches = searcher.search(query, 10); }
  • 14. Querying - QueryParser Query query = new QueryParser("subject", analyzer).parse("(clinical OR ethics) AND methodology");            trachea AND esophagus The default join condition is OR e.g. trachea esophagus cough AND (trachea OR esophagus) trachea NOT esophagus full_title:trachea "trachea disease" "trachea disease“~5 is_gender_male:y [2010-01-01 TO 2010-07-01] esophaguz~ Trachea^5 esophagus
  • 15. Analyzers - Internals   At Indexing and querying time Inside an analyzer   Operates on a TokenStream A token has a text value and metadata like      Start end character offsets Token type Position increment Optionally application specific bit flags and byte[] payload Token stream is abstract. Tokenizer and TokenFilter are the concrete ones    Tokenizer reads chars and produces tokens Token filter ingests tokens and produces new ones The composite pattern is implemented and they form a chain of one another
  • 16. Analyzers – building blocks   Analyzers can be created by combining token streams (Order is important) Building blocks provided in core                CharTokenizer WhitespaceTokenizer KeywordTokenizer. LetterTokenizer LowerCaseTokenizer SinkTokenizer StandardTokenizer LowerCaseFilter StopFilter PorterStemFilter TeeTokenFilter ASCIIFoldingFilter CachingTokenFilter LengthFilter StandardFilter
  • 17. Analyzers - core      WhitespaceAnalyzer Splits tokens at whitespace SimpleAnalyzer Divides text at non letter characters and lowercases StopAnalyzer Divides text at non letter characters, lowercases, and removes stop words KeywordAnalyzer Treats entire text as a single token StandardAnalyzer Tokenizes based on a sophisticated grammar that recognizes emailaddresses, acronyms, Chinese-JapaneseKorean characters,alphanumerics, and more lowercases and removes stop words
  • 18. Analyzers – Example (1/2) Analyzing “The JHUG meeting is on this Saturday" WhitespaceAnalyzer: [The] [JHUG] [meeting] [is] [on] [this] [Saturday] SimpleAnalyzer: [the] [jhug] [meeting] [is] [on] [this] [saturday] StopAnalyzer: [jhug] [meeting] [saturday] StandardAnalyzer: [jhug] [meeting] [Saturday]
  • 19. Analyzers – Example (2/2) Analyzing "XY&Z Corporation - xyz@example.com" WhitespaceAnalyzer: [XY&Z] [Corporation] [-] [xyz@example.com] SimpleAnalyzer: [xy] [z] [corporation] [xyz] [example] [com] StopAnalyzer: [xy] [z] [corporation] [xyz] [example] [com] StandardAnalyzer: [xy&z] [corporation] [xyz@example.com]
  • 20. Analyzers – Beyond the built in  language-specific analyzers, under contrib/analyzers.      language-specific stemming and stop-word removal Sounds Like analyzer e.g. MetaphoneReplacementAnalyzer that transforms terms to their phonetic roots SynonymAnalyzer Nutch Analysis: bigrams for stop words Stemming analysis  The PorterStemFilter. It stems words using the Porter stemming algorithm created by Dr. Martin Porter, and it‟s best defined in his own words:   The Porter stemming algorithm (or „Porter stemmer‟) is a process for removing the commoner morphological and inflexional endings from words in English. Its main use is as part of a term normalisation process that is usually done when setting up Information Retrieval systems. SnowballAnalyzer: Stemming for many European languages
  • 21. Filters    Narrow the search space Overloaded search methods that accept Filter instances Examples       TermRangeFilter NumericRangeFilter PrefixFilter QueryWrapperFilter SpanQueryFilter ChainedFilter
  • 22. Example: Filters for Security  Constraints known at indexing time    Index the constraint as a field Search wrapping a TermQuery on the constraint field with a QueryWrapperFilter Factor in information at search time    A custom filter Filter will access an external privilege store that will provide some means of identifying documents in the index e.g. a unique term with regard to permissions Return an DocIdSet to Lucene. Bit positions match the document numbers. Enabled bits mean the document for that position is available to be searched against the query; unset bits mean the documents won‟t be considered in the search
  • 23. Internals - Concurrency  Any number of IndexReaders open   Only one IndexWriter at a time   Locking with write lock file IndexReaders may be open while the index is being changed by an IndexWriter   IndexSearchers use underlying IndexReaders It will see changes only when the writer commits and is reopened Both are thread safe/friendly classes
  • 24. Internals - Indexing concepts     Index is made up from segment files Deleting documents does not actually deletes - only marks for deletion Index writes are buffered and flushed periodically Segments need to be merged    Automatically by the IndexWriter Explicit calls to optimize There is the notion of commit (as you would expect), which has 4 steps     Flush buffered documents and deletions Sync files; force OS to write to stable storage of the underlying I/O system Write and sync the segments_N file Remove old commits
  • 25. Internals - Transactions  Two-phase commit is supported   prepareCommit performs steps 1,2 and most of 3 Lucene implements the ACID transactional model     Atomicity: all or nothing commit Consistency: e.g. update will mean both delete and add Isolation: IndexReaders cannot see what has not been comitted Durability: Index is not corrupted and persists in storage
  • 26. Architectures  Cluster nodes that share a remote file system index    Index in database   Much slower Separate write and read indexes (replication)    Slower than local Possible limitations due to client side caching (Samba, NFS, AFP) or stale file handles (NFS) relies on the IndexDeletionPolicy feature of Lucene Out of the box in Solr and ElasticSearch Autonomous search servers (e.g. Solr, ElasticSearch)  Loose coupling through JSON or XML
  • 27. Frameworks– Compass Document definition via JPA mapping <compass-core-mapping package="eu.emea.eudract.model.entity"> <class name="cta.sectiona.CtaIdentification" alias="cta" root="true" support-unmarshall="false"> <id name="ctaIdentificationId"> <meta-data>cta_id</meta-data> </id> <dynamic-meta-data name="ncaName" converter="jexl" store="yes">data.submissionOrg.name </dynamic-meta-data> <property name="fullTitle"> <meta-data>cta_full_title</meta-data> </property><property name="sponsorProtocolVersionDate"> <meta-data format="yyyy-MM-dd" store="no">cta_sponsor_protocol_version_date</meta-data> </property> <property name="isResubmission"> <meta-data converter="shortToYesNoNaConverter" store="no">cta_is_resubmission</meta-data> </property> <component name="eudractNumber" /> </class> <class name="eudractnumber.EudractNumber" alias="eudract_number" root="false"> <property name="eudractNumberId"> <meta-data converter="dashHandlingConverter" store="no">filteredEudractNumberId</meta-data> <meta-data>eudract_number</meta-data> </property> <property name="paediatricClinicalTrial"> <meta-data converter="shortToYesNoNaConverter" store="no">paediatric_clinical_trial </meta-data> </property> </class> ..... </compass-core-mapping>
  • 28. Frameworks– Solr Document definition via DB mapping <dataConfig> <dataSource driver="org.hsqldb.jdbcDriver" url="jdbc:hsqldb:/temp/example/ex" user="sa" /> <document name="products"> <entity name="item" query="select * from item"> <field column="ID" name="id" /> <field column="NAME" name="name" /> <field column="MANU" name="manu" /> <field column="WEIGHT" name="weight" /> <field column="PRICE" name="price" /> <field column="POPULARITY" name="popularity" /> <field column="INSTOCK" name="inStock" /> <field column="INCLUDES" name="includes" /> <entity name="feature" query="select description from feature where item_id='${item.ID}'"> <field name="features" column="description" /> </entity> <entity name="item_category" query="select CATEGORY_ID from item_category where item_id='${item.ID}'"> <entity name="category" query="select description from category where id = '${item_category.CATEGORY_ID}'"> <field column="description" name="cat" /> </entity> </entity> </entity> </document> </dataConfig>
  • 29. Frameworks– Compass/Lucene Configuration <compass name="default"> <setting name="compass.transaction.managerLookup"> org.compass.core.transaction.manager.OC4J</setting> <setting name="compass.transaction.factory"> org.compass.core.transaction.JTASyncTransactionFactory</setting> <setting name="compass.transaction.lockPollInterval">400</setting> <setting name="compass.transaction.lockTimeout">90</setting> <setting name="compass.engine.connection">file://P:/Tmp/stelinio</setting> <!--<setting name="compass.engine.connection"> jdbc://jdbc/EudractV8DataSourceSecure</setting>--> <!--<setting name="compass.engine.store.jdbc.connection.provider.class">--> <!--org.compass.core.lucene.engine.store.jdbc.JndiDataSourceProvider--> <!--</setting>--> <!--<setting name="compass.engine.ramBufferSize">512</setting>--> <!--<setting name="compass.engine.maxBufferedDocs">-1</setting>--> <setting name="compass.converter.dashHandlingConverter.type"> eu.emea.eudract.compasssearch.DashHandlingConverter </setting> <setting name="compass.converter.shortToYesNoNaConverter.type"> eu.emea.eudract.compasssearch.ShortToYesNoNaConverter </setting> <setting name="compass.converter.shortToPerDayOrTotalConverter.type"> eu.emea.eudract.compasssearch.ShortToPerDayOrTotalConverter </setting> <setting name="compass.engine.store.jdbc.dialect"> org.apache.lucene.store.jdbc.dialect.OracleDialect </setting> <setting name="compass.engine.analyzer.default.type"> org.apache.lucene.analysis.standard.StandardAnalyzer </setting> </compass>
  • 30. Cool extra features- Spellchecking    You will need a dictionary of valid words You could use the unique terms in your index Given the dictionary you could     To present or not to present (the suggestion)   Use a Sounds like algorithm like Soundex or Metaphone Or use Ngrams E.g. squirrel as a 3gram is squ, qui, uir, irr, rre, rel. As a 4gram squi, quir, uirr, irre, rrel. Mistakenly searching for squirel would match 5 grams, with 2 shared between the 3grams and 4grams. This would score high! Maybe use the Levenshtein distance Other ideas    Use the rest of the terms in the query to bias Maybe combine distance with frequency of term Check result numbers in initial and corrected searches
  • 31. Even More features  Sorting    SpanQueries    Use a field for sorting instead of relevance e.g. when you use the MatchAllDocsQuery Beware it uses FieldCache which resides in RAM! distance between terms (span) Family of queries like SpanNearQuery or SpanOrQuery and others Synonyms  Injection during indexing or during searching?    Leverage a synonyms knowledge base     Key thing is to understand that synonyms must be injected on the same position increments Answer to the query “Greek Restaurants Near Me” An efficient technique is to use grids   Assign non-unique grid numbers at areas (e.g. in a mercator space) Index documents with a field containing the grid numbers that match their positional lingitude and latitude MoreLikeThis   A good strategy is to convert it into an index Spatial Searches   A MultiPhraseQuery is appropriate for searching time During indexing will allow faster searches One use of term vectors Function Queries  e.g. add boosts for fields at search time
  • 32. Some last things to bare in mind  It would be wise to back up you index   Performance has some trade-offs          search latency indexing throughput near real time results index replication index optimization Resource consumption   You can have hot back ups (supported through the CommitDeletionPolicy) Disk space File descriptors Memory „Luke‟ is a really handy tool You can repair a corrupted index (corrupted segments are just lost… D‟oh!)
  • 33. Resources     Book: Lucene in Action Solr: http://lucene.apache.org/solr/ Vector Space Model: http://en.wikipedia.org/wiki/V ector_Space_Model IR Links: http://wiki.apache.org/lucenejava/InformationRetrieval