Full Text Search with Lucene

Full Text Search
David LeBer
Align Software Inc.

How?

• Wild card database queries

• Database implementations

• Third party search engines

• Text indexing libraries

Wild Card Queries

SELECT FROM 'SOME_TABLE' WHERE 'SOME_COLUMN' LIKE '%Some String%'

Wild Card Queries

• Easy

Wild Card Queries

• Slow

• Hard to optimize

• Difﬁcult to rank

Database Implementations

• MySQL FULLTEXT index and MATCH queries

• PostgreSQL tsvector & tsquery


• Fairly Easy


• Database speciﬁc SQL

• May include additional limitations
(i.e: MySQL - MyISAM tables only)

• Functionality deﬁne by the DB engine

Third Party Search Engines

• Google indexing / searching of your content


• Easy

• Matches user expectations


• Content must be available for indexing

• Loss of control

• Enhances the Google hegemony

Text Indexing Library

• Lucene


• Complete control

• Database independent

• Flexible search behaviour

• Ranked results


• Adds complexity

• Additional query language

• Parallel index

Lucene Overview

• Open Source - part of the Apache Project

• Very ﬂexible

• Wickedly fast

• Index based

Lucene : Installing

• Add the Lucene jars to your classpath

• Use ERIndexing

Lucene : Tasks

• Indexing

• Searching

Indexing : Steps

• Conversion (to plain text)

• Analysis (clean and convert the text to tokens)

• Index (save the tokens to the index)

Indexing : Parts

• Index - either ﬁle or memory based

• Document - represents a unique object added to the index

• Field - identiﬁes a chunk of data in the document

Indexing : Classes

• IndexWriter

• Directory

• Analyzer

• Document

• Field

Creating an Index

URL indexDirectoryURL = ... // assume exists
File indexFile = new File(indexDirectoryURL.getPath());
FSDirectory indexDirectory = FSDirectory.open(indexFile);
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);
IndexWriter indexWriter = new IndexWriter(index, analyzer, true,
IndexWriter.MaxFieldLength.UNLIMITED);

Indexing : Field Parameters

• Stored or not

• Analyzed or not, with and without norms

• Include position, offset, and term frequency

Indexing : Analyzers

• SimpleAnalyzer

• StopAnalyzer

• StandardAnalyzer

• ...

Adding a Document

String value = ... // assume exists
Document doc = new Document();
Field docField = new Field("title", value,
Field.Store.YES, Field.Index.ANALYZED);
doc.add(docField);
...
indexWriter.addDocument(doc);

Indexing : Fun with indexes

• Multiple Access

Searching : Steps

• Clean the user input

• Create a Query

• Query the Index

• Return the results

Searching : Search Classes
• IndexReader

• IndexSearcher

• Query

• QueryParser

• TopDocs/ScoreDocs

• Document

Searching : QueryTypes
• TermQuery

• RangeQuery

• PreﬁxQuery

• BooleanQuery

• PhraseQuery

• WildCardQuery

• FuzzyQuery

Searching : QueryParser
• 'webobjects' - contains an exact match - TermQuery

• 'webobjects apple', 'webobjects OR apple' - an OR Query

• +webobjects +apple / webobjects AND apple - an AND Query

• title:webobjects - Contains the term in title ﬁeld

• title:webobjects -subject:iTunes / title:webobjects AND NOT
subject:iTunes

• (webobjects OR apple) AND iTunes

Searching : QueryParser

• title:"apple webobjects" - Phrase Query

• title:"apple webobjects"~5 - slop of 5

• webobj* - Preﬁx Query

• webobjicts~ - Fuzzy Query

• lastmodiﬁed:[1/1/10 TO 1/1/11] - Range Query

Performing a Search

Query q = ... // assume exists
IndexSearcher searcher = new IndexSearcher(index, true);
TopScoreDocCollector collector = TopScoreDocCollector.create(10, true);
searcher.search(query, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;

Using a QueryParser

QueryParser queryParser = new QueryParser(Version.LUCENE_2.9,
"content", analyzer);
Query query = queryParser.parse(queryString);

“The more times a query term appears in a
document relative to the number of times the term
appears in all the documents in the collection, the
more relevant that document is to the query”

Boost

• While Indexing

• Document

• Field

• While Searching

• Query

ERIndexing : Strengths

• Hides some of the complexity of integrating Lucene with WO

• Offers lots of utility and helper methods

• Speaks WebObjects collection classes

• Simpliﬁes index creation

ERIndexing : Weaknesses

• Hides some of the complexity of integrating Lucene with WO

• Not fully baked

• Auto indexing may be dangerous

Beyond Lucene

• Solr

• Compass

• ElasticSearch

Q&A
Lucene: http://lucene.apache.org
Luke: http://code.google.com/p/luke/
Solr: http://lucene.apache.org/solr/
Compass: http://www.compass-project.org/overview.html
ElasticSearch: http://www.elasticsearch.com/

Full Text Search with Lucene

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (19)

Semelhante a Full Text Search with Lucene

Semelhante a Full Text Search with Lucene (20)

Mais de WO Community

Mais de WO Community (20)

Último

Último (20)

Full Text Search with Lucene