SlideShare uma empresa Scribd logo
1 de 50
Baixar para ler offline
Full Text Search
David LeBer
Align Software Inc.
What is full text search?
How?

•   Wild card database queries

•   Database implementations

•   Third party search engines

•   Text indexing libraries
Wild Card Queries

SELECT FROM 'SOME_TABLE' WHERE 'SOME_COLUMN' LIKE '%Some String%'
Wild Card Queries



•   Easy
Wild Card Queries


•   Slow

•   Hard to optimize

•   Difficult to rank
Database Implementations


•   MySQL FULLTEXT index and MATCH queries

•   PostgreSQL tsvector & tsquery
Database Implementations



•   Fairly Easy
Database Implementations

•   Database specific SQL

•   May include additional limitations
    (i.e: MySQL - MyISAM tables only)

•   Functionality define by the DB engine
Third Party Search Engines



•   Google indexing / searching of your content
Third Party Search Engines


•   Easy

•   Matches user expectations
Third Party Search Engines


•   Content must be available for indexing

•   Loss of control

•   Enhances the Google hegemony
Text Indexing Library



•   Lucene
Text Indexing Library

•   Complete control

•   Database independent

•   Flexible search behaviour

•   Ranked results
Text Indexing Library


•   Adds complexity

•   Additional query language

•   Parallel index
Lucene Overview

•   Open Source - part of the Apache Project

•   Very flexible

•   Wickedly fast

•   Index based
Lucene : Installing


•   Add the Lucene jars to your classpath

•   Use ERIndexing
Lucene : Tasks


•   Indexing

•   Searching
Indexing
What is Indexing?
Indexing : Steps


•   Conversion (to plain text)

•   Analysis (clean and convert the text to tokens)

•   Index (save the tokens to the index)
Indexing : Parts


•   Index - either file or memory based

•   Document - represents a unique object added to the index

•   Field - identifies a chunk of data in the document
Indexing : Classes

•   IndexWriter

•   Directory

•   Analyzer

•   Document

•   Field
Creating an Index

URL indexDirectoryURL = ... // assume exists
File indexFile = new File(indexDirectoryURL.getPath());
FSDirectory indexDirectory = FSDirectory.open(indexFile);
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);
IndexWriter indexWriter = new IndexWriter(index, analyzer, true,
                                IndexWriter.MaxFieldLength.UNLIMITED);
Indexing : Field Parameters


•   Stored or not

•   Analyzed or not, with and without norms

•   Include position, offset, and term frequency
Indexing : Analyzers

•   SimpleAnalyzer

•   StopAnalyzer

•   StandardAnalyzer

•   ...
Adding a Document

String value = ... // assume exists
Document doc = new Document();
Field docField = new Field("title", value,
                            Field.Store.YES, Field.Index.ANALYZED);
doc.add(docField);
...
indexWriter.addDocument(doc);
Indexing : Fun with indexes



•   Multiple Access
Searching
What is Searching
Searching : Steps

•   Clean the user input

•   Create a Query

•   Query the Index

•   Return the results
Searching : Search Classes
•   IndexReader

•   IndexSearcher

•   Query

•   QueryParser

•   TopDocs/ScoreDocs

•   Document
Searching : QueryTypes
•   TermQuery

•   RangeQuery

•   PrefixQuery

•   BooleanQuery

•   PhraseQuery

•   WildCardQuery

•   FuzzyQuery
Searching : QueryParser
•   'webobjects' - contains an exact match - TermQuery

•   'webobjects apple', 'webobjects OR apple' - an OR Query

•   +webobjects +apple / webobjects AND apple - an AND Query

•   title:webobjects - Contains the term in title field

•   title:webobjects -subject:iTunes / title:webobjects AND NOT
    subject:iTunes

•   (webobjects OR apple) AND iTunes
Searching : QueryParser

•   title:"apple webobjects" - Phrase Query

•   title:"apple webobjects"~5 - slop of 5

•   webobj* - Prefix Query

•   webobjicts~ - Fuzzy Query

•   lastmodified:[1/1/10 TO 1/1/11] - Range Query
Performing a Search

Query q = ... // assume exists
IndexSearcher searcher = new IndexSearcher(index, true);
TopScoreDocCollector collector = TopScoreDocCollector.create(10, true);
searcher.search(query, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
Using a QueryParser

QueryParser queryParser = new QueryParser(Version.LUCENE_2.9,
                                          "content", analyzer);
Query query = queryParser.parse(queryString);
Demo
Scoring
“The more times a query term appears in a
document relative to the number of times the term
 appears in all the documents in the collection, the
   more relevant that document is to the query”
Boost

•   While Indexing

    •   Document

    •   Field

•   While Searching

    •   Query
Luke
Demo
ERIndexing
ERIndexing : Strengths

•   Hides some of the complexity of integrating Lucene with WO

•   Offers lots of utility and helper methods

•   Speaks WebObjects collection classes

•   Simplifies index creation
ERIndexing : Weaknesses


•   Hides some of the complexity of integrating Lucene with WO

•   Not fully baked

•   Auto indexing may be dangerous
Demo
Beyond Lucene


•   Solr

•   Compass

•   ElasticSearch
Q&A
Lucene: http://lucene.apache.org
Luke: http://code.google.com/p/luke/
Solr: http://lucene.apache.org/solr/
Compass: http://www.compass-project.org/overview.html
ElasticSearch: http://www.elasticsearch.com/

Mais conteúdo relacionado

Mais procurados

Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)dnaber
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneSwapnil & Patil
 
Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?Adrien Grand
 
Munching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processingMunching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processingabial
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and SolrGrant Ingersoll
 
Hacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsHacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsOpenSource Connections
 
Wanna search? Piece of cake!
Wanna search? Piece of cake!Wanna search? Piece of cake!
Wanna search? Piece of cake!Alex Kursov
 
Multi faceted responsive search, autocomplete, feeds engine & logging
Multi faceted responsive search, autocomplete, feeds engine & loggingMulti faceted responsive search, autocomplete, feeds engine & logging
Multi faceted responsive search, autocomplete, feeds engine & logginglucenerevolution
 
Search Me: Using Lucene.Net
Search Me: Using Lucene.NetSearch Me: Using Lucene.Net
Search Me: Using Lucene.Netgramana
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache SolrAndy Jackson
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
 
Content analysis for ECM with Apache Tika
Content analysis for ECM with Apache TikaContent analysis for ECM with Apache Tika
Content analysis for ECM with Apache TikaPaolo Mottadelli
 
Full text search
Full text searchFull text search
Full text searchdeleteman
 
Content extraction with apache tika
Content extraction with apache tikaContent extraction with apache tika
Content extraction with apache tikaJukka Zitting
 
What's new with Apache Tika?
What's new with Apache Tika?What's new with Apache Tika?
What's new with Apache Tika?gagravarr
 

Mais procurados (19)

Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
 
Apache Lucene Basics
Apache Lucene BasicsApache Lucene Basics
Apache Lucene Basics
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using lucene
 
Azure search
Azure searchAzure search
Azure search
 
Lucece Indexing
Lucece IndexingLucece Indexing
Lucece Indexing
 
Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?
 
Munching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processingMunching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processing
 
Building a Search Engine Using Lucene
Building a Search Engine Using LuceneBuilding a Search Engine Using Lucene
Building a Search Engine Using Lucene
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and Solr
 
Hacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsHacking Lucene for Custom Search Results
Hacking Lucene for Custom Search Results
 
Wanna search? Piece of cake!
Wanna search? Piece of cake!Wanna search? Piece of cake!
Wanna search? Piece of cake!
 
Multi faceted responsive search, autocomplete, feeds engine & logging
Multi faceted responsive search, autocomplete, feeds engine & loggingMulti faceted responsive search, autocomplete, feeds engine & logging
Multi faceted responsive search, autocomplete, feeds engine & logging
 
Search Me: Using Lucene.Net
Search Me: Using Lucene.NetSearch Me: Using Lucene.Net
Search Me: Using Lucene.Net
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Content analysis for ECM with Apache Tika
Content analysis for ECM with Apache TikaContent analysis for ECM with Apache Tika
Content analysis for ECM with Apache Tika
 
Full text search
Full text searchFull text search
Full text search
 
Content extraction with apache tika
Content extraction with apache tikaContent extraction with apache tika
Content extraction with apache tika
 
What's new with Apache Tika?
What's new with Apache Tika?What's new with Apache Tika?
What's new with Apache Tika?
 

Semelhante a Full Text Search with Lucene

Examiness hints and tips from the trenches
Examiness hints and tips from the trenchesExaminess hints and tips from the trenches
Examiness hints and tips from the trenchesIsmail Mayat
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrErik Hatcher
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes WorkshopErik Hatcher
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr DevelopersErik Hatcher
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr DevelopersErik Hatcher
 
An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.Jurriaan Persyn
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash courseTommaso Teofili
 
Illuminating Lucene.Net
Illuminating Lucene.NetIlluminating Lucene.Net
Illuminating Lucene.NetDean Thrasher
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrErik Hatcher
 
Ferret A Ruby Search Engine
Ferret A Ruby Search EngineFerret A Ruby Search Engine
Ferret A Ruby Search Engineelliando dias
 
Practical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+SolrPractical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+SolrJake Mannix
 
Practical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and SparkPractical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and SparkJake Mannix
 
Tutorial 5 (lucene)
Tutorial 5 (lucene)Tutorial 5 (lucene)
Tutorial 5 (lucene)Kira
 
Test driving Azure Search and DocumentDB
Test driving Azure Search and DocumentDBTest driving Azure Search and DocumentDB
Test driving Azure Search and DocumentDBAndrew Siemer
 
Tagging search solution design Advanced edition
Tagging search solution design Advanced editionTagging search solution design Advanced edition
Tagging search solution design Advanced editionAlexander Tokarev
 
Using Sphinx for Search in PHP
Using Sphinx for Search in PHPUsing Sphinx for Search in PHP
Using Sphinx for Search in PHPMike Lively
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrRahul Jain
 
Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoopgregchanan
 

Semelhante a Full Text Search with Lucene (20)

Examiness hints and tips from the trenches
Examiness hints and tips from the trenchesExaminess hints and tips from the trenches
Examiness hints and tips from the trenches
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Full Text Search In PostgreSQL
Full Text Search In PostgreSQLFull Text Search In PostgreSQL
Full Text Search In PostgreSQL
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes Workshop
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
Illuminating Lucene.Net
Illuminating Lucene.NetIlluminating Lucene.Net
Illuminating Lucene.Net
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Ferret A Ruby Search Engine
Ferret A Ruby Search EngineFerret A Ruby Search Engine
Ferret A Ruby Search Engine
 
Elasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetupElasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetup
 
Practical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+SolrPractical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+Solr
 
Practical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and SparkPractical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and Spark
 
Tutorial 5 (lucene)
Tutorial 5 (lucene)Tutorial 5 (lucene)
Tutorial 5 (lucene)
 
Test driving Azure Search and DocumentDB
Test driving Azure Search and DocumentDBTest driving Azure Search and DocumentDB
Test driving Azure Search and DocumentDB
 
Tagging search solution design Advanced edition
Tagging search solution design Advanced editionTagging search solution design Advanced edition
Tagging search solution design Advanced edition
 
Using Sphinx for Search in PHP
Using Sphinx for Search in PHPUsing Sphinx for Search in PHP
Using Sphinx for Search in PHP
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
 
Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoop
 

Mais de WO Community

In memory OLAP engine
In memory OLAP engineIn memory OLAP engine
In memory OLAP engineWO Community
 
Using Nagios to monitor your WO systems
Using Nagios to monitor your WO systemsUsing Nagios to monitor your WO systems
Using Nagios to monitor your WO systemsWO Community
 
Build and deployment
Build and deploymentBuild and deployment
Build and deploymentWO Community
 
Reenabling SOAP using ERJaxWS
Reenabling SOAP using ERJaxWSReenabling SOAP using ERJaxWS
Reenabling SOAP using ERJaxWSWO Community
 
Chaining the Beast - Testing Wonder Applications in the Real World
Chaining the Beast - Testing Wonder Applications in the Real WorldChaining the Beast - Testing Wonder Applications in the Real World
Chaining the Beast - Testing Wonder Applications in the Real WorldWO Community
 
D2W Stateful Controllers
D2W Stateful ControllersD2W Stateful Controllers
D2W Stateful ControllersWO Community
 
Deploying WO on Windows
Deploying WO on WindowsDeploying WO on Windows
Deploying WO on WindowsWO Community
 
Unit Testing with WOUnit
Unit Testing with WOUnitUnit Testing with WOUnit
Unit Testing with WOUnitWO Community
 
Apache Cayenne for WO Devs
Apache Cayenne for WO DevsApache Cayenne for WO Devs
Apache Cayenne for WO DevsWO Community
 
Advanced Apache Cayenne
Advanced Apache CayenneAdvanced Apache Cayenne
Advanced Apache CayenneWO Community
 
Migrating existing Projects to Wonder
Migrating existing Projects to WonderMigrating existing Projects to Wonder
Migrating existing Projects to WonderWO Community
 
iOS for ERREST - alternative version
iOS for ERREST - alternative versioniOS for ERREST - alternative version
iOS for ERREST - alternative versionWO Community
 
"Framework Principal" pattern
"Framework Principal" pattern"Framework Principal" pattern
"Framework Principal" patternWO Community
 
Filtering data with D2W
Filtering data with D2W Filtering data with D2W
Filtering data with D2W WO Community
 
Localizing your apps for multibyte languages
Localizing your apps for multibyte languagesLocalizing your apps for multibyte languages
Localizing your apps for multibyte languagesWO Community
 

Mais de WO Community (20)

KAAccessControl
KAAccessControlKAAccessControl
KAAccessControl
 
In memory OLAP engine
In memory OLAP engineIn memory OLAP engine
In memory OLAP engine
 
Using Nagios to monitor your WO systems
Using Nagios to monitor your WO systemsUsing Nagios to monitor your WO systems
Using Nagios to monitor your WO systems
 
Build and deployment
Build and deploymentBuild and deployment
Build and deployment
 
High availability
High availabilityHigh availability
High availability
 
Reenabling SOAP using ERJaxWS
Reenabling SOAP using ERJaxWSReenabling SOAP using ERJaxWS
Reenabling SOAP using ERJaxWS
 
Chaining the Beast - Testing Wonder Applications in the Real World
Chaining the Beast - Testing Wonder Applications in the Real WorldChaining the Beast - Testing Wonder Applications in the Real World
Chaining the Beast - Testing Wonder Applications in the Real World
 
D2W Stateful Controllers
D2W Stateful ControllersD2W Stateful Controllers
D2W Stateful Controllers
 
Deploying WO on Windows
Deploying WO on WindowsDeploying WO on Windows
Deploying WO on Windows
 
Unit Testing with WOUnit
Unit Testing with WOUnitUnit Testing with WOUnit
Unit Testing with WOUnit
 
Life outside WO
Life outside WOLife outside WO
Life outside WO
 
Apache Cayenne for WO Devs
Apache Cayenne for WO DevsApache Cayenne for WO Devs
Apache Cayenne for WO Devs
 
Advanced Apache Cayenne
Advanced Apache CayenneAdvanced Apache Cayenne
Advanced Apache Cayenne
 
Migrating existing Projects to Wonder
Migrating existing Projects to WonderMigrating existing Projects to Wonder
Migrating existing Projects to Wonder
 
iOS for ERREST - alternative version
iOS for ERREST - alternative versioniOS for ERREST - alternative version
iOS for ERREST - alternative version
 
iOS for ERREST
iOS for ERRESTiOS for ERREST
iOS for ERREST
 
"Framework Principal" pattern
"Framework Principal" pattern"Framework Principal" pattern
"Framework Principal" pattern
 
Filtering data with D2W
Filtering data with D2W Filtering data with D2W
Filtering data with D2W
 
WOver
WOverWOver
WOver
 
Localizing your apps for multibyte languages
Localizing your apps for multibyte languagesLocalizing your apps for multibyte languages
Localizing your apps for multibyte languages
 

Último

TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 

Último (20)

TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 

Full Text Search with Lucene

  • 1. Full Text Search David LeBer Align Software Inc.
  • 2. What is full text search?
  • 3.
  • 4. How? • Wild card database queries • Database implementations • Third party search engines • Text indexing libraries
  • 5. Wild Card Queries SELECT FROM 'SOME_TABLE' WHERE 'SOME_COLUMN' LIKE '%Some String%'
  • 7. Wild Card Queries • Slow • Hard to optimize • Difficult to rank
  • 8. Database Implementations • MySQL FULLTEXT index and MATCH queries • PostgreSQL tsvector & tsquery
  • 10. Database Implementations • Database specific SQL • May include additional limitations (i.e: MySQL - MyISAM tables only) • Functionality define by the DB engine
  • 11. Third Party Search Engines • Google indexing / searching of your content
  • 12. Third Party Search Engines • Easy • Matches user expectations
  • 13. Third Party Search Engines • Content must be available for indexing • Loss of control • Enhances the Google hegemony
  • 15. Text Indexing Library • Complete control • Database independent • Flexible search behaviour • Ranked results
  • 16. Text Indexing Library • Adds complexity • Additional query language • Parallel index
  • 17. Lucene Overview • Open Source - part of the Apache Project • Very flexible • Wickedly fast • Index based
  • 18. Lucene : Installing • Add the Lucene jars to your classpath • Use ERIndexing
  • 19. Lucene : Tasks • Indexing • Searching
  • 22. Indexing : Steps • Conversion (to plain text) • Analysis (clean and convert the text to tokens) • Index (save the tokens to the index)
  • 23. Indexing : Parts • Index - either file or memory based • Document - represents a unique object added to the index • Field - identifies a chunk of data in the document
  • 24. Indexing : Classes • IndexWriter • Directory • Analyzer • Document • Field
  • 25. Creating an Index URL indexDirectoryURL = ... // assume exists File indexFile = new File(indexDirectoryURL.getPath()); FSDirectory indexDirectory = FSDirectory.open(indexFile); StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT); IndexWriter indexWriter = new IndexWriter(index, analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED);
  • 26. Indexing : Field Parameters • Stored or not • Analyzed or not, with and without norms • Include position, offset, and term frequency
  • 27. Indexing : Analyzers • SimpleAnalyzer • StopAnalyzer • StandardAnalyzer • ...
  • 28. Adding a Document String value = ... // assume exists Document doc = new Document(); Field docField = new Field("title", value, Field.Store.YES, Field.Index.ANALYZED); doc.add(docField); ... indexWriter.addDocument(doc);
  • 29. Indexing : Fun with indexes • Multiple Access
  • 32. Searching : Steps • Clean the user input • Create a Query • Query the Index • Return the results
  • 33. Searching : Search Classes • IndexReader • IndexSearcher • Query • QueryParser • TopDocs/ScoreDocs • Document
  • 34. Searching : QueryTypes • TermQuery • RangeQuery • PrefixQuery • BooleanQuery • PhraseQuery • WildCardQuery • FuzzyQuery
  • 35. Searching : QueryParser • 'webobjects' - contains an exact match - TermQuery • 'webobjects apple', 'webobjects OR apple' - an OR Query • +webobjects +apple / webobjects AND apple - an AND Query • title:webobjects - Contains the term in title field • title:webobjects -subject:iTunes / title:webobjects AND NOT subject:iTunes • (webobjects OR apple) AND iTunes
  • 36. Searching : QueryParser • title:"apple webobjects" - Phrase Query • title:"apple webobjects"~5 - slop of 5 • webobj* - Prefix Query • webobjicts~ - Fuzzy Query • lastmodified:[1/1/10 TO 1/1/11] - Range Query
  • 37. Performing a Search Query q = ... // assume exists IndexSearcher searcher = new IndexSearcher(index, true); TopScoreDocCollector collector = TopScoreDocCollector.create(10, true); searcher.search(query, collector); ScoreDoc[] hits = collector.topDocs().scoreDocs;
  • 38. Using a QueryParser QueryParser queryParser = new QueryParser(Version.LUCENE_2.9, "content", analyzer); Query query = queryParser.parse(queryString);
  • 39. Demo
  • 41. “The more times a query term appears in a document relative to the number of times the term appears in all the documents in the collection, the more relevant that document is to the query”
  • 42. Boost • While Indexing • Document • Field • While Searching • Query
  • 43. Luke
  • 44. Demo
  • 46. ERIndexing : Strengths • Hides some of the complexity of integrating Lucene with WO • Offers lots of utility and helper methods • Speaks WebObjects collection classes • Simplifies index creation
  • 47. ERIndexing : Weaknesses • Hides some of the complexity of integrating Lucene with WO • Not fully baked • Auto indexing may be dangerous
  • 48. Demo
  • 49. Beyond Lucene • Solr • Compass • ElasticSearch
  • 50. Q&A Lucene: http://lucene.apache.org Luke: http://code.google.com/p/luke/ Solr: http://lucene.apache.org/solr/ Compass: http://www.compass-project.org/overview.html ElasticSearch: http://www.elasticsearch.com/