SlideShare uma empresa Scribd logo
1 de 48
Illuminating Lucene.Net:
Bringing Full-Text Search to Light
W. Dean Thrasher
14 May 2013
Agenda
• About the presenter
• About Lucene.Net
– What it is
– What it does
– How it works
– Who uses it
– Why you should care
More Agenda
• Core concepts
– Lucene structure
– Luke
– Terminology
• Code examples
• Things to know
• Recap
• References
W. Dean Thrasher
Dean.thrasher@infovark.com
www.infovark.com
www.linkedin.com/in/deanthrasher
@DThrasher
@infovark
BACKGROUND
Illuminating Lucene.Net
What is Lucene.Net?
Lucene.Net is a port of the Lucene search engine
library, written in C# and targeted at .NET
runtime users.
What is Lucene?
Apache Lucene is a high-performance, full-
featured text search engine library written
entirely in Java.
Apache Lucene is an open source project
available for free download.
History
1997 – Lucene project began by Doug Cutting
2000 – First open source release
2002 – First Apache Jakarta release
2005 – Lucene becomes a top-level project
2006 – Lucene.Net gets Apache incubation status
2010 – Lucene.Net orphaned by original committers
2011 – Lucene.Net reaccepted into Apache Incubator
2012 – Lucene.Net graduates from the Incubator
Why you should care
You want to provide
customers with a
“Google-like” search
experience
You want to tune
incoming queries or
results ranking
You want better
performance than SQL
“like” searches
You want to avoid
deploying a separate
search tool with your
website or application
What does it do?
• Allows you to index and search vast amounts
of text quickly
• Provides a powerful query syntax
• Integrates into applications easily
How it works
• Lucene uses an inverted index
– Maps terms to the documents that contain them
• Lucene manages its index
– Stores the index in memory or on disk
– Allows documents to be added or removed
• Makes an index for each document
• Merges the index with a set of other indices
Who uses Lucene.Net?
• Stackoverflow
• RavenDB
• Sitecore
• Orchard
• MindTouch
• Umbraco
• Sitefinity
• SubText
CONCEPTS
Illuminating Lucene.Net
Differences between Java and .Net
The Lucene.Net API:
• Lags a few steps behind the Java version of
Lucene
• Takes advantage of advanced .Net features not
found in Java
But it:
• Preserves the core Lucene concepts
• Maintains indexes that are compatible with the
Java version
Logical Index Storage
• Field – a name/value pair
• Document – a sequence of fields
• Index – a collection of documents
Physical Index Storage
• Lucene generates a
series of files within a
single directory
• Moving an index is a
copy-and-paste
operation
• You can compress or zip
an index to archive it
Luke
• Lucene Index Toolbox
• Built in Java, but can
read Lucene.Net
indexes
• http://code.google.com
/p/luke/
Analyzers and Tokens
• Analyzers take strings of text and break them
into tokens
• Tokens are chunks of text and associated
metadata
Terms, Queries and Hits
• Terms – the basic unit for searching. A field
name and a value to seek.
• Queries – combine terms to form search
criteria
• Hits – a ranked list of pointers to documents
CODE EXAMPLES
Create documents demo
• IndexWriter
• Directory
• Analyzer
• Document
• Field
Read documents demo
• IndexReader
• Term
• Query
• Hits
Update documents demo
• IndexWriter
• Document
• Term
Delete documents demo
• IndexWriter
• Query
• Term
Search demo
• IndexSearcher
• QueryParser
• Query
• Term
• TopDocs
• ScoreDoc
THINGS TO KNOW
Illuminating Lucene.Net
Transactional Lucene
• Lucene supports ACID commits to its indexes
• Lucene uses the Commit and Rollback syntax,
much like relational databases.
• Source:
http://blog.mikemccandless.com/2012/03/tra
nsactional-lucene.html
Lucene index types
FSDirectory
• Stores indexed documents
on disk
• Persists data across sessions
• Best choice for most
applications
Your first choice
RAMDirectory
• Stores indexed documents
in memory
• Entire index must fit into
available memory
• Does not persist data
• Faster than FSDirectory
Useful for unit testing
Precalculation
• How you store things in Lucene matters –
choose field options and analyzers carefully
• The way you retrieve information determines
how it should be stored
• Smaller indexes give you better performance
Field.Store
Yes – stores the text in its original form
No – the original text is not preserved
Field.Index
• No – the field is not indexed, so it is not
searchable
• Not analyzed – the text is treated as single
unit and indexed whole
• Analyzed – the text is broken down into
tokens and indexed
Field.TermVector
• No – Does not store term vectors
• Yes – Stores the term vectors of each
document (terms and number of occurrences)
• With Positions Offsets – Term vector, token
position and offset information
Field types indexing options
Field Stored Analyzed Vectored
Id Yes Not analyzed No
Modified Yes Not analyzed No
Path Yes Analyzed No
Content No Analyzed With Positions Offsets
An example of storing fields related to files on
your computer.
Analyzers
• Break apart text into tokens; each token gets
indexed separately
• Remove stop words
• Decide how to handle punctuation
• Handle languages and case sensitivity
• You can create your own by building from
scratch or chaining exiting analyzers
Types of Queries
• TermQuery
• PhraseQuery
• RangeQuery
• PrefixQuery , Wildcard Query
• FuzzyQuery
• Use BooleanQuery to combine them
Query syntax
Query Type Purpose Sample
TermQuery Single word query scarlett
PhraseQuery Matches terms in order “frankly my dear”
RangeQuery Matches documents between the
terms
[1861 to 1865]
{1861 to 1865}
WildcardQuery Lightweight regex-like term matching Atl*
D?m?
PrefixQuery Matches terms that being with the
string
War*
FuzzyQuery Closeness matching cry~
BooleanQuery Combines other queries into complex
expressions
Scarlett AND “frankly my
dear” -voldemort
Query, Filter, and Sort
• Lucene.Net can handle all three
• Default sort is by relevance
• Prefer queries to filters – they perform better
Using Dispose()
Linq Providers
• LINQ to Lucene
• http://linqtolucene.cod
eplex.com/
• Lucene.Net.Linq
• https://github.com/the
motleyfool/Lucene.Net.
Linq
• Chris Eldredge
• MotleyFool
Recap
• Why would I use a search engine?
• Why would I use Lucene.Net?
• How would I add Lucene.Net to my project?
– Web
– Desktop
• Where could I go to learn more?
• When can I buy Dean a beer?
REFERENCES
Illuminating Lucene.Net
Web References
• Lucene.Net – http://lucenenet.apache.org
• Solr – http://lucene.apache.org/solr
• Wikipedia
– http://en.wikipedia.org/wiki/Lucene
– http://en.wikipedia.org/wiki/Search_engine_indexing
• Academic discussions
– http://lucene.sourceforge.net/talks/pisa/
– http://lucene.sourceforge.net/talks/inktomi/
Books
• Lucene in Action,
Second Edition
• Michael McCandless,
Erick Hatcher, Otis
Gospodnetić
• Manning Publications
• July 2010
• http://www.manning.co
m/hatcher3/
Books
• Taming Text
• Grant S. Ingersoll,
Thomas S. Morton,
Andrew L. Farris
• Manning Publications
• January 2013
• http://www.manning.co
m/ingersoll/
Books
• Introduction to
Information Retrieval
• Christopher D. Manning,
Prabhakar Raghavan,
Hinrich Schutze
• Cambridge University Press
• 2008
• http://www-
nlp.stanford.edu/IR-book/
Presentations
• http://www.slideshare.net/nitin_stephens/luc
ene-basics
Blogs
• http://blog.mikemccandless.com/
Sample Files
All the literature shown in the code samples
comes from Project Gutenberg.
http://www.gutenberg.org/

Mais conteúdo relacionado

Mais procurados

Integrating Doctrine with Laravel
Integrating Doctrine with LaravelIntegrating Doctrine with Laravel
Integrating Doctrine with LaravelMark Garratt
 
Flexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit OakFlexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit OakTommaso Teofili
 
Ceylon introduction by Stéphane Épardaud
Ceylon introduction by Stéphane ÉpardaudCeylon introduction by Stéphane Épardaud
Ceylon introduction by Stéphane ÉpardaudUnFroMage
 
Building Enterprise Search Engines using Open Source Technologies
Building Enterprise Search Engines using Open Source TechnologiesBuilding Enterprise Search Engines using Open Source Technologies
Building Enterprise Search Engines using Open Source TechnologiesRahul Singh
 
Ceylon SDK by Stéphane Épardaud
Ceylon SDK by Stéphane ÉpardaudCeylon SDK by Stéphane Épardaud
Ceylon SDK by Stéphane ÉpardaudUnFroMage
 
Apachesolr presentation
Apachesolr presentationApachesolr presentation
Apachesolr presentationfreeformkurt
 
Battle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearchBattle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearchRafał Kuć
 
Enterprise Search Using Apache Solr
Enterprise Search Using Apache SolrEnterprise Search Using Apache Solr
Enterprise Search Using Apache Solrsagar chaturvedi
 
Python first day
Python first dayPython first day
Python first dayfarkhand
 
Java SE 7 New Features and Enhancements
Java SE 7 New Features and EnhancementsJava SE 7 New Features and Enhancements
Java SE 7 New Features and EnhancementsFu Cheng
 
Ceylon module repositories by Aleš Justin
Ceylon module repositories by Aleš JustinCeylon module repositories by Aleš Justin
Ceylon module repositories by Aleš JustinUnFroMage
 
WordPress 4.4
WordPress 4.4WordPress 4.4
WordPress 4.4Toru Miki
 
Apache Solr 5.0 and beyond
Apache Solr 5.0 and beyondApache Solr 5.0 and beyond
Apache Solr 5.0 and beyondAnshum Gupta
 

Mais procurados (20)

Integrating Doctrine with Laravel
Integrating Doctrine with LaravelIntegrating Doctrine with Laravel
Integrating Doctrine with Laravel
 
Solr vs ElasticSearch
Solr vs ElasticSearchSolr vs ElasticSearch
Solr vs ElasticSearch
 
Flexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit OakFlexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit Oak
 
Drupal7 and Apache Solr
Drupal7 and Apache SolrDrupal7 and Apache Solr
Drupal7 and Apache Solr
 
Solr Recipes
Solr RecipesSolr Recipes
Solr Recipes
 
Ceylon introduction by Stéphane Épardaud
Ceylon introduction by Stéphane ÉpardaudCeylon introduction by Stéphane Épardaud
Ceylon introduction by Stéphane Épardaud
 
Building Enterprise Search Engines using Open Source Technologies
Building Enterprise Search Engines using Open Source TechnologiesBuilding Enterprise Search Engines using Open Source Technologies
Building Enterprise Search Engines using Open Source Technologies
 
Ceylon SDK by Stéphane Épardaud
Ceylon SDK by Stéphane ÉpardaudCeylon SDK by Stéphane Épardaud
Ceylon SDK by Stéphane Épardaud
 
Apachesolr presentation
Apachesolr presentationApachesolr presentation
Apachesolr presentation
 
Battle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearchBattle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearch
 
Enterprise Search Using Apache Solr
Enterprise Search Using Apache SolrEnterprise Search Using Apache Solr
Enterprise Search Using Apache Solr
 
Python first day
Python first dayPython first day
Python first day
 
Python first day
Python first dayPython first day
Python first day
 
Java SE 7 New Features and Enhancements
Java SE 7 New Features and EnhancementsJava SE 7 New Features and Enhancements
Java SE 7 New Features and Enhancements
 
Ceylon module repositories by Aleš Justin
Ceylon module repositories by Aleš JustinCeylon module repositories by Aleš Justin
Ceylon module repositories by Aleš Justin
 
Elastic pivorak
Elastic pivorakElastic pivorak
Elastic pivorak
 
Akka.Net Overview
Akka.Net OverviewAkka.Net Overview
Akka.Net Overview
 
WordPress 4.4
WordPress 4.4WordPress 4.4
WordPress 4.4
 
Java and the JVM
Java and the JVMJava and the JVM
Java and the JVM
 
Apache Solr 5.0 and beyond
Apache Solr 5.0 and beyondApache Solr 5.0 and beyond
Apache Solr 5.0 and beyond
 

Semelhante a Illuminating Lucene.Net

Lucene BootCamp
Lucene BootCampLucene BootCamp
Lucene BootCampGokulD
 
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrLet's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrSease
 
Finite State Queries In Lucene
Finite State Queries In LuceneFinite State Queries In Lucene
Finite State Queries In Luceneotisg
 
Full Text Search with Lucene
Full Text Search with LuceneFull Text Search with Lucene
Full Text Search with LuceneWO Community
 
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search EngineElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search EngineDaniel N
 
Lucene Bootcamp - 2
Lucene Bootcamp - 2Lucene Bootcamp - 2
Lucene Bootcamp - 2GokulD
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrErik Hatcher
 
Roaring with elastic search sangam2018
Roaring with elastic search sangam2018Roaring with elastic search sangam2018
Roaring with elastic search sangam2018Vinay Kumar
 
Building Enterprise Search Engines using Open Source Technologies
Building Enterprise Search Engines using Open Source TechnologiesBuilding Enterprise Search Engines using Open Source Technologies
Building Enterprise Search Engines using Open Source TechnologiesAnant Corporation
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?lucenerevolution
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr DevelopersErik Hatcher
 
Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1YI-CHING WU
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr DevelopersErik Hatcher
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes WorkshopErik Hatcher
 
Intro to Apache Solr for Drupal
Intro to Apache Solr for DrupalIntro to Apache Solr for Drupal
Intro to Apache Solr for DrupalChris Caple
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearchpmanvi
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...S. Diana Hu
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...Joaquin Delgado PhD.
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...Simon Ambridge
 
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr MeetupImproved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetuprcmuir
 

Semelhante a Illuminating Lucene.Net (20)

Lucene BootCamp
Lucene BootCampLucene BootCamp
Lucene BootCamp
 
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrLet's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
 
Finite State Queries In Lucene
Finite State Queries In LuceneFinite State Queries In Lucene
Finite State Queries In Lucene
 
Full Text Search with Lucene
Full Text Search with LuceneFull Text Search with Lucene
Full Text Search with Lucene
 
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search EngineElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
 
Lucene Bootcamp - 2
Lucene Bootcamp - 2Lucene Bootcamp - 2
Lucene Bootcamp - 2
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Roaring with elastic search sangam2018
Roaring with elastic search sangam2018Roaring with elastic search sangam2018
Roaring with elastic search sangam2018
 
Building Enterprise Search Engines using Open Source Technologies
Building Enterprise Search Engines using Open Source TechnologiesBuilding Enterprise Search Engines using Open Source Technologies
Building Enterprise Search Engines using Open Source Technologies
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes Workshop
 
Intro to Apache Solr for Drupal
Intro to Apache Solr for DrupalIntro to Apache Solr for Drupal
Intro to Apache Solr for Drupal
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...
 
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr MeetupImproved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
 

Último

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 

Último (20)

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 

Illuminating Lucene.Net

  • 1. Illuminating Lucene.Net: Bringing Full-Text Search to Light W. Dean Thrasher 14 May 2013
  • 2. Agenda • About the presenter • About Lucene.Net – What it is – What it does – How it works – Who uses it – Why you should care
  • 3. More Agenda • Core concepts – Lucene structure – Luke – Terminology • Code examples • Things to know • Recap • References
  • 6. What is Lucene.Net? Lucene.Net is a port of the Lucene search engine library, written in C# and targeted at .NET runtime users.
  • 7. What is Lucene? Apache Lucene is a high-performance, full- featured text search engine library written entirely in Java. Apache Lucene is an open source project available for free download.
  • 8. History 1997 – Lucene project began by Doug Cutting 2000 – First open source release 2002 – First Apache Jakarta release 2005 – Lucene becomes a top-level project 2006 – Lucene.Net gets Apache incubation status 2010 – Lucene.Net orphaned by original committers 2011 – Lucene.Net reaccepted into Apache Incubator 2012 – Lucene.Net graduates from the Incubator
  • 9. Why you should care You want to provide customers with a “Google-like” search experience You want to tune incoming queries or results ranking You want better performance than SQL “like” searches You want to avoid deploying a separate search tool with your website or application
  • 10. What does it do? • Allows you to index and search vast amounts of text quickly • Provides a powerful query syntax • Integrates into applications easily
  • 11. How it works • Lucene uses an inverted index – Maps terms to the documents that contain them • Lucene manages its index – Stores the index in memory or on disk – Allows documents to be added or removed • Makes an index for each document • Merges the index with a set of other indices
  • 12. Who uses Lucene.Net? • Stackoverflow • RavenDB • Sitecore • Orchard • MindTouch • Umbraco • Sitefinity • SubText
  • 14. Differences between Java and .Net The Lucene.Net API: • Lags a few steps behind the Java version of Lucene • Takes advantage of advanced .Net features not found in Java But it: • Preserves the core Lucene concepts • Maintains indexes that are compatible with the Java version
  • 15. Logical Index Storage • Field – a name/value pair • Document – a sequence of fields • Index – a collection of documents
  • 16. Physical Index Storage • Lucene generates a series of files within a single directory • Moving an index is a copy-and-paste operation • You can compress or zip an index to archive it
  • 17. Luke • Lucene Index Toolbox • Built in Java, but can read Lucene.Net indexes • http://code.google.com /p/luke/
  • 18. Analyzers and Tokens • Analyzers take strings of text and break them into tokens • Tokens are chunks of text and associated metadata
  • 19. Terms, Queries and Hits • Terms – the basic unit for searching. A field name and a value to seek. • Queries – combine terms to form search criteria • Hits – a ranked list of pointers to documents
  • 21. Create documents demo • IndexWriter • Directory • Analyzer • Document • Field
  • 22. Read documents demo • IndexReader • Term • Query • Hits
  • 23. Update documents demo • IndexWriter • Document • Term
  • 24. Delete documents demo • IndexWriter • Query • Term
  • 25. Search demo • IndexSearcher • QueryParser • Query • Term • TopDocs • ScoreDoc
  • 27. Transactional Lucene • Lucene supports ACID commits to its indexes • Lucene uses the Commit and Rollback syntax, much like relational databases. • Source: http://blog.mikemccandless.com/2012/03/tra nsactional-lucene.html
  • 28. Lucene index types FSDirectory • Stores indexed documents on disk • Persists data across sessions • Best choice for most applications Your first choice RAMDirectory • Stores indexed documents in memory • Entire index must fit into available memory • Does not persist data • Faster than FSDirectory Useful for unit testing
  • 29. Precalculation • How you store things in Lucene matters – choose field options and analyzers carefully • The way you retrieve information determines how it should be stored • Smaller indexes give you better performance
  • 30. Field.Store Yes – stores the text in its original form No – the original text is not preserved
  • 31. Field.Index • No – the field is not indexed, so it is not searchable • Not analyzed – the text is treated as single unit and indexed whole • Analyzed – the text is broken down into tokens and indexed
  • 32. Field.TermVector • No – Does not store term vectors • Yes – Stores the term vectors of each document (terms and number of occurrences) • With Positions Offsets – Term vector, token position and offset information
  • 33. Field types indexing options Field Stored Analyzed Vectored Id Yes Not analyzed No Modified Yes Not analyzed No Path Yes Analyzed No Content No Analyzed With Positions Offsets An example of storing fields related to files on your computer.
  • 34. Analyzers • Break apart text into tokens; each token gets indexed separately • Remove stop words • Decide how to handle punctuation • Handle languages and case sensitivity • You can create your own by building from scratch or chaining exiting analyzers
  • 35. Types of Queries • TermQuery • PhraseQuery • RangeQuery • PrefixQuery , Wildcard Query • FuzzyQuery • Use BooleanQuery to combine them
  • 36. Query syntax Query Type Purpose Sample TermQuery Single word query scarlett PhraseQuery Matches terms in order “frankly my dear” RangeQuery Matches documents between the terms [1861 to 1865] {1861 to 1865} WildcardQuery Lightweight regex-like term matching Atl* D?m? PrefixQuery Matches terms that being with the string War* FuzzyQuery Closeness matching cry~ BooleanQuery Combines other queries into complex expressions Scarlett AND “frankly my dear” -voldemort
  • 37. Query, Filter, and Sort • Lucene.Net can handle all three • Default sort is by relevance • Prefer queries to filters – they perform better
  • 39. Linq Providers • LINQ to Lucene • http://linqtolucene.cod eplex.com/ • Lucene.Net.Linq • https://github.com/the motleyfool/Lucene.Net. Linq • Chris Eldredge • MotleyFool
  • 40. Recap • Why would I use a search engine? • Why would I use Lucene.Net? • How would I add Lucene.Net to my project? – Web – Desktop • Where could I go to learn more? • When can I buy Dean a beer?
  • 42. Web References • Lucene.Net – http://lucenenet.apache.org • Solr – http://lucene.apache.org/solr • Wikipedia – http://en.wikipedia.org/wiki/Lucene – http://en.wikipedia.org/wiki/Search_engine_indexing • Academic discussions – http://lucene.sourceforge.net/talks/pisa/ – http://lucene.sourceforge.net/talks/inktomi/
  • 43. Books • Lucene in Action, Second Edition • Michael McCandless, Erick Hatcher, Otis Gospodnetić • Manning Publications • July 2010 • http://www.manning.co m/hatcher3/
  • 44. Books • Taming Text • Grant S. Ingersoll, Thomas S. Morton, Andrew L. Farris • Manning Publications • January 2013 • http://www.manning.co m/ingersoll/
  • 45. Books • Introduction to Information Retrieval • Christopher D. Manning, Prabhakar Raghavan, Hinrich Schutze • Cambridge University Press • 2008 • http://www- nlp.stanford.edu/IR-book/
  • 48. Sample Files All the literature shown in the code samples comes from Project Gutenberg. http://www.gutenberg.org/

Notas do Editor

  1. Egad, the PUNishment! Well, at least I didn’t have a boring “Introduction to Lucene.NET” title.
  2. Oooh, an agenda. Aren’t I organized?
  3. Please send me an email to get in touch with me. Keep up with what I’m doing on the Infovark website or on my LinkedIn profile. I’ve listed my twitter handles – personal and work – but I rarely log into Twitter for any length of time. Send me a private message if you want to get my attention on Twitter.
  4. Doug Cutting had written search engines in other languages, but he wanted to teach himself Java. So the Lucene project began. Although he started building a commercial venture around the project, he decided that he preferred writing code to running a business. He open sourced the code in 2000.Lucene got adopted by the Apache Software foundation in 2001. Lucene.Net, which began as an independent port of Lucene, was accepted by the ASF in 2006.In 2010, Lucene.Net hit a rough patch, but thatnk’s to the efforts of the Alt.Net community, it was reintroduced to the Apache Incubator. In 2012, it graduated from the Incubator and became a full-fledged Apache project.
  5. Inverted indexMaps terms to the documents that contain themTerms may include metadata to improve rankingTerms may include position data for proximity searches
  6. These are a few examples of websites, applications, and platforms that use Lucene.Net. If I included those that use Lucene, the Java version, the list would be huge. Even if you don’t use Lucene.Net directly, chances are good that you use something that does. Lucene has become a foundational technology for many of the tools and sites we use today, but not many folks working on the Microsoft side are familiar with it. Some prominent Java examples include: LinkedIn, Twitter, IBM’s OpenFind, and many more.
  7. The .Net version is catching up with the Java version, but it remains nearly a full version behind.The .Net API is much nicer to work with, having good collections and generics support.Tools that interact with a Lucene index will work regardless of the Lucene library that created it.
  8. Although we’ll be working with the Lucene.NET API tonight, many of the concepts you’ll hear will apply to any search engine, though the specific terminology may differ a little. Let’s review some basic definitions we’ll use throughout the rest of the presentation.Index – a collection of documentsDocument – a sequence of fieldsField – a string name/value pair
  9. Luke is one of the ugliest applications I’ve ever seen, but it’s extremely useful. It exposes just about every aspect of the Lucene API, so it makes a great test-bed for trying out different ideas.
  10. Analyzer – breaks field values into tokensToken – a tuple consisting of a chunk of text and its associated metadata. Tokens are the raw bits that gets indexed.(Tokens and terms are closely related.)
  11. Query – a way to ask a question of an indexTerm – a tuple containing a field and a value to seek
  12. Here are some of the key classes used to add documents to the index.I really ought to add some details to the slide for folks who can’t see the code sample.
  13. Updating is a fairly new operation in the Lucene.Net API. Under the hood, it’s doing a Delete operation then an Add operation.
  14. Did you know that you can use an IndexReader to update and delete documents, too? Yes, but I don’t recommend it. This is one of the parts of the API that’s getting revised in the near future.
  15. Unlike a relational database, there’s no “normal form” to guide you when structuring a Lucene index. The key thing to remember is that the
  16. Keeping the original text within the Lucene index is convenient, but can vastly increase the size of your indexes.
  17. Term Vector Yes
  18. Just an example of how you might combine the flags when adding fields to a document.
  19. TermQuery – retrieve documents by a keyPrefixQuery – matches the start of a string valueRangeQuery – searches starting at one term and ending at another (useful for date searches)BooleanQuery – lets you combine other queries using AND, OR, NOT operationsPhaseQuery – finds terms a specified distance from one anotherFuzzyQuery – matches terms similar to a specified term
  20. Examples of query syntax.
  21. Some odds and ends on Queries, filters and sorting.
  22. We can finally dispose of our Lucene objects in versions 2.9.4 and later. If you’re using older versions, you must remember to try/finally the FSDirectory and IndexWriter.Remember that it’s much more efficient to add a bunch of documents within a single using statement than to open a new IndexWriter each time.