SlideShare uma empresa Scribd logo
1 de 26
Lucene Introduction Otis Gospodnetic, Sematext Int’l @otisg [email_address] http://jroller.com/otis http://sematext.com/
About Otis ,[object Object],[object Object],[object Object],[object Object],[object Object]
What is Lucene? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Otis Gospodnetic, Sematext Int’l
What Lucene Ain’t ,[object Object],[object Object],[object Object],[object Object],Otis Gospodnetic, Sematext Int’l
The Lucene Family ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Otis Gospodnetic, Sematext Int’l
Integration Data Source Data Source Gather Parse Make Doc Search UI Search App e.g. webapp Search Index Index Otis Gospodnetic, Sematext Int’l
Integration: Rich Doc Indexing HTML PDF Gather Make Doc Index Index MS Word PDF Parse with Tika Otis Gospodnetic, Sematext Int’l
Lucene Strengths ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Otis Gospodnetic, Sematext Int’l
Query Types ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Otis Gospodnetic, Sematext Int’l
Query Syntax ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Otis Gospodnetic, Sematext Int’l
Code: FS Indexer Otis Gospodnetic, Sematext Int’l private  IndexWriter  writer; public Indexer(String indexDir) throws IOException { Directory dir =  FSDirectory.open (new File(indexDir)); writer =  new IndexWriter(dir, new StandardAnalyzer(Version.LUCENE_CURRENT),  true, IndexWriter.MaxFieldLength.UNLIMITED); } public void close() throws IOException { writer.close(); } public void index(String dataDir, FileFilter filter) throws Exception { File[] files = new File(dataDir).listFiles(); for (File f: files) { Document doc =  new Document(); doc.add(new Field("contents", new FileReader(f))); doc.add(new Field("filename", f.getName(), Field.Store.YES, Field.Index.NOT_ANALYZED)); writer.addDocument(doc); } }
Indexing Pipeline Otis Gospodnetic, Sematext Int’l Tokenizer TokenFilter Document Document Writer Inverted Index add
Indexer Pipeline: Analysis Source: Lucene in Action Otis Gospodnetic, Sematext Int’l ,[object Object],[object Object]
Analysis in Action Otis Gospodnetic, Sematext Int’l " The quick brown fox jumped over the lazy dogs " WhitespaceAnalyzer  : [The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs] SimpleAnalyzer  : [the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs] StopAnalyzer  : [quick] [brown] [fox] [jumped] [over] [lazy] [dogs]  StandardAnalyzer : [quick] [brown] [fox] [jumped] [over] [lazy] [dogs]  " XY&Z Corporation - xyz@example.com " WhitespaceAnalyzer : [XY&Z] [Corporation] [-] [xyz@example.com]  SimpleAnalyzer : [xy] [z] [corporation] [xyz] [example] [com]  StopAnalyzer : [xy] [z] [corporation] [xyz] [example] [com]  StandardAnalyzer : [xy&z] [corporation] [xyz@example.com]
Field Options ,[object Object],[object Object],[object Object],[object Object],Otis Gospodnetic, Sematext Int’l
Inverted Index Source: developer.apple.com Otis Gospodnetic, Sematext Int’l
Index Directory ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Otis Gospodnetic, Sematext Int’l
Code: Searcher Otis Gospodnetic, Sematext Int’l public void search(String indexDir, String q) throws IOException, ParseException { Directory dir =  FSDirectory.open (new File(indexDir)); IndexSearcher is =  new IndexSearcher(dir, true); QueryParser parser =  new QueryParser(&quot;contents&quot;, new  StandardAnalyzer(Version.LUCENE_CURRENT)); Query query = parser.parse(q); TopDocs hits = is.search(query, 10); System.err.println(&quot;Found &quot; + hits.totalHits + &quot; document(s)&quot;); for (int i=0; i<hits.scoreDocs.length; i++) { ScoreDoc scoreDoc = hits.scoreDocs[i]; Document doc = is.doc(scoreDoc.doc); System.out.println( doc.get(&quot;filename&quot;) ); } is.close(); }
Code: Doc Deletion ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Otis Gospodnetic, Sematext Int’l
Code: Doc Updates Otis Gospodnetic, Sematext Int’l void  updateDocument(Term  term, Document  doc, Analyzer analyzer)             Updates a document by first deleting the document(s) containing term and then adding the new document.   void Via  IndexWriter  facade void  updateDocument(Term term, Document doc)            Updates a document by first deleting the document(s) containing term and then adding the new document.   void
Pitfalls ,[object Object],[object Object],[object Object],Otis Gospodnetic, Sematext Int’l
Performance Tips ,[object Object],[object Object],[object Object],[object Object],Otis Gospodnetic, Sematext Int’l
Lucene 2.9 & 3.0 ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Otis Gospodnetic, Sematext Int’l
Community [email_address] [email_address] Otis Gospodnetic, Sematext Int’l &quot;I posted, went to get a sandwich, and came back to see two answers.  The change works, and I can get the fix into production today. This list is magic.&quot;
Resources ,[object Object],[object Object],[object Object],[object Object],[object Object],Otis Gospodnetic, Sematext Int’l
Contact ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Otis Gospodnetic, Sematext Int’l

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

File Format Benchmark - Avro, JSON, ORC and Parquet
File Format Benchmark - Avro, JSON, ORC and ParquetFile Format Benchmark - Avro, JSON, ORC and Parquet
File Format Benchmark - Avro, JSON, ORC and Parquet
 
ElasticSearch Basic Introduction
ElasticSearch Basic IntroductionElasticSearch Basic Introduction
ElasticSearch Basic Introduction
 
Elastic search overview
Elastic search overviewElastic search overview
Elastic search overview
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
 
Building Data Pipelines for Solr with Apache NiFi
Building Data Pipelines for Solr with Apache NiFiBuilding Data Pipelines for Solr with Apache NiFi
Building Data Pipelines for Solr with Apache NiFi
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLP
 
Azure Machine Learning and its real-world use cases
Azure Machine Learning and its real-world use casesAzure Machine Learning and its real-world use cases
Azure Machine Learning and its real-world use cases
 
NiFi 시작하기
NiFi 시작하기NiFi 시작하기
NiFi 시작하기
 
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, FlipkartNear Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
 
Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr Workshop
 
Apache PIG
Apache PIGApache PIG
Apache PIG
 
Fundamental of ELK Stack
Fundamental of ELK StackFundamental of ELK Stack
Fundamental of ELK Stack
 
Cloudera Impala 1.0
Cloudera Impala 1.0Cloudera Impala 1.0
Cloudera Impala 1.0
 
Elastic stack Presentation
Elastic stack PresentationElastic stack Presentation
Elastic stack Presentation
 
Introduction to influx db
Introduction to influx dbIntroduction to influx db
Introduction to influx db
 
Social Media Monitoring with NiFi, Druid and Superset
Social Media Monitoring with NiFi, Druid and SupersetSocial Media Monitoring with NiFi, Druid and Superset
Social Media Monitoring with NiFi, Druid and Superset
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?
 
Manage Add-On Services with Apache Ambari
Manage Add-On Services with Apache AmbariManage Add-On Services with Apache Ambari
Manage Add-On Services with Apache Ambari
 
Presto: Distributed sql query engine
Presto: Distributed sql query engine Presto: Distributed sql query engine
Presto: Distributed sql query engine
 
Switch transformers paper review
Switch transformers paper reviewSwitch transformers paper review
Switch transformers paper review
 

Destaque

Destaque (7)

Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
 
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka
 
Lucene, Apache
Lucene, ApacheLucene, Apache
Lucene, Apache
 
Apache lucene
Apache luceneApache lucene
Apache lucene
 
Lucene basics
Lucene basicsLucene basics
Lucene basics
 
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...
 
SlideShare 101
SlideShare 101SlideShare 101
SlideShare 101
 

Semelhante a Lucene Introduction

Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
Erik Hatcher
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
Erik Hatcher
 

Semelhante a Lucene Introduction (20)

Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using Lucene
 
Building a Search Engine Using Lucene
Building a Search Engine Using LuceneBuilding a Search Engine Using Lucene
Building a Search Engine Using Lucene
 
IR with lucene
IR with luceneIR with lucene
IR with lucene
 
Apache Lucene Searching The Web
Apache Lucene Searching The WebApache Lucene Searching The Web
Apache Lucene Searching The Web
 
Intro to Elasticsearch
Intro to ElasticsearchIntro to Elasticsearch
Intro to Elasticsearch
 
Full Text Search In PostgreSQL
Full Text Search In PostgreSQLFull Text Search In PostgreSQL
Full Text Search In PostgreSQL
 
Lucene in Action
Lucene in ActionLucene in Action
Lucene in Action
 
Dapper
DapperDapper
Dapper
 
ElasticSearch for .NET Developers
ElasticSearch for .NET DevelopersElasticSearch for .NET Developers
ElasticSearch for .NET Developers
 
Building Search & Recommendation Engines
Building Search & Recommendation EnginesBuilding Search & Recommendation Engines
Building Search & Recommendation Engines
 
Tutorial 5 (lucene)
Tutorial 5 (lucene)Tutorial 5 (lucene)
Tutorial 5 (lucene)
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
 
Fast track to lucene
Fast track to luceneFast track to lucene
Fast track to lucene
 
Elasticsearch, a distributed search engine with real-time analytics
Elasticsearch, a distributed search engine with real-time analyticsElasticsearch, a distributed search engine with real-time analytics
Elasticsearch, a distributed search engine with real-time analytics
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
Using Thinking Sphinx with rails
Using Thinking Sphinx with railsUsing Thinking Sphinx with rails
Using Thinking Sphinx with rails
 
[2 d1] elasticsearch 성능 최적화
[2 d1] elasticsearch 성능 최적화[2 d1] elasticsearch 성능 최적화
[2 d1] elasticsearch 성능 최적화
 
[2D1]Elasticsearch 성능 최적화
[2D1]Elasticsearch 성능 최적화[2D1]Elasticsearch 성능 최적화
[2D1]Elasticsearch 성능 최적화
 

Mais de otisg

UIMA
UIMAUIMA
UIMA
otisg
 

Mais de otisg (6)

Search at Tumblr (nyc search meetup)
Search at Tumblr (nyc search meetup)Search at Tumblr (nyc search meetup)
Search at Tumblr (nyc search meetup)
 
Lucandra
LucandraLucandra
Lucandra
 
Finite State Queries In Lucene
Finite State Queries In LuceneFinite State Queries In Lucene
Finite State Queries In Lucene
 
UIMA
UIMAUIMA
UIMA
 
Probabilistic Retrieval
Probabilistic RetrievalProbabilistic Retrieval
Probabilistic Retrieval
 
Faceted Search and Solr
Faceted Search and SolrFaceted Search and Solr
Faceted Search and Solr
 

Último

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 

Lucene Introduction

  • 1. Lucene Introduction Otis Gospodnetic, Sematext Int’l @otisg [email_address] http://jroller.com/otis http://sematext.com/
  • 2.
  • 3.
  • 4.
  • 5.
  • 6. Integration Data Source Data Source Gather Parse Make Doc Search UI Search App e.g. webapp Search Index Index Otis Gospodnetic, Sematext Int’l
  • 7. Integration: Rich Doc Indexing HTML PDF Gather Make Doc Index Index MS Word PDF Parse with Tika Otis Gospodnetic, Sematext Int’l
  • 8.
  • 9.
  • 10.
  • 11. Code: FS Indexer Otis Gospodnetic, Sematext Int’l private IndexWriter writer; public Indexer(String indexDir) throws IOException { Directory dir = FSDirectory.open (new File(indexDir)); writer = new IndexWriter(dir, new StandardAnalyzer(Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.UNLIMITED); } public void close() throws IOException { writer.close(); } public void index(String dataDir, FileFilter filter) throws Exception { File[] files = new File(dataDir).listFiles(); for (File f: files) { Document doc = new Document(); doc.add(new Field(&quot;contents&quot;, new FileReader(f))); doc.add(new Field(&quot;filename&quot;, f.getName(), Field.Store.YES, Field.Index.NOT_ANALYZED)); writer.addDocument(doc); } }
  • 12. Indexing Pipeline Otis Gospodnetic, Sematext Int’l Tokenizer TokenFilter Document Document Writer Inverted Index add
  • 13.
  • 14. Analysis in Action Otis Gospodnetic, Sematext Int’l &quot; The quick brown fox jumped over the lazy dogs &quot; WhitespaceAnalyzer : [The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs] SimpleAnalyzer : [the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs] StopAnalyzer : [quick] [brown] [fox] [jumped] [over] [lazy] [dogs] StandardAnalyzer : [quick] [brown] [fox] [jumped] [over] [lazy] [dogs] &quot; XY&Z Corporation - xyz@example.com &quot; WhitespaceAnalyzer : [XY&Z] [Corporation] [-] [xyz@example.com] SimpleAnalyzer : [xy] [z] [corporation] [xyz] [example] [com] StopAnalyzer : [xy] [z] [corporation] [xyz] [example] [com] StandardAnalyzer : [xy&z] [corporation] [xyz@example.com]
  • 15.
  • 16. Inverted Index Source: developer.apple.com Otis Gospodnetic, Sematext Int’l
  • 17.
  • 18. Code: Searcher Otis Gospodnetic, Sematext Int’l public void search(String indexDir, String q) throws IOException, ParseException { Directory dir = FSDirectory.open (new File(indexDir)); IndexSearcher is = new IndexSearcher(dir, true); QueryParser parser = new QueryParser(&quot;contents&quot;, new StandardAnalyzer(Version.LUCENE_CURRENT)); Query query = parser.parse(q); TopDocs hits = is.search(query, 10); System.err.println(&quot;Found &quot; + hits.totalHits + &quot; document(s)&quot;); for (int i=0; i<hits.scoreDocs.length; i++) { ScoreDoc scoreDoc = hits.scoreDocs[i]; Document doc = is.doc(scoreDoc.doc); System.out.println( doc.get(&quot;filename&quot;) ); } is.close(); }
  • 19.
  • 20. Code: Doc Updates Otis Gospodnetic, Sematext Int’l void updateDocument(Term  term, Document  doc, Analyzer analyzer)           Updates a document by first deleting the document(s) containing term and then adding the new document.   void Via IndexWriter facade void updateDocument(Term term, Document doc)           Updates a document by first deleting the document(s) containing term and then adding the new document.   void
  • 21.
  • 22.
  • 23.
  • 24. Community [email_address] [email_address] Otis Gospodnetic, Sematext Int’l &quot;I posted, went to get a sandwich, and came back to see two answers. The change works, and I can get the fix into production today. This list is magic.&quot;
  • 25.
  • 26.