SlideShare uma empresa Scribd logo
1 de 20
Introduction to Open Source Search with Apache Lucene and Solr Grant Ingersoll
The How Many Game How many of you: Have taken a class in Information Retrieval (IR)? Are doing work/research in IR? Have heard of or are using Lucene? Have heard of or are using Solr? Are doing work on core IR algorithms such as compression techniques or scoring? Are doing UI/Application work/research as they relate to search?
Topics Brief Bio Search 101 (skip?) What is: Apache Lucene Apache Solr What can they do? Features and functionality Intangibles What’s new in Lucene and Solr? How can they help my research/work/____?
Brief Bio Apache Lucene/Solr Committer Apache Mahout co-founder Scalable Machine Learning Co-founder of Lucid Imagination http://www.lucidimagination.com Previously worked at Center for Natural Lang. Processing at Syracuse Univ. with Dr. Liddy Co-Author of upcoming “Taming Text” (Manning Publications) http://www.manning.com/ingersoll
Search 101 Search tools are designed for dealing with fuzzy data/questions Works well with structured and unstructured data Performs well when dealing with large volumes of data Many apps don’t need the limits that databases place on content Search fits well alongside a DB too Given a user’s information need, (query) find and, optionally, score content relevant to that need Many different ways to solve this problem, each with tradeoffs What’s “relevant” mean?
Vector Space Model (VSM) for relevance Common across many search engines Apache Lucene is a highly optimized implementation of the VSM Search 101 Relevance Indexing Finds  and maps terms and documents  Conceptually similar to a book index At the heart of fast search/retrieve
Apache Lucene in a Nutshell http://lucene.apache.org/java Java based Application Programming Interface (API) for adding search and indexing functionality to applications Fast and efficient scoring and indexing algorithms Lots of contributions to make common tasks easier: Highlighting, spatial, Query Parsers, Benchmarking tools, etc. Most widely deployed search library on the planet
Lucene Basics Content is modeled via Documents and Fields Content can be text, integers, floats, dates, custom Analysis can be employed to alter content before indexing Searches are supported through a wide range of Query options Keyword Terms Phrases Wildcards Many, many more
Apache Solr in a Nutshell http://lucene.apache.org/solr Lucene-based Search Server + other features and functionality Access Lucene over HTTP: Java, XML, Ruby, Python, .NET, JSON, PHP, etc. Most programming tasks in Lucene are configuration tasks in Solr Faceting (guided navigation, filters, etc.) Replication and distributed search support Lucene Best Practices
A small sampling of Lucene/Solr-Powered Sites 10 Buy.com
Features and Functionality
Quick Solr/Lucene Demo Pre-reqs: Apache Ant 1.7.x, Subversion (SVN) Command Line 1: svn co https://svn.apache.org/repos/asf/lucene/dev/trunksolr-trunk cdsolr-trunk/solr/ ant example cd example java –Dsolr.clustering.enabled=true –jar start.jar Command Line 2 cd exampledocs; java –jar post.jar *.xml http://localhost:8983/solr/browse?q=&debugQuery=true&annotateBrowse=true
Other Features Data Import Handler Database, Mail, RSS, etc. Rich document support via Apache Tika PDF, MS Office, Images, etc. Replication for high query volume Distributed search for large indexes Production systems with 1B+ documents Configurable Analysis chain and other extension points Total control over tokenization, stemming, etc.
Intangibles Open Source Flexible, non-restrictive license Apache License v2 – non-viral “Do what you want with the software, just don’t claim you wrote it” Large community willing to help Great place to learn about real world IR systems Many books and other documentation Lucene in Action by Hatcher, McCandless and Gospodnetic
What’s New? https://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/CHANGES.txt https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/CHANGES.txt Codecs Pluggable Index Formats Provide Different index compression techniques Stats to enable alternate scoring approaches  BM25, Lang. Modeling, etc.  -- More work to be done here Faster Java Strings are slow; convert to use byte arrays
Other New Items Many new Analyzers (tokenizers, etc.) Richer Language support (Hindi, Indonesian, Arabic, …) Richer Geospatial (Local) Search capabilities Score, filter, sort by distance http://wiki.apache.org/solr/SpatialSearch Results Grouping Group Related Results http://wiki.apache.org/solr/FieldCollapsing More Faceting Capabilities Pivot New underlying algorithms
How can Lucene/Solr help me?
Job Trends http://www.indeed.com
Other Things that Can Help Nutch Crawling http://nutch.apache.org Mahout Machine learning (clustering, classification, others) http://mahout.apache.org OpenNLP Part of Speech, Parsers, Named Entity Recognition http://incubator.apache.org/opennlp Open Relevance Project Relevance Judgments http://lucene.apache.org/openrelevance
Resources http://lucene.apache.org http://www.lucidimagination.com {java-user|solr-user}@lucene.apache.org @gsingers http://www.slideshare.net/gsingers grant@lucidimagination.com

Mais conteúdo relacionado

Mais procurados

Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneSwapnil & Patil
 
Solr: 4 big features
Solr: 4 big featuresSolr: 4 big features
Solr: 4 big featuresDavid Smiley
 
Hacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsHacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsOpenSource Connections
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with SolrErik Hatcher
 
Integrating the Solr search engine
Integrating the Solr search engineIntegrating the Solr search engine
Integrating the Solr search engineth0masr
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrErik Hatcher
 
Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5israelekpo
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash courseTommaso Teofili
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesRahul Jain
 
Enterprise Search Using Apache Solr
Enterprise Search Using Apache SolrEnterprise Search Using Apache Solr
Enterprise Search Using Apache Solrsagar chaturvedi
 

Mais procurados (20)

Apache lucene
Apache luceneApache lucene
Apache lucene
 
Introduction To Apache Lucene
Introduction To Apache LuceneIntroduction To Apache Lucene
Introduction To Apache Lucene
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using lucene
 
Solr 101
Solr 101Solr 101
Solr 101
 
Lucene basics
Lucene basicsLucene basics
Lucene basics
 
Solr: 4 big features
Solr: 4 big featuresSolr: 4 big features
Solr: 4 big features
 
Hacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsHacking Lucene for Custom Search Results
Hacking Lucene for Custom Search Results
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Integrating the Solr search engine
Integrating the Solr search engineIntegrating the Solr search engine
Integrating the Solr search engine
 
Azure search
Azure searchAzure search
Azure search
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
How Solr Search Works
How Solr Search WorksHow Solr Search Works
How Solr Search Works
 
Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Elasticsearch speed is key
Elasticsearch speed is keyElasticsearch speed is key
Elasticsearch speed is key
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and Usecases
 
Solr Recipes
Solr RecipesSolr Recipes
Solr Recipes
 
Lucene indexing
Lucene indexingLucene indexing
Lucene indexing
 
Enterprise Search Using Apache Solr
Enterprise Search Using Apache SolrEnterprise Search Using Apache Solr
Enterprise Search Using Apache Solr
 

Destaque

Solr data importhandler
Solr data importhandlerSolr data importhandler
Solr data importhandlerDikshant Shahi
 
Assamese search engine using SOLR by Moinuddin Ahmed ( moin )
Assamese search engine using SOLR by Moinuddin Ahmed ( moin )Assamese search engine using SOLR by Moinuddin Ahmed ( moin )
Assamese search engine using SOLR by Moinuddin Ahmed ( moin )'Moinuddin Ahmed
 
Do you need an external search platform for Adobe Experience Manager?
Do you need an external search platform for Adobe Experience Manager?Do you need an external search platform for Adobe Experience Manager?
Do you need an external search platform for Adobe Experience Manager?therealgaston
 
Search Engine-Building with Lucene and Solr
Search Engine-Building with Lucene and SolrSearch Engine-Building with Lucene and Solr
Search Engine-Building with Lucene and SolrKai Chan
 
Solr installation
Solr installationSolr installation
Solr installationZHAO Sam
 
Getting to know alfresco 4
Getting to know alfresco 4Getting to know alfresco 4
Getting to know alfresco 4Paul Hampton
 
Webinar: MongoDB and Polyglot Persistence Architecture
Webinar: MongoDB and Polyglot Persistence ArchitectureWebinar: MongoDB and Polyglot Persistence Architecture
Webinar: MongoDB and Polyglot Persistence ArchitectureMongoDB
 
Side by Side with Elasticsearch & Solr, Part 2
Side by Side with Elasticsearch & Solr, Part 2Side by Side with Elasticsearch & Solr, Part 2
Side by Side with Elasticsearch & Solr, Part 2Sematext Group, Inc.
 

Destaque (8)

Solr data importhandler
Solr data importhandlerSolr data importhandler
Solr data importhandler
 
Assamese search engine using SOLR by Moinuddin Ahmed ( moin )
Assamese search engine using SOLR by Moinuddin Ahmed ( moin )Assamese search engine using SOLR by Moinuddin Ahmed ( moin )
Assamese search engine using SOLR by Moinuddin Ahmed ( moin )
 
Do you need an external search platform for Adobe Experience Manager?
Do you need an external search platform for Adobe Experience Manager?Do you need an external search platform for Adobe Experience Manager?
Do you need an external search platform for Adobe Experience Manager?
 
Search Engine-Building with Lucene and Solr
Search Engine-Building with Lucene and SolrSearch Engine-Building with Lucene and Solr
Search Engine-Building with Lucene and Solr
 
Solr installation
Solr installationSolr installation
Solr installation
 
Getting to know alfresco 4
Getting to know alfresco 4Getting to know alfresco 4
Getting to know alfresco 4
 
Webinar: MongoDB and Polyglot Persistence Architecture
Webinar: MongoDB and Polyglot Persistence ArchitectureWebinar: MongoDB and Polyglot Persistence Architecture
Webinar: MongoDB and Polyglot Persistence Architecture
 
Side by Side with Elasticsearch & Solr, Part 2
Side by Side with Elasticsearch & Solr, Part 2Side by Side with Elasticsearch & Solr, Part 2
Side by Side with Elasticsearch & Solr, Part 2
 

Semelhante a Intro to Apache Lucene and Solr

TriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr HadoopTriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr HadoopGrant Ingersoll
 
Get involved with the Apache Software Foundation
Get involved with the Apache Software FoundationGet involved with the Apache Software Foundation
Get involved with the Apache Software FoundationShalin Shekhar Mangar
 
Lucene Bootcamp -1
Lucene Bootcamp -1 Lucene Bootcamp -1
Lucene Bootcamp -1 GokulD
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAsad Abbas
 
ElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learnedElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learnedBeyondTrees
 
Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Manish kumar
 
Apace Solr Web Development.pdf
Apace Solr Web Development.pdfApace Solr Web Development.pdf
Apace Solr Web Development.pdfAbanti Aazmin
 
Search Me: Using Lucene.Net
Search Me: Using Lucene.NetSearch Me: Using Lucene.Net
Search Me: Using Lucene.Netgramana
 
JavaEdge09 : Java Indexing and Searching
JavaEdge09 : Java Indexing and SearchingJavaEdge09 : Java Indexing and Searching
JavaEdge09 : Java Indexing and SearchingShay Sofer
 
Wanna search? Piece of cake!
Wanna search? Piece of cake!Wanna search? Piece of cake!
Wanna search? Piece of cake!Alex Kursov
 
Develop open source search engine
Develop open source search engineDevelop open source search engine
Develop open source search engineNAILBITER
 
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene EcosystemOpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene EcosystemGrant Ingersoll
 
ElasticSearch Basic Introduction
ElasticSearch Basic IntroductionElasticSearch Basic Introduction
ElasticSearch Basic IntroductionMayur Rathod
 
A customized web search engine [autosaved]
A customized web search engine [autosaved]A customized web search engine [autosaved]
A customized web search engine [autosaved]Mustafa Elkhiat
 
Elastic search overview
Elastic search overviewElastic search overview
Elastic search overviewABC Talks
 
apache solr web development.pdf
apache solr web development.pdfapache solr web development.pdf
apache solr web development.pdfTasnim Jahan
 
PyCon India 2012: Rapid development of website search in python
PyCon India 2012: Rapid development of website search in pythonPyCon India 2012: Rapid development of website search in python
PyCon India 2012: Rapid development of website search in pythonChetan Giridhar
 

Semelhante a Intro to Apache Lucene and Solr (20)

TriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr HadoopTriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr Hadoop
 
Apache Lucene Searching The Web
Apache Lucene Searching The WebApache Lucene Searching The Web
Apache Lucene Searching The Web
 
Get involved with the Apache Software Foundation
Get involved with the Apache Software FoundationGet involved with the Apache Software Foundation
Get involved with the Apache Software Foundation
 
Lucene Bootcamp -1
Lucene Bootcamp -1 Lucene Bootcamp -1
Lucene Bootcamp -1
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using Lucene
 
ElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learnedElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learned
 
Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Apace Solr Web Development.pdf
Apace Solr Web Development.pdfApace Solr Web Development.pdf
Apace Solr Web Development.pdf
 
Search Me: Using Lucene.Net
Search Me: Using Lucene.NetSearch Me: Using Lucene.Net
Search Me: Using Lucene.Net
 
JavaEdge09 : Java Indexing and Searching
JavaEdge09 : Java Indexing and SearchingJavaEdge09 : Java Indexing and Searching
JavaEdge09 : Java Indexing and Searching
 
Wanna search? Piece of cake!
Wanna search? Piece of cake!Wanna search? Piece of cake!
Wanna search? Piece of cake!
 
Develop open source search engine
Develop open source search engineDevelop open source search engine
Develop open source search engine
 
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene EcosystemOpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene Ecosystem
 
ElasticSearch Basic Introduction
ElasticSearch Basic IntroductionElasticSearch Basic Introduction
ElasticSearch Basic Introduction
 
A customized web search engine [autosaved]
A customized web search engine [autosaved]A customized web search engine [autosaved]
A customized web search engine [autosaved]
 
Elastic search overview
Elastic search overviewElastic search overview
Elastic search overview
 
apache solr web development.pdf
apache solr web development.pdfapache solr web development.pdf
apache solr web development.pdf
 
963
963963
963
 
PyCon India 2012: Rapid development of website search in python
PyCon India 2012: Rapid development of website search in pythonPyCon India 2012: Rapid development of website search in python
PyCon India 2012: Rapid development of website search in python
 

Mais de Grant Ingersoll

This Ain't Your Parent's Search Engine
This Ain't Your Parent's Search EngineThis Ain't Your Parent's Search Engine
This Ain't Your Parent's Search EngineGrant Ingersoll
 
Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4Grant Ingersoll
 
Crowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and HadoopCrowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and HadoopGrant Ingersoll
 
What's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.xWhat's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.xGrant Ingersoll
 
Leveraging Solr and Mahout
Leveraging Solr and MahoutLeveraging Solr and Mahout
Leveraging Solr and MahoutGrant Ingersoll
 
Scalable Machine Learning with Hadoop
Scalable Machine Learning with HadoopScalable Machine Learning with Hadoop
Scalable Machine Learning with HadoopGrant Ingersoll
 
Large Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in ActionLarge Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in ActionGrant Ingersoll
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrGrant Ingersoll
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrGrant Ingersoll
 
Bet you didn't know Lucene can...
Bet you didn't know Lucene can...Bet you didn't know Lucene can...
Bet you didn't know Lucene can...Grant Ingersoll
 
Starfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsStarfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsGrant Ingersoll
 
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopIntro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopGrant Ingersoll
 
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantApache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantGrant Ingersoll
 
Intelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and FriendsIntelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and FriendsGrant Ingersoll
 

Mais de Grant Ingersoll (20)

Solr for Data Science
Solr for Data ScienceSolr for Data Science
Solr for Data Science
 
This Ain't Your Parent's Search Engine
This Ain't Your Parent's Search EngineThis Ain't Your Parent's Search Engine
This Ain't Your Parent's Search Engine
 
Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4
 
Intro to Search
Intro to SearchIntro to Search
Intro to Search
 
Open Source Search FTW
Open Source Search FTWOpen Source Search FTW
Open Source Search FTW
 
Crowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and HadoopCrowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and Hadoop
 
What's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.xWhat's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.x
 
Taming Text
Taming TextTaming Text
Taming Text
 
Leveraging Solr and Mahout
Leveraging Solr and MahoutLeveraging Solr and Mahout
Leveraging Solr and Mahout
 
Scalable Machine Learning with Hadoop
Scalable Machine Learning with HadoopScalable Machine Learning with Hadoop
Scalable Machine Learning with Hadoop
 
Large Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in ActionLarge Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in Action
 
Apache Lucene 4
Apache Lucene 4Apache Lucene 4
Apache Lucene 4
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
 
Bet you didn't know Lucene can...
Bet you didn't know Lucene can...Bet you didn't know Lucene can...
Bet you didn't know Lucene can...
 
Starfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsStarfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data Analytics
 
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopIntro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
 
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantApache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow Elephant
 
Intelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and FriendsIntelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and Friends
 
Intro to Apache Mahout
Intro to Apache MahoutIntro to Apache Mahout
Intro to Apache Mahout
 

Último

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 

Último (20)

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 

Intro to Apache Lucene and Solr

  • 1. Introduction to Open Source Search with Apache Lucene and Solr Grant Ingersoll
  • 2. The How Many Game How many of you: Have taken a class in Information Retrieval (IR)? Are doing work/research in IR? Have heard of or are using Lucene? Have heard of or are using Solr? Are doing work on core IR algorithms such as compression techniques or scoring? Are doing UI/Application work/research as they relate to search?
  • 3. Topics Brief Bio Search 101 (skip?) What is: Apache Lucene Apache Solr What can they do? Features and functionality Intangibles What’s new in Lucene and Solr? How can they help my research/work/____?
  • 4. Brief Bio Apache Lucene/Solr Committer Apache Mahout co-founder Scalable Machine Learning Co-founder of Lucid Imagination http://www.lucidimagination.com Previously worked at Center for Natural Lang. Processing at Syracuse Univ. with Dr. Liddy Co-Author of upcoming “Taming Text” (Manning Publications) http://www.manning.com/ingersoll
  • 5. Search 101 Search tools are designed for dealing with fuzzy data/questions Works well with structured and unstructured data Performs well when dealing with large volumes of data Many apps don’t need the limits that databases place on content Search fits well alongside a DB too Given a user’s information need, (query) find and, optionally, score content relevant to that need Many different ways to solve this problem, each with tradeoffs What’s “relevant” mean?
  • 6. Vector Space Model (VSM) for relevance Common across many search engines Apache Lucene is a highly optimized implementation of the VSM Search 101 Relevance Indexing Finds and maps terms and documents Conceptually similar to a book index At the heart of fast search/retrieve
  • 7. Apache Lucene in a Nutshell http://lucene.apache.org/java Java based Application Programming Interface (API) for adding search and indexing functionality to applications Fast and efficient scoring and indexing algorithms Lots of contributions to make common tasks easier: Highlighting, spatial, Query Parsers, Benchmarking tools, etc. Most widely deployed search library on the planet
  • 8. Lucene Basics Content is modeled via Documents and Fields Content can be text, integers, floats, dates, custom Analysis can be employed to alter content before indexing Searches are supported through a wide range of Query options Keyword Terms Phrases Wildcards Many, many more
  • 9. Apache Solr in a Nutshell http://lucene.apache.org/solr Lucene-based Search Server + other features and functionality Access Lucene over HTTP: Java, XML, Ruby, Python, .NET, JSON, PHP, etc. Most programming tasks in Lucene are configuration tasks in Solr Faceting (guided navigation, filters, etc.) Replication and distributed search support Lucene Best Practices
  • 10. A small sampling of Lucene/Solr-Powered Sites 10 Buy.com
  • 12. Quick Solr/Lucene Demo Pre-reqs: Apache Ant 1.7.x, Subversion (SVN) Command Line 1: svn co https://svn.apache.org/repos/asf/lucene/dev/trunksolr-trunk cdsolr-trunk/solr/ ant example cd example java –Dsolr.clustering.enabled=true –jar start.jar Command Line 2 cd exampledocs; java –jar post.jar *.xml http://localhost:8983/solr/browse?q=&debugQuery=true&annotateBrowse=true
  • 13. Other Features Data Import Handler Database, Mail, RSS, etc. Rich document support via Apache Tika PDF, MS Office, Images, etc. Replication for high query volume Distributed search for large indexes Production systems with 1B+ documents Configurable Analysis chain and other extension points Total control over tokenization, stemming, etc.
  • 14. Intangibles Open Source Flexible, non-restrictive license Apache License v2 – non-viral “Do what you want with the software, just don’t claim you wrote it” Large community willing to help Great place to learn about real world IR systems Many books and other documentation Lucene in Action by Hatcher, McCandless and Gospodnetic
  • 15. What’s New? https://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/CHANGES.txt https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/CHANGES.txt Codecs Pluggable Index Formats Provide Different index compression techniques Stats to enable alternate scoring approaches BM25, Lang. Modeling, etc. -- More work to be done here Faster Java Strings are slow; convert to use byte arrays
  • 16. Other New Items Many new Analyzers (tokenizers, etc.) Richer Language support (Hindi, Indonesian, Arabic, …) Richer Geospatial (Local) Search capabilities Score, filter, sort by distance http://wiki.apache.org/solr/SpatialSearch Results Grouping Group Related Results http://wiki.apache.org/solr/FieldCollapsing More Faceting Capabilities Pivot New underlying algorithms
  • 19. Other Things that Can Help Nutch Crawling http://nutch.apache.org Mahout Machine learning (clustering, classification, others) http://mahout.apache.org OpenNLP Part of Speech, Parsers, Named Entity Recognition http://incubator.apache.org/opennlp Open Relevance Project Relevance Judgments http://lucene.apache.org/openrelevance
  • 20. Resources http://lucene.apache.org http://www.lucidimagination.com {java-user|solr-user}@lucene.apache.org @gsingers http://www.slideshare.net/gsingers grant@lucidimagination.com

Notas do Editor

  1. Rather than talk you through a lot of the features and functionality, let me show you
  2. Do thisExample Queries:ipod184-pin DDRCover: Querying, scoring, faceting, clustering, function queries, spatial, grouping, more like this, indexing