SlideShare uma empresa Scribd logo
1 de 40
Apache Solr
Oberseminar, 12.06.2015
Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen
Péter Király, pkiraly@gwdg.de
What is Apache Solr?
Solr is the popular, blazing-fast, open source enterprise
search platform built on Apache Lucene
2
● 1999: Doug Cutting published Lucene
● 2004: Yonik Seeley published Solr
● 2006: Apache project (2007: TLP)
● 2009: LucidWorks company
● 2010: Merge of Lucene and Solr
● 2011: 3.1
● 2012: 4.0
● 2015: 5.0
History in one minute
3
“Sister” projects
● Nutch: web scale search engine
● Tika: document parser
● Hadoop: distributes storage and data
processing
● Elasticsearch: alternative to Solr
● forks/ports of Lucene
● client libraries and tools (Luke index viewer)
4
Main features I
● Faceted navigation
● Hit highlighting
● Query language
● Schema-less mode and Schema REST API
● JSON, XML, PHP, Ruby, Python, XSLT,
Velocity and custom Java binary outputs
● HTML administration interface
5
Main features II
● Replication to other Solr servers
● Distributed search through sharding
● Search results clustering based on Carrot2
● Extensible through plugins
● Relevance boosting via functions
● Caching - queries, filters, and documents
● Embeddable in a Java Application
6
Main features III
● Geo-spatial search, including multiple
points per documents and polygons
● Automated management of large clusters
through ZooKeeper
● Function queries
● Field Collapsing and grouping
● Auto-suggest
7
Inverted index
Original documents:
Doc # Content field
1 A Fun Guide to Cooking
2 Decorating Your Home
3 How to Raise a Child
4 Buying a New Car
8
Inverted index
Index structure
Term Doc1 Doc2 Doc3 Doc4 Doc5 Doc6 Doc7
a 0 1 1 1 0 0 0
becomming 0 0 0 0 1 0 0
beginner’s 0 0 0 0 0 1 0
buy 0 0 1 0 0 0 0
stored as a bit vectorstored as reference to a tree
structure
9
Indexing
Document ~ RDBM record
Fields (key-value structure):
● types (text, numeric, date, point, custom)
● indexed, stored, multiple, required
● field name patterns (prefixes, suffixes, such
as *_tx)
● special fields (identifier, _version_)
10
Indexing
formats: JSON, XML, binary, RDBM, ...
connections: file, Data Import Handler, API
sharding (separating documents into multiple
parts)
denormalized documents - (almost) no JOIN ;-(
copy field
catch all field (contains everything)
11
A document example (XML)
<doc>
<field name="id">F8V7067-APL-KIT</field> string
<field name="name">Belkin Mobile Power Cord for iPod w/ Dock</field> text
<field name="cat">electronics</field>
<field name="cat">connector</field> multivalue
<field name="price">19.95</field> float
<field name="inStock">false</field> boolean
<field name="store">45.18014,-93.87741</field> geo point
<field name="manufacturedate_dt">2005-08-01T16:30:25Z</field> date
</doc>
12
A document example (JSON)
{
"id": "F8V7067-APL-KIT",
"name": "Belkin Mobile Power Cord for iPod w/ Dock",
"cat": ["electronics", "connector"],
"price":19.95,
"inStock":false,
"store": "45.18014,-93.87741",
"manufacturedate_dt": "2005-08-01T16:30:25Z"
}
13
A document example (Solr4j library)
SolrServer solr = new HttpSolrServer(“http://…”);
SolrInputDocument doc = new SolrInputDocument();
doc.setField("id", "F8V7067-APL-KIT");
doc.setField("name", "Belkin Mobile Power Cord for iPod w/ Dock");
...
solr.add(doc);
solr.commit(true, true);
14
Text analysis chain
1) character filters — preprocess text
pattern replace, ASCII folding, HTML stripping
1) tokenizers — split text into smaller units
whitespace, lowercase, word delim., standard
1) token filters — examine/modify/eliminate
stemming, lowercase, stop words,
15
Text analysis chain
<fieldType name="my-text-type" class="solr.TextField">
<analyzer type="index">
<charFilter class="solr.MappingCharFilterFactory"
mapping="mapping-FoldToASCII.txt" />
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.StopFilterFactory" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
16
Text analysis result
#Yummm :) Drinking a latte at Caffé Grecco in
SF’s historic North Beach…Learning text
analysis
“#yumm”, “drink”, “latte”, “caffe”, “grecco”,
“sf”/”san francisco”, “historic” “north” “beach”
“learn”, “text”, “analysis”
17
Performing queries
1) user enters a query (+ specifies other
components)
2) query handler
3) analysis (use similar as in indexing)
4) run search
5) adding components
6) serialization (XML, JSON etc.)
18
Lucene query language
● *:* (→ everything)
● gwdg
● name:gwdg
● name:admin*
● h?ld (→ hold, held)
● name:administrator~ (→ —tor, —tion)
● name:Gesellschaft~0.6 (similarity measure)
19
Lucene query language
● name:Max AND name:Planck
● name:Max OR name:Planck
● name:Max NOT name:Planck
● name:”Max Planck”
● name:(“Max Planck” OR Gesselschaft)
● “Max Planck”~3 (within 3 words)
→ so “Planck Max”, “Max Ludwig Planck”
20
Lucene query language
● max planck^10 (weighting)
● price:[10 TO 20] (→ 10..20)
● price:{10 TO 20} (→ 11..19)
● born:[1900-01-01T00:00.0Z TO 1949-12-
31T23:59.0Z] (date range)
21
Date mathematics
indexing hour granularity
"born": "2012-05-22T09:30:22Z/HOUR"
search by relative time range, eg. last month:
born:[NOW/DAY-1MONTH TO NOW/DAY]
keywords:
MINUTE, HOUR, DAY, WEEK, MONTH, YEAR
22
Faceted search
Facets let user to get an overview of the
content, and helps to browse without entering
search terms (search theorists: browse and
search are equally imortant).
● term/field facet: list terms and counts
● query facet: run queries, return counts
● range facet: split range into pieces
23
Term facets
&facet=true
&facet.field=TYPE
"facet_fields":{
"TYPE":[
"IMAGE", 25334764,
"TEXT", 16990647,
"VIDEO", 702787,
"SOUND", 558825,
"3D", 21303
]
http://europeana.eu - Europeana portal
24
Term facet
Additional parameters:
● limit, offset → for pagination
● sort (by index or count) → alphabetically or frequency
● mincount → filter less frequent terms
● missing → number of documents miss this field
● prefix → such as “http” to display URLs only
● f.[facet name].facet.[parameter] → overwrites generals
25
Query facets
&facet=true&
facet.query=price:[* TO 5}&
facet.query=price:[5 TO 10}&
facet.query=price:[10 TO 20}&
facet.query=price:[20 TO 50}&
facet.query=price:[50 TO *]
"facet_counts":{
"facet_queries":{
"price:[* TO 5}":6,
"price:[5 TO 10}":5,
"price:[10 TO 20}":3,
"price:[20 TO 50}":6,
"price:[50 TO *]":0
},
26
Query facets (zooming)
From centuries to years
http://pcu.bage.es/ Catálogo Colectivo de las Bibliotecas de la Administración General del Estado
27
Range facet
&facet=true&
facet.range=price&
facet.range.start=0&
facet.range.end=50&
facet.range.gap=5
"facet_ranges":{
"price":{
"counts":[
"0.0", 6, "5.0", 5,
"10.0", 0, "15.0", 3,
"20.0", 2, "25.0", 2,
"30.0", 1, "35.0", 0,
"40.0", 0, "45.0", 1
],
"gap":5.0,"start":0.0,"end":50.0
}}}}
28
Hit highlighting
?...&hl=true
&hl.fl=name
&hl.simple.pre=<em>
&hl.simple.post=</em>
"highlighting": {
"SP2514N": { ←ID
"name": [
"<em>SpinPoint P120
</em> SP2514N - hard
drive - 250 GB - ATA-
133"]}
29
More like this… (similar documents)
mlt (more like this)
handler:
● doc ID
● fields
● boost
● limit
● min length and
freq
http://catalog.lib.kyushu-u.ac.jp/en/ - Kyushu University library catalog
30
More like this (alternative solution)
(DATA_PROVIDER:("NIOD")^0.2 OR what:("IMAGE" OR "Amerikaanse
Strijdkrachten" OR "Luchtmacht" OR "Steden - Zie ook: Ruimtelijke ordening,
Wederopbouw, Dorpen")^0.8) NOT europeana_id:"/2021622/11607
31
Multilingual search
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.ArabicNormalizationFilterFactory"/>
<filter class="solr.ArabicStemFilterFactory"/>
<filter class="solr.PersianCharFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="lang/en_stop.txt"/>
<filter class="solr.SynonymFilterFactory" synonyms="lang/en_synonyms.txt" />
<filter class="solr.SnowballPorterFilterFactory" language="Hungarian" />
32
Multilingual search strategies
● Separate fields by language
→ title_en:horse OR title_de:horse OR title_hu:horse
● Separate collections (core, shard) per language
all core has language settings and same field names
→ /select?shards=.../english,.../spanish,.../french
&q=title:horse
● All language in one field (from Solr 5.0)
→ title:(es|escuela OR en,es,de|school OR school)
33
Multilingual search
query → translation API → rewrited query
horse → (Hauspferd OR Ló OR Paard OR …)
34
Relevancy
The most important concepts:
● Term frequency (tf) - how often a particular term appears in a matching
document
● Inverse document frequency (idf) - how “rare” a search term is, inverse
of the document frequency (how many total documents the search term
appears within)
● field normalization factor (field norm) - a combination of factors
describing the importance of a particular field on a per-document basis
35
Relevancy
score(q,d) = Σ (tf(t in d) × idf(t)2 × t.getBoost() ×
norm(t,d)) × coord(q,d) × queryNorm(q)
where
t = term; d = document; q = query; f = field
tf(t in d) = num. of term occurrences in document1/2
norm(t,d) = d.getBoost() × lengthNorm(f) × f.getBoost()
idf(t) = 1 + log (numDocs / (docFreq +1))
coord(q,d) = numTermsInDocumentFromQuery / numTermsInQuery
queryNorm(q) = 1 / (sumOfSquaredWeights1/2)
sumOfSquaredWeights = q.getBoost()2 × Σ(idf(t) × t.getBoost())2
see: Solr in Action, p. 67
36
Debug
?...&debug=true
...
"debug":{
"rawquerystring":"hard drive",
"querystring":"hard drive",
"parsedquery":"text:hard text:drive",
"parsedquery_toString":"text:hard text:drive",
37
debug
"explain":{
"6H500F0":”
1.209934 = (MATCH) sum of:
0.6588537 = (MATCH) weight(text:hard in 2) [DefaultSimilarity], result of:
0.6588537 = score(doc=2,freq=2.0), product of:
0.73792744 = queryWeight, product of:
3.3671236 = idf(docFreq=2, maxDocs=32)
0.21915662 = queryNorm
0.8928435 = fieldWeight in 2, product of:
1.4142135 = tf(freq=2.0), with freq of:
2.0 = termFreq=2.0
3.3671236 = idf(docFreq=2, maxDocs=32)
...
38
References
● http://lucene.apache.org/solr/
● Grainger & Potter: Solr in Action
● https://lucidworks.com/blog/
● http://blog.sematext.com/
● http://solr.pl/
● https://www.packtpub.com/all?search=solr
● http://www.slideshare.net/treygrainger
39
Happy searching!
40

Mais conteúdo relacionado

Mais procurados

Data Processing and Aggregation with MongoDB
Data Processing and Aggregation with MongoDB Data Processing and Aggregation with MongoDB
Data Processing and Aggregation with MongoDB
MongoDB
 
The Aggregation Framework
The Aggregation FrameworkThe Aggregation Framework
The Aggregation Framework
MongoDB
 

Mais procurados (20)

Querying Nested JSON Data Using N1QL and Couchbase
Querying Nested JSON Data Using N1QL and CouchbaseQuerying Nested JSON Data Using N1QL and Couchbase
Querying Nested JSON Data Using N1QL and Couchbase
 
MongoDB World 2016 : Advanced Aggregation
MongoDB World 2016 : Advanced AggregationMongoDB World 2016 : Advanced Aggregation
MongoDB World 2016 : Advanced Aggregation
 
OSDC 2012 | Building a first application on MongoDB by Ross Lawley
OSDC 2012 | Building a first application on MongoDB by Ross LawleyOSDC 2012 | Building a first application on MongoDB by Ross Lawley
OSDC 2012 | Building a first application on MongoDB by Ross Lawley
 
Java Performance Tips (So Code Camp San Diego 2014)
Java Performance Tips (So Code Camp San Diego 2014)Java Performance Tips (So Code Camp San Diego 2014)
Java Performance Tips (So Code Camp San Diego 2014)
 
Full metal mongo
Full metal mongoFull metal mongo
Full metal mongo
 
Data Processing and Aggregation with MongoDB
Data Processing and Aggregation with MongoDB Data Processing and Aggregation with MongoDB
Data Processing and Aggregation with MongoDB
 
The Aggregation Framework
The Aggregation FrameworkThe Aggregation Framework
The Aggregation Framework
 
MongoDB Aggregation Framework
MongoDB Aggregation FrameworkMongoDB Aggregation Framework
MongoDB Aggregation Framework
 
Couchbase N1QL: Language & Architecture Overview.
Couchbase N1QL: Language & Architecture Overview.Couchbase N1QL: Language & Architecture Overview.
Couchbase N1QL: Language & Architecture Overview.
 
Oracle Developer Day, 20 October 2009, Oracle De Meern, Holland: Oracle Datab...
Oracle Developer Day, 20 October 2009, Oracle De Meern, Holland: Oracle Datab...Oracle Developer Day, 20 October 2009, Oracle De Meern, Holland: Oracle Datab...
Oracle Developer Day, 20 October 2009, Oracle De Meern, Holland: Oracle Datab...
 
UKOUG Tech14 - Using Database In-Memory Column Store with Complex Datatypes
UKOUG Tech14 - Using Database In-Memory Column Store with Complex DatatypesUKOUG Tech14 - Using Database In-Memory Column Store with Complex Datatypes
UKOUG Tech14 - Using Database In-Memory Column Store with Complex Datatypes
 
Starting with JSON Path Expressions in Oracle 12.1.0.2
Starting with JSON Path Expressions in Oracle 12.1.0.2Starting with JSON Path Expressions in Oracle 12.1.0.2
Starting with JSON Path Expressions in Oracle 12.1.0.2
 
OakTable World 2015 - Using XMLType content with the Oracle In-Memory Column...
OakTable World 2015  - Using XMLType content with the Oracle In-Memory Column...OakTable World 2015  - Using XMLType content with the Oracle In-Memory Column...
OakTable World 2015 - Using XMLType content with the Oracle In-Memory Column...
 
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
 
Indexing and Performance Tuning
Indexing and Performance TuningIndexing and Performance Tuning
Indexing and Performance Tuning
 
Avro, la puissance du binaire, la souplesse du JSON
Avro, la puissance du binaire, la souplesse du JSONAvro, la puissance du binaire, la souplesse du JSON
Avro, la puissance du binaire, la souplesse du JSON
 
Json in 18c and 19c
Json in 18c and 19cJson in 18c and 19c
Json in 18c and 19c
 
Beyond the Basics 2: Aggregation Framework
Beyond the Basics 2: Aggregation Framework Beyond the Basics 2: Aggregation Framework
Beyond the Basics 2: Aggregation Framework
 
Webinar: Data Processing and Aggregation Options
Webinar: Data Processing and Aggregation OptionsWebinar: Data Processing and Aggregation Options
Webinar: Data Processing and Aggregation Options
 
Avro introduction
Avro introductionAvro introduction
Avro introduction
 

Semelhante a Apache solr

Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
Erik Hatcher
 

Semelhante a Apache solr (20)

Data Science with Solr and Spark
Data Science with Solr and SparkData Science with Solr and Spark
Data Science with Solr and Spark
 
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
 
Using Search API, Search API Solr and Facets in Drupal 8
Using Search API, Search API Solr and Facets in Drupal 8Using Search API, Search API Solr and Facets in Drupal 8
Using Search API, Search API Solr and Facets in Drupal 8
 
OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content
OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve contentOpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content
OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content
 
Elastic search and Symfony3 - A practical approach
Elastic search and Symfony3 - A practical approachElastic search and Symfony3 - A practical approach
Elastic search and Symfony3 - A practical approach
 
Solr5
Solr5Solr5
Solr5
 
Elasticsearch an overview
Elasticsearch   an overviewElasticsearch   an overview
Elasticsearch an overview
 
Solr 3.1 and beyond
Solr 3.1 and beyondSolr 3.1 and beyond
Solr 3.1 and beyond
 
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
 
IT talk SPb "Full text search for lazy guys"
IT talk SPb "Full text search for lazy guys" IT talk SPb "Full text search for lazy guys"
IT talk SPb "Full text search for lazy guys"
 
Solr introduction
Solr introductionSolr introduction
Solr introduction
 
Confluent & MongoDB APAC Lunch & Learn
Confluent & MongoDB APAC Lunch & LearnConfluent & MongoDB APAC Lunch & Learn
Confluent & MongoDB APAC Lunch & Learn
 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs Pig
 
Oracle by Muhammad Iqbal
Oracle by Muhammad IqbalOracle by Muhammad Iqbal
Oracle by Muhammad Iqbal
 
Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Oslo Solr MeetUp March 2012 - Solr4 alpha
Oslo Solr MeetUp March 2012 - Solr4 alphaOslo Solr MeetUp March 2012 - Solr4 alpha
Oslo Solr MeetUp March 2012 - Solr4 alpha
 
Big Data and Machine Learning with FIWARE
Big Data and Machine Learning with FIWAREBig Data and Machine Learning with FIWARE
Big Data and Machine Learning with FIWARE
 
Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search
 
Elasticsearch - SEARCH & ANALYZE DATA IN REAL TIME
Elasticsearch - SEARCH & ANALYZE DATA IN REAL TIMEElasticsearch - SEARCH & ANALYZE DATA IN REAL TIME
Elasticsearch - SEARCH & ANALYZE DATA IN REAL TIME
 

Mais de Péter Király

Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
Péter Király
 

Mais de Péter Király (20)

Requirements of DARIAH community for a Dataverse repository (SSHOC 2020)
Requirements of DARIAH community for a Dataverse repository (SSHOC 2020)Requirements of DARIAH community for a Dataverse repository (SSHOC 2020)
Requirements of DARIAH community for a Dataverse repository (SSHOC 2020)
 
Validating 126 million MARC records (DATeCH 2019)
Validating 126 million MARC records (DATeCH 2019)Validating 126 million MARC records (DATeCH 2019)
Validating 126 million MARC records (DATeCH 2019)
 
Measuring Metadata Quality (doctoral defense 2019)
Measuring Metadata Quality (doctoral defense 2019)Measuring Metadata Quality (doctoral defense 2019)
Measuring Metadata Quality (doctoral defense 2019)
 
Empirical evaluation of library catalogues (SWIB 2019)
Empirical evaluation of library catalogues (SWIB 2019)Empirical evaluation of library catalogues (SWIB 2019)
Empirical evaluation of library catalogues (SWIB 2019)
 
GRO.data - Dataverse in Göttingen (Dataverse Europe 2020)
GRO.data - Dataverse in Göttingen (Dataverse Europe 2020)GRO.data - Dataverse in Göttingen (Dataverse Europe 2020)
GRO.data - Dataverse in Göttingen (Dataverse Europe 2020)
 
Data element constraints for DDB (DDB 2021)
Data element constraints for DDB (DDB 2021)Data element constraints for DDB (DDB 2021)
Data element constraints for DDB (DDB 2021)
 
Incubating Göttingen Cultural Analytics Alliance (SUB 2021)
Incubating Göttingen Cultural Analytics Alliance (SUB 2021)Incubating Göttingen Cultural Analytics Alliance (SUB 2021)
Incubating Göttingen Cultural Analytics Alliance (SUB 2021)
 
Continuous quality assessment for MARC21 catalogues (MINI ELAG 2021)
Continuous quality assessment for MARC21 catalogues (MINI ELAG 2021)Continuous quality assessment for MARC21 catalogues (MINI ELAG 2021)
Continuous quality assessment for MARC21 catalogues (MINI ELAG 2021)
 
Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021)
Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021)Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021)
Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021)
 
Magyar irodalom idegen nyelven (BTK ITI 2021)
Magyar irodalom idegen nyelven (BTK ITI 2021)Magyar irodalom idegen nyelven (BTK ITI 2021)
Magyar irodalom idegen nyelven (BTK ITI 2021)
 
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
 
FRBR a book history perspective (Bibliodata WG 2022)
FRBR a book history perspective (Bibliodata WG 2022)FRBR a book history perspective (Bibliodata WG 2022)
FRBR a book history perspective (Bibliodata WG 2022)
 
GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)
GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)
GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)
 
Understanding, extracting and enhancing catalogue data (CE Book history works...
Understanding, extracting and enhancing catalogue data (CE Book history works...Understanding, extracting and enhancing catalogue data (CE Book history works...
Understanding, extracting and enhancing catalogue data (CE Book history works...
 
Measuring cultural heritage metadata quality (Semantics 2017)
Measuring cultural heritage metadata quality (Semantics 2017)Measuring cultural heritage metadata quality (Semantics 2017)
Measuring cultural heritage metadata quality (Semantics 2017)
 
Measuring Metadata Quality in Europeana (ADOCHS 2017)
Measuring Metadata Quality in Europeana (ADOCHS 2017)Measuring Metadata Quality in Europeana (ADOCHS 2017)
Measuring Metadata Quality in Europeana (ADOCHS 2017)
 
Measuring library catalogs (ADOCHS 2017)
Measuring library catalogs (ADOCHS 2017)Measuring library catalogs (ADOCHS 2017)
Measuring library catalogs (ADOCHS 2017)
 
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
 
Researching metadata quality (ORKG 2018)
Researching metadata quality (ORKG 2018)Researching metadata quality (ORKG 2018)
Researching metadata quality (ORKG 2018)
 
Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)
Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)
Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)
 

Último

%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
masabamasaba
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
VictoriaMetrics
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
masabamasaba
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
masabamasaba
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
masabamasaba
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
masabamasaba
 

Último (20)

%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
WSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - Keynote
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
 
WSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security ProgramWSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security Program
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 

Apache solr

  • 1. Apache Solr Oberseminar, 12.06.2015 Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen Péter Király, pkiraly@gwdg.de
  • 2. What is Apache Solr? Solr is the popular, blazing-fast, open source enterprise search platform built on Apache Lucene 2
  • 3. ● 1999: Doug Cutting published Lucene ● 2004: Yonik Seeley published Solr ● 2006: Apache project (2007: TLP) ● 2009: LucidWorks company ● 2010: Merge of Lucene and Solr ● 2011: 3.1 ● 2012: 4.0 ● 2015: 5.0 History in one minute 3
  • 4. “Sister” projects ● Nutch: web scale search engine ● Tika: document parser ● Hadoop: distributes storage and data processing ● Elasticsearch: alternative to Solr ● forks/ports of Lucene ● client libraries and tools (Luke index viewer) 4
  • 5. Main features I ● Faceted navigation ● Hit highlighting ● Query language ● Schema-less mode and Schema REST API ● JSON, XML, PHP, Ruby, Python, XSLT, Velocity and custom Java binary outputs ● HTML administration interface 5
  • 6. Main features II ● Replication to other Solr servers ● Distributed search through sharding ● Search results clustering based on Carrot2 ● Extensible through plugins ● Relevance boosting via functions ● Caching - queries, filters, and documents ● Embeddable in a Java Application 6
  • 7. Main features III ● Geo-spatial search, including multiple points per documents and polygons ● Automated management of large clusters through ZooKeeper ● Function queries ● Field Collapsing and grouping ● Auto-suggest 7
  • 8. Inverted index Original documents: Doc # Content field 1 A Fun Guide to Cooking 2 Decorating Your Home 3 How to Raise a Child 4 Buying a New Car 8
  • 9. Inverted index Index structure Term Doc1 Doc2 Doc3 Doc4 Doc5 Doc6 Doc7 a 0 1 1 1 0 0 0 becomming 0 0 0 0 1 0 0 beginner’s 0 0 0 0 0 1 0 buy 0 0 1 0 0 0 0 stored as a bit vectorstored as reference to a tree structure 9
  • 10. Indexing Document ~ RDBM record Fields (key-value structure): ● types (text, numeric, date, point, custom) ● indexed, stored, multiple, required ● field name patterns (prefixes, suffixes, such as *_tx) ● special fields (identifier, _version_) 10
  • 11. Indexing formats: JSON, XML, binary, RDBM, ... connections: file, Data Import Handler, API sharding (separating documents into multiple parts) denormalized documents - (almost) no JOIN ;-( copy field catch all field (contains everything) 11
  • 12. A document example (XML) <doc> <field name="id">F8V7067-APL-KIT</field> string <field name="name">Belkin Mobile Power Cord for iPod w/ Dock</field> text <field name="cat">electronics</field> <field name="cat">connector</field> multivalue <field name="price">19.95</field> float <field name="inStock">false</field> boolean <field name="store">45.18014,-93.87741</field> geo point <field name="manufacturedate_dt">2005-08-01T16:30:25Z</field> date </doc> 12
  • 13. A document example (JSON) { "id": "F8V7067-APL-KIT", "name": "Belkin Mobile Power Cord for iPod w/ Dock", "cat": ["electronics", "connector"], "price":19.95, "inStock":false, "store": "45.18014,-93.87741", "manufacturedate_dt": "2005-08-01T16:30:25Z" } 13
  • 14. A document example (Solr4j library) SolrServer solr = new HttpSolrServer(“http://…”); SolrInputDocument doc = new SolrInputDocument(); doc.setField("id", "F8V7067-APL-KIT"); doc.setField("name", "Belkin Mobile Power Cord for iPod w/ Dock"); ... solr.add(doc); solr.commit(true, true); 14
  • 15. Text analysis chain 1) character filters — preprocess text pattern replace, ASCII folding, HTML stripping 1) tokenizers — split text into smaller units whitespace, lowercase, word delim., standard 1) token filters — examine/modify/eliminate stemming, lowercase, stop words, 15
  • 16. Text analysis chain <fieldType name="my-text-type" class="solr.TextField"> <analyzer type="index"> <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-FoldToASCII.txt" /> <tokenizer class="solr.WhitespaceTokenizerFactory" /> <filter class="solr.StopFilterFactory" words="stopwords.txt" /> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType> 16
  • 17. Text analysis result #Yummm :) Drinking a latte at Caffé Grecco in SF’s historic North Beach…Learning text analysis “#yumm”, “drink”, “latte”, “caffe”, “grecco”, “sf”/”san francisco”, “historic” “north” “beach” “learn”, “text”, “analysis” 17
  • 18. Performing queries 1) user enters a query (+ specifies other components) 2) query handler 3) analysis (use similar as in indexing) 4) run search 5) adding components 6) serialization (XML, JSON etc.) 18
  • 19. Lucene query language ● *:* (→ everything) ● gwdg ● name:gwdg ● name:admin* ● h?ld (→ hold, held) ● name:administrator~ (→ —tor, —tion) ● name:Gesellschaft~0.6 (similarity measure) 19
  • 20. Lucene query language ● name:Max AND name:Planck ● name:Max OR name:Planck ● name:Max NOT name:Planck ● name:”Max Planck” ● name:(“Max Planck” OR Gesselschaft) ● “Max Planck”~3 (within 3 words) → so “Planck Max”, “Max Ludwig Planck” 20
  • 21. Lucene query language ● max planck^10 (weighting) ● price:[10 TO 20] (→ 10..20) ● price:{10 TO 20} (→ 11..19) ● born:[1900-01-01T00:00.0Z TO 1949-12- 31T23:59.0Z] (date range) 21
  • 22. Date mathematics indexing hour granularity "born": "2012-05-22T09:30:22Z/HOUR" search by relative time range, eg. last month: born:[NOW/DAY-1MONTH TO NOW/DAY] keywords: MINUTE, HOUR, DAY, WEEK, MONTH, YEAR 22
  • 23. Faceted search Facets let user to get an overview of the content, and helps to browse without entering search terms (search theorists: browse and search are equally imortant). ● term/field facet: list terms and counts ● query facet: run queries, return counts ● range facet: split range into pieces 23
  • 24. Term facets &facet=true &facet.field=TYPE "facet_fields":{ "TYPE":[ "IMAGE", 25334764, "TEXT", 16990647, "VIDEO", 702787, "SOUND", 558825, "3D", 21303 ] http://europeana.eu - Europeana portal 24
  • 25. Term facet Additional parameters: ● limit, offset → for pagination ● sort (by index or count) → alphabetically or frequency ● mincount → filter less frequent terms ● missing → number of documents miss this field ● prefix → such as “http” to display URLs only ● f.[facet name].facet.[parameter] → overwrites generals 25
  • 26. Query facets &facet=true& facet.query=price:[* TO 5}& facet.query=price:[5 TO 10}& facet.query=price:[10 TO 20}& facet.query=price:[20 TO 50}& facet.query=price:[50 TO *] "facet_counts":{ "facet_queries":{ "price:[* TO 5}":6, "price:[5 TO 10}":5, "price:[10 TO 20}":3, "price:[20 TO 50}":6, "price:[50 TO *]":0 }, 26
  • 27. Query facets (zooming) From centuries to years http://pcu.bage.es/ Catálogo Colectivo de las Bibliotecas de la Administración General del Estado 27
  • 28. Range facet &facet=true& facet.range=price& facet.range.start=0& facet.range.end=50& facet.range.gap=5 "facet_ranges":{ "price":{ "counts":[ "0.0", 6, "5.0", 5, "10.0", 0, "15.0", 3, "20.0", 2, "25.0", 2, "30.0", 1, "35.0", 0, "40.0", 0, "45.0", 1 ], "gap":5.0,"start":0.0,"end":50.0 }}}} 28
  • 29. Hit highlighting ?...&hl=true &hl.fl=name &hl.simple.pre=<em> &hl.simple.post=</em> "highlighting": { "SP2514N": { ←ID "name": [ "<em>SpinPoint P120 </em> SP2514N - hard drive - 250 GB - ATA- 133"]} 29
  • 30. More like this… (similar documents) mlt (more like this) handler: ● doc ID ● fields ● boost ● limit ● min length and freq http://catalog.lib.kyushu-u.ac.jp/en/ - Kyushu University library catalog 30
  • 31. More like this (alternative solution) (DATA_PROVIDER:("NIOD")^0.2 OR what:("IMAGE" OR "Amerikaanse Strijdkrachten" OR "Luchtmacht" OR "Steden - Zie ook: Ruimtelijke ordening, Wederopbouw, Dorpen")^0.8) NOT europeana_id:"/2021622/11607 31
  • 32. Multilingual search <filter class="solr.EnglishPossessiveFilterFactory"/> <filter class="solr.ArabicNormalizationFilterFactory"/> <filter class="solr.ArabicStemFilterFactory"/> <filter class="solr.PersianCharFilterFactory"/> <filter class="solr.KeywordMarkerFilterFactory" protected="lang/en_stop.txt"/> <filter class="solr.SynonymFilterFactory" synonyms="lang/en_synonyms.txt" /> <filter class="solr.SnowballPorterFilterFactory" language="Hungarian" /> 32
  • 33. Multilingual search strategies ● Separate fields by language → title_en:horse OR title_de:horse OR title_hu:horse ● Separate collections (core, shard) per language all core has language settings and same field names → /select?shards=.../english,.../spanish,.../french &q=title:horse ● All language in one field (from Solr 5.0) → title:(es|escuela OR en,es,de|school OR school) 33
  • 34. Multilingual search query → translation API → rewrited query horse → (Hauspferd OR Ló OR Paard OR …) 34
  • 35. Relevancy The most important concepts: ● Term frequency (tf) - how often a particular term appears in a matching document ● Inverse document frequency (idf) - how “rare” a search term is, inverse of the document frequency (how many total documents the search term appears within) ● field normalization factor (field norm) - a combination of factors describing the importance of a particular field on a per-document basis 35
  • 36. Relevancy score(q,d) = Σ (tf(t in d) × idf(t)2 × t.getBoost() × norm(t,d)) × coord(q,d) × queryNorm(q) where t = term; d = document; q = query; f = field tf(t in d) = num. of term occurrences in document1/2 norm(t,d) = d.getBoost() × lengthNorm(f) × f.getBoost() idf(t) = 1 + log (numDocs / (docFreq +1)) coord(q,d) = numTermsInDocumentFromQuery / numTermsInQuery queryNorm(q) = 1 / (sumOfSquaredWeights1/2) sumOfSquaredWeights = q.getBoost()2 × Σ(idf(t) × t.getBoost())2 see: Solr in Action, p. 67 36
  • 38. debug "explain":{ "6H500F0":” 1.209934 = (MATCH) sum of: 0.6588537 = (MATCH) weight(text:hard in 2) [DefaultSimilarity], result of: 0.6588537 = score(doc=2,freq=2.0), product of: 0.73792744 = queryWeight, product of: 3.3671236 = idf(docFreq=2, maxDocs=32) 0.21915662 = queryNorm 0.8928435 = fieldWeight in 2, product of: 1.4142135 = tf(freq=2.0), with freq of: 2.0 = termFreq=2.0 3.3671236 = idf(docFreq=2, maxDocs=32) ... 38
  • 39. References ● http://lucene.apache.org/solr/ ● Grainger & Potter: Solr in Action ● https://lucidworks.com/blog/ ● http://blog.sematext.com/ ● http://solr.pl/ ● https://www.packtpub.com/all?search=solr ● http://www.slideshare.net/treygrainger 39