Dealing with a search engine in your application - a Solr approach for beginners

Dealing with a search engine
in your application
a Solr approach for beginners
Elaine Naomi Watanabe

Elaine Naomi Watanabe
Full-stack developer
(Playax)
Master's degree in Computer Science
(IME-USP)
Passionate about:
Web Development, Agile,
Cloud Computing, DevOps,
NoSQL and RDBMS

ANALYZING BILLIONS OF DATA TO HELP
ARTISTS AND MUSIC PROFESSIONALS TO
DEVELOP THEIR AUDIENCE
BIG DATA + MUSIC + TECH = <3

Searching
Problem
Introduction
Information
Retrieval
Basic concepts
Apache
Solr
How to configure
Sunspot
Gem
Integrating with
Ruby on Rails
Next
Steps
Including
references
SPOILER ALERT

or did you mean...
HOW TO SEARCH LIKE THE
GOOGLE?

A LIST OF SONGS…
TITLES, ARTISTS, LYRICS...
IMAGINE IN OUR CONTEXT…

"Imagine all the people living life in peace"
(Imagine - John Lennon)
"I'm a radioactive, radioactive"
(Radioactive - Imagine Dragons)
"Welcome to the jungle,
watch it bring you to your knees"
(Welcome to the jungle - Guns N' Roses)
A little little sample of songs...

A SQL LIKE statement is enough?
SEARCHING TERM: "IMAGINE"
SELECT *
FROM songs
WHERE
title LIKE '%IMAGINE%' OR
artist LIKE '%IMAGINE%' OR
lyrics LIKE '%IMAGINE%';

"Imagine all the people living life in peace"
(Imagine - John Lennon)
"I'm a radioactive, radioactive"
(Radioactive - Imagine Dragons)
"Welcome to the jungle,
watch it bring you to your knees"
(Welcome to the jungle - Guns N' Roses)
Searching for "Imagine"

A SQL LIKE statement is really enough??
SEARCHING TERM:
"IMAGINE PEOPLE"

A SQL LIKE statement is really enough??
SEARCHING TERM: "IMAGINE PEOPLE"
SELECT *
FROM songs
WHERE
title LIKE '%IMAGINE%PEOPLE%' OR
artist LIKE '%IMAGINE%PEOPLE%' OR
lyrics LIKE '%IMAGINE%PEOPLE%';

USER x YOUR APP
A BUG SEARCH TOOL

When LIKE STATEMENT is not enough...
SEARCH TERMS:
"Dragons Imagine"
"Imagine John"
"Imagine JONH" <- TYPO!

When only Yahoo! Answers is the solution...
Ueca tudi diango...
tanananananananaann nisss
♬ welcome to the jungle,
watch it bring you to your knees ♬
(╯°▽°)╯ ︵ ┻━┻

Unstructured data
Large number of documents

IN THE PAST...
List all documents that match a search query
was enough…
However, in a Big Data era…

NOWADAYS …
Ranking documents by their relevance
for a search query is the most important goal.

TOKENIZATION: Tokens ~> Words
A list wordsof"A list of words!"
Tokens
semantic units

DON T PANIC
"DON'T PANIC"
Is it enough to remove punctuation and spaces?
"do not panic" do not panic
How to tokenize contractions?
Are all of them semantic units?
Are they same tokens?
don't = do not
DON'T PANIC

Imagine Dragons
"Imagine Dragons"
Imagine Dragons
or

you know
"you-know-who" you-know-who
or
who

"Fullstack developer"
"Full-stack dev"
"Full stack developer"
Fullstack
Full developerstack
developer
Full-stack developer
For a user, these terms should return the same documents, isn't it?

30 seconds to Mars Thirty seconds to Mars
November, 18th, 2017 2017-11-18
SP São Paulo
How to deal with numbers and abbreviations?

Kaminari
Is a gem or thunder in
japanese?
Windows
Is it plural of window or
about the company?
About the semantics of the original term and its normalized token...

音楽
ONGAKU
おんがく
SAME LANGUAGE, SAME PRONUNCIATION
DIFFERENT ALPHABETS

STOP WORDS: extremely common words
In English:
a, an, the, and, or, are, as, at, by,
for, from, of ...
In Portuguese:
um, uma, a, o, as, os, é, são, por, de,
da, do, se …

STOP WORDS: extremely common words
A list wordsof list words
meaningful tokens

Stop words, diacritics, case folding...
Stop word
removal
Case folding
normalization
Diacritics
removal
HELLO WORLD
Hello World
hello world
hello
world
naive
naïve
naive
roses are red
red roses
roses
red

When not to normalize tokens…
The Who (a band)
Se (Brazilian song, from Djavan)
Strings solely composed by stop
words
Different meanings for words
with and without
diacritics
In Spanish:
peña means a "cliff"
pena means "sorrow"
When not to set all characters to
lowercase
General Motors
Windows
Apple

LEMMATIZATION / STEMMING
To reduce a token to its base form

LEMMATIZATION: based on a vocabulary
am, are, is be
sou, somos, foi, é ser
car, cars, car’s, cars’ car
English
Portuguese
carros, carro carro

STEMMING
Heuristic process that chops off the ends of words
cats cat
ponies poni
Increase the number of returned documents.
However, harming precision...

STEMMING
Heuristic process that chops off the ends of words
amor
amoramores
amora
operating operat
system
Portuguese
system
English
It means love
It's a Brazilian berry
not so meaningful tokens

SYNONYMS
bike bicycle
indivíduo pessoa

Bag of words:
List of keywords
Ordering of words is ignored!
e.g.
Imagine Dragons
Dragons Imagine
Phrase queries:
Order matters!
Restrict searches
e.g.
"Imagine Dragons"

RELEVANCE
term frequency (tf)
total of occurrences of a term in a document
inverse document frequency (idf)
how rare is a term in all indexed documents

RELEVANCE
tf-idf = tf x idf
function that balances the term frequency in a
document within how rare is term in a collection

Boolean Model
Probabilistic Model
PageRank
...

Evaluation method
We need a test dataset with:
1. A document collection
2. A collection of queries
3. A set of relevance judgments, for each query, a list of relevant and
non-relevant documents
TP: True Positive
TN: True Negative
FP: False Positive
FN: False Negative

ACCURACY
TP + TN
TP + FP + TN + FN

PRECISION
TP
TP + FP
# Corrected Matches / # Total Results Returned

RECALL
TP
TP + FN
# Corrected Matches /
(# Corrected Matches + # Missed Matches)

F1 SCORE
2 * (RECALL + PRECISION)
(RECALL + PRECISION)

When a model is good enough for an app?
You can choose the model with the best F1 score, for
example.
However, there is no universal solution
It is an incremental process
You should tune it based on users' information needs
Usability tests is also a good way to evaluate a model

Adding a search
engine to my app

FULL TEXT SEARCH IN MARIADB...
CREATE TABLE `songs` (
`id` int NOT NULL AUTO_INCREMENT PRIMARY KEY,
`title` varchar(300),
`artist` varchar(255),
`genre` varchar(255),
`lyrics` text
) ENGINE=InnoDB;
CREATE FULLTEXT INDEX songs_title_idx ON songs (title);
CREATE FULLTEXT INDEX songs_artist_idx ON songs (artist);
CREATE FULLTEXT INDEX songs_lyrics_idx ON songs (lyrics);
CREATE FULLTEXT INDEX songs_genre_idx ON songs (genre);
FTS

SELECT *
FROM songs
WHERE MATCH (title,artist, lyrics) AGAINST ('imagine' IN NATURAL
LANGUAGE MODE);
SELECT *
FROM songs
WHERE MATCH (title,artist, lyrics) AGAINST ('imagine' IN BOOLEAN MODE);
CREATE FULLTEXT INDEX songs_all_idx ON songs (title,artist,lyrics);
default mode

SELECT *
FROM songs
WHERE MATCH (title,artist, lyrics) AGAINST ('imagine dragons');
Returned rows:
Radioactive - Imagine Dragons
Imagine - John Lennon

SELECT *
FROM songs
WHERE MATCH (title,artist, lyrics)
AGAINST ('+imagine +dragons') IN BOOLEAN MODE);

SELECT *
FROM songs
WHERE MATCH (title,artist, lyrics) AGAINST ('"imagine dragons"');

SELECT * FROM songs WHERE MATCH(genre) AGAINST('alternative');
SELECT * FROM songs WHERE MATCH(genre) AGAINST('music');
SELECT * FROM songs WHERE MATCH(genre)
AGAINST('alternative' WITH QUERY EXPANSION);
SELECT * FROM songs WHERE MATCH(genre)
AGAINST('music' WITH QUERY EXPANSION);
Imagine Dragons
John Lennon
Imagine Dragons
John Lennon
Imagine Dragons - Alternative Rock
John Lennon - Rock music, Pop music

Why to use an external search engine?
Spell checking!
Spell checking! Or did you mean…
search like Google? ♡

Why to use an external search engine?
You can use spell checking!
You can also:
- Add multivalued fields (document oriented database)
- Add new algorithms to the databases
- Customize stop words, stemming analyzers
- Use fuzziness functions
- Boost some documents/fields according to the search

Apache Solr and ElasticSearch
Based on Apache Lucene
Document oriented databases (welcome to polyglot persistence!)
It is not a relational database, ok? No ACID, sorry!
Developed to be scalable
Apache Solr has a better documentation +50
ES has native support to Structured Query DSL +1
ES is better for analytic queries

ElasticSearch DSL
// artist = John Lennon AND (genres = rock OR genres = pop)
// AND NOT(nome = imagine)
GET /songs/v1/_search
{
"query" : {
"bool": {
"must": {"match": {"artist": "John Lennon" }},
"should": [
{"match": {"genres": "rock" }},
{"match": {"genres": "pop" }}
],
"must_not": {"match": {"nome": "imagine"}}
}
}
}

Our choice: Apache Solr
Apache Solr is Open Source and Open Development +1000
Latest release: 7.1.0 (October 17th, 2017)

Installing for development environment...
docker run --name my_solr -p 8983:8983 -d solr
https://hub.docker.com/r/risdenk/docker-solr/

Creating a core
docker exec -it my_solr solr create_core -c development
core ~> database or table
document ~> a row from a table
schemaless!!
core name

Creating a document...
curl -X POST -H 'Content-Type: application/json'
'http://localhost:8983/solr/development/update/json/docs' --data-binary
'
{
"id": "1",
"title": "Song 1"
}'

Commit!!
'http://localhost:8983/solr/development/update?commit=true'
--data-binary '
{
"commit": {}
}'

Creating a document...
'http://localhost:8983/solr/development/update?commit=true'
--data-binary '
[{
"id": "1",
"title": "Song 1"
},{
"title": "Song 2"
}]'
Optional in insert

Our new documents!
title and title_str?
dynamic fields
*_str, *_i, ...

Updating a document...
'http://localhost:8983/solr/development/update?commit=true' --data-binary '
[{
"id": "1",
"title": "Song 3"
},
{
"title": "Song 3"
}]'

Deleting a document...
'http://localhost:8983/solr/development/update' --data-binary '
{
"delete": { "id":"1" },
"commit": {},
}'

Deleting ALL documents
'http://localhost:8983/solr/development/update?commit=true' --data-binary '
{
"delete": {
"query": "*:*"
}
}'

q (query) main query parameter
fq (filter query) filter query
(to reduce the dataset)
fl (filter list) list of fields to return
sort list of fields to sort the
dataset
Results are paginated
QUERY

Basic queries
List all documents (with pagination)
curl 'http://localhost:8983/solr/development/select?q=*:*'

Basic queries
List all documents (with pagination)
curl http://localhost:8983/solr/development/select -d '
{
query:"*:*"
}'

My documents
{
"docs": [
{
"title": ["Song 1"],
"genre": "Rock",
"year": 2010
},
{
"title": ["Song 2"],
"genre": "MPB",
"year": 1990
}, {
"title": ["Other music Rock"],
"genre": "Pop",
"year": 1970
},
{
"title": ["My favorite songs"],
"genre": "Rock Music",
"year": 2011
}
]
}

Fuzzy matching
title:Song* 3 documents
title:Song? 1 document
title:Sonjs 0 documents
title:Sonjs~1 1 document
title:Sonjs~2 3 documents
title:(my songs) 1 document
title:"my songs" 0 documents
title:"my songs"~2 1 document
title:(-favorite +song*) 2 documents
*:* 4 documents
Wildcards:
? one letter
* any number letter
~ query slop
( ) keyword search
" " phrase query

Fuzzy matching
title:"song" AND genre:"rock" 1 document
(title:"song" AND genre:"rock") OR title:"track" 2 documents
year: [1980 TO *] 3 documents
genre:[Pop TO *] 3 documents
Boosting:
(title:music OR title:Rock)^1.5 (genre:music OR genre: Rock) 3 documents
1st: "Other music rock"
(title:music OR title:Rock) (genre:music OR genre: Rock)^1.5 3 documents
1st: "My favorite songs"

Searching in all fields
In your schema.xml, add:
<copyField source="*_txt" dest="_text_" />
<copyField source="*_text" dest="_text_" />
You can add but it is not recommended:
<copyField source="*" dest="_text_" />
Then, you can search without defining the default field

Analysis: list all indexing and querying transformations
Indexing
Transform.
Querying
Transform.

Customizing fields and their analyzers (schema.xml)
<fieldtype name="phonetic" stored="false" indexed="true"
class="solr.TextField" >
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.DoubleMetaphoneFilterFactory" inject="false"/>
<filter class="solr.StopFilterFactory" format="snowball"
words="lang/stopwords_pt.txt" />
</analyzer>
</fieldtype>

Building the spell checking indexing
curl --request GET --url
'http://localhost:8983/solr/development/select?q=*:*&spellcheck.build=t
rue&spellcheck=true'

Suggestion:
IMAGINE
Searching: IMAGINA

Searching: DRAGOONS
Suggestion:
DRAGONS

Searching:
IMAGINA DRAGOONS
Suggestion:
IMAGINE DRAGONS

Integrating with
Ruby on Rails

Connecting through a REST Client …
params = {q: 'title:song' }
response = RestClient.get
"http://localhost:8983/solr/development/select?#{params.to_param}"
response_json = JSON.parse(response.body)
items = response_json["response"]["docs"]
[{"title"=>["Song 1"], "id"=>"eeb507c6-461f-4219-9f5a-50528340d84d",
"_version_"=>1584234836063682560, "title_str"=>["Song 1"]},
{"title"=>["Song 2"], "id"=>"1b8bacc1-9ed9-4c85-922d-71b3472f9d44",
"_version_"=>1584234836065779712, "title_str"=>["Song 2"]}]
ヽ(•́o•̀)ノ

Installing...
gem 'sunspot_rails'
rails generate sunspot_rails:install
development:
solr:
hostname: solr
port: 8983
path: /solr/playax
log_level: INFO
auto_index_callback: after_commit
auto_remove_callback: after_commit
config/sunspot.yml

Sunspot needs its own schema.xml.
Follow this example in:
elainenaomi/search_engine

Sunspot DSL - Defining the indexed fields
class Song < ActiveRecord::Base
searchable do
text :title, stored: true
text :lyrics, stored: false
text :artist, stored: true
string :genre, multiple: true, stored: true do
genre.split(',')
end
end
end
Sunspot.index! Song.all

Bag of words:
search = Song.search do
fulltext 'imagine dragons'
with :genre, 'Rock'
without :genre, 'Pop'
with(:year).less_than 2014
field_list :title, :artist
order_by :title, :asc
end
songs = search.results
Imagine
(John Lennon)
Radioactive
( Imagine Dragons)

Phrase queries:
search = Song.search do
fulltext ""imagine dragons""
with :genre, 'Rock'
without :genre, 'Pop'
with(:year).less_than 2014
field_list :title, :artist
order_by :title, :asc
end
songs = search.results
Radioactive
( Imagine Dragons)

Query Phrase Slop
# Two words can appear between the words in the phrase, so
# "imagine all the people" also matches, in addition to "imagine people"
Song.search do
fulltext '"imagine people"' do
fields :lyrics
query_phrase_slop 2
end
end

Minimum Match
Song.search do
fulltext "dragons imagine test" do
fields :artist, :title
minimum_match '70%'
end
end
Song.search do
fulltext 'dragons imagine test' do
fields :artist, :title
boost_fields title: 2.0
minimum_match '60%'
end
end
1 document:
Radioactive
( Imagine Dragons)
2 documents:
1st: Imagine
(John Lennon)
2nd: Radioactive
(Imagine Dragons)
boost
rounded down

Spell checking
search = Sunspot.search(Song) do
keywords 'Imagina Dragoons'
spellcheck :count => 3
end
search.spellcheck_suggestion_for('imagina') # => 'imagine'
search.spellcheck_suggestions # => [{"word"=>"imagine",
"freq"=>3}, {"word"=>"dragons", "freq"=>1}]

To test or not to test?
Unit tests? No.
Integration tests? Maybe…
Search engines depends on terms frequency to ranking docs
You will need all your dataset to compute precision, recall..
You can test only filter queries, indexing callbacks…

Summary
The searching
problem
● User: a bug search tool
Adding a search
engine to my app
● Full text search in MariaDB
● Apache Solr x ElasticSearch
Apache Solr
● How to create cores
● CRUD operations
Integrating with Rails
● Sunspot gem
● How to index, search and test

Keep in mind
Always verify the user's information needs from your app
E.g.: check if removing stop words, synonymous should be applied
"No" Meghan Trainor
"I am" - P.O.D
E.g: which transformations your search engine should apply
- Phonetic transformations? Custom language analyzers?

Keep in mind
The information is not only on text files but also in
audios, videos, images, etc.

Suggested topics for studying
- Evaluation of available analyzers for FTS
- Optimization of Performance (such as soft commit, lazy build indexes)
- Distribution and replication through SolrCloud
- Using of Machine Learning algorithms
- Creation of custom function queries
- Authentication
- Integrating with Logstash and Kibana
- Geospatial searches

Thank you! <3
github.com/elainenaomi
slideshare.net/elainenaomi
@elaine_nw

References
Introduction to Information Retrieval
Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze (2008)
Solr in action
Grainger, Trey, Timothy Potter, and Yonik Seeley (2014)
Sunspot gem
http://sunspot.github.io/
Uma introdução ao tema recuperação de informações textuais.
Barth, F. J. (2013)
10 Reasons to Choose Apache Solr Over Elasticsearch (2016)
https://dzone.com/articles/10-reasons-to-choose-apache-solr-over-elasticsearc

References
Apache Solr vs Elasticsearch
http://solr-vs-elasticsearch.com/
When to consider Solr
https://stackoverflow.com/questions/4960952/when-to-consider-solr
Indexing for full text search in PostgreSQL
https://www.compose.com/articles/indexing-for-full-text-search-in-postgresql/
PolyglotPersistence
https://martinfowler.com/bliki/PolyglotPersistence.html
Yahoo! Answers: Qual o nome desta Música?
https://br.answers.yahoo.com/question/index?qid=20080627085726AAJM9Wa

References
Full-Text Index in MariaDB
https://mariadb.com/kb/en/library/full-text-index-overview/
Natural Language Full-Text Searches (MySQL)
https://dev.mysql.com/doc/refman/5.7/en/fulltext-natural-language.html
Postgres full-text search is Good Enough! (2015)
http://rachbelaid.com/postgres-full-text-search-is-good-enough/
Text Indexes in MongoDB
https://docs.mongodb.com/manual/core/index-text/
Full-Text Index Stopwords for MariaDB
https://mariadb.com/kb/en/library/full-text-index-stopwords/

Dealing with a search engine in your application - a Solr approach for beginners

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Dealing with a search engine in your application - a Solr approach for beginners

Semelhante a Dealing with a search engine in your application - a Solr approach for beginners (20)

Mais de Elaine Naomi

Mais de Elaine Naomi (18)

Último

Último (20)

Dealing with a search engine in your application - a Solr approach for beginners