7. or did you mean...
HOW TO SEARCH LIKE THE
GOOGLE?
8. A LIST OF SONGS…
TITLES, ARTISTS, LYRICS...
IMAGINE IN OUR CONTEXT…
9. "Imagine all the people living life in peace"
(Imagine - John Lennon)
"I'm a radioactive, radioactive"
(Radioactive - Imagine Dragons)
"Welcome to the jungle,
watch it bring you to your knees"
(Welcome to the jungle - Guns N' Roses)
A little little sample of songs...
10. A SQL LIKE statement is enough?
SEARCHING TERM: "IMAGINE"
SELECT *
FROM songs
WHERE
title LIKE '%IMAGINE%' OR
artist LIKE '%IMAGINE%' OR
lyrics LIKE '%IMAGINE%';
11. "Imagine all the people living life in peace"
(Imagine - John Lennon)
"I'm a radioactive, radioactive"
(Radioactive - Imagine Dragons)
"Welcome to the jungle,
watch it bring you to your knees"
(Welcome to the jungle - Guns N' Roses)
Searching for "Imagine"
12. A SQL LIKE statement is really enough??
SEARCHING TERM:
"IMAGINE PEOPLE"
13. "Imagine all the people living life in peace"
(Imagine - John Lennon)
"I'm a radioactive, radioactive"
(Radioactive - Imagine Dragons)
"Welcome to the jungle,
watch it bring you to your knees"
(Welcome to the jungle - Guns N' Roses)
A little little sample of songs...
14. A SQL LIKE statement is really enough??
SEARCHING TERM: "IMAGINE PEOPLE"
SELECT *
FROM songs
WHERE
title LIKE '%IMAGINE%PEOPLE%' OR
artist LIKE '%IMAGINE%PEOPLE%' OR
lyrics LIKE '%IMAGINE%PEOPLE%';
16. When LIKE STATEMENT is not enough...
SEARCH TERMS:
"Dragons Imagine"
"Imagine John"
"Imagine JONH" <- TYPO!
17. When only Yahoo! Answers is the solution...
Ueca tudi diango...
tanananananananaann nisss
♬ welcome to the jungle,
watch it bring you to your knees ♬
(╯°▽°)╯ ︵ ┻━┻
24. DON T PANIC
"DON'T PANIC"
Is it enough to remove punctuation and spaces?
"do not panic" do not panic
How to tokenize contractions?
Are all of them semantic units?
Are they same tokens?
don't = do not
DON'T PANIC
27. "Fullstack developer"
"Full-stack dev"
"Full stack developer"
Is it enough to remove punctuation and spaces?
Fullstack
Full developerstack
developer
Full-stack developer
For a user, these terms should return the same documents, isn't it?
28. 30 seconds to Mars Thirty seconds to Mars
November, 18th, 2017 2017-11-18
SP São Paulo
How to deal with numbers and abbreviations?
29. Kaminari
Is a gem or thunder in
japanese?
Windows
Is it plural of window or
about the company?
About the semantics of the original term and its normalized token...
31. STOP WORDS: extremely common words
In English:
a, an, the, and, or, are, as, at, by,
for, from, of ...
In Portuguese:
um, uma, a, o, as, os, é, são, por, de,
da, do, se …
33. Stop words, diacritics, case folding...
Stop word
removal
Case folding
normalization
Diacritics
removal
HELLO WORLD
Hello World
hello world
hello
world
naive
naïve
naive
roses are red
red roses
roses
red
34. When not to normalize tokens…
The Who (a band)
Se (Brazilian song, from Djavan)
Strings solely composed by stop
words
Different meanings for words
with and without
diacritics
In Spanish:
peña means a "cliff"
pena means "sorrow"
When not to set all characters to
lowercase
General Motors
Windows
Apple
36. LEMMATIZATION: based on a vocabulary
am, are, is be
sou, somos, foi, é ser
car, cars, car’s, cars’ car
English
Portuguese
carros, carro carro
37. STEMMING
Heuristic process that chops off the ends of words
cats cat
ponies poni
Increase the number of returned documents.
However, harming precision...
38. STEMMING
Heuristic process that chops off the ends of words
amor
amoramores
amora
operating operat
system
Portuguese
system
English
It means love
It's a Brazilian berry
not so meaningful tokens
40. Bag of words:
List of keywords
Ordering of words is ignored!
e.g.
Imagine Dragons
Dragons Imagine
Phrase queries:
Order matters!
Restrict searches
e.g.
"Imagine Dragons"
41. RELEVANCE
term frequency (tf)
total of occurrences of a term in a document
inverse document frequency (idf)
how rare is a term in all indexed documents
42. RELEVANCE
tf-idf = tf x idf
function that balances the term frequency in a
document within how rare is term in a collection
45. Evaluation method
We need a test dataset with:
1. A document collection
2. A collection of queries
3. A set of relevance judgments, for each query, a list of relevant and
non-relevant documents
TP: True Positive
TN: True Negative
FP: False Positive
FN: False Negative
50. When a model is good enough for an app?
You can choose the model with the best F1 score, for
example.
However, there is no universal solution
It is an incremental process
You should tune it based on users' information needs
Usability tests is also a good way to evaluate a model
52. FULL TEXT SEARCH IN MARIADB...
CREATE TABLE `songs` (
`id` int NOT NULL AUTO_INCREMENT PRIMARY KEY,
`title` varchar(300),
`artist` varchar(255),
`genre` varchar(255),
`lyrics` text
) ENGINE=InnoDB;
CREATE FULLTEXT INDEX songs_title_idx ON songs (title);
CREATE FULLTEXT INDEX songs_artist_idx ON songs (artist);
CREATE FULLTEXT INDEX songs_lyrics_idx ON songs (lyrics);
CREATE FULLTEXT INDEX songs_genre_idx ON songs (genre);
FTS
53. FULL TEXT SEARCH IN MARIADB...
SELECT *
FROM songs
WHERE MATCH (title,artist, lyrics) AGAINST ('imagine' IN NATURAL
LANGUAGE MODE);
SELECT *
FROM songs
WHERE MATCH (title,artist, lyrics) AGAINST ('imagine' IN BOOLEAN MODE);
CREATE FULLTEXT INDEX songs_all_idx ON songs (title,artist,lyrics);
default mode
54. FULL TEXT SEARCH IN MARIADB...
SELECT *
FROM songs
WHERE MATCH (title,artist, lyrics) AGAINST ('imagine dragons');
Returned rows:
Radioactive - Imagine Dragons
Imagine - John Lennon
55. FULL TEXT SEARCH IN MARIADB...
SELECT *
FROM songs
WHERE MATCH (title,artist, lyrics)
AGAINST ('+imagine +dragons') IN BOOLEAN MODE);
Radioactive - Imagine Dragons
56. FULL TEXT SEARCH IN MARIADB...
SELECT *
FROM songs
WHERE MATCH (title,artist, lyrics) AGAINST ('"imagine dragons"');
Radioactive - Imagine Dragons
57. FULL TEXT SEARCH IN MARIADB...
SELECT * FROM songs WHERE MATCH(genre) AGAINST('alternative');
SELECT * FROM songs WHERE MATCH(genre) AGAINST('music');
SELECT * FROM songs WHERE MATCH(genre)
AGAINST('alternative' WITH QUERY EXPANSION);
SELECT * FROM songs WHERE MATCH(genre)
AGAINST('music' WITH QUERY EXPANSION);
Imagine Dragons
John Lennon
Imagine Dragons
John Lennon
Imagine Dragons - Alternative Rock
John Lennon - Rock music, Pop music
58. Why to use an external search engine?
Spell checking!
Spell checking! Or did you mean…
search like Google? ♡
59. Why to use an external search engine?
You can use spell checking!
You can also:
- Add multivalued fields (document oriented database)
- Add new algorithms to the databases
- Customize stop words, stemming analyzers
- Use fuzziness functions
- Boost some documents/fields according to the search
60. Apache Solr and ElasticSearch
Based on Apache Lucene
Document oriented databases (welcome to polyglot persistence!)
It is not a relational database, ok? No ACID, sorry!
Developed to be scalable
Apache Solr has a better documentation +50
ES has native support to Structured Query DSL +1
ES is better for analytic queries
61. ElasticSearch DSL
// artist = John Lennon AND (genres = rock OR genres = pop)
// AND NOT(nome = imagine)
GET /songs/v1/_search
{
"query" : {
"bool": {
"must": {"match": {"artist": "John Lennon" }},
"should": [
{"match": {"genres": "rock" }},
{"match": {"genres": "pop" }}
],
"must_not": {"match": {"nome": "imagine"}}
}
}
}
62. Our choice: Apache Solr
Apache Solr is Open Source and Open Development +1000
Latest release: 7.1.0 (October 17th, 2017)
66. Creating a core
docker exec -it my_solr solr create_core -c development
core ~> database or table
document ~> a row from a table
schemaless!!
core name
82. q (query) main query parameter
fq (filter query) filter query
(to reduce the dataset)
fl (filter list) list of fields to return
sort list of fields to sort the
dataset
Results are paginated
QUERY
83. Basic queries
List all documents (with pagination)
curl 'http://localhost:8983/solr/development/select?q=*:*'
84. Basic queries
List all documents (with pagination)
curl http://localhost:8983/solr/development/select -d '
{
query:"*:*"
}'
87. Fuzzy matching
title:"song" AND genre:"rock" 1 document
(title:"song" AND genre:"rock") OR title:"track" 2 documents
year: [1980 TO *] 3 documents
genre:[Pop TO *] 3 documents
Boosting:
(title:music OR title:Rock)^1.5 (genre:music OR genre: Rock) 3 documents
1st: "Other music rock"
(title:music OR title:Rock) (genre:music OR genre: Rock)^1.5 3 documents
1st: "My favorite songs"
88. Searching in all fields
In your schema.xml, add:
<copyField source="*_txt" dest="_text_" />
<copyField source="*_text" dest="_text_" />
You can add but it is not recommended:
<copyField source="*" dest="_text_" />
Then, you can search without defining the default field
89. Analysis: list all indexing and querying transformations
Indexing
Transform.
Querying
Transform.
92. Building the spell checking indexing
curl --request GET --url
'http://localhost:8983/solr/development/select?q=*:*&spellcheck.build=t
rue&spellcheck=true'
100. Sunspot needs its own schema.xml.
Follow this example in:
elainenaomi/search_engine
101. Sunspot DSL - Defining the indexed fields
class Song < ActiveRecord::Base
searchable do
text :title, stored: true
text :lyrics, stored: false
text :artist, stored: true
string :genre, multiple: true, stored: true do
genre.split(',')
end
end
end
Sunspot.index! Song.all
102. Bag of words:
search = Song.search do
fulltext 'imagine dragons'
with :genre, 'Rock'
without :genre, 'Pop'
with(:year).less_than 2014
field_list :title, :artist
order_by :title, :asc
end
songs = search.results
Imagine
(John Lennon)
Radioactive
( Imagine Dragons)
103. Phrase queries:
search = Song.search do
fulltext ""imagine dragons""
with :genre, 'Rock'
without :genre, 'Pop'
with(:year).less_than 2014
field_list :title, :artist
order_by :title, :asc
end
songs = search.results
Radioactive
( Imagine Dragons)
104. Query Phrase Slop
# Two words can appear between the words in the phrase, so
# "imagine all the people" also matches, in addition to "imagine people"
Song.search do
fulltext '"imagine people"' do
fields :lyrics
query_phrase_slop 2
end
end
105. Minimum Match
Song.search do
fulltext "dragons imagine test" do
fields :artist, :title
minimum_match '70%'
end
end
Song.search do
fulltext 'dragons imagine test' do
fields :artist, :title
boost_fields title: 2.0
minimum_match '60%'
end
end
1 document:
Radioactive
( Imagine Dragons)
2 documents:
1st: Imagine
(John Lennon)
2nd: Radioactive
(Imagine Dragons)
boost
rounded down
108. To test or not to test?
Unit tests? No.
Integration tests? Maybe…
Search engines depends on terms frequency to ranking docs
You will need all your dataset to compute precision, recall..
You can test only filter queries, indexing callbacks…
110. Summary
The searching
problem
● User: a bug search tool
Adding a search
engine to my app
● Full text search in MariaDB
● Apache Solr x ElasticSearch
Apache Solr
● How to create cores
● CRUD operations
Integrating with Rails
● Sunspot gem
● How to index, search and test
111. Keep in mind
Always verify the user's information needs from your app
E.g.: check if removing stop words, synonymous should be applied
"No" Meghan Trainor
"I am" - P.O.D
E.g: which transformations your search engine should apply
- Phonetic transformations? Custom language analyzers?
112. Keep in mind
The information is not only on text files but also in
audios, videos, images, etc.
113. Suggested topics for studying
- Evaluation of available analyzers for FTS
- Optimization of Performance (such as soft commit, lazy build indexes)
- Distribution and replication through SolrCloud
- Using of Machine Learning algorithms
- Creation of custom function queries
- Authentication
- Integrating with Logstash and Kibana
- Geospatial searches
115. References
Introduction to Information Retrieval
Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze (2008)
Solr in action
Grainger, Trey, Timothy Potter, and Yonik Seeley (2014)
Sunspot gem
http://sunspot.github.io/
Uma introdução ao tema recuperação de informações textuais.
Barth, F. J. (2013)
10 Reasons to Choose Apache Solr Over Elasticsearch (2016)
https://dzone.com/articles/10-reasons-to-choose-apache-solr-over-elasticsearc
116. References
Apache Solr vs Elasticsearch
http://solr-vs-elasticsearch.com/
When to consider Solr
https://stackoverflow.com/questions/4960952/when-to-consider-solr
Indexing for full text search in PostgreSQL
https://www.compose.com/articles/indexing-for-full-text-search-in-postgresql/
PolyglotPersistence
https://martinfowler.com/bliki/PolyglotPersistence.html
Yahoo! Answers: Qual o nome desta Música?
https://br.answers.yahoo.com/question/index?qid=20080627085726AAJM9Wa
117. References
Full-Text Index in MariaDB
https://mariadb.com/kb/en/library/full-text-index-overview/
Natural Language Full-Text Searches (MySQL)
https://dev.mysql.com/doc/refman/5.7/en/fulltext-natural-language.html
Postgres full-text search is Good Enough! (2015)
http://rachbelaid.com/postgres-full-text-search-is-good-enough/
Text Indexes in MongoDB
https://docs.mongodb.com/manual/core/index-text/
Full-Text Index Stopwords for MariaDB
https://mariadb.com/kb/en/library/full-text-index-stopwords/