SlideShare uma empresa Scribd logo
1 de 117
Baixar para ler offline
Dealing with a search engine
in your application
a Solr approach for beginners
Elaine Naomi Watanabe
Elaine Naomi Watanabe
Full-stack developer
(Playax)
Master's degree in Computer Science
(IME-USP)
Passionate about:
Web Development, Agile,
Cloud Computing, DevOps,
NoSQL and RDBMS
ANALYZING BILLIONS OF DATA TO HELP
ARTISTS AND MUSIC PROFESSIONALS TO
DEVELOP THEIR AUDIENCE
BIG DATA + MUSIC + TECH = <3
AGENDA
Searching
Problem
Introduction
Information
Retrieval
Basic concepts
Apache
Solr
How to configure
Sunspot
Gem
Integrating with
Ruby on Rails
Next
Steps
Including
references
SPOILER ALERT
THE SEARCHING
PROBLEM
or did you mean...
HOW TO SEARCH LIKE THE
GOOGLE?
A LIST OF SONGS…
TITLES, ARTISTS, LYRICS...
IMAGINE IN OUR CONTEXT…
"Imagine all the people living life in peace"
(Imagine - John Lennon)
"I'm a radioactive, radioactive"
(Radioactive - Imagine Dragons)
"Welcome to the jungle,
watch it bring you to your knees"
(Welcome to the jungle - Guns N' Roses)
A little little sample of songs...
A SQL LIKE statement is enough?
SEARCHING TERM: "IMAGINE"
SELECT *
FROM songs
WHERE
title LIKE '%IMAGINE%' OR
artist LIKE '%IMAGINE%' OR
lyrics LIKE '%IMAGINE%';
"Imagine all the people living life in peace"
(Imagine - John Lennon)
"I'm a radioactive, radioactive"
(Radioactive - Imagine Dragons)
"Welcome to the jungle,
watch it bring you to your knees"
(Welcome to the jungle - Guns N' Roses)
Searching for "Imagine"
A SQL LIKE statement is really enough??
SEARCHING TERM:
"IMAGINE PEOPLE"
"Imagine all the people living life in peace"
(Imagine - John Lennon)
"I'm a radioactive, radioactive"
(Radioactive - Imagine Dragons)
"Welcome to the jungle,
watch it bring you to your knees"
(Welcome to the jungle - Guns N' Roses)
A little little sample of songs...
A SQL LIKE statement is really enough??
SEARCHING TERM: "IMAGINE PEOPLE"
SELECT *
FROM songs
WHERE
title LIKE '%IMAGINE%PEOPLE%' OR
artist LIKE '%IMAGINE%PEOPLE%' OR
lyrics LIKE '%IMAGINE%PEOPLE%';
USER x YOUR APP
A BUG SEARCH TOOL
When LIKE STATEMENT is not enough...
SEARCH TERMS:
"Dragons Imagine"
"Imagine John"
"Imagine JONH" <- TYPO!
When only Yahoo! Answers is the solution...
Ueca tudi diango...
tanananananananaann nisss
♬ welcome to the jungle,
watch it bring you to your knees ♬
(╯°▽°)╯ ︵ ┻━┻
INFORMATION
RETRIEVAL
Unstructured data
Large number of documents
IN THE PAST...
List all documents that match a search query
was enough…
However, in a Big Data era…
NOWADAYS …
Ranking documents by their relevance
for a search query is the most important goal.
Basic concepts
TOKENIZATION: Tokens ~> Words
A list wordsof"A list of words!"
Tokens
semantic units
DON T PANIC
"DON'T PANIC"
Is it enough to remove punctuation and spaces?
"do not panic" do not panic
How to tokenize contractions?
Are all of them semantic units?
Are they same tokens?
don't = do not
DON'T PANIC
Imagine Dragons
"Imagine Dragons"
Imagine Dragons
Is it enough to remove punctuation and spaces?
or
you know
"you-know-who" you-know-who
Is it enough to remove punctuation and spaces?
or
who
"Fullstack developer"
"Full-stack dev"
"Full stack developer"
Is it enough to remove punctuation and spaces?
Fullstack
Full developerstack
developer
Full-stack developer
For a user, these terms should return the same documents, isn't it?
30 seconds to Mars Thirty seconds to Mars
November, 18th, 2017 2017-11-18
SP São Paulo
How to deal with numbers and abbreviations?
Kaminari
Is a gem or thunder in
japanese?
Windows
Is it plural of window or
about the company?
About the semantics of the original term and its normalized token...
音楽
ONGAKU
お​んがく
SAME LANGUAGE, SAME PRONUNCIATION
DIFFERENT ALPHABETS
STOP WORDS: extremely common words
In English:
a, an, the, and, or, are, as, at, by,
for, from, of ...
In Portuguese:
um, uma, a, o, as, os, é, são, por, de,
da, do, se …
STOP WORDS: extremely common words
A list wordsof list words
meaningful tokens
Stop words, diacritics, case folding...
Stop word
removal
Case folding
normalization
Diacritics
removal
HELLO WORLD
Hello World
hello world
hello
world
naive
naïve
naive
roses are red
red roses
roses
red
When not to normalize tokens…
The Who (a band)
Se (Brazilian song, from Djavan)
Strings solely composed by stop
words
Different meanings for words
with and without
diacritics
In Spanish:
peña means a "cliff"
pena means "sorrow"
When not to set all characters to
lowercase
General Motors
Windows
Apple
LEMMATIZATION / STEMMING
To reduce a token to its base form
LEMMATIZATION: based on a vocabulary
am, are, is be
sou, somos, foi, é ser
car, cars, car’s, cars’ car
English
Portuguese
carros, carro carro
STEMMING
Heuristic process that chops off the ends of words
cats cat
ponies poni
Increase the number of returned documents.
However, harming precision...
STEMMING
Heuristic process that chops off the ends of words
amor
amoramores
amora
operating operat
system
Portuguese
system
English
It means love
It's a Brazilian berry
not so meaningful tokens
SYNONYMS
bike bicycle
indivíduo pessoa
Bag of words:
List of keywords
Ordering of words is ignored!
e.g.
Imagine Dragons
Dragons Imagine
Phrase queries:
Order matters!
Restrict searches
e.g.
"Imagine Dragons"
RELEVANCE
term frequency (tf)
total of occurrences of a term in a document
inverse document frequency (idf)
how rare is a term in all indexed documents
RELEVANCE
tf-idf = tf x idf
function that balances the term frequency in a
document within how rare is term in a collection
Boolean Model
Probabilistic Model
PageRank
...
Evaluating a
searching model
Evaluation method
We need a test dataset with:
1. A document collection
2. A collection of queries
3. A set of relevance judgments, for each query, a list of relevant and
non-relevant documents
TP: True Positive
TN: True Negative
FP: False Positive
FN: False Negative
ACCURACY
TP + TN
TP + FP + TN + FN
PRECISION
TP
TP + FP
# Corrected Matches / # Total Results Returned
RECALL
TP
TP + FN
# Corrected Matches /
(# Corrected Matches + # Missed Matches)
F1 SCORE
2 * (RECALL + PRECISION)
(RECALL + PRECISION)
When a model is good enough for an app?
You can choose the model with the best F1 score, for
example.
However, there is no universal solution
It is an incremental process
You should tune it based on users' information needs
Usability tests is also a good way to evaluate a model
Adding a search
engine to my app
FULL TEXT SEARCH IN MARIADB...
CREATE TABLE `songs` (
`id` int NOT NULL AUTO_INCREMENT PRIMARY KEY,
`title` varchar(300),
`artist` varchar(255),
`genre` varchar(255),
`lyrics` text
) ENGINE=InnoDB;
CREATE FULLTEXT INDEX songs_title_idx ON songs (title);
CREATE FULLTEXT INDEX songs_artist_idx ON songs (artist);
CREATE FULLTEXT INDEX songs_lyrics_idx ON songs (lyrics);
CREATE FULLTEXT INDEX songs_genre_idx ON songs (genre);
FTS
FULL TEXT SEARCH IN MARIADB...
SELECT *
FROM songs
WHERE MATCH (title,artist, lyrics) AGAINST ('imagine' IN NATURAL
LANGUAGE MODE);
SELECT *
FROM songs
WHERE MATCH (title,artist, lyrics) AGAINST ('imagine' IN BOOLEAN MODE);
CREATE FULLTEXT INDEX songs_all_idx ON songs (title,artist,lyrics);
default mode
FULL TEXT SEARCH IN MARIADB...
SELECT *
FROM songs
WHERE MATCH (title,artist, lyrics) AGAINST ('imagine dragons');
Returned rows:
Radioactive - Imagine Dragons
Imagine - John Lennon
FULL TEXT SEARCH IN MARIADB...
SELECT *
FROM songs
WHERE MATCH (title,artist, lyrics)
AGAINST ('+imagine +dragons') IN BOOLEAN MODE);
Radioactive - Imagine Dragons
FULL TEXT SEARCH IN MARIADB...
SELECT *
FROM songs
WHERE MATCH (title,artist, lyrics) AGAINST ('"imagine dragons"');
Radioactive - Imagine Dragons
FULL TEXT SEARCH IN MARIADB...
SELECT * FROM songs WHERE MATCH(genre) AGAINST('alternative');
SELECT * FROM songs WHERE MATCH(genre) AGAINST('music');
SELECT * FROM songs WHERE MATCH(genre)
AGAINST('alternative' WITH QUERY EXPANSION);
SELECT * FROM songs WHERE MATCH(genre)
AGAINST('music' WITH QUERY EXPANSION);
Imagine Dragons
John Lennon
Imagine Dragons
John Lennon
Imagine Dragons - Alternative Rock
John Lennon - Rock music, Pop music
Why to use an external search engine?
Spell checking!
Spell checking! Or did you mean…
search like Google? ♡
Why to use an external search engine?
You can use spell checking!
You can also:
- Add multivalued fields (document oriented database)
- Add new algorithms to the databases
- Customize stop words, stemming analyzers
- Use fuzziness functions
- Boost some documents/fields according to the search
Apache Solr and ElasticSearch
Based on Apache Lucene
Document oriented databases (welcome to polyglot persistence!)
It is not a relational database, ok? No ACID, sorry!
Developed to be scalable
Apache Solr has a better documentation +50
ES has native support to Structured Query DSL +1
ES is better for analytic queries
ElasticSearch DSL
// artist = John Lennon AND (genres = rock OR genres = pop)
// AND NOT(nome = imagine)
GET /songs/v1/_search
{
"query" : {
"bool": {
"must": {"match": {"artist": "John Lennon" }},
"should": [
{"match": {"genres": "rock" }},
{"match": {"genres": "pop" }}
],
"must_not": {"match": {"nome": "imagine"}}
}
}
}
Our choice: Apache Solr
Apache Solr is Open Source and Open Development +1000
Latest release: 7.1.0 (October 17th, 2017)
Apache Solr
Installing for development environment...
docker run --name my_solr -p 8983:8983 -d solr
https://hub.docker.com/r/risdenk/docker-solr/
localhost:8983
Creating a core
docker exec -it my_solr solr create_core -c development
core ~> database or table
document ~> a row from a table
schemaless!!
core name
List of all cores
Menu options for each core
Creating a document...
curl -X POST -H 'Content-Type: application/json'
'http://localhost:8983/solr/development/update/json/docs' --data-binary
'
{
"id": "1",
"title": "Song 1"
}'
Zero documents??
Check!
Commit!!
curl -X POST -H 'Content-Type: application/json'
'http://localhost:8983/solr/development/update?commit=true'
--data-binary '
{
"commit": {}
}'
Creating a document...
curl -X POST -H 'Content-Type: application/json'
'http://localhost:8983/solr/development/update?commit=true'
--data-binary '
[{
"id": "1",
"title": "Song 1"
},{
"title": "Song 2"
}]'
Optional in insert
Our new documents!
Our new documents!
title and title_str?
dynamic fields
*_str, *_i, ...
Updating a document...
curl -X POST -H 'Content-Type: application/json'
'http://localhost:8983/solr/development/update?commit=true' --data-binary '
[{
"id": "1",
"title": "Song 3"
},
{
"title": "Song 3"
}]'
id: 1
new doc
Documents menu - JSON
Deleting a document...
curl -X POST -H 'Content-Type: application/json'
'http://localhost:8983/solr/development/update' --data-binary '
{
"delete": { "id":"1" },
"commit": {},
}'
Deleting ALL documents
curl -X POST -H 'Content-Type: application/json'
'http://localhost:8983/solr/development/update?commit=true' --data-binary '
{
"delete": {
"query": "*:*"
}
}'
Documents menu - Solr command
Searching
q (query) main query parameter
fq (filter query) filter query
(to reduce the dataset)
fl (filter list) list of fields to return
sort list of fields to sort the
dataset
Results are paginated
QUERY
Basic queries
List all documents (with pagination)
curl 'http://localhost:8983/solr/development/select?q=*:*'
Basic queries
List all documents (with pagination)
curl http://localhost:8983/solr/development/select -d '
{
query:"*:*"
}'
My documents
{
"docs": [
{
"title": ["Song 1"],
"genre": "Rock",
"year": 2010
},
{
"title": ["Song 2"],
"genre": "MPB",
"year": 1990
}, {
"title": ["Other music Rock"],
"genre": "Pop",
"year": 1970
},
{
"title": ["My favorite songs"],
"genre": "Rock Music",
"year": 2011
}
]
}
Fuzzy matching
title:Song* 3 documents
title:Song? 1 document
title:Sonjs 0 documents
title:Sonjs~1 1 document
title:Sonjs~2 3 documents
title:(my songs) 1 document
title:"my songs" 0 documents
title:"my songs"~2 1 document
title:(-favorite +song*) 2 documents
*:* 4 documents
Wildcards:
? one letter
* any number letter
~ query slop
( ) keyword search
" " phrase query
Fuzzy matching
title:"song" AND genre:"rock" 1 document
(title:"song" AND genre:"rock") OR title:"track" 2 documents
year: [1980 TO *] 3 documents
genre:[Pop TO *] 3 documents
Boosting:
(title:music OR title:Rock)^1.5 (genre:music OR genre: Rock) 3 documents
1st: "Other music rock"
(title:music OR title:Rock) (genre:music OR genre: Rock)^1.5 3 documents
1st: "My favorite songs"
Searching in all fields
In your schema.xml, add:
<copyField source="*_txt" dest="_text_" />
<copyField source="*_text" dest="_text_" />
You can add but it is not recommended:
<copyField source="*" dest="_text_" />
Then, you can search without defining the default field
Analysis: list all indexing and querying transformations
Indexing
Transform.
Querying
Transform.
Customizing fields and their analyzers (schema.xml)
<fieldtype name="phonetic" stored="false" indexed="true"
class="solr.TextField" >
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.DoubleMetaphoneFilterFactory" inject="false"/>
<filter class="solr.StopFilterFactory" format="snowball"
words="lang/stopwords_pt.txt" />
</analyzer>
</fieldtype>
Spell Checking
Building the spell checking indexing
curl --request GET --url
'http://localhost:8983/solr/development/select?q=*:*&spellcheck.build=t
rue&spellcheck=true'
Suggestion:
IMAGINE
Searching: IMAGINA
Searching: DRAGOONS
Suggestion:
DRAGONS
Searching:
IMAGINA DRAGOONS
Suggestion:
IMAGINE DRAGONS
Integrating with
Ruby on Rails
Connecting through a REST Client …
params = {q: 'title:song' }
response = RestClient.get
"http://localhost:8983/solr/development/select?#{params.to_param}"
response_json = JSON.parse(response.body)
items = response_json["response"]["docs"]
[{"title"=>["Song 1"], "id"=>"eeb507c6-461f-4219-9f5a-50528340d84d",
"_version_"=>1584234836063682560, "title_str"=>["Song 1"]},
{"title"=>["Song 2"], "id"=>"1b8bacc1-9ed9-4c85-922d-71b3472f9d44",
"_version_"=>1584234836065779712, "title_str"=>["Song 2"]}]
ヽ(•́o•̀)ノ
Sunspot Gem
V. 2.2.7
Installing...
gem 'sunspot_rails'
rails generate sunspot_rails:install
development:
solr:
hostname: solr
port: 8983
path: /solr/playax
log_level: INFO
auto_index_callback: after_commit
auto_remove_callback: after_commit
config/sunspot.yml
Sunspot needs its own schema.xml.
Follow this example in:
elainenaomi/search_engine
Sunspot DSL - Defining the indexed fields
class Song < ActiveRecord::Base
searchable do
text :title, stored: true
text :lyrics, stored: false
text :artist, stored: true
string :genre, multiple: true, stored: true do
genre.split(',')
end
end
end
Sunspot.index! Song.all
Bag of words:
search = Song.search do
fulltext 'imagine dragons'
with :genre, 'Rock'
without :genre, 'Pop'
with(:year).less_than 2014
field_list :title, :artist
order_by :title, :asc
end
songs = search.results
Imagine
(John Lennon)
Radioactive
( Imagine Dragons)
Phrase queries:
search = Song.search do
fulltext ""imagine dragons""
with :genre, 'Rock'
without :genre, 'Pop'
with(:year).less_than 2014
field_list :title, :artist
order_by :title, :asc
end
songs = search.results
Radioactive
( Imagine Dragons)
Query Phrase Slop
# Two words can appear between the words in the phrase, so
# "imagine all the people" also matches, in addition to "imagine people"
Song.search do
fulltext '"imagine people"' do
fields :lyrics
query_phrase_slop 2
end
end
Minimum Match
Song.search do
fulltext "dragons imagine test" do
fields :artist, :title
minimum_match '70%'
end
end
Song.search do
fulltext 'dragons imagine test' do
fields :artist, :title
boost_fields title: 2.0
minimum_match '60%'
end
end
1 document:
Radioactive
( Imagine Dragons)
2 documents:
1st: Imagine
(John Lennon)
2nd: Radioactive
(Imagine Dragons)
boost
rounded down
Spell checking
search = Sunspot.search(Song) do
keywords 'Imagina Dragoons'
spellcheck :count => 3
end
search.spellcheck_suggestion_for('imagina') # => 'imagine'
search.spellcheck_suggestions # => [{"word"=>"imagine",
"freq"=>3}, {"word"=>"dragons", "freq"=>1}]
To test or not to test?
To test or not to test?
Unit tests? No.
Integration tests? Maybe…
Search engines depends on terms frequency to ranking docs
You will need all your dataset to compute precision, recall..
You can test only filter queries, indexing callbacks…
Summary
Summary
The searching
problem
● User: a bug search tool
Adding a search
engine to my app
● Full text search in MariaDB
● Apache Solr x ElasticSearch
Apache Solr
● How to create cores
● CRUD operations
Integrating with Rails
● Sunspot gem
● How to index, search and test
Keep in mind
Always verify the user's information needs from your app
E.g.: check if removing stop words, synonymous should be applied
"No" Meghan Trainor
"I am" - P.O.D
E.g: which transformations your search engine should apply
- Phonetic transformations? Custom language analyzers?
Keep in mind
The information is not only on text files but also in
audios, videos, images, etc.
Suggested topics for studying
- Evaluation of available analyzers for FTS
- Optimization of Performance (such as soft commit, lazy build indexes)
- Distribution and replication through SolrCloud
- Using of Machine Learning algorithms
- Creation of custom function queries
- Authentication
- Integrating with Logstash and Kibana
- Geospatial searches
Thank you! <3
github.com/elainenaomi
slideshare.net/elainenaomi
@elaine_nw
References
Introduction to Information Retrieval
Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze (2008)
Solr in action
Grainger, Trey, Timothy Potter, and Yonik Seeley (2014)
Sunspot gem
http://sunspot.github.io/
Uma introdução ao tema recuperação de informações textuais.
Barth, F. J. (2013)
10 Reasons to Choose Apache Solr Over Elasticsearch (2016)
https://dzone.com/articles/10-reasons-to-choose-apache-solr-over-elasticsearc
References
Apache Solr vs Elasticsearch
http://solr-vs-elasticsearch.com/
When to consider Solr
https://stackoverflow.com/questions/4960952/when-to-consider-solr
Indexing for full text search in PostgreSQL
https://www.compose.com/articles/indexing-for-full-text-search-in-postgresql/
PolyglotPersistence
https://martinfowler.com/bliki/PolyglotPersistence.html
Yahoo! Answers: Qual o nome desta Música?
https://br.answers.yahoo.com/question/index?qid=20080627085726AAJM9Wa
References
Full-Text Index in MariaDB
https://mariadb.com/kb/en/library/full-text-index-overview/
Natural Language Full-Text Searches (MySQL)
https://dev.mysql.com/doc/refman/5.7/en/fulltext-natural-language.html
Postgres full-text search is Good Enough! (2015)
http://rachbelaid.com/postgres-full-text-search-is-good-enough/
Text Indexes in MongoDB
https://docs.mongodb.com/manual/core/index-text/
Full-Text Index Stopwords for MariaDB
https://mariadb.com/kb/en/library/full-text-index-stopwords/

Mais conteúdo relacionado

Semelhante a Dealing with a search engine in your application - a Solr approach for beginners

Fluid, Fluent APIs
Fluid, Fluent APIsFluid, Fluent APIs
Fluid, Fluent APIsErik Rose
 
Scripting OS X with Applescript, without Applescript
Scripting OS X with Applescript, without ApplescriptScripting OS X with Applescript, without Applescript
Scripting OS X with Applescript, without ApplescriptMatt Patterson
 
Understanding Names with Neural Networks - May 2020
Understanding Names with Neural Networks - May 2020Understanding Names with Neural Networks - May 2020
Understanding Names with Neural Networks - May 2020Basis Technology
 
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and ApplicationsICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and ApplicationsForward Gradient
 
What have fruits to do with technology? The case of Orange, Blackberry and Apple
What have fruits to do with technology? The case of Orange, Blackberry and AppleWhat have fruits to do with technology? The case of Orange, Blackberry and Apple
What have fruits to do with technology? The case of Orange, Blackberry and ApplePlanetData Network of Excellence
 
Pow! Your Point: Better Presentations for a Happier Audience
Pow! Your Point: Better Presentations for a Happier AudiencePow! Your Point: Better Presentations for a Happier Audience
Pow! Your Point: Better Presentations for a Happier AudienceMesaPublicLibrary
 
Analyze one year of radio station songs aired with Spark SQL, Spotify, and Da...
Analyze one year of radio station songs aired with Spark SQL, Spotify, and Da...Analyze one year of radio station songs aired with Spark SQL, Spotify, and Da...
Analyze one year of radio station songs aired with Spark SQL, Spotify, and Da...Paul Leclercq
 
Mashup University 4: Intro To Mashups
Mashup University 4: Intro To MashupsMashup University 4: Intro To Mashups
Mashup University 4: Intro To MashupsJohn Herren
 
Webinar: Natural Language Search with Solr
Webinar: Natural Language Search with SolrWebinar: Natural Language Search with Solr
Webinar: Natural Language Search with SolrLucidworks
 
Getting Started: Atlas Search Webinar
Getting Started: Atlas Search WebinarGetting Started: Atlas Search Webinar
Getting Started: Atlas Search WebinarKaren Huaulme
 
MongoDB & Machine Learning
MongoDB & Machine LearningMongoDB & Machine Learning
MongoDB & Machine LearningTom Maiaroto
 
Culture And Aesthetic Revisited
Culture And Aesthetic RevisitedCulture And Aesthetic Revisited
Culture And Aesthetic RevisitedAdam Keys
 
SQL206 SQL Median
SQL206 SQL MedianSQL206 SQL Median
SQL206 SQL MedianDan D'Urso
 
Four Languages From Forty Years Ago
Four Languages From Forty Years AgoFour Languages From Forty Years Ago
Four Languages From Forty Years AgoScott Wlaschin
 
MUSIC’s VULGAR AWARENESS
MUSIC’s VULGAR AWARENESSMUSIC’s VULGAR AWARENESS
MUSIC’s VULGAR AWARENESSBoat Teelekboat
 
The Holistic Programmer
The Holistic ProgrammerThe Holistic Programmer
The Holistic ProgrammerAdam Keys
 
Os Keysholistic
Os KeysholisticOs Keysholistic
Os Keysholisticoscon2007
 

Semelhante a Dealing with a search engine in your application - a Solr approach for beginners (20)

Fluid, Fluent APIs
Fluid, Fluent APIsFluid, Fluent APIs
Fluid, Fluent APIs
 
Scripting OS X with Applescript, without Applescript
Scripting OS X with Applescript, without ApplescriptScripting OS X with Applescript, without Applescript
Scripting OS X with Applescript, without Applescript
 
Understanding Names with Neural Networks - May 2020
Understanding Names with Neural Networks - May 2020Understanding Names with Neural Networks - May 2020
Understanding Names with Neural Networks - May 2020
 
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and ApplicationsICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
 
What have fruits to do with technology? The case of Orange, Blackberry and Apple
What have fruits to do with technology? The case of Orange, Blackberry and AppleWhat have fruits to do with technology? The case of Orange, Blackberry and Apple
What have fruits to do with technology? The case of Orange, Blackberry and Apple
 
Pow! Your Point: Better Presentations for a Happier Audience
Pow! Your Point: Better Presentations for a Happier AudiencePow! Your Point: Better Presentations for a Happier Audience
Pow! Your Point: Better Presentations for a Happier Audience
 
Analyze one year of radio station songs aired with Spark SQL, Spotify, and Da...
Analyze one year of radio station songs aired with Spark SQL, Spotify, and Da...Analyze one year of radio station songs aired with Spark SQL, Spotify, and Da...
Analyze one year of radio station songs aired with Spark SQL, Spotify, and Da...
 
Mashup University 4: Intro To Mashups
Mashup University 4: Intro To MashupsMashup University 4: Intro To Mashups
Mashup University 4: Intro To Mashups
 
Webinar: Natural Language Search with Solr
Webinar: Natural Language Search with SolrWebinar: Natural Language Search with Solr
Webinar: Natural Language Search with Solr
 
C 2
C 2C 2
C 2
 
Getting Started: Atlas Search Webinar
Getting Started: Atlas Search WebinarGetting Started: Atlas Search Webinar
Getting Started: Atlas Search Webinar
 
MongoDB & Machine Learning
MongoDB & Machine LearningMongoDB & Machine Learning
MongoDB & Machine Learning
 
Culture And Aesthetic Revisited
Culture And Aesthetic RevisitedCulture And Aesthetic Revisited
Culture And Aesthetic Revisited
 
NLP
NLPNLP
NLP
 
NLP
NLPNLP
NLP
 
SQL206 SQL Median
SQL206 SQL MedianSQL206 SQL Median
SQL206 SQL Median
 
Four Languages From Forty Years Ago
Four Languages From Forty Years AgoFour Languages From Forty Years Ago
Four Languages From Forty Years Ago
 
MUSIC’s VULGAR AWARENESS
MUSIC’s VULGAR AWARENESSMUSIC’s VULGAR AWARENESS
MUSIC’s VULGAR AWARENESS
 
The Holistic Programmer
The Holistic ProgrammerThe Holistic Programmer
The Holistic Programmer
 
Os Keysholistic
Os KeysholisticOs Keysholistic
Os Keysholistic
 

Mais de Elaine Naomi

Design de aplicações orientadas a objeto
Design de aplicações orientadas a objetoDesign de aplicações orientadas a objeto
Design de aplicações orientadas a objetoElaine Naomi
 
Sobre code smells, refactoring e design: como SOLID pode te ajudar no dia a dia
Sobre code smells, refactoring e design: como SOLID pode te ajudar no dia a diaSobre code smells, refactoring e design: como SOLID pode te ajudar no dia a dia
Sobre code smells, refactoring e design: como SOLID pode te ajudar no dia a diaElaine Naomi
 
Hacking Evening - Liskov Substitution Principle
Hacking Evening - Liskov Substitution PrincipleHacking Evening - Liskov Substitution Principle
Hacking Evening - Liskov Substitution PrincipleElaine Naomi
 
Code Smells: o que eles dizem sobre seu código?
Code Smells: o que eles dizem sobre seu código?Code Smells: o que eles dizem sobre seu código?
Code Smells: o que eles dizem sobre seu código?Elaine Naomi
 
Guru SP: Decodificando o code review
Guru SP: Decodificando o code reviewGuru SP: Decodificando o code review
Guru SP: Decodificando o code reviewElaine Naomi
 
Bootcamp de Rails - CaquiCoders Meetup
Bootcamp de Rails - CaquiCoders MeetupBootcamp de Rails - CaquiCoders Meetup
Bootcamp de Rails - CaquiCoders MeetupElaine Naomi
 
GURU SP - Design de aplicações orientadas a objeto
GURU SP - Design de aplicações orientadas a objetoGURU SP - Design de aplicações orientadas a objeto
GURU SP - Design de aplicações orientadas a objetoElaine Naomi
 
TDC SP 2019 - Decodificando o code review
TDC SP 2019 - Decodificando o code reviewTDC SP 2019 - Decodificando o code review
TDC SP 2019 - Decodificando o code reviewElaine Naomi
 
Além da programação funcional com Elixir e Erlang
Além da programação funcional com Elixir e ErlangAlém da programação funcional com Elixir e Erlang
Além da programação funcional com Elixir e ErlangElaine Naomi
 
Code review: o que isso diz sobre a cultura dos times de desenvolvimento?
Code review: o que isso diz sobre a cultura dos times de desenvolvimento?Code review: o que isso diz sobre a cultura dos times de desenvolvimento?
Code review: o que isso diz sobre a cultura dos times de desenvolvimento?Elaine Naomi
 
Explorando o Paralelismo em Workflows Intensivos em Dados com o Uso de Anotaç...
Explorando o Paralelismo em Workflows Intensivos em Dados com o Uso de Anotaç...Explorando o Paralelismo em Workflows Intensivos em Dados com o Uso de Anotaç...
Explorando o Paralelismo em Workflows Intensivos em Dados com o Uso de Anotaç...Elaine Naomi
 
Uso de Anotações Semânticas para Exploração de Paralelismo em Workflows Inten...
Uso de Anotações Semânticas para Exploração de Paralelismo em Workflows Inten...Uso de Anotações Semânticas para Exploração de Paralelismo em Workflows Inten...
Uso de Anotações Semânticas para Exploração de Paralelismo em Workflows Inten...Elaine Naomi
 
Um Método para Paralelização Automática de Workflows Intensivos em Dados
Um Método para Paralelização Automática de Workflows Intensivos em DadosUm Método para Paralelização Automática de Workflows Intensivos em Dados
Um Método para Paralelização Automática de Workflows Intensivos em DadosElaine Naomi
 
O que é BIG DATA e como pode influenciar nossas vidas
O que é BIG DATA e como pode influenciar nossas vidasO que é BIG DATA e como pode influenciar nossas vidas
O que é BIG DATA e como pode influenciar nossas vidasElaine Naomi
 
Introdução ao MongoDB
Introdução ao MongoDBIntrodução ao MongoDB
Introdução ao MongoDBElaine Naomi
 
Workflows científicos
Workflows científicosWorkflows científicos
Workflows científicosElaine Naomi
 
Algoritmos para economia de energia no escalonamento de workflows em nuvens c...
Algoritmos para economia de energia no escalonamento de workflows em nuvens c...Algoritmos para economia de energia no escalonamento de workflows em nuvens c...
Algoritmos para economia de energia no escalonamento de workflows em nuvens c...Elaine Naomi
 

Mais de Elaine Naomi (18)

Design de aplicações orientadas a objeto
Design de aplicações orientadas a objetoDesign de aplicações orientadas a objeto
Design de aplicações orientadas a objeto
 
Sobre code smells, refactoring e design: como SOLID pode te ajudar no dia a dia
Sobre code smells, refactoring e design: como SOLID pode te ajudar no dia a diaSobre code smells, refactoring e design: como SOLID pode te ajudar no dia a dia
Sobre code smells, refactoring e design: como SOLID pode te ajudar no dia a dia
 
Hacking Evening - Liskov Substitution Principle
Hacking Evening - Liskov Substitution PrincipleHacking Evening - Liskov Substitution Principle
Hacking Evening - Liskov Substitution Principle
 
Code Smells: o que eles dizem sobre seu código?
Code Smells: o que eles dizem sobre seu código?Code Smells: o que eles dizem sobre seu código?
Code Smells: o que eles dizem sobre seu código?
 
Guru SP: Decodificando o code review
Guru SP: Decodificando o code reviewGuru SP: Decodificando o code review
Guru SP: Decodificando o code review
 
Bootcamp de Rails - CaquiCoders Meetup
Bootcamp de Rails - CaquiCoders MeetupBootcamp de Rails - CaquiCoders Meetup
Bootcamp de Rails - CaquiCoders Meetup
 
GURU SP - Design de aplicações orientadas a objeto
GURU SP - Design de aplicações orientadas a objetoGURU SP - Design de aplicações orientadas a objeto
GURU SP - Design de aplicações orientadas a objeto
 
TDC SP 2019 - Decodificando o code review
TDC SP 2019 - Decodificando o code reviewTDC SP 2019 - Decodificando o code review
TDC SP 2019 - Decodificando o code review
 
Além da programação funcional com Elixir e Erlang
Além da programação funcional com Elixir e ErlangAlém da programação funcional com Elixir e Erlang
Além da programação funcional com Elixir e Erlang
 
Code review: o que isso diz sobre a cultura dos times de desenvolvimento?
Code review: o que isso diz sobre a cultura dos times de desenvolvimento?Code review: o que isso diz sobre a cultura dos times de desenvolvimento?
Code review: o que isso diz sobre a cultura dos times de desenvolvimento?
 
Explorando o Paralelismo em Workflows Intensivos em Dados com o Uso de Anotaç...
Explorando o Paralelismo em Workflows Intensivos em Dados com o Uso de Anotaç...Explorando o Paralelismo em Workflows Intensivos em Dados com o Uso de Anotaç...
Explorando o Paralelismo em Workflows Intensivos em Dados com o Uso de Anotaç...
 
Uso de Anotações Semânticas para Exploração de Paralelismo em Workflows Inten...
Uso de Anotações Semânticas para Exploração de Paralelismo em Workflows Inten...Uso de Anotações Semânticas para Exploração de Paralelismo em Workflows Inten...
Uso de Anotações Semânticas para Exploração de Paralelismo em Workflows Inten...
 
Um Método para Paralelização Automática de Workflows Intensivos em Dados
Um Método para Paralelização Automática de Workflows Intensivos em DadosUm Método para Paralelização Automática de Workflows Intensivos em Dados
Um Método para Paralelização Automática de Workflows Intensivos em Dados
 
O que é BIG DATA e como pode influenciar nossas vidas
O que é BIG DATA e como pode influenciar nossas vidasO que é BIG DATA e como pode influenciar nossas vidas
O que é BIG DATA e como pode influenciar nossas vidas
 
Introdução ao MongoDB
Introdução ao MongoDBIntrodução ao MongoDB
Introdução ao MongoDB
 
Workflows científicos
Workflows científicosWorkflows científicos
Workflows científicos
 
Algoritmos para economia de energia no escalonamento de workflows em nuvens c...
Algoritmos para economia de energia no escalonamento de workflows em nuvens c...Algoritmos para economia de energia no escalonamento de workflows em nuvens c...
Algoritmos para economia de energia no escalonamento de workflows em nuvens c...
 
Qt Apresentação
Qt ApresentaçãoQt Apresentação
Qt Apresentação
 

Último

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 

Último (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 

Dealing with a search engine in your application - a Solr approach for beginners

  • 1. Dealing with a search engine in your application a Solr approach for beginners Elaine Naomi Watanabe
  • 2. Elaine Naomi Watanabe Full-stack developer (Playax) Master's degree in Computer Science (IME-USP) Passionate about: Web Development, Agile, Cloud Computing, DevOps, NoSQL and RDBMS
  • 3. ANALYZING BILLIONS OF DATA TO HELP ARTISTS AND MUSIC PROFESSIONALS TO DEVELOP THEIR AUDIENCE BIG DATA + MUSIC + TECH = <3
  • 5. Searching Problem Introduction Information Retrieval Basic concepts Apache Solr How to configure Sunspot Gem Integrating with Ruby on Rails Next Steps Including references SPOILER ALERT
  • 7. or did you mean... HOW TO SEARCH LIKE THE GOOGLE?
  • 8. A LIST OF SONGS… TITLES, ARTISTS, LYRICS... IMAGINE IN OUR CONTEXT…
  • 9. "Imagine all the people living life in peace" (Imagine - John Lennon) "I'm a radioactive, radioactive" (Radioactive - Imagine Dragons) "Welcome to the jungle, watch it bring you to your knees" (Welcome to the jungle - Guns N' Roses) A little little sample of songs...
  • 10. A SQL LIKE statement is enough? SEARCHING TERM: "IMAGINE" SELECT * FROM songs WHERE title LIKE '%IMAGINE%' OR artist LIKE '%IMAGINE%' OR lyrics LIKE '%IMAGINE%';
  • 11. "Imagine all the people living life in peace" (Imagine - John Lennon) "I'm a radioactive, radioactive" (Radioactive - Imagine Dragons) "Welcome to the jungle, watch it bring you to your knees" (Welcome to the jungle - Guns N' Roses) Searching for "Imagine"
  • 12. A SQL LIKE statement is really enough?? SEARCHING TERM: "IMAGINE PEOPLE"
  • 13. "Imagine all the people living life in peace" (Imagine - John Lennon) "I'm a radioactive, radioactive" (Radioactive - Imagine Dragons) "Welcome to the jungle, watch it bring you to your knees" (Welcome to the jungle - Guns N' Roses) A little little sample of songs...
  • 14. A SQL LIKE statement is really enough?? SEARCHING TERM: "IMAGINE PEOPLE" SELECT * FROM songs WHERE title LIKE '%IMAGINE%PEOPLE%' OR artist LIKE '%IMAGINE%PEOPLE%' OR lyrics LIKE '%IMAGINE%PEOPLE%';
  • 15. USER x YOUR APP A BUG SEARCH TOOL
  • 16. When LIKE STATEMENT is not enough... SEARCH TERMS: "Dragons Imagine" "Imagine John" "Imagine JONH" <- TYPO!
  • 17. When only Yahoo! Answers is the solution... Ueca tudi diango... tanananananananaann nisss ♬ welcome to the jungle, watch it bring you to your knees ♬ (╯°▽°)╯ ︵ ┻━┻
  • 20. IN THE PAST... List all documents that match a search query was enough… However, in a Big Data era…
  • 21. NOWADAYS … Ranking documents by their relevance for a search query is the most important goal.
  • 23. TOKENIZATION: Tokens ~> Words A list wordsof"A list of words!" Tokens semantic units
  • 24. DON T PANIC "DON'T PANIC" Is it enough to remove punctuation and spaces? "do not panic" do not panic How to tokenize contractions? Are all of them semantic units? Are they same tokens? don't = do not DON'T PANIC
  • 25. Imagine Dragons "Imagine Dragons" Imagine Dragons Is it enough to remove punctuation and spaces? or
  • 26. you know "you-know-who" you-know-who Is it enough to remove punctuation and spaces? or who
  • 27. "Fullstack developer" "Full-stack dev" "Full stack developer" Is it enough to remove punctuation and spaces? Fullstack Full developerstack developer Full-stack developer For a user, these terms should return the same documents, isn't it?
  • 28. 30 seconds to Mars Thirty seconds to Mars November, 18th, 2017 2017-11-18 SP São Paulo How to deal with numbers and abbreviations?
  • 29. Kaminari Is a gem or thunder in japanese? Windows Is it plural of window or about the company? About the semantics of the original term and its normalized token...
  • 30. 音楽 ONGAKU お​んがく SAME LANGUAGE, SAME PRONUNCIATION DIFFERENT ALPHABETS
  • 31. STOP WORDS: extremely common words In English: a, an, the, and, or, are, as, at, by, for, from, of ... In Portuguese: um, uma, a, o, as, os, é, são, por, de, da, do, se …
  • 32. STOP WORDS: extremely common words A list wordsof list words meaningful tokens
  • 33. Stop words, diacritics, case folding... Stop word removal Case folding normalization Diacritics removal HELLO WORLD Hello World hello world hello world naive naïve naive roses are red red roses roses red
  • 34. When not to normalize tokens… The Who (a band) Se (Brazilian song, from Djavan) Strings solely composed by stop words Different meanings for words with and without diacritics In Spanish: peña means a "cliff" pena means "sorrow" When not to set all characters to lowercase General Motors Windows Apple
  • 35. LEMMATIZATION / STEMMING To reduce a token to its base form
  • 36. LEMMATIZATION: based on a vocabulary am, are, is be sou, somos, foi, é ser car, cars, car’s, cars’ car English Portuguese carros, carro carro
  • 37. STEMMING Heuristic process that chops off the ends of words cats cat ponies poni Increase the number of returned documents. However, harming precision...
  • 38. STEMMING Heuristic process that chops off the ends of words amor amoramores amora operating operat system Portuguese system English It means love It's a Brazilian berry not so meaningful tokens
  • 40. Bag of words: List of keywords Ordering of words is ignored! e.g. Imagine Dragons Dragons Imagine Phrase queries: Order matters! Restrict searches e.g. "Imagine Dragons"
  • 41. RELEVANCE term frequency (tf) total of occurrences of a term in a document inverse document frequency (idf) how rare is a term in all indexed documents
  • 42. RELEVANCE tf-idf = tf x idf function that balances the term frequency in a document within how rare is term in a collection
  • 45. Evaluation method We need a test dataset with: 1. A document collection 2. A collection of queries 3. A set of relevance judgments, for each query, a list of relevant and non-relevant documents TP: True Positive TN: True Negative FP: False Positive FN: False Negative
  • 46. ACCURACY TP + TN TP + FP + TN + FN
  • 47. PRECISION TP TP + FP # Corrected Matches / # Total Results Returned
  • 48. RECALL TP TP + FN # Corrected Matches / (# Corrected Matches + # Missed Matches)
  • 49. F1 SCORE 2 * (RECALL + PRECISION) (RECALL + PRECISION)
  • 50. When a model is good enough for an app? You can choose the model with the best F1 score, for example. However, there is no universal solution It is an incremental process You should tune it based on users' information needs Usability tests is also a good way to evaluate a model
  • 52. FULL TEXT SEARCH IN MARIADB... CREATE TABLE `songs` ( `id` int NOT NULL AUTO_INCREMENT PRIMARY KEY, `title` varchar(300), `artist` varchar(255), `genre` varchar(255), `lyrics` text ) ENGINE=InnoDB; CREATE FULLTEXT INDEX songs_title_idx ON songs (title); CREATE FULLTEXT INDEX songs_artist_idx ON songs (artist); CREATE FULLTEXT INDEX songs_lyrics_idx ON songs (lyrics); CREATE FULLTEXT INDEX songs_genre_idx ON songs (genre); FTS
  • 53. FULL TEXT SEARCH IN MARIADB... SELECT * FROM songs WHERE MATCH (title,artist, lyrics) AGAINST ('imagine' IN NATURAL LANGUAGE MODE); SELECT * FROM songs WHERE MATCH (title,artist, lyrics) AGAINST ('imagine' IN BOOLEAN MODE); CREATE FULLTEXT INDEX songs_all_idx ON songs (title,artist,lyrics); default mode
  • 54. FULL TEXT SEARCH IN MARIADB... SELECT * FROM songs WHERE MATCH (title,artist, lyrics) AGAINST ('imagine dragons'); Returned rows: Radioactive - Imagine Dragons Imagine - John Lennon
  • 55. FULL TEXT SEARCH IN MARIADB... SELECT * FROM songs WHERE MATCH (title,artist, lyrics) AGAINST ('+imagine +dragons') IN BOOLEAN MODE); Radioactive - Imagine Dragons
  • 56. FULL TEXT SEARCH IN MARIADB... SELECT * FROM songs WHERE MATCH (title,artist, lyrics) AGAINST ('"imagine dragons"'); Radioactive - Imagine Dragons
  • 57. FULL TEXT SEARCH IN MARIADB... SELECT * FROM songs WHERE MATCH(genre) AGAINST('alternative'); SELECT * FROM songs WHERE MATCH(genre) AGAINST('music'); SELECT * FROM songs WHERE MATCH(genre) AGAINST('alternative' WITH QUERY EXPANSION); SELECT * FROM songs WHERE MATCH(genre) AGAINST('music' WITH QUERY EXPANSION); Imagine Dragons John Lennon Imagine Dragons John Lennon Imagine Dragons - Alternative Rock John Lennon - Rock music, Pop music
  • 58. Why to use an external search engine? Spell checking! Spell checking! Or did you mean… search like Google? ♡
  • 59. Why to use an external search engine? You can use spell checking! You can also: - Add multivalued fields (document oriented database) - Add new algorithms to the databases - Customize stop words, stemming analyzers - Use fuzziness functions - Boost some documents/fields according to the search
  • 60. Apache Solr and ElasticSearch Based on Apache Lucene Document oriented databases (welcome to polyglot persistence!) It is not a relational database, ok? No ACID, sorry! Developed to be scalable Apache Solr has a better documentation +50 ES has native support to Structured Query DSL +1 ES is better for analytic queries
  • 61. ElasticSearch DSL // artist = John Lennon AND (genres = rock OR genres = pop) // AND NOT(nome = imagine) GET /songs/v1/_search { "query" : { "bool": { "must": {"match": {"artist": "John Lennon" }}, "should": [ {"match": {"genres": "rock" }}, {"match": {"genres": "pop" }} ], "must_not": {"match": {"nome": "imagine"}} } } }
  • 62. Our choice: Apache Solr Apache Solr is Open Source and Open Development +1000 Latest release: 7.1.0 (October 17th, 2017)
  • 64. Installing for development environment... docker run --name my_solr -p 8983:8983 -d solr https://hub.docker.com/r/risdenk/docker-solr/
  • 66. Creating a core docker exec -it my_solr solr create_core -c development core ~> database or table document ~> a row from a table schemaless!! core name
  • 67. List of all cores
  • 68. Menu options for each core
  • 69. Creating a document... curl -X POST -H 'Content-Type: application/json' 'http://localhost:8983/solr/development/update/json/docs' --data-binary ' { "id": "1", "title": "Song 1" }'
  • 71. Commit!! curl -X POST -H 'Content-Type: application/json' 'http://localhost:8983/solr/development/update?commit=true' --data-binary ' { "commit": {} }'
  • 72. Creating a document... curl -X POST -H 'Content-Type: application/json' 'http://localhost:8983/solr/development/update?commit=true' --data-binary ' [{ "id": "1", "title": "Song 1" },{ "title": "Song 2" }]' Optional in insert
  • 74. Our new documents! title and title_str? dynamic fields *_str, *_i, ...
  • 75. Updating a document... curl -X POST -H 'Content-Type: application/json' 'http://localhost:8983/solr/development/update?commit=true' --data-binary ' [{ "id": "1", "title": "Song 3" }, { "title": "Song 3" }]'
  • 78. Deleting a document... curl -X POST -H 'Content-Type: application/json' 'http://localhost:8983/solr/development/update' --data-binary ' { "delete": { "id":"1" }, "commit": {}, }'
  • 79. Deleting ALL documents curl -X POST -H 'Content-Type: application/json' 'http://localhost:8983/solr/development/update?commit=true' --data-binary ' { "delete": { "query": "*:*" } }'
  • 80. Documents menu - Solr command
  • 82. q (query) main query parameter fq (filter query) filter query (to reduce the dataset) fl (filter list) list of fields to return sort list of fields to sort the dataset Results are paginated QUERY
  • 83. Basic queries List all documents (with pagination) curl 'http://localhost:8983/solr/development/select?q=*:*'
  • 84. Basic queries List all documents (with pagination) curl http://localhost:8983/solr/development/select -d ' { query:"*:*" }'
  • 85. My documents { "docs": [ { "title": ["Song 1"], "genre": "Rock", "year": 2010 }, { "title": ["Song 2"], "genre": "MPB", "year": 1990 }, { "title": ["Other music Rock"], "genre": "Pop", "year": 1970 }, { "title": ["My favorite songs"], "genre": "Rock Music", "year": 2011 } ] }
  • 86. Fuzzy matching title:Song* 3 documents title:Song? 1 document title:Sonjs 0 documents title:Sonjs~1 1 document title:Sonjs~2 3 documents title:(my songs) 1 document title:"my songs" 0 documents title:"my songs"~2 1 document title:(-favorite +song*) 2 documents *:* 4 documents Wildcards: ? one letter * any number letter ~ query slop ( ) keyword search " " phrase query
  • 87. Fuzzy matching title:"song" AND genre:"rock" 1 document (title:"song" AND genre:"rock") OR title:"track" 2 documents year: [1980 TO *] 3 documents genre:[Pop TO *] 3 documents Boosting: (title:music OR title:Rock)^1.5 (genre:music OR genre: Rock) 3 documents 1st: "Other music rock" (title:music OR title:Rock) (genre:music OR genre: Rock)^1.5 3 documents 1st: "My favorite songs"
  • 88. Searching in all fields In your schema.xml, add: <copyField source="*_txt" dest="_text_" /> <copyField source="*_text" dest="_text_" /> You can add but it is not recommended: <copyField source="*" dest="_text_" /> Then, you can search without defining the default field
  • 89. Analysis: list all indexing and querying transformations Indexing Transform. Querying Transform.
  • 90. Customizing fields and their analyzers (schema.xml) <fieldtype name="phonetic" stored="false" indexed="true" class="solr.TextField" > <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.DoubleMetaphoneFilterFactory" inject="false"/> <filter class="solr.StopFilterFactory" format="snowball" words="lang/stopwords_pt.txt" /> </analyzer> </fieldtype>
  • 92. Building the spell checking indexing curl --request GET --url 'http://localhost:8983/solr/development/select?q=*:*&spellcheck.build=t rue&spellcheck=true'
  • 97. Connecting through a REST Client … params = {q: 'title:song' } response = RestClient.get "http://localhost:8983/solr/development/select?#{params.to_param}" response_json = JSON.parse(response.body) items = response_json["response"]["docs"] [{"title"=>["Song 1"], "id"=>"eeb507c6-461f-4219-9f5a-50528340d84d", "_version_"=>1584234836063682560, "title_str"=>["Song 1"]}, {"title"=>["Song 2"], "id"=>"1b8bacc1-9ed9-4c85-922d-71b3472f9d44", "_version_"=>1584234836065779712, "title_str"=>["Song 2"]}] ヽ(•́o•̀)ノ
  • 99. Installing... gem 'sunspot_rails' rails generate sunspot_rails:install development: solr: hostname: solr port: 8983 path: /solr/playax log_level: INFO auto_index_callback: after_commit auto_remove_callback: after_commit config/sunspot.yml
  • 100. Sunspot needs its own schema.xml. Follow this example in: elainenaomi/search_engine
  • 101. Sunspot DSL - Defining the indexed fields class Song < ActiveRecord::Base searchable do text :title, stored: true text :lyrics, stored: false text :artist, stored: true string :genre, multiple: true, stored: true do genre.split(',') end end end Sunspot.index! Song.all
  • 102. Bag of words: search = Song.search do fulltext 'imagine dragons' with :genre, 'Rock' without :genre, 'Pop' with(:year).less_than 2014 field_list :title, :artist order_by :title, :asc end songs = search.results Imagine (John Lennon) Radioactive ( Imagine Dragons)
  • 103. Phrase queries: search = Song.search do fulltext ""imagine dragons"" with :genre, 'Rock' without :genre, 'Pop' with(:year).less_than 2014 field_list :title, :artist order_by :title, :asc end songs = search.results Radioactive ( Imagine Dragons)
  • 104. Query Phrase Slop # Two words can appear between the words in the phrase, so # "imagine all the people" also matches, in addition to "imagine people" Song.search do fulltext '"imagine people"' do fields :lyrics query_phrase_slop 2 end end
  • 105. Minimum Match Song.search do fulltext "dragons imagine test" do fields :artist, :title minimum_match '70%' end end Song.search do fulltext 'dragons imagine test' do fields :artist, :title boost_fields title: 2.0 minimum_match '60%' end end 1 document: Radioactive ( Imagine Dragons) 2 documents: 1st: Imagine (John Lennon) 2nd: Radioactive (Imagine Dragons) boost rounded down
  • 106. Spell checking search = Sunspot.search(Song) do keywords 'Imagina Dragoons' spellcheck :count => 3 end search.spellcheck_suggestion_for('imagina') # => 'imagine' search.spellcheck_suggestions # => [{"word"=>"imagine", "freq"=>3}, {"word"=>"dragons", "freq"=>1}]
  • 107. To test or not to test?
  • 108. To test or not to test? Unit tests? No. Integration tests? Maybe… Search engines depends on terms frequency to ranking docs You will need all your dataset to compute precision, recall.. You can test only filter queries, indexing callbacks…
  • 110. Summary The searching problem ● User: a bug search tool Adding a search engine to my app ● Full text search in MariaDB ● Apache Solr x ElasticSearch Apache Solr ● How to create cores ● CRUD operations Integrating with Rails ● Sunspot gem ● How to index, search and test
  • 111. Keep in mind Always verify the user's information needs from your app E.g.: check if removing stop words, synonymous should be applied "No" Meghan Trainor "I am" - P.O.D E.g: which transformations your search engine should apply - Phonetic transformations? Custom language analyzers?
  • 112. Keep in mind The information is not only on text files but also in audios, videos, images, etc.
  • 113. Suggested topics for studying - Evaluation of available analyzers for FTS - Optimization of Performance (such as soft commit, lazy build indexes) - Distribution and replication through SolrCloud - Using of Machine Learning algorithms - Creation of custom function queries - Authentication - Integrating with Logstash and Kibana - Geospatial searches
  • 115. References Introduction to Information Retrieval Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze (2008) Solr in action Grainger, Trey, Timothy Potter, and Yonik Seeley (2014) Sunspot gem http://sunspot.github.io/ Uma introdução ao tema recuperação de informações textuais. Barth, F. J. (2013) 10 Reasons to Choose Apache Solr Over Elasticsearch (2016) https://dzone.com/articles/10-reasons-to-choose-apache-solr-over-elasticsearc
  • 116. References Apache Solr vs Elasticsearch http://solr-vs-elasticsearch.com/ When to consider Solr https://stackoverflow.com/questions/4960952/when-to-consider-solr Indexing for full text search in PostgreSQL https://www.compose.com/articles/indexing-for-full-text-search-in-postgresql/ PolyglotPersistence https://martinfowler.com/bliki/PolyglotPersistence.html Yahoo! Answers: Qual o nome desta Música? https://br.answers.yahoo.com/question/index?qid=20080627085726AAJM9Wa
  • 117. References Full-Text Index in MariaDB https://mariadb.com/kb/en/library/full-text-index-overview/ Natural Language Full-Text Searches (MySQL) https://dev.mysql.com/doc/refman/5.7/en/fulltext-natural-language.html Postgres full-text search is Good Enough! (2015) http://rachbelaid.com/postgres-full-text-search-is-good-enough/ Text Indexes in MongoDB https://docs.mongodb.com/manual/core/index-text/ Full-Text Index Stopwords for MariaDB https://mariadb.com/kb/en/library/full-text-index-stopwords/