Text search with Elasticsearch on AWS

Text search with
Elasticsearch on AWS
Łukasz Przybyłek
Tidio

What’s Elasticsearch?
● Search & analytics engine
● Fast
● Scalable
● Distributed
● Full text search capabilities
● (near) Real time indexing
● Document oriented
● Schema free

When do I need it?
● If needed faster search mechanism
● If needed searching in large amount of data
● If needed powerful full text queries

How does it work?
Input Document Analyzer Terms Index

Inverted Index
Id Content
1 The quick brown fox jumped
over the lazy dog
2 Quick brown foxes leap over
lazy dogs in summer
analysis
Term Doc_1 Doc_2
brown X X
dog X X
fox X X
in X
jump X X
lazy X X
over X X
quick X X
summer X
the X X

Logical data structures
● Elasticsearch (cluster) contains indexes
● Index contains types
● Type contains documents
● Mappings are assigned to types
● Index aliases (optional) can point to indices and modify queries (e.g. add
filter)
● There are no classic SQL-like relationships (!)

Logical data structures
Cluster
Index IndexIndex
Type Type
Document
Mapping
Document

Physical data structures
● Cluster contains nodes
● Index is stored in one or more shards (single shard is a Lucene index
instance)
● Single node contains shards of different indexes

How to deal with lack of joins?
● Denormalization
● Client-side joins
● Parent-child relationships

Elasticsearch in Tidio
● Tidio Chat - business communication tool where business owners (operators)
communicate with their customers (visitors)
● www.tidiochat.com
● ES used instead of MariaDB to perform:
○ Fetching last conversations in project
○ Perform search by message content and visitor email in project’s conversation history

Relations in Tidio Chat
Message
id
visitor_id
operator_id
content
time
Project
public_key
Visitor
id
project_public_key
name
email
Operator
id
project_public_key

Message document schema
● Project’s public key added to document
● Search by email performed in MariaDB
● Time mapped as date explicitly
● Client-side join with Visitor
Message
id
visitor_id
operator_id
project_public_key
content
time

Design decisions
● Questions
a. What indexes should be created?
b. What types should be created?
c. How shards should be distributed among nodes and indexes?
● Things to consider
a. Search in smaller dataset usually means faster search results
b. Index with small number of shards does not scale efficiently to new nodes
c. Types are used mainly to assign mappings, they are not separated “search entities” so there is
no direct performance boost from using many types
d. Index doesn’t need to represent domain entity

Ideas?
Index for each project, one type inside index
● 250k projects = 250k indexes
● Adding new index is slow
● Large overhead associated with shards and indices count

Ideas?
One index and separate type for each project
● Large index
● Nodes scaling up only to number of shards in particular index (default 5, no
auto index splitting)
● Every query would go through all shards and filter by project_public_key (large
amount of data to search in)

Ideas?
Group projects and create an index for each group
● Limited amount of data to search in
● Reasonable number of shards, which still can scale up to many nodes
● Possibility to add alias for each project and search as it would be separate
index
● Projects may be grouped by language and use specific analyzers

Amazon Web Services Elasticsearch cluster
● Quick and easy to install
● Extremely limited configuration options
● Limited query options (scripts disabled)
● Can be used with standard AWS authentication
● There is no AWS SDK that supports ES, so users have to write code that sign
requests manually

PHP clients for ES
● elasticsearch/elasticsearch
○ https://github.com/elastic/elasticsearch-php
○ Low level ES client
○ One-to-one mapping with REST API
○ Pluggable architecture (can use custom request handler and send AWS signed requests)
○ Does all things that you don’t want to know about, e.g. discovery of cluster nodes, load
balancing, Keep-Alive connections
○ Accepts queries in JSON
● ruflin/elastica
○ https://github.com/ruflin/Elastica
○ High level client
○ Classes representing indices/queries/terms - you do not have to write JSONs

Elasticsearch limitations
● Less capable than SQL
● There is no paging support for aggregations

AWS Elasticsearch limitations
● threadpool.bulk.queue_size=50
● No script support

Indexing performance
● Check your mappings!
● Set fields as not analyzed
● Disable _all field
● Tune your analyzer and index_options (advanced)

Search performance
● Unfair comparison
● Over 26 million documents
● Time of PHP requests in seconds
QueryService MariaDB (8 CPU) Elasticsearch (4 CPU)
Search by text 14.16 (σ=0.51) 0.80 (σ=0.20)
Last conversations 4.77 (σ=0.45) 0.87 (σ=0.23)

Thank you!
lucas@tidio.net
lprzybylek@gmail.com

Text search with Elasticsearch on AWS

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a Text search with Elasticsearch on AWS

Semelhante a Text search with Elasticsearch on AWS (20)

Último

Último (20)

Text search with Elasticsearch on AWS

Notas do Editor