SlideShare uma empresa Scribd logo
1 de 24
Baixar para ler offline
Text search with
Elasticsearch on AWS
Łukasz Przybyłek
Tidio
What’s Elasticsearch?
● Search & analytics engine
● Fast
● Scalable
● Distributed
● Full text search capabilities
● (near) Real time indexing
● Document oriented
● Schema free
When do I need it?
● If needed faster search mechanism
● If needed searching in large amount of data
● If needed powerful full text queries
How does it work?
Input Document Analyzer Terms Index
Inverted Index
Id Content
1 The quick brown fox jumped
over the lazy dog
2 Quick brown foxes leap over
lazy dogs in summer
analysis
Term Doc_1 Doc_2
brown X X
dog X X
fox X X
in X
jump X X
lazy X X
over X X
quick X X
summer X
the X X
Logical data structures
● Elasticsearch (cluster) contains indexes
● Index contains types
● Type contains documents
● Mappings are assigned to types
● Index aliases (optional) can point to indices and modify queries (e.g. add
filter)
● There are no classic SQL-like relationships (!)
Logical data structures
Cluster
Index IndexIndex
Type Type
Document
Mapping
Document
Physical data structures
● Cluster contains nodes
● Index is stored in one or more shards (single shard is a Lucene index
instance)
● Single node contains shards of different indexes
How to deal with lack of joins?
● Denormalization
● Client-side joins
● Parent-child relationships
Elasticsearch in Tidio
● Tidio Chat - business communication tool where business owners (operators)
communicate with their customers (visitors)
● www.tidiochat.com
● ES used instead of MariaDB to perform:
○ Fetching last conversations in project
○ Perform search by message content and visitor email in project’s conversation history
Relations in Tidio Chat
Message
id
visitor_id
operator_id
content
time
Project
public_key
Visitor
id
project_public_key
name
email
Operator
id
project_public_key
Message document schema
● Project’s public key added to document
● Search by email performed in MariaDB
● Time mapped as date explicitly
● Client-side join with Visitor
Message
id
visitor_id
operator_id
project_public_key
content
time
Design decisions
● Questions
a. What indexes should be created?
b. What types should be created?
c. How shards should be distributed among nodes and indexes?
● Things to consider
a. Search in smaller dataset usually means faster search results
b. Index with small number of shards does not scale efficiently to new nodes
c. Types are used mainly to assign mappings, they are not separated “search entities” so there is
no direct performance boost from using many types
d. Index doesn’t need to represent domain entity
Ideas?
Index for each project, one type inside index
● 250k projects = 250k indexes
● Adding new index is slow
● Large overhead associated with shards and indices count
Ideas?
One index and separate type for each project
● Large index
● Nodes scaling up only to number of shards in particular index (default 5, no
auto index splitting)
● Every query would go through all shards and filter by project_public_key (large
amount of data to search in)
Ideas?
Group projects and create an index for each group
● Limited amount of data to search in
● Reasonable number of shards, which still can scale up to many nodes
● Possibility to add alias for each project and search as it would be separate
index
● Projects may be grouped by language and use specific analyzers
Amazon Web Services Elasticsearch cluster
● Quick and easy to install
● Extremely limited configuration options
● Limited query options (scripts disabled)
● Can be used with standard AWS authentication
● There is no AWS SDK that supports ES, so users have to write code that sign
requests manually
PHP clients for ES
● elasticsearch/elasticsearch
○ https://github.com/elastic/elasticsearch-php
○ Low level ES client
○ One-to-one mapping with REST API
○ Pluggable architecture (can use custom request handler and send AWS signed requests)
○ Does all things that you don’t want to know about, e.g. discovery of cluster nodes, load
balancing, Keep-Alive connections
○ Accepts queries in JSON
● ruflin/elastica
○ https://github.com/ruflin/Elastica
○ High level client
○ Classes representing indices/queries/terms - you do not have to write JSONs
Elasticsearch limitations
● Less capable than SQL
● There is no paging support for aggregations
AWS Elasticsearch limitations
● threadpool.bulk.queue_size=50
● No script support
Indexing performance
● Check your mappings!
● Set fields as not analyzed
● Disable _all field
● Tune your analyzer and index_options (advanced)
Search performance
● Unfair comparison
● Over 26 million documents
● Time of PHP requests in seconds
QueryService MariaDB (8 CPU) Elasticsearch (4 CPU)
Search by text 14.16 (σ=0.51) 0.80 (σ=0.20)
Last conversations 4.77 (σ=0.45) 0.87 (σ=0.23)
Any questions?
Thank you!
lucas@tidio.net
lprzybylek@gmail.com

Mais conteúdo relacionado

Mais procurados

Lightning talk: elasticsearch at Cogenta
Lightning talk: elasticsearch at CogentaLightning talk: elasticsearch at Cogenta
Lightning talk: elasticsearch at CogentaYann Cluchey
 
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Spark Summit
 
Azure CosmosDB the new frontier of big data and nosql
Azure CosmosDB the new frontier of big data and nosqlAzure CosmosDB the new frontier of big data and nosql
Azure CosmosDB the new frontier of big data and nosqlRiccardo Cappello
 
Rental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean DownesRental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean DownesDatabricks
 
Elastic Stack Introduction
Elastic Stack IntroductionElastic Stack Introduction
Elastic Stack IntroductionVikram Shinde
 
Azure CosmosDb - Where we are
Azure CosmosDb - Where we areAzure CosmosDb - Where we are
Azure CosmosDb - Where we areMarco Parenzan
 
Use cases for cassandra in federal and state government
Use cases for cassandra in federal and state governmentUse cases for cassandra in federal and state government
Use cases for cassandra in federal and state governmentOpenSource Connections
 
Log analysis with the elk stack
Log analysis with the elk stackLog analysis with the elk stack
Log analysis with the elk stackVikrant Chauhan
 
Log analysis using Logstash,ElasticSearch and Kibana
Log analysis using Logstash,ElasticSearch and KibanaLog analysis using Logstash,ElasticSearch and Kibana
Log analysis using Logstash,ElasticSearch and KibanaAvinash Ramineni
 
Elastic stack Presentation
Elastic stack PresentationElastic stack Presentation
Elastic stack PresentationAmr Alaa Yassen
 
ElasticSearch for data mining
ElasticSearch for data mining ElasticSearch for data mining
ElasticSearch for data mining William Simms
 
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Amazon Web Services
 
Interactive Analytics on AWS - AWS Summit Tel Aviv 2017
Interactive Analytics on AWS - AWS Summit Tel Aviv 2017Interactive Analytics on AWS - AWS Summit Tel Aviv 2017
Interactive Analytics on AWS - AWS Summit Tel Aviv 2017Amazon Web Services
 
Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...
Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...
Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...Lucidworks
 
Search Analytics with ELK (Elastic Stack)
Search Analytics with ELK (Elastic Stack)Search Analytics with ELK (Elastic Stack)
Search Analytics with ELK (Elastic Stack)MC+A
 
Building DSLs with Scala
Building DSLs with ScalaBuilding DSLs with Scala
Building DSLs with ScalaMohit Jaggi
 

Mais procurados (20)

NATE-Central-Log
NATE-Central-LogNATE-Central-Log
NATE-Central-Log
 
Lightning talk: elasticsearch at Cogenta
Lightning talk: elasticsearch at CogentaLightning talk: elasticsearch at Cogenta
Lightning talk: elasticsearch at Cogenta
 
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
 
Azure CosmosDB the new frontier of big data and nosql
Azure CosmosDB the new frontier of big data and nosqlAzure CosmosDB the new frontier of big data and nosql
Azure CosmosDB the new frontier of big data and nosql
 
Rental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean DownesRental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean Downes
 
Elastic Stack Introduction
Elastic Stack IntroductionElastic Stack Introduction
Elastic Stack Introduction
 
Azure CosmosDb - Where we are
Azure CosmosDb - Where we areAzure CosmosDb - Where we are
Azure CosmosDb - Where we are
 
Use cases for cassandra in federal and state government
Use cases for cassandra in federal and state governmentUse cases for cassandra in federal and state government
Use cases for cassandra in federal and state government
 
Aws Kinesis
Aws KinesisAws Kinesis
Aws Kinesis
 
Log analysis with the elk stack
Log analysis with the elk stackLog analysis with the elk stack
Log analysis with the elk stack
 
Cascalog
CascalogCascalog
Cascalog
 
Log analysis using Logstash,ElasticSearch and Kibana
Log analysis using Logstash,ElasticSearch and KibanaLog analysis using Logstash,ElasticSearch and Kibana
Log analysis using Logstash,ElasticSearch and Kibana
 
Elastic stack Presentation
Elastic stack PresentationElastic stack Presentation
Elastic stack Presentation
 
Log analytics with ELK stack
Log analytics with ELK stackLog analytics with ELK stack
Log analytics with ELK stack
 
ElasticSearch for data mining
ElasticSearch for data mining ElasticSearch for data mining
ElasticSearch for data mining
 
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
 
Interactive Analytics on AWS - AWS Summit Tel Aviv 2017
Interactive Analytics on AWS - AWS Summit Tel Aviv 2017Interactive Analytics on AWS - AWS Summit Tel Aviv 2017
Interactive Analytics on AWS - AWS Summit Tel Aviv 2017
 
Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...
Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...
Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...
 
Search Analytics with ELK (Elastic Stack)
Search Analytics with ELK (Elastic Stack)Search Analytics with ELK (Elastic Stack)
Search Analytics with ELK (Elastic Stack)
 
Building DSLs with Scala
Building DSLs with ScalaBuilding DSLs with Scala
Building DSLs with Scala
 

Destaque

Building a Chat System in AJAX & PHP
Building a Chat System in AJAX & PHPBuilding a Chat System in AJAX & PHP
Building a Chat System in AJAX & PHPayman diab
 
Trabajo de excel de alumnos
Trabajo de excel de alumnosTrabajo de excel de alumnos
Trabajo de excel de alumnosaaliss
 
Keteladanan rasulullah saw periode mekkah part 1
Keteladanan rasulullah saw periode mekkah part 1Keteladanan rasulullah saw periode mekkah part 1
Keteladanan rasulullah saw periode mekkah part 1Nisfatur Rosyidah Rosyidah
 
Imagenes rosorch
Imagenes rosorchImagenes rosorch
Imagenes rosorchEERG
 
Resetting the Hotel Industry (2010)
Resetting the Hotel Industry (2010)Resetting the Hotel Industry (2010)
Resetting the Hotel Industry (2010)Akira Park
 
Calentamiento global
Calentamiento globalCalentamiento global
Calentamiento global555845
 
COMENIUS - Présentation personnelle des élèves à leurs correspondants
COMENIUS - Présentation personnelle des élèves à leurs correspondantsCOMENIUS - Présentation personnelle des élèves à leurs correspondants
COMENIUS - Présentation personnelle des élèves à leurs correspondantsfcassore
 
Roteiro para elaboração de aplano RPPN
Roteiro para elaboração de aplano RPPNRoteiro para elaboração de aplano RPPN
Roteiro para elaboração de aplano RPPNCarlos Alberto Monteiro
 
[Webinar] Applications of Psycho-physiological Measures in Holistic Consumer ...
[Webinar] Applications of Psycho-physiological Measures in Holistic Consumer ...[Webinar] Applications of Psycho-physiological Measures in Holistic Consumer ...
[Webinar] Applications of Psycho-physiological Measures in Holistic Consumer ...InsightInnovation
 
Program Monitoring Obat Tradisional, Kosmetika dan Produk Komplemen
Program Monitoring Obat Tradisional, Kosmetika dan Produk KomplemenProgram Monitoring Obat Tradisional, Kosmetika dan Produk Komplemen
Program Monitoring Obat Tradisional, Kosmetika dan Produk Komplemenkhoiril anwar
 
Valmet moving forward with clean technologies niemelä
Valmet moving forward with clean technologies niemeläValmet moving forward with clean technologies niemelä
Valmet moving forward with clean technologies niemeläValmet Oyj
 
Chat-Ops : PHP Berkshire
Chat-Ops : PHP BerkshireChat-Ops : PHP Berkshire
Chat-Ops : PHP Berkshirefullybaked
 
Teck’s Investor and Analyst Day, March 2016
Teck’s Investor and Analyst Day, March 2016Teck’s Investor and Analyst Day, March 2016
Teck’s Investor and Analyst Day, March 2016TeckResourcesLtd
 
Pesarios y fisioterapia en el prolapso de órganos pélvicos
Pesarios y fisioterapia en el prolapso de órganos pélvicosPesarios y fisioterapia en el prolapso de órganos pélvicos
Pesarios y fisioterapia en el prolapso de órganos pélvicosCecilio Rodríguez Ayala
 
project
projectproject
projectdnraj
 

Destaque (20)

Documentation
DocumentationDocumentation
Documentation
 
Building a Chat System in AJAX & PHP
Building a Chat System in AJAX & PHPBuilding a Chat System in AJAX & PHP
Building a Chat System in AJAX & PHP
 
Trabajo de excel de alumnos
Trabajo de excel de alumnosTrabajo de excel de alumnos
Trabajo de excel de alumnos
 
Keteladanan rasulullah saw periode mekkah part 1
Keteladanan rasulullah saw periode mekkah part 1Keteladanan rasulullah saw periode mekkah part 1
Keteladanan rasulullah saw periode mekkah part 1
 
Imagenes rosorch
Imagenes rosorchImagenes rosorch
Imagenes rosorch
 
Libro1
Libro1Libro1
Libro1
 
Resetting the Hotel Industry (2010)
Resetting the Hotel Industry (2010)Resetting the Hotel Industry (2010)
Resetting the Hotel Industry (2010)
 
Calentamiento global
Calentamiento globalCalentamiento global
Calentamiento global
 
COMENIUS - Présentation personnelle des élèves à leurs correspondants
COMENIUS - Présentation personnelle des élèves à leurs correspondantsCOMENIUS - Présentation personnelle des élèves à leurs correspondants
COMENIUS - Présentation personnelle des élèves à leurs correspondants
 
Roteiro para elaboração de aplano RPPN
Roteiro para elaboração de aplano RPPNRoteiro para elaboração de aplano RPPN
Roteiro para elaboração de aplano RPPN
 
Bali Preview
Bali PreviewBali Preview
Bali Preview
 
[Webinar] Applications of Psycho-physiological Measures in Holistic Consumer ...
[Webinar] Applications of Psycho-physiological Measures in Holistic Consumer ...[Webinar] Applications of Psycho-physiological Measures in Holistic Consumer ...
[Webinar] Applications of Psycho-physiological Measures in Holistic Consumer ...
 
Program Monitoring Obat Tradisional, Kosmetika dan Produk Komplemen
Program Monitoring Obat Tradisional, Kosmetika dan Produk KomplemenProgram Monitoring Obat Tradisional, Kosmetika dan Produk Komplemen
Program Monitoring Obat Tradisional, Kosmetika dan Produk Komplemen
 
Valmet moving forward with clean technologies niemelä
Valmet moving forward with clean technologies niemeläValmet moving forward with clean technologies niemelä
Valmet moving forward with clean technologies niemelä
 
Chat-Ops : PHP Berkshire
Chat-Ops : PHP BerkshireChat-Ops : PHP Berkshire
Chat-Ops : PHP Berkshire
 
Teck’s Investor and Analyst Day, March 2016
Teck’s Investor and Analyst Day, March 2016Teck’s Investor and Analyst Day, March 2016
Teck’s Investor and Analyst Day, March 2016
 
Configuring SOLIDWORKS Toolbox
Configuring SOLIDWORKS ToolboxConfiguring SOLIDWORKS Toolbox
Configuring SOLIDWORKS Toolbox
 
Pesarios y fisioterapia en el prolapso de órganos pélvicos
Pesarios y fisioterapia en el prolapso de órganos pélvicosPesarios y fisioterapia en el prolapso de órganos pélvicos
Pesarios y fisioterapia en el prolapso de órganos pélvicos
 
Konjungtiva
KonjungtivaKonjungtiva
Konjungtiva
 
project
projectproject
project
 

Semelhante a Text search with Elasticsearch on AWS

Mongodb Performance
Mongodb PerformanceMongodb Performance
Mongodb PerformanceJack
 
Data Con LA 2022 - Pre- Recorded - OpenSearch: Everything You Need to Know Ab...
Data Con LA 2022 - Pre- Recorded - OpenSearch: Everything You Need to Know Ab...Data Con LA 2022 - Pre- Recorded - OpenSearch: Everything You Need to Know Ab...
Data Con LA 2022 - Pre- Recorded - OpenSearch: Everything You Need to Know Ab...Data Con LA
 
Search engines in the industry
Search engines in the industrySearch engines in the industry
Search engines in the industryTommaso Teofili
 
Which Questions We Should Have
Which Questions We Should HaveWhich Questions We Should Have
Which Questions We Should HaveOracle Korea
 
PostgreSQL, Extensible to the Nth Degree: Functions, Languages, Types, Rules,...
PostgreSQL, Extensible to the Nth Degree: Functions, Languages, Types, Rules,...PostgreSQL, Extensible to the Nth Degree: Functions, Languages, Types, Rules,...
PostgreSQL, Extensible to the Nth Degree: Functions, Languages, Types, Rules,...Command Prompt., Inc
 
Open source log analytics
Open source log analyticsOpen source log analytics
Open source log analyticsVinod Nayal
 
Elastic & Azure & Episever, Case Evira
Elastic & Azure & Episever, Case EviraElastic & Azure & Episever, Case Evira
Elastic & Azure & Episever, Case EviraMikko Huilaja
 
[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...
[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...
[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...Anna Ossowski
 
PostgreSQL - Object Relational Database
PostgreSQL - Object Relational DatabasePostgreSQL - Object Relational Database
PostgreSQL - Object Relational DatabaseMubashar Iqbal
 
Introduction to document db- Global Azure Bootcamp 2016
Introduction to document db- Global Azure Bootcamp 2016Introduction to document db- Global Azure Bootcamp 2016
Introduction to document db- Global Azure Bootcamp 2016Jalpesh Vadgama
 
TechEd AU 2014: Microsoft Azure DocumentDB Deep Dive
TechEd AU 2014: Microsoft Azure DocumentDB Deep DiveTechEd AU 2014: Microsoft Azure DocumentDB Deep Dive
TechEd AU 2014: Microsoft Azure DocumentDB Deep DiveIntergen
 
From class to architecture
From class to architectureFrom class to architecture
From class to architectureMarcin Hawraniak
 
Azure Cosmos DB: Features, Practical Use and Optimization "
Azure Cosmos DB: Features, Practical Use and Optimization "Azure Cosmos DB: Features, Practical Use and Optimization "
Azure Cosmos DB: Features, Practical Use and Optimization "GlobalLogic Ukraine
 
No sql bigdata and postgresql
No sql bigdata and postgresqlNo sql bigdata and postgresql
No sql bigdata and postgresqlZaid Shabbir
 
Cool NoSQL on Azure with DocumentDB
Cool NoSQL on Azure with DocumentDBCool NoSQL on Azure with DocumentDB
Cool NoSQL on Azure with DocumentDBJan Hentschel
 
AWS Lambda and Serverless framework: lessons learned while building a serverl...
AWS Lambda and Serverless framework: lessons learned while building a serverl...AWS Lambda and Serverless framework: lessons learned while building a serverl...
AWS Lambda and Serverless framework: lessons learned while building a serverl...Luciano Mammino
 
Google app engine - Soft Uni 19.06.2014
Google app engine - Soft Uni 19.06.2014Google app engine - Soft Uni 19.06.2014
Google app engine - Soft Uni 19.06.2014Dimitar Danailov
 
Streaming data in the cloud with Confluent and MongoDB Atlas | Robert Walters...
Streaming data in the cloud with Confluent and MongoDB Atlas | Robert Walters...Streaming data in the cloud with Confluent and MongoDB Atlas | Robert Walters...
Streaming data in the cloud with Confluent and MongoDB Atlas | Robert Walters...HostedbyConfluent
 
(ATS6-PLAT02) Accelrys Catalog and Protocol Validation
(ATS6-PLAT02) Accelrys Catalog and Protocol Validation(ATS6-PLAT02) Accelrys Catalog and Protocol Validation
(ATS6-PLAT02) Accelrys Catalog and Protocol ValidationBIOVIA
 

Semelhante a Text search with Elasticsearch on AWS (20)

MongoDB Basics
MongoDB BasicsMongoDB Basics
MongoDB Basics
 
Mongodb Performance
Mongodb PerformanceMongodb Performance
Mongodb Performance
 
Data Con LA 2022 - Pre- Recorded - OpenSearch: Everything You Need to Know Ab...
Data Con LA 2022 - Pre- Recorded - OpenSearch: Everything You Need to Know Ab...Data Con LA 2022 - Pre- Recorded - OpenSearch: Everything You Need to Know Ab...
Data Con LA 2022 - Pre- Recorded - OpenSearch: Everything You Need to Know Ab...
 
Search engines in the industry
Search engines in the industrySearch engines in the industry
Search engines in the industry
 
Which Questions We Should Have
Which Questions We Should HaveWhich Questions We Should Have
Which Questions We Should Have
 
PostgreSQL, Extensible to the Nth Degree: Functions, Languages, Types, Rules,...
PostgreSQL, Extensible to the Nth Degree: Functions, Languages, Types, Rules,...PostgreSQL, Extensible to the Nth Degree: Functions, Languages, Types, Rules,...
PostgreSQL, Extensible to the Nth Degree: Functions, Languages, Types, Rules,...
 
Open source log analytics
Open source log analyticsOpen source log analytics
Open source log analytics
 
Elastic & Azure & Episever, Case Evira
Elastic & Azure & Episever, Case EviraElastic & Azure & Episever, Case Evira
Elastic & Azure & Episever, Case Evira
 
[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...
[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...
[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...
 
PostgreSQL - Object Relational Database
PostgreSQL - Object Relational DatabasePostgreSQL - Object Relational Database
PostgreSQL - Object Relational Database
 
Introduction to document db- Global Azure Bootcamp 2016
Introduction to document db- Global Azure Bootcamp 2016Introduction to document db- Global Azure Bootcamp 2016
Introduction to document db- Global Azure Bootcamp 2016
 
TechEd AU 2014: Microsoft Azure DocumentDB Deep Dive
TechEd AU 2014: Microsoft Azure DocumentDB Deep DiveTechEd AU 2014: Microsoft Azure DocumentDB Deep Dive
TechEd AU 2014: Microsoft Azure DocumentDB Deep Dive
 
From class to architecture
From class to architectureFrom class to architecture
From class to architecture
 
Azure Cosmos DB: Features, Practical Use and Optimization "
Azure Cosmos DB: Features, Practical Use and Optimization "Azure Cosmos DB: Features, Practical Use and Optimization "
Azure Cosmos DB: Features, Practical Use and Optimization "
 
No sql bigdata and postgresql
No sql bigdata and postgresqlNo sql bigdata and postgresql
No sql bigdata and postgresql
 
Cool NoSQL on Azure with DocumentDB
Cool NoSQL on Azure with DocumentDBCool NoSQL on Azure with DocumentDB
Cool NoSQL on Azure with DocumentDB
 
AWS Lambda and Serverless framework: lessons learned while building a serverl...
AWS Lambda and Serverless framework: lessons learned while building a serverl...AWS Lambda and Serverless framework: lessons learned while building a serverl...
AWS Lambda and Serverless framework: lessons learned while building a serverl...
 
Google app engine - Soft Uni 19.06.2014
Google app engine - Soft Uni 19.06.2014Google app engine - Soft Uni 19.06.2014
Google app engine - Soft Uni 19.06.2014
 
Streaming data in the cloud with Confluent and MongoDB Atlas | Robert Walters...
Streaming data in the cloud with Confluent and MongoDB Atlas | Robert Walters...Streaming data in the cloud with Confluent and MongoDB Atlas | Robert Walters...
Streaming data in the cloud with Confluent and MongoDB Atlas | Robert Walters...
 
(ATS6-PLAT02) Accelrys Catalog and Protocol Validation
(ATS6-PLAT02) Accelrys Catalog and Protocol Validation(ATS6-PLAT02) Accelrys Catalog and Protocol Validation
(ATS6-PLAT02) Accelrys Catalog and Protocol Validation
 

Último

VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 

Último (20)

VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 

Text search with Elasticsearch on AWS

  • 1. Text search with Elasticsearch on AWS Łukasz Przybyłek Tidio
  • 2. What’s Elasticsearch? ● Search & analytics engine ● Fast ● Scalable ● Distributed ● Full text search capabilities ● (near) Real time indexing ● Document oriented ● Schema free
  • 3. When do I need it? ● If needed faster search mechanism ● If needed searching in large amount of data ● If needed powerful full text queries
  • 4. How does it work? Input Document Analyzer Terms Index
  • 5. Inverted Index Id Content 1 The quick brown fox jumped over the lazy dog 2 Quick brown foxes leap over lazy dogs in summer analysis Term Doc_1 Doc_2 brown X X dog X X fox X X in X jump X X lazy X X over X X quick X X summer X the X X
  • 6. Logical data structures ● Elasticsearch (cluster) contains indexes ● Index contains types ● Type contains documents ● Mappings are assigned to types ● Index aliases (optional) can point to indices and modify queries (e.g. add filter) ● There are no classic SQL-like relationships (!)
  • 7. Logical data structures Cluster Index IndexIndex Type Type Document Mapping Document
  • 8. Physical data structures ● Cluster contains nodes ● Index is stored in one or more shards (single shard is a Lucene index instance) ● Single node contains shards of different indexes
  • 9. How to deal with lack of joins? ● Denormalization ● Client-side joins ● Parent-child relationships
  • 10. Elasticsearch in Tidio ● Tidio Chat - business communication tool where business owners (operators) communicate with their customers (visitors) ● www.tidiochat.com ● ES used instead of MariaDB to perform: ○ Fetching last conversations in project ○ Perform search by message content and visitor email in project’s conversation history
  • 11. Relations in Tidio Chat Message id visitor_id operator_id content time Project public_key Visitor id project_public_key name email Operator id project_public_key
  • 12. Message document schema ● Project’s public key added to document ● Search by email performed in MariaDB ● Time mapped as date explicitly ● Client-side join with Visitor Message id visitor_id operator_id project_public_key content time
  • 13. Design decisions ● Questions a. What indexes should be created? b. What types should be created? c. How shards should be distributed among nodes and indexes? ● Things to consider a. Search in smaller dataset usually means faster search results b. Index with small number of shards does not scale efficiently to new nodes c. Types are used mainly to assign mappings, they are not separated “search entities” so there is no direct performance boost from using many types d. Index doesn’t need to represent domain entity
  • 14. Ideas? Index for each project, one type inside index ● 250k projects = 250k indexes ● Adding new index is slow ● Large overhead associated with shards and indices count
  • 15. Ideas? One index and separate type for each project ● Large index ● Nodes scaling up only to number of shards in particular index (default 5, no auto index splitting) ● Every query would go through all shards and filter by project_public_key (large amount of data to search in)
  • 16. Ideas? Group projects and create an index for each group ● Limited amount of data to search in ● Reasonable number of shards, which still can scale up to many nodes ● Possibility to add alias for each project and search as it would be separate index ● Projects may be grouped by language and use specific analyzers
  • 17. Amazon Web Services Elasticsearch cluster ● Quick and easy to install ● Extremely limited configuration options ● Limited query options (scripts disabled) ● Can be used with standard AWS authentication ● There is no AWS SDK that supports ES, so users have to write code that sign requests manually
  • 18. PHP clients for ES ● elasticsearch/elasticsearch ○ https://github.com/elastic/elasticsearch-php ○ Low level ES client ○ One-to-one mapping with REST API ○ Pluggable architecture (can use custom request handler and send AWS signed requests) ○ Does all things that you don’t want to know about, e.g. discovery of cluster nodes, load balancing, Keep-Alive connections ○ Accepts queries in JSON ● ruflin/elastica ○ https://github.com/ruflin/Elastica ○ High level client ○ Classes representing indices/queries/terms - you do not have to write JSONs
  • 19. Elasticsearch limitations ● Less capable than SQL ● There is no paging support for aggregations
  • 20. AWS Elasticsearch limitations ● threadpool.bulk.queue_size=50 ● No script support
  • 21. Indexing performance ● Check your mappings! ● Set fields as not analyzed ● Disable _all field ● Tune your analyzer and index_options (advanced)
  • 22. Search performance ● Unfair comparison ● Over 26 million documents ● Time of PHP requests in seconds QueryService MariaDB (8 CPU) Elasticsearch (4 CPU) Search by text 14.16 (σ=0.51) 0.80 (σ=0.20) Last conversations 4.77 (σ=0.45) 0.87 (σ=0.23)

Notas do Editor

  1. Czym jest Elasticsearch? Silnik wyszukiwania Szybki, skalowalny, rozproszony, z możliwością wyszukiwania pełnotekstowego, wyniki są dostępne praktycznie natychmiastowo po zaindeksowaniu, document oriented i schema free.
  2. Kiedy myśleć o elasticsearchu? Kiedy potrzebujesz wyników wyszukiwania szybciej, potrzebujesz wyszukiwać w dużych ilościach danych (w dokumentacji piszą nawet o petabajtach), potrzebujesz zaawansowanego wyszukiwania pełnotekstowego
  3. Dane wrzucamy do elasticsearcha jako dokumenty, później przechodzą one przez Analyzer, zamieniane są na tokeny i zapisywane do indeksu. Zamiennie używa się słów term, token czy word.
  4. Dane w ES przechowywane są w postaci odwróconego indeksu. Zmiana na tokeny/termy/słowa i zapis czy token jest w dokumencie -Lowercase -Foxes i dogs zmodyfikowane do liczby pojedynczej (stemming - wyznaczenie trzonu) -Jumped i leap to synonimy zapisane jako jump -Można jeszcze pozbyć się the na podstawie listy stopwords. Ciekawostka - przykład z dokumentacji, ale Elasticsearch nie zachowuje się tak domyślnie.
  5. Co to index, typ, dokument. Odwołanie do baza danych, tabela, wiersz w SQL Co to mapping, często pomijany, a ważny. Jeśli zapomnimy o nim, to musimy przeindeksować cały indeks. W rzeczywistości typ dokumentu to własność (nie są fizycznie grupowane po typie) i służy głównie do wybrania mappingu. Aliasy, żeby łatwiej konstruować zapytania. Brak relacji!
  6. Graficzna wizualizacja struktur danych Ważne, kluczowe, żeby zrozumieć kolejne slajdy
  7. Jak to wszystko jest zapisywane fizycznie Klaster zawiera nody, index jest zapisany w shardach, na danym node są shardy różnych indeksów. Lucene index - Lucene to tak naprawdę silnik wyszukiwania, a Elasticsearch jest nakładką zapewniającą skalowalność, przyjemne api itp.
  8. Denormalization Put all data to one document Types Classic flat structure Nested objects After update of related entity, all associated objects has to be updated manually Client-side joins Few requests Increase client load Parent-child relationships Documents has to be on the same shard Generally slow
  9. Zastosowanie w naszym przypadku Tidio Chat - produkt Tidio służący właścicielom biznesów do komunikacji ze swoimi klientami. Do podejrzenia jak działa na tidiochat.com Jeden zainstalowany komunikator na stronie naszego klienta, to jeden projekt. Używamy elasticsearcha w miejsce MariaDB do dwóch requestów - pobierania ostatnich konwersacji w projekcie i wyszukiwania w treści wiadomości i emailu visitora.
  10. Tak wyglądają relacje w naszej bazie danych. Projekt identyfikowany przez klucz publiczny, Visitor przypisany do projektu posiadający name email i inne dane. Operator również przypisany do projektu. Wiadomość wysyłana jest pomiędzy operatorem i visitorem, ma treść i czas wysłania. To oczywiście uproszczony model. Potrzebujemy szukać zarówno po mailu visitora, treści wiadomości, jak i zgrupować po visitor_id i wyświetlić ostatnią wiadomość. W widoku potrzebujemy również danych o visitorze - np. nazwę.
  11. Klucz projektu dodany do dokumentu Czas wiadomości musieliśmy jawnie zapisać jako typ time, ponieważ domyślny mapping nie złapal, że są tam daty Wyszukiwanie po mailu visitora dalej na MariaDB Pobranie danych visitora na MariaDB
  12. Mając już dokument, trzeba zastanowić się gdzie go umieścić, jak go wkomponować w strukturę indexów, typów i dokumentów. Pytanie też jak shardy powinny być rozłożone po nodach i indeksach. Warto pamiętać o tym, że: Wyszukiwanie w małym zbiorze danych pozwala na szybsze uzyskanie wyniku. Indeksy z małą liczbą shardów źle się skalują. Z drugiej strony zbyt dużo shardów również powoduje duży narzut. Typy służą głównie do mappingów, w ramach indeksu dokumenty o danym typie nie są zgrupowane razem. Indeksy nie muszą reprezentować jakiejś biznesowej encji. W dokumentacji jest masa przykładów, kiedy zaleca się coś innego, więc nie jest to tak, jak w przypadku tabeli bazy danych.
  13. Pomysły na rozwiązanie Osobny indeks dla każdego projektu, jeden typ w każdym z indeksów Definitywnie za dużo shardów Dodawanie indeksów trwa długo, zapisuje się struktura indeksu, testowaliśmy to i import trwałby miesiącami Każdy shard zżera zasoby (procesy, ram)
  14. Jeden duży indeks i osobny typ dla każdego projektu Bardzo duży indeks, trzeba by wrzucić na start dużą liczbę shardów, żeby uniknąć reindeksowania w przyszłości. To spowoduje duży narzut póki nie osiągnie się dużej liczby węzłów, Dużo danych do wyszukiwania za każdym razem, ale to może być zmniejszone przez document routing, który pozwala nam wybrać shard, do którego trafi dany dokument i wyszukiwać tylko w nim. Jest to szybkie, za to minus jest taki, że jak ten shard padnie, to nie otrzymamy wyników w ogóle.
  15. Nasze rozwiązanie Podzieliliśmy projekty na grupy i dodaliśmy osobny indeks dla każdej z grup. Dzięki temu otrzymujemy sensowną liczbę węzłów i shardów i padnięcie jednego shardu nie powoduje fuckupu u klienta.. Można poszaleć i dla każdego z projektów dodać alias, który będzie dodawał filtr. Nie udało nam się tego zrobić, bo nie da się założyć aliasu z filtrem, po nałożeniu którego nie ma jeszcze żadnej wiadomości. Obsłużenie tego w naszej architekturze wymagałoby sporo gimnastyki, a to tak naprawdę tylko lukier składniowy. Nie wiem też jak zachowuje się 250 tys aliasów. Nie wiem czy to ograniczenie Elasticsearcha samego w sobie, czy instancji na AWS. Projekty pogrupowane po językach mogą używać różnych ustawień analyzerów, co pomoże w dostosowaniu wyszukiwania do języka.
  16. Dwie dojrzałe biblioteki, używamy pierwszej, ale druga może się przydać.
  17. W SQL da się zrobić prawie wszystko. Będzie to wolne, zrobi tmp table, ale zwróci poprawny wynik. Tutaj często czegoś po prostu nie da się policzyć, dlatego dobrze wrzucić część danych i przetestować zapytania przed decyzją o użyciu na produkcji. My mieliśmy problem z pobraniem ostatniej wiadomości w agregacjach. Da się to zrobić w zapytaniu bez agregacji, w agregacjach nie. Nie ma wsparcia dla stronicowania w wynikach agregacji. Jest to najbardziej pożądana funkcjonalność na githubie, jednak zespół nie planuje tego wprowadzić. Ludzie obchodzą to na różne sposoby - najbardziej popularny polega na pobraniu po prostu całego wyniku, zapisania po stronie klienta i realizacji stronicowania tam. U nas skorzystaliśmy z tego, że nie potrzebujemy stałej liczby wyników dla każdej strony, tylko możemy zwyczajnie “dociągać” kolejne wyniki (infinite scroll). Korzystamy z tego przy wyświetlaniu ostatnich konwersacji, gdzie wiadomości są grupowane po id visitora. W każdym requeście pobieramy wszystkie ostatnie wiadomości z takim limitem, jaki po zgrupowaniu średnio daje jedną stornę visitorów. Po stronie PHP filtrujemy, aby zwrócić tylko ostatnią wiadomość dla danego visitora. Przy pobraniu kolejnej strony przesyłamy do ES listę visitorów, których wiadomości mają być pominięte. Testy pokazały, że jest to najszybsze, a kolejne requesty mimo rosnącej liczby visitorów do pominięcia nie różnią się praktycznie czasami odpowiedzi.
  18. Ograniczenia ES na AWS Threadpool bulk queue_size To jest bardzo kłopotliwe. Parametr ten odpowiada za liczbę dokumentów, które elasticsearch jest w stanie zapisać sobie w kolejce do zaindeksowania przy operacjach typu bulk. Nie da się tego zmienić, jest bardzo niskie. Wielkość kolejki powinna być w okolicach 20-50 MB, u nas 50 dokumentów to kilkaset kilobajtów. Przez to indeksowanie dokumentów w naszym przypadku trwało tydzien. Dosłownie 7 dni. Było to co prawda 20 mln rekordów, ale bez przesady. Obciążenie procesora było w tym czasie znikome. Druga rzecz to brak wsparcia dla skryptów. To znacząco ogranicza możliwości pisania zapytań. Planujemy przejść na Elasticsearcha postawionego ręcznie
  19. Jak optymalizować wydajność indeksowania Najważniejsze są mappingi Dla pól typu email, kod pocztowy, wszelkie id itp. nie jest potrzebna analiza i zamiana na tokeny. Pozwala to też na filtrowanie po dokładnych wartościach. Trzeba ustawić dla pól typu string, bo te są domyślnie analizowane. Zablokuj pole _all. Zamknie to możliwość wyszukiwania w całym indeksie (bez podania pola), ale nie jest to zazwyczaj potrzebne i pozwala zaoszczędzić czas i miejsce na serwerze. Warto też dostosować rodzaj analizatora, w tym tokenizera i dostosować pod niego dane, ktore są zapisywane w indeksie. To nie tylko termy i ich obecność w dokumencie, ale też częstotliwości występowania, pozycje itp. które są wykorzystywane w zaawansowanych wyszukiwaniach. U nas nie przyspieszyło to znacznie, wąskie gardło jest w limicie kolejki.
  20. I na koniec porownanie czasów. Niesprawiedliwe porównanie, bo w MySQL było wyszukiwanie po like, baza jest pod większym obciążeniem, ale też Maria ma mocniejszy sprzęt. Czasy całego requesta razem z frameworkiem, client-side joins itp.