- Signal is a text analytics startup that uses Elasticsearch to analyze large volumes of news articles and provide search and analytics services to customers.
- Signal faced challenges in providing low latency search across thousands of heterogeneous users querying large and spiky loads of data while continuing to improve their AI models.
- Joachim Draeger led experiments with Elasticsearch configurations and monitoring to optimize performance and scaling, finding that fewer, larger shards and reducing the number of search terms improved query latency. Proper monitoring was also essential to identify bottlenecks and expensive searches.
2. ● Signal’s Use Case & Challenges
● Performance & Scaling Journey
● Live Experiments
Agenda
3. Signal: signalmedia.co @SignalHQ
Text Analytics Start-Up, founded in 2013
Media Monitoring & more
100 people, about 20 in tech/data science/product
We’re hiring!
Joachim Draeger: linkedin.com/in/joachimdraeger/ @joachimdraeger
Lead Software Engineer, joined two years ago
Terraformed Infrastructure, Tamed Elasticsearch, Built up Monitoring
currently developing full-stack on Signal’s User Management and Login security
Before: 10 years of Java
Signal & Me
8. ● Latest 15 months of the world’s news
● AI powered annotations
○ Entities (Apple vs apples)
○ Topics
○ Quotes
○ Sentiment
● Full text for keyword searches
● Source
● … and more
Data in Elasticsearch
9. ● Thousands of Users with heterogeneous demands
○ Some only interested in their coverage (1 Entity)
○ Some are interested in a lot of different and specific things
○ => spiky load, sometimes caused by single user
● AI cat & mouse
○ Information needs not (yet) covered by AI annotations get modelled with keywords
○ E.g. “according to”, “said”, “declared” => Quote detection
○ E.g. positive/negative words => Sentiment
○ More and better Entities & Topics
● Queries with lots of terms are expensive!
Challenges & Usage Characteristics
11. ● Be pragmatic
● Add more nodes!
● Monitoring, identify resource bottlenecks *
● Upgrade to latest ES version
● Identify and improve expensive searches *
● Find the right machine type
● Find the right number of indices and shards *
● Build a (mental) model for query cost
Signal’s Performance & Scaling Journey
12. ● End-user latency
● Search queue & rejected searches
● CPU
● Memory
● Garbage collection: Old Gen (new JDKs are coming!)
● IO: Ops & Bytes/s
● Field Data
Monitoring
13. ● Log all queries at source
● Miniature production
○ Proportional less/smaller servers and data
● Consider warming up caches
● Goal A: Experiment with optimisations
○ Replay in real-time
○ Watch impact with monitoring
○ Tune one thing and repeat
● Goal B: Identify expensive searches
○ Replay one search at a time
○ Filter by latency or metrics for single searches - how?
Replay Live Traffic
15. ● Docker Compose Stack + Python/Shell Scripts
https://github.com/joachimdraeger/elasticsearch-performance-experiments
● The Signal Media One-Million News Articles Dataset
https://research.signalmedia.co/newsir16/signal-dataset.html
One month of articles, September 2015
● Indexed in 3 different ways:
○ Daily indices with 5 shards each, e.g. articles-daily-20150901
○ One index with 5 shards (articles-5)
○ One index with 1 shard (articles-1)
● One search with 4, one search with 16 terms
● Repeat each search 1000x
Live Experiment
18. ● Docker Compose Stack
● Signal’s 1M articles data set
● Scripts for indexing
● 2 searches around VW diesel
● Script to run 1000 searches
● metrics.py to collect stats
● On GitHub:
tinyurl.com/esperf-2018
`Live Experiment
21. ● the default number of shards will change from [5] to [1]
in 7.0.0
● Huge shards are more efficient to search (50GB!)
● One shard per server!?
● Huge shards can be difficult to move/recover
● Multiple shards => parallel indexing/searching
● Replicas for failover and balancing load
● Consider monthly/bi-weekly-quarterly/yearly indices
Last words on shards...
22. ● Metric counters are great to measure experiments
● Shards are expensive
● Terms too!
● Elasticsearch use cases are diverse - it depends!
Summary