SlideShare uma empresa Scribd logo
1 de 38
Baixar para ler offline
Log Analytics with ELK Stack
(Architecture for aggressive cost optimization and infinite data scale)
Denis D’Souza | 27th July 2019
About me...
● Currently a DevOps engineer at Moonfrog Labs
● 6 + years working as DevOps Engineer, SRE and Linux administrator
Worked on a variety of technologies in both service-based and
product-based organisations
● How do I spend my free time ?
Learning new technologies and Playing PC Games
www.linkedin.com/in/denis-dsouza
• A Mobile Gaming Company making mass market social games
• More than 5M+ Daily Active, 15M+ Weekly Active Users
• Real time, Cross platform games optimized for Primary
Market(s) - India and subcontinent
• Profitable!
Current Scale
Who we are ?
1. Our business requirements
2. Choosing the right option
3. ELK Stack overview
4. Our ELK architecture
5. Optimizations we did
6. Cost savings
7. Key takeaways
Our problem statement
● Log analytics platform (Web-Server, Application, Database logs)
● Data Ingestion rate: ~300GB/day
● Frequently accessed data: last 8 days
● Infrequently accessed
● Uptime: 99.90
● Hot Retention period: 90 days
● Cold Retention period: 90 days (with potential to increase)
● Simple and Cost effective solution
● Fairly predictable concurrent user-base
● Not to be used for storing user/business data
Our business requirements
ELK stack Splunk Sumo logic
Product Self managed Cloud Professional
Pricing ~ $30 per GB / month ~ $100 per GB / month * ~ $108 per GB / month *
Data Ingestion ~ 300 GB / day
~ 100 GB / day *
(post ingestion custom pricing)
~ 20 GB / day *
(post ingestion custom pricing)
Retention ~ 90 days ~ 90 days * ~ 30 days *
Cost/GB/day ~$ 0.98 per GB / day ~$ 3.33 per GB / day * ~$ 3.60 per GB /day *
* values are estimations taken from the ‘product pricing web-page’ of the respective products, they may not represent the actual values and are meant for the purpose of comparison only.
References:
https://www.splunk.com/en_us/products/pricing/calculator.html#tabs/tab2
https://www.sumologic.com/pricing/apac/
Choosing the right option
ELK Stack overview
● Index
● Shard
○ Primary
○ Replica
● Segment
● Node
References:
https://www.elastic.co/guide/en/elasticsearch/reference/5.6/_basic_concepts.html
ELK Stack overview: Terminologies
Our ELK architecture
Our ELK architecture: Hot-Warm-Cold data storage
(infinite scale)
Service
Number of
Nodes
Total CPU
Cores
Total RAM
Storage
EBS
1 Elasticsearch 7 28 141 GB
2 Logstash 3 6 12 GB
3 Kibana 1 1 4 GB
Total 11 35 157 GB ~ 20 TB
Data-ingestion per day ~ 300 GB
Hot Retention period 90 days
Docs/sec (at peak load) ~ 7K
Our ELK architecture: Size and scale
Application Side
● Logstash
● Elasticsearch
Infrastructure Side
● EC2
● EBS
● Data transfer
Optimizations we did
Optimizations we did: Application side
Logstash
Pipeline Workers:
● Adjusted "pipeline.workers" to x4 the number of
Cores to improve CPU utilisation on Logstash
server (as threads may spend significant time in
an I/O wait state)
### Core-count: 2 ###
...
pipeline.workers: 8
...
logstash.yml
References:
https://www.elastic.co/guide/en/logstash/current/tuning-logstash.html
Optimizations we did: Logstash
'info' logs:
● Separated application 'info' log to be store in a
different index with retention policy of fewer days
if [sourcetype] == "app_logs" and [level] == "info"
{
elasticsearch {
index => "%{sourcetype}-%{level}-%{+YYYY.MM.dd}"
...
Filter config
if [sourcetype] == "nginx" and [status] == "200"
{
elasticsearch {
index => "%{sourcetype}-%{status}-%{+YYYY.MM.dd}"
...
References:
https://www.elastic.co/guide/en/logstash/current/event-dependent-configuration.html
'200' response-code logs:
● Separated Access log with '200' response-code
be store in a different index with retention policy
of fewer days
Optimizations we did: Logstash
Log ‘message’ field:
● Removed "message" field if there were no
'grok-failures' in logstash while applying grok
patterns
(reduced storage footprint by ~30% per doc)
if "_grokparsefailure" not in [tags] {
mutate {
remove_field => ["message"]
}
}
Filter config
Eg:
Nginx Log-message: 127.0.0.1 - - [26/Mar/2016:19:09:19 -0400] "GET / HTTP/1.1" 401 194 "" "Mozilla/5.0
Gecko" "-"
Grok Pattern: %{IPORHOST:clientip} (?:-|(%{WORD}.%{WORD})) %{USER:ident}
[%{HTTPDATE:timestamp}] "(?:%{WORD:verb} %{NOTSPACE:request}(?:
HTTP/%{NUMBER:httpversion})?|%{DATA:rawrequest})" %{NUMBER:response} (?:%{NUMBER:bytes}|-)
%{QS:referrer} %{QS:agent} %{QS:forwarder}
Optimizations we did: Logstash
Elasticsearch
Optimizations we did: Application side
JVM heap vs non-heap memory:
● Optimised JVM heap-size by monitoring the GC
interval, this helped in efficient utilization of system
Memory (33% for JVM, 66% for non-heap) *
jvm.options
### Total system Memory 15GB ###
-Xms5g
-Xmx5g
Heap too small
Heap too large
Optimised Heap
* Recommended heap-size settings by Elastic:
https://www.elastic.co/guide/en/elasticsearch/reference/current/heap-size.html
Optimizations we did: Elasticsearch
Shards:
● Created templates with number of shards which
are multiples of the number of Elasticsearch
nodes
(helps fix issues with shards distribution
imbalance which resulted in uneven disk,
compute resource usage)
### Number of ES nodes: 5 ###
{
"template": "appserver-*",
"settings": {
"number_of_shards": "5",
"number_of_replicas": "0",
...
}
}'
Trade-offs:
● Removing replicas will result in search queries
running slower as replicas are used while
performing search operations
● It is not recommended to run production clusters
without replicas
Replicas:
● Removed replicas for the required indexes
(50% savings on storage cost, ~30% reduction in
compute resource utilization)
Optimizations we did: Elasticsearch
Template config
AWS
● EC2
● EBS
● Data transfer (Inter AZ)
Spotinst platform allows users to reliably
leverage excess capacity, simplify cloud
operations and save 80% on compute costs.
Optimizations we did: Infrastructure side
Optimizations we did: Infrastructure side
EC2
Stateful EC2 Spot instances:
● Moved all ELK nodes to run on spot instances
(Instances maintain IP address, EBS volumes)
Recovery time: < 10 mins
Trade-offs:
● Prefer using previous generation instance
types to reduce frequent spot take-backs
Optimizations we did: EC2 and spot
Auto-Scaling:
● Performance/time based auto-scaling for
Logstash Instances
Optimizations we did: EC2 and spot
Optimizations we did: Infrastructure side
EBS
"Hot-Warm" Architecture:
● "Hot" nodes: store active indexes, use GP2
EBS-disks (General purpose SSD)
● "Warm" nodes: store passive indexes, use SC1
EBS-disks (Cold storage)
(~69% savings on storage cost)
node.attr.box_type: hot
...
elasticsearch.yml
"template": "appserver-*",
"settings": {
"index": {
"routing": {
"allocation": {
"require": {
"box_type": "hot"}
}
}
},
...
Template config
Trade-offs:
● Since "Warm" nodes are using SC1 EBS-disks,
they have lower IOPS, throughput this will result
in search operations being comparatively slower
References:
https://cinhtau.net/2017/06/14/hot-warm-architecture/
Optimizations we did: EBS
Moving indexes to "Warm" nodes:
● Reallocated indexes older than 8 days to "Warm"
nodes
● Recommended to perform this operation during
off-peak hours as it is I/O intensive
actions:
1:
action: allocation
description: "Move index to Warm-nodes after 8
days"
options:
key: box_type
value: warm
allocation_type: require
timeout_override:
continue_if_exception: false
filters:
- filtertype: age
source: name
direction: older
timestring: '%Y.%m.%d'
unit: days
unit_count: 8
...
Curator config
References:
https://www.elastic.co/blog/hot-warm-architecture-in-elasticsearch-5-x
Optimizations we did: EBS
Single Availability Zone:
● Migrated all ELK node to a single availability zone
(reduce inter AZ data transfer cost for ELK nodes
by 100%)
● Data transfer/day: ~700GB
(Logstash to Elasticsearch: ~300GB,
Elasticsearch inter-communication: ~400GB)
Trade-offs:
● It is not recommended to run production clusters in
a single AZ as it will result in downtime and
potential data loss in case of AZ failures
Optimizations we did: Inter-AZ data transfer
Using S3 for index Snapshots:
● Take snapshots of indexes and store them in S3
curl -XPUT
"http://<domain>:9200/_snapshot/s3_repository/
snap1?pretty?wait_for_completion=true" -d'
{
"indices": "index_1,index_2",
"ignore_unavailable": true,
"include_global_state": false
}
Backup:
References:
https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-snapshots.html
https://medium.com/@federicopanini/elasticsearch-backup-snapshot-and-restore-on-aws-s3-f1fc32fbca7f
Data backup and restore
curl -s -XPOST --url
"http://<domain>:9200/_snapshot/s3_repository/s
nap1/_restore" -d'
{
"indices": "index_1,index_2",
"ignore_unavailable": true,
"include_global_state": false,
}'
On-demand Elasticsearch cluster:
● Launching a on demand ES cluster and importing
the snapshots from S3
Existing Cluster:
● Restore the required snapshots to existing cluster
Restore:
Data backup and restore
Data corruption:
● List out indexes with status as ‘Red’
● Deleted the corrupted indexes
● Restore indexes from S3 snapshots
● Recovery time: depends of size of data
Node failure due to AZ going down:
● Launch a new ELK cluster using AWS cloud
formation templates
● Do the necessary config changes in Filebeat,
Logstash etc.
● Restore the required indexes from S3 snapshots
● Recovery time: depends on provisioning time and
size of data
Node failures due to underlying hardware issue:
● Recycle node in Spotinst console
(will take AMI of root volume, launch new instance,
attach EBS volumes, maintain private IP)
● Recovery time: < 10 mins/node
Snapshot restore time (estimates):
● < 4mins for a 20GB snapshot (test-cluster: 3
nodes, multiple indexes with 3 primary shards
each, no replicas)
Disaster recovery
EC2
Instance type Service Daily cost
5 x r5.xlarge (20C, 160GB) Elasticsearch 40.80
3 x c5.large (6C, 12GB) Logstash 7.17
1 x t3.medium (2C, 4GB) Kibana 1.29
Total ~ 49.26$
EC2 (optimized)
Instance type Service
Daily cost
65% savings + Spotinst charges (20% of savings) Total Savings
5 x m4.xlarge (20C, 80GB) Elasticsearch Hot 14.64
2 x r4.xlarge (8C, 61GB) Elasticsearch Warm 7.50
3 x c4.large (6C, 12GB) Logstash 3.50
1 x t2.medium (2C, 4GB) Kibana 0.69
Total ~ 26.33$ ~ 47%
Cost savings: EC2
Ingesting: 300GB/day
Retention: 90 days
Replica count: 1
Storage
Storage type Retention Daily cost
~54TB (GP2) 90 days ~ 237.60$
Storage (optimized)
Storage type Retention Daily cost Total Savings
~ 3TB (GP2) Hot 8 days 12.00
~ 24TB (SC1) Warm 82 days 24.00
~ 27TB (S3) Backup 90 days 22.50
Total ~ 58.50$ ~ 75%
Ingesting: 300GB/day
Retention: 90 days
Replica count: 0
Backups: Daily S3 snapshots
Cost savings: Storage
ELK stack
ELK stack
(optimized) Savings
EC2 49.40 26.33 47%
Storage 237.60 58.50 75%
Data-transfer 7 0 100%
Total (daily cost) ~ 294.00$ ~ 84.83$ ~ 71% *
Cost/GB (daily) ~ 0.98$ ~ 0.28$
* Total savings are exclusive of some of the application-level optimizations done
Total savings
ELK Stack
(optimized) ELK Stack Splunk Sumo logic
Product Self managed Self managed Cloud Professional
Data Ingestion ~ 300GB/day ~ 300GB/day
~ 100 GB / day *
(post ingestion custom pricing)
~ 20 GB / day *
(post ingestion custom pricing)
Retention ~ 90 days ~ 90 days ~ 90 days * ~ 30 days *
Cost/GB/day ~ $ 0.28 per GB /day ~ $ 0.98 per GB /day ~ $ 3.33 per GB /day * ~ $ 3.60 per GB /day *
Savings over traditional ELK stack: 71% *
* Total savings are exclusive of some of the application-level optimizations done
Our Costs vs other Platforms
ELK Stack Scalability:
● Logstash: auto-scaling
● Elasticsearch: overprovisioning (nodes run at 60% capacity during peak load), predictive vertical/horizontal scaling
Handling potential data-loss while AZ is down:
● DR mechanisms in place, daily/hourly backups stored in S3, Potential chances of data loss of about 1 hour
● We do not store user-data or business metrics in ELK, users/business will not be impacted
Handling potential data-corruptions in Elasticsearch:
● DR mechanisms in place, recover index from S3 index-snapshots
Managing downtime during spot take-backs:
● Logstash: multiple nodes, minimal impact
● Elasticsearch/Kibana: < 10min downtime per node
● Use previous generation instance types as spot take-back chances are comparatively low
Key Takeaways
Handling back-pressure when a node is down:
● Filebeat: will auto-retry to send old logs
● Logstash: use ‘date’ filter for document timestamp, auto-scaling
● Elasticsearch: overprovisioning
Other log analytics alternatives:
● We have only evaluated ELK, Splunk and Sumo Logic
ELK stack upgrade path:
● Blue Green deployment for major version upgrade
Key Takeaways
● We built a platform tailored to our requirements, yours might be different...
● Building a log analytics platform is not rocket science, but it can be painfully iterative if you
are not aware of the options
● Be aware of the trade-offs you are ‘OK with’ and you can roll out a solution optimised for
your specific requirements
Reflection
Thank you!
Happy to take your questions..
Copyright Disclaimer: All rights to the materials used for this presentation belongs to their respective owners..

Mais conteúdo relacionado

Mais procurados

What Is ELK Stack | ELK Tutorial For Beginners | Elasticsearch Kibana | ELK S...
What Is ELK Stack | ELK Tutorial For Beginners | Elasticsearch Kibana | ELK S...What Is ELK Stack | ELK Tutorial For Beginners | Elasticsearch Kibana | ELK S...
What Is ELK Stack | ELK Tutorial For Beginners | Elasticsearch Kibana | ELK S...Edureka!
 
ELK in Security Analytics
ELK in Security Analytics ELK in Security Analytics
ELK in Security Analytics nullowaspmumbai
 
Log analysis using elk
Log analysis using elkLog analysis using elk
Log analysis using elkRushika Shah
 
Elastic - ELK, Logstash & Kibana
Elastic - ELK, Logstash & KibanaElastic - ELK, Logstash & Kibana
Elastic - ELK, Logstash & KibanaSpringPeople
 
Logstash-Elasticsearch-Kibana
Logstash-Elasticsearch-KibanaLogstash-Elasticsearch-Kibana
Logstash-Elasticsearch-Kibanadknx01
 
OSMC 2022 | The Power of Metrics, Logs & Traces with Open Source by Emil-Andr...
OSMC 2022 | The Power of Metrics, Logs & Traces with Open Source by Emil-Andr...OSMC 2022 | The Power of Metrics, Logs & Traces with Open Source by Emil-Andr...
OSMC 2022 | The Power of Metrics, Logs & Traces with Open Source by Emil-Andr...NETWAYS
 
Serverless and Design Patterns In GCP
Serverless and Design Patterns In GCPServerless and Design Patterns In GCP
Serverless and Design Patterns In GCPOliver Fierro
 
Elastic Cloud keynote
Elastic Cloud keynoteElastic Cloud keynote
Elastic Cloud keynoteElasticsearch
 
Keeping Up with the ELK Stack: Elasticsearch, Kibana, Beats, and Logstash
Keeping Up with the ELK Stack: Elasticsearch, Kibana, Beats, and LogstashKeeping Up with the ELK Stack: Elasticsearch, Kibana, Beats, and Logstash
Keeping Up with the ELK Stack: Elasticsearch, Kibana, Beats, and LogstashAmazon Web Services
 
OSMC 2021 | Introduction into OpenSearch
OSMC 2021 | Introduction into OpenSearchOSMC 2021 | Introduction into OpenSearch
OSMC 2021 | Introduction into OpenSearchNETWAYS
 
Kibana Tutorial | Kibana Dashboard Tutorial | Kibana Elasticsearch | ELK Stac...
Kibana Tutorial | Kibana Dashboard Tutorial | Kibana Elasticsearch | ELK Stac...Kibana Tutorial | Kibana Dashboard Tutorial | Kibana Elasticsearch | ELK Stac...
Kibana Tutorial | Kibana Dashboard Tutorial | Kibana Elasticsearch | ELK Stac...Edureka!
 
OSMC 2022 | Ignite: Observability with Grafana & Prometheus for Kafka on Kube...
OSMC 2022 | Ignite: Observability with Grafana & Prometheus for Kafka on Kube...OSMC 2022 | Ignite: Observability with Grafana & Prometheus for Kafka on Kube...
OSMC 2022 | Ignite: Observability with Grafana & Prometheus for Kafka on Kube...NETWAYS
 
Customer Intelligence: Using the ELK Stack to Analyze ForgeRock OpenAM Audit ...
Customer Intelligence: Using the ELK Stack to Analyze ForgeRock OpenAM Audit ...Customer Intelligence: Using the ELK Stack to Analyze ForgeRock OpenAM Audit ...
Customer Intelligence: Using the ELK Stack to Analyze ForgeRock OpenAM Audit ...ForgeRock
 
Empower Your Security Practitioners with Elastic SIEM
Empower Your Security Practitioners with Elastic SIEMEmpower Your Security Practitioners with Elastic SIEM
Empower Your Security Practitioners with Elastic SIEMElasticsearch
 
Logging with Elasticsearch, Logstash & Kibana
Logging with Elasticsearch, Logstash & KibanaLogging with Elasticsearch, Logstash & Kibana
Logging with Elasticsearch, Logstash & KibanaAmazee Labs
 

Mais procurados (20)

Introduction to ELK
Introduction to ELKIntroduction to ELK
Introduction to ELK
 
What Is ELK Stack | ELK Tutorial For Beginners | Elasticsearch Kibana | ELK S...
What Is ELK Stack | ELK Tutorial For Beginners | Elasticsearch Kibana | ELK S...What Is ELK Stack | ELK Tutorial For Beginners | Elasticsearch Kibana | ELK S...
What Is ELK Stack | ELK Tutorial For Beginners | Elasticsearch Kibana | ELK S...
 
Elk - An introduction
Elk - An introductionElk - An introduction
Elk - An introduction
 
ELK in Security Analytics
ELK in Security Analytics ELK in Security Analytics
ELK in Security Analytics
 
Log analysis using elk
Log analysis using elkLog analysis using elk
Log analysis using elk
 
Elastic - ELK, Logstash & Kibana
Elastic - ELK, Logstash & KibanaElastic - ELK, Logstash & Kibana
Elastic - ELK, Logstash & Kibana
 
Logstash-Elasticsearch-Kibana
Logstash-Elasticsearch-KibanaLogstash-Elasticsearch-Kibana
Logstash-Elasticsearch-Kibana
 
OSMC 2022 | The Power of Metrics, Logs & Traces with Open Source by Emil-Andr...
OSMC 2022 | The Power of Metrics, Logs & Traces with Open Source by Emil-Andr...OSMC 2022 | The Power of Metrics, Logs & Traces with Open Source by Emil-Andr...
OSMC 2022 | The Power of Metrics, Logs & Traces with Open Source by Emil-Andr...
 
ELK introduction
ELK introductionELK introduction
ELK introduction
 
Serverless and Design Patterns In GCP
Serverless and Design Patterns In GCPServerless and Design Patterns In GCP
Serverless and Design Patterns In GCP
 
Elastic Cloud keynote
Elastic Cloud keynoteElastic Cloud keynote
Elastic Cloud keynote
 
Keeping Up with the ELK Stack: Elasticsearch, Kibana, Beats, and Logstash
Keeping Up with the ELK Stack: Elasticsearch, Kibana, Beats, and LogstashKeeping Up with the ELK Stack: Elasticsearch, Kibana, Beats, and Logstash
Keeping Up with the ELK Stack: Elasticsearch, Kibana, Beats, and Logstash
 
OSMC 2021 | Introduction into OpenSearch
OSMC 2021 | Introduction into OpenSearchOSMC 2021 | Introduction into OpenSearch
OSMC 2021 | Introduction into OpenSearch
 
Kibana Tutorial | Kibana Dashboard Tutorial | Kibana Elasticsearch | ELK Stac...
Kibana Tutorial | Kibana Dashboard Tutorial | Kibana Elasticsearch | ELK Stac...Kibana Tutorial | Kibana Dashboard Tutorial | Kibana Elasticsearch | ELK Stac...
Kibana Tutorial | Kibana Dashboard Tutorial | Kibana Elasticsearch | ELK Stac...
 
OSMC 2022 | Ignite: Observability with Grafana & Prometheus for Kafka on Kube...
OSMC 2022 | Ignite: Observability with Grafana & Prometheus for Kafka on Kube...OSMC 2022 | Ignite: Observability with Grafana & Prometheus for Kafka on Kube...
OSMC 2022 | Ignite: Observability with Grafana & Prometheus for Kafka on Kube...
 
Customer Intelligence: Using the ELK Stack to Analyze ForgeRock OpenAM Audit ...
Customer Intelligence: Using the ELK Stack to Analyze ForgeRock OpenAM Audit ...Customer Intelligence: Using the ELK Stack to Analyze ForgeRock OpenAM Audit ...
Customer Intelligence: Using the ELK Stack to Analyze ForgeRock OpenAM Audit ...
 
Empower Your Security Practitioners with Elastic SIEM
Empower Your Security Practitioners with Elastic SIEMEmpower Your Security Practitioners with Elastic SIEM
Empower Your Security Practitioners with Elastic SIEM
 
Fleet and elastic agent
Fleet and elastic agentFleet and elastic agent
Fleet and elastic agent
 
Kibana overview
Kibana overviewKibana overview
Kibana overview
 
Logging with Elasticsearch, Logstash & Kibana
Logging with Elasticsearch, Logstash & KibanaLogging with Elasticsearch, Logstash & Kibana
Logging with Elasticsearch, Logstash & Kibana
 

Semelhante a Log analytics with ELK stack

Optimizing spark based data pipelines - are you up for it?
Optimizing spark based data pipelines - are you up for it?Optimizing spark based data pipelines - are you up for it?
Optimizing spark based data pipelines - are you up for it?Etti Gur
 
Elasticsearch on Kubernetes
Elasticsearch on KubernetesElasticsearch on Kubernetes
Elasticsearch on KubernetesJoerg Henning
 
Benchmarking your cloud performance with top 4 global public clouds
Benchmarking your cloud performance with top 4 global public cloudsBenchmarking your cloud performance with top 4 global public clouds
Benchmarking your cloud performance with top 4 global public cloudsdata://disrupted®
 
Mongodb in-anger-boston-rb-2011
Mongodb in-anger-boston-rb-2011Mongodb in-anger-boston-rb-2011
Mongodb in-anger-boston-rb-2011bostonrb
 
Elasticsearch for Logs & Metrics - a deep dive
Elasticsearch for Logs & Metrics - a deep diveElasticsearch for Logs & Metrics - a deep dive
Elasticsearch for Logs & Metrics - a deep diveSematext Group, Inc.
 
6 tips for improving ruby performance
6 tips for improving ruby performance6 tips for improving ruby performance
6 tips for improving ruby performanceEngine Yard
 
Getting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on KubernetesGetting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on KubernetesDatabricks
 
MongoDB performance tuning and load testing, NOSQL Now! 2013 Conference prese...
MongoDB performance tuning and load testing, NOSQL Now! 2013 Conference prese...MongoDB performance tuning and load testing, NOSQL Now! 2013 Conference prese...
MongoDB performance tuning and load testing, NOSQL Now! 2013 Conference prese...ronwarshawsky
 
Best Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and DeltaBest Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and DeltaDatabricks
 
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesKubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesSeungYong Oh
 
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...Hernan Costante
 
Durable Azure Functions
Durable Azure FunctionsDurable Azure Functions
Durable Azure FunctionsPushkar Saraf
 
Improving the performance of Odoo deployments
Improving the performance of Odoo deploymentsImproving the performance of Odoo deployments
Improving the performance of Odoo deploymentsOdoo
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09Chris Purrington
 
Clug 2012 March web server optimisation
Clug 2012 March   web server optimisationClug 2012 March   web server optimisation
Clug 2012 March web server optimisationgrooverdan
 
Building an analytics workflow using Apache Airflow
Building an analytics workflow using Apache AirflowBuilding an analytics workflow using Apache Airflow
Building an analytics workflow using Apache AirflowYohei Onishi
 
Lessons Learned While Scaling Elasticsearch at Vinted
Lessons Learned While Scaling Elasticsearch at VintedLessons Learned While Scaling Elasticsearch at Vinted
Lessons Learned While Scaling Elasticsearch at VintedDainius Jocas
 
Time series denver an introduction to prometheus
Time series denver   an introduction to prometheusTime series denver   an introduction to prometheus
Time series denver an introduction to prometheusBob Cotton
 
MongoDB .local Bengaluru 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local Bengaluru 2019: MongoDB Atlas Data Lake Technical Deep DiveMongoDB .local Bengaluru 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local Bengaluru 2019: MongoDB Atlas Data Lake Technical Deep DiveMongoDB
 

Semelhante a Log analytics with ELK stack (20)

Optimizing spark based data pipelines - are you up for it?
Optimizing spark based data pipelines - are you up for it?Optimizing spark based data pipelines - are you up for it?
Optimizing spark based data pipelines - are you up for it?
 
Elasticsearch on Kubernetes
Elasticsearch on KubernetesElasticsearch on Kubernetes
Elasticsearch on Kubernetes
 
Benchmarking your cloud performance with top 4 global public clouds
Benchmarking your cloud performance with top 4 global public cloudsBenchmarking your cloud performance with top 4 global public clouds
Benchmarking your cloud performance with top 4 global public clouds
 
Mongodb in-anger-boston-rb-2011
Mongodb in-anger-boston-rb-2011Mongodb in-anger-boston-rb-2011
Mongodb in-anger-boston-rb-2011
 
Elasticsearch for Logs & Metrics - a deep dive
Elasticsearch for Logs & Metrics - a deep diveElasticsearch for Logs & Metrics - a deep dive
Elasticsearch for Logs & Metrics - a deep dive
 
6 tips for improving ruby performance
6 tips for improving ruby performance6 tips for improving ruby performance
6 tips for improving ruby performance
 
Getting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on KubernetesGetting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on Kubernetes
 
MongoDB performance tuning and load testing, NOSQL Now! 2013 Conference prese...
MongoDB performance tuning and load testing, NOSQL Now! 2013 Conference prese...MongoDB performance tuning and load testing, NOSQL Now! 2013 Conference prese...
MongoDB performance tuning and load testing, NOSQL Now! 2013 Conference prese...
 
Best Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and DeltaBest Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and Delta
 
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesKubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
 
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
 
Durable Azure Functions
Durable Azure FunctionsDurable Azure Functions
Durable Azure Functions
 
Improving the performance of Odoo deployments
Improving the performance of Odoo deploymentsImproving the performance of Odoo deployments
Improving the performance of Odoo deployments
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
Clug 2012 March web server optimisation
Clug 2012 March   web server optimisationClug 2012 March   web server optimisation
Clug 2012 March web server optimisation
 
Building an analytics workflow using Apache Airflow
Building an analytics workflow using Apache AirflowBuilding an analytics workflow using Apache Airflow
Building an analytics workflow using Apache Airflow
 
Lessons Learned While Scaling Elasticsearch at Vinted
Lessons Learned While Scaling Elasticsearch at VintedLessons Learned While Scaling Elasticsearch at Vinted
Lessons Learned While Scaling Elasticsearch at Vinted
 
Time series denver an introduction to prometheus
Time series denver   an introduction to prometheusTime series denver   an introduction to prometheus
Time series denver an introduction to prometheus
 
Varnish - PLNOG 4
Varnish - PLNOG 4Varnish - PLNOG 4
Varnish - PLNOG 4
 
MongoDB .local Bengaluru 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local Bengaluru 2019: MongoDB Atlas Data Lake Technical Deep DiveMongoDB .local Bengaluru 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local Bengaluru 2019: MongoDB Atlas Data Lake Technical Deep Dive
 

Mais de AWS User Group Bengaluru

Lessons learnt building a Distributed Linked List on S3
Lessons learnt building a Distributed Linked List on S3Lessons learnt building a Distributed Linked List on S3
Lessons learnt building a Distributed Linked List on S3AWS User Group Bengaluru
 
Building Efficient, Scalable and Resilient Front-end logging service with AWS
Building Efficient, Scalable and Resilient Front-end logging service with AWSBuilding Efficient, Scalable and Resilient Front-end logging service with AWS
Building Efficient, Scalable and Resilient Front-end logging service with AWSAWS User Group Bengaluru
 
Exploring opportunities with communities for a successful career
Exploring opportunities with communities for a successful careerExploring opportunities with communities for a successful career
Exploring opportunities with communities for a successful careerAWS User Group Bengaluru
 
Slack's transition away from a single AWS account
Slack's transition away from a single AWS accountSlack's transition away from a single AWS account
Slack's transition away from a single AWS accountAWS User Group Bengaluru
 
Building Efficient, Scalable and Resilient Front-end logging service with AWS
Building Efficient, Scalable and Resilient Front-end logging service with AWSBuilding Efficient, Scalable and Resilient Front-end logging service with AWS
Building Efficient, Scalable and Resilient Front-end logging service with AWSAWS User Group Bengaluru
 
Medlife's journey with AWS from 0(zero) orders to 6 digit mark
Medlife's journey with AWS from 0(zero) orders to 6 digit markMedlife's journey with AWS from 0(zero) orders to 6 digit mark
Medlife's journey with AWS from 0(zero) orders to 6 digit markAWS User Group Bengaluru
 
Exploring opportunities with communities for a successful career
Exploring opportunities with communities for a successful careerExploring opportunities with communities for a successful career
Exploring opportunities with communities for a successful careerAWS User Group Bengaluru
 
Lessons learnt building a Distributed Linked List on S3
Lessons learnt building a Distributed Linked List on S3Lessons learnt building a Distributed Linked List on S3
Lessons learnt building a Distributed Linked List on S3AWS User Group Bengaluru
 
Keynote - Chaos Engineering: Why breaking things should be practiced
Keynote - Chaos Engineering: Why breaking things should be practicedKeynote - Chaos Engineering: Why breaking things should be practiced
Keynote - Chaos Engineering: Why breaking things should be practicedAWS User Group Bengaluru
 

Mais de AWS User Group Bengaluru (20)

Demystifying identity on AWS
Demystifying identity on AWSDemystifying identity on AWS
Demystifying identity on AWS
 
AWS Secrets for Best Practices
AWS Secrets for Best PracticesAWS Secrets for Best Practices
AWS Secrets for Best Practices
 
Cloud Security
Cloud SecurityCloud Security
Cloud Security
 
Lessons learnt building a Distributed Linked List on S3
Lessons learnt building a Distributed Linked List on S3Lessons learnt building a Distributed Linked List on S3
Lessons learnt building a Distributed Linked List on S3
 
Medlife journey with AWS
Medlife journey with AWSMedlife journey with AWS
Medlife journey with AWS
 
Building Efficient, Scalable and Resilient Front-end logging service with AWS
Building Efficient, Scalable and Resilient Front-end logging service with AWSBuilding Efficient, Scalable and Resilient Front-end logging service with AWS
Building Efficient, Scalable and Resilient Front-end logging service with AWS
 
Exploring opportunities with communities for a successful career
Exploring opportunities with communities for a successful careerExploring opportunities with communities for a successful career
Exploring opportunities with communities for a successful career
 
Slack's transition away from a single AWS account
Slack's transition away from a single AWS accountSlack's transition away from a single AWS account
Slack's transition away from a single AWS account
 
Serverless Culture
Serverless CultureServerless Culture
Serverless Culture
 
Refactoring to serverless
Refactoring to serverlessRefactoring to serverless
Refactoring to serverless
 
Amazon EC2 Spot Instances Workshop
Amazon EC2 Spot Instances WorkshopAmazon EC2 Spot Instances Workshop
Amazon EC2 Spot Instances Workshop
 
Building Efficient, Scalable and Resilient Front-end logging service with AWS
Building Efficient, Scalable and Resilient Front-end logging service with AWSBuilding Efficient, Scalable and Resilient Front-end logging service with AWS
Building Efficient, Scalable and Resilient Front-end logging service with AWS
 
Medlife's journey with AWS from 0(zero) orders to 6 digit mark
Medlife's journey with AWS from 0(zero) orders to 6 digit markMedlife's journey with AWS from 0(zero) orders to 6 digit mark
Medlife's journey with AWS from 0(zero) orders to 6 digit mark
 
AWS Secrets for Best Practices
AWS Secrets for Best PracticesAWS Secrets for Best Practices
AWS Secrets for Best Practices
 
Exploring opportunities with communities for a successful career
Exploring opportunities with communities for a successful careerExploring opportunities with communities for a successful career
Exploring opportunities with communities for a successful career
 
Lessons learnt building a Distributed Linked List on S3
Lessons learnt building a Distributed Linked List on S3Lessons learnt building a Distributed Linked List on S3
Lessons learnt building a Distributed Linked List on S3
 
Cloud Security
Cloud SecurityCloud Security
Cloud Security
 
Amazon EC2 Spot Instances
Amazon EC2 Spot InstancesAmazon EC2 Spot Instances
Amazon EC2 Spot Instances
 
Cost Optimization in AWS
Cost Optimization in AWSCost Optimization in AWS
Cost Optimization in AWS
 
Keynote - Chaos Engineering: Why breaking things should be practiced
Keynote - Chaos Engineering: Why breaking things should be practicedKeynote - Chaos Engineering: Why breaking things should be practiced
Keynote - Chaos Engineering: Why breaking things should be practiced
 

Último

DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 

Último (20)

DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 

Log analytics with ELK stack

  • 1. Log Analytics with ELK Stack (Architecture for aggressive cost optimization and infinite data scale) Denis D’Souza | 27th July 2019
  • 2. About me... ● Currently a DevOps engineer at Moonfrog Labs ● 6 + years working as DevOps Engineer, SRE and Linux administrator Worked on a variety of technologies in both service-based and product-based organisations ● How do I spend my free time ? Learning new technologies and Playing PC Games www.linkedin.com/in/denis-dsouza
  • 3. • A Mobile Gaming Company making mass market social games • More than 5M+ Daily Active, 15M+ Weekly Active Users • Real time, Cross platform games optimized for Primary Market(s) - India and subcontinent • Profitable! Current Scale Who we are ?
  • 4. 1. Our business requirements 2. Choosing the right option 3. ELK Stack overview 4. Our ELK architecture 5. Optimizations we did 6. Cost savings 7. Key takeaways Our problem statement
  • 5. ● Log analytics platform (Web-Server, Application, Database logs) ● Data Ingestion rate: ~300GB/day ● Frequently accessed data: last 8 days ● Infrequently accessed ● Uptime: 99.90 ● Hot Retention period: 90 days ● Cold Retention period: 90 days (with potential to increase) ● Simple and Cost effective solution ● Fairly predictable concurrent user-base ● Not to be used for storing user/business data Our business requirements
  • 6. ELK stack Splunk Sumo logic Product Self managed Cloud Professional Pricing ~ $30 per GB / month ~ $100 per GB / month * ~ $108 per GB / month * Data Ingestion ~ 300 GB / day ~ 100 GB / day * (post ingestion custom pricing) ~ 20 GB / day * (post ingestion custom pricing) Retention ~ 90 days ~ 90 days * ~ 30 days * Cost/GB/day ~$ 0.98 per GB / day ~$ 3.33 per GB / day * ~$ 3.60 per GB /day * * values are estimations taken from the ‘product pricing web-page’ of the respective products, they may not represent the actual values and are meant for the purpose of comparison only. References: https://www.splunk.com/en_us/products/pricing/calculator.html#tabs/tab2 https://www.sumologic.com/pricing/apac/ Choosing the right option
  • 8. ● Index ● Shard ○ Primary ○ Replica ● Segment ● Node References: https://www.elastic.co/guide/en/elasticsearch/reference/5.6/_basic_concepts.html ELK Stack overview: Terminologies
  • 10. Our ELK architecture: Hot-Warm-Cold data storage (infinite scale)
  • 11. Service Number of Nodes Total CPU Cores Total RAM Storage EBS 1 Elasticsearch 7 28 141 GB 2 Logstash 3 6 12 GB 3 Kibana 1 1 4 GB Total 11 35 157 GB ~ 20 TB Data-ingestion per day ~ 300 GB Hot Retention period 90 days Docs/sec (at peak load) ~ 7K Our ELK architecture: Size and scale
  • 12. Application Side ● Logstash ● Elasticsearch Infrastructure Side ● EC2 ● EBS ● Data transfer Optimizations we did
  • 13. Optimizations we did: Application side Logstash
  • 14. Pipeline Workers: ● Adjusted "pipeline.workers" to x4 the number of Cores to improve CPU utilisation on Logstash server (as threads may spend significant time in an I/O wait state) ### Core-count: 2 ### ... pipeline.workers: 8 ... logstash.yml References: https://www.elastic.co/guide/en/logstash/current/tuning-logstash.html Optimizations we did: Logstash
  • 15. 'info' logs: ● Separated application 'info' log to be store in a different index with retention policy of fewer days if [sourcetype] == "app_logs" and [level] == "info" { elasticsearch { index => "%{sourcetype}-%{level}-%{+YYYY.MM.dd}" ... Filter config if [sourcetype] == "nginx" and [status] == "200" { elasticsearch { index => "%{sourcetype}-%{status}-%{+YYYY.MM.dd}" ... References: https://www.elastic.co/guide/en/logstash/current/event-dependent-configuration.html '200' response-code logs: ● Separated Access log with '200' response-code be store in a different index with retention policy of fewer days Optimizations we did: Logstash
  • 16. Log ‘message’ field: ● Removed "message" field if there were no 'grok-failures' in logstash while applying grok patterns (reduced storage footprint by ~30% per doc) if "_grokparsefailure" not in [tags] { mutate { remove_field => ["message"] } } Filter config Eg: Nginx Log-message: 127.0.0.1 - - [26/Mar/2016:19:09:19 -0400] "GET / HTTP/1.1" 401 194 "" "Mozilla/5.0 Gecko" "-" Grok Pattern: %{IPORHOST:clientip} (?:-|(%{WORD}.%{WORD})) %{USER:ident} [%{HTTPDATE:timestamp}] "(?:%{WORD:verb} %{NOTSPACE:request}(?: HTTP/%{NUMBER:httpversion})?|%{DATA:rawrequest})" %{NUMBER:response} (?:%{NUMBER:bytes}|-) %{QS:referrer} %{QS:agent} %{QS:forwarder} Optimizations we did: Logstash
  • 18. JVM heap vs non-heap memory: ● Optimised JVM heap-size by monitoring the GC interval, this helped in efficient utilization of system Memory (33% for JVM, 66% for non-heap) * jvm.options ### Total system Memory 15GB ### -Xms5g -Xmx5g Heap too small Heap too large Optimised Heap * Recommended heap-size settings by Elastic: https://www.elastic.co/guide/en/elasticsearch/reference/current/heap-size.html Optimizations we did: Elasticsearch
  • 19. Shards: ● Created templates with number of shards which are multiples of the number of Elasticsearch nodes (helps fix issues with shards distribution imbalance which resulted in uneven disk, compute resource usage) ### Number of ES nodes: 5 ### { "template": "appserver-*", "settings": { "number_of_shards": "5", "number_of_replicas": "0", ... } }' Trade-offs: ● Removing replicas will result in search queries running slower as replicas are used while performing search operations ● It is not recommended to run production clusters without replicas Replicas: ● Removed replicas for the required indexes (50% savings on storage cost, ~30% reduction in compute resource utilization) Optimizations we did: Elasticsearch Template config
  • 20. AWS ● EC2 ● EBS ● Data transfer (Inter AZ) Spotinst platform allows users to reliably leverage excess capacity, simplify cloud operations and save 80% on compute costs. Optimizations we did: Infrastructure side
  • 21. Optimizations we did: Infrastructure side EC2
  • 22. Stateful EC2 Spot instances: ● Moved all ELK nodes to run on spot instances (Instances maintain IP address, EBS volumes) Recovery time: < 10 mins Trade-offs: ● Prefer using previous generation instance types to reduce frequent spot take-backs Optimizations we did: EC2 and spot
  • 23. Auto-Scaling: ● Performance/time based auto-scaling for Logstash Instances Optimizations we did: EC2 and spot
  • 24. Optimizations we did: Infrastructure side EBS
  • 25. "Hot-Warm" Architecture: ● "Hot" nodes: store active indexes, use GP2 EBS-disks (General purpose SSD) ● "Warm" nodes: store passive indexes, use SC1 EBS-disks (Cold storage) (~69% savings on storage cost) node.attr.box_type: hot ... elasticsearch.yml "template": "appserver-*", "settings": { "index": { "routing": { "allocation": { "require": { "box_type": "hot"} } } }, ... Template config Trade-offs: ● Since "Warm" nodes are using SC1 EBS-disks, they have lower IOPS, throughput this will result in search operations being comparatively slower References: https://cinhtau.net/2017/06/14/hot-warm-architecture/ Optimizations we did: EBS
  • 26. Moving indexes to "Warm" nodes: ● Reallocated indexes older than 8 days to "Warm" nodes ● Recommended to perform this operation during off-peak hours as it is I/O intensive actions: 1: action: allocation description: "Move index to Warm-nodes after 8 days" options: key: box_type value: warm allocation_type: require timeout_override: continue_if_exception: false filters: - filtertype: age source: name direction: older timestring: '%Y.%m.%d' unit: days unit_count: 8 ... Curator config References: https://www.elastic.co/blog/hot-warm-architecture-in-elasticsearch-5-x Optimizations we did: EBS
  • 27. Single Availability Zone: ● Migrated all ELK node to a single availability zone (reduce inter AZ data transfer cost for ELK nodes by 100%) ● Data transfer/day: ~700GB (Logstash to Elasticsearch: ~300GB, Elasticsearch inter-communication: ~400GB) Trade-offs: ● It is not recommended to run production clusters in a single AZ as it will result in downtime and potential data loss in case of AZ failures Optimizations we did: Inter-AZ data transfer
  • 28. Using S3 for index Snapshots: ● Take snapshots of indexes and store them in S3 curl -XPUT "http://<domain>:9200/_snapshot/s3_repository/ snap1?pretty?wait_for_completion=true" -d' { "indices": "index_1,index_2", "ignore_unavailable": true, "include_global_state": false } Backup: References: https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-snapshots.html https://medium.com/@federicopanini/elasticsearch-backup-snapshot-and-restore-on-aws-s3-f1fc32fbca7f Data backup and restore
  • 29. curl -s -XPOST --url "http://<domain>:9200/_snapshot/s3_repository/s nap1/_restore" -d' { "indices": "index_1,index_2", "ignore_unavailable": true, "include_global_state": false, }' On-demand Elasticsearch cluster: ● Launching a on demand ES cluster and importing the snapshots from S3 Existing Cluster: ● Restore the required snapshots to existing cluster Restore: Data backup and restore
  • 30. Data corruption: ● List out indexes with status as ‘Red’ ● Deleted the corrupted indexes ● Restore indexes from S3 snapshots ● Recovery time: depends of size of data Node failure due to AZ going down: ● Launch a new ELK cluster using AWS cloud formation templates ● Do the necessary config changes in Filebeat, Logstash etc. ● Restore the required indexes from S3 snapshots ● Recovery time: depends on provisioning time and size of data Node failures due to underlying hardware issue: ● Recycle node in Spotinst console (will take AMI of root volume, launch new instance, attach EBS volumes, maintain private IP) ● Recovery time: < 10 mins/node Snapshot restore time (estimates): ● < 4mins for a 20GB snapshot (test-cluster: 3 nodes, multiple indexes with 3 primary shards each, no replicas) Disaster recovery
  • 31. EC2 Instance type Service Daily cost 5 x r5.xlarge (20C, 160GB) Elasticsearch 40.80 3 x c5.large (6C, 12GB) Logstash 7.17 1 x t3.medium (2C, 4GB) Kibana 1.29 Total ~ 49.26$ EC2 (optimized) Instance type Service Daily cost 65% savings + Spotinst charges (20% of savings) Total Savings 5 x m4.xlarge (20C, 80GB) Elasticsearch Hot 14.64 2 x r4.xlarge (8C, 61GB) Elasticsearch Warm 7.50 3 x c4.large (6C, 12GB) Logstash 3.50 1 x t2.medium (2C, 4GB) Kibana 0.69 Total ~ 26.33$ ~ 47% Cost savings: EC2
  • 32. Ingesting: 300GB/day Retention: 90 days Replica count: 1 Storage Storage type Retention Daily cost ~54TB (GP2) 90 days ~ 237.60$ Storage (optimized) Storage type Retention Daily cost Total Savings ~ 3TB (GP2) Hot 8 days 12.00 ~ 24TB (SC1) Warm 82 days 24.00 ~ 27TB (S3) Backup 90 days 22.50 Total ~ 58.50$ ~ 75% Ingesting: 300GB/day Retention: 90 days Replica count: 0 Backups: Daily S3 snapshots Cost savings: Storage
  • 33. ELK stack ELK stack (optimized) Savings EC2 49.40 26.33 47% Storage 237.60 58.50 75% Data-transfer 7 0 100% Total (daily cost) ~ 294.00$ ~ 84.83$ ~ 71% * Cost/GB (daily) ~ 0.98$ ~ 0.28$ * Total savings are exclusive of some of the application-level optimizations done Total savings
  • 34. ELK Stack (optimized) ELK Stack Splunk Sumo logic Product Self managed Self managed Cloud Professional Data Ingestion ~ 300GB/day ~ 300GB/day ~ 100 GB / day * (post ingestion custom pricing) ~ 20 GB / day * (post ingestion custom pricing) Retention ~ 90 days ~ 90 days ~ 90 days * ~ 30 days * Cost/GB/day ~ $ 0.28 per GB /day ~ $ 0.98 per GB /day ~ $ 3.33 per GB /day * ~ $ 3.60 per GB /day * Savings over traditional ELK stack: 71% * * Total savings are exclusive of some of the application-level optimizations done Our Costs vs other Platforms
  • 35. ELK Stack Scalability: ● Logstash: auto-scaling ● Elasticsearch: overprovisioning (nodes run at 60% capacity during peak load), predictive vertical/horizontal scaling Handling potential data-loss while AZ is down: ● DR mechanisms in place, daily/hourly backups stored in S3, Potential chances of data loss of about 1 hour ● We do not store user-data or business metrics in ELK, users/business will not be impacted Handling potential data-corruptions in Elasticsearch: ● DR mechanisms in place, recover index from S3 index-snapshots Managing downtime during spot take-backs: ● Logstash: multiple nodes, minimal impact ● Elasticsearch/Kibana: < 10min downtime per node ● Use previous generation instance types as spot take-back chances are comparatively low Key Takeaways
  • 36. Handling back-pressure when a node is down: ● Filebeat: will auto-retry to send old logs ● Logstash: use ‘date’ filter for document timestamp, auto-scaling ● Elasticsearch: overprovisioning Other log analytics alternatives: ● We have only evaluated ELK, Splunk and Sumo Logic ELK stack upgrade path: ● Blue Green deployment for major version upgrade Key Takeaways
  • 37. ● We built a platform tailored to our requirements, yours might be different... ● Building a log analytics platform is not rocket science, but it can be painfully iterative if you are not aware of the options ● Be aware of the trade-offs you are ‘OK with’ and you can roll out a solution optimised for your specific requirements Reflection
  • 38. Thank you! Happy to take your questions.. Copyright Disclaimer: All rights to the materials used for this presentation belongs to their respective owners..