SlideShare a Scribd company logo
1 of 48
Download to read offline
2021/11/14
Hojin Shim / Site Reliabilty Engineer
ELK Stack - Log 처리 속도 개선
요청량 평균 약 100만건/분, Log 가 밀리기 시작했다.
Various Logging Pipeline


Architecture Patterns
Logging Patterns
Well-known patterns
• Remote logging 

• File Logging & Cron backup

• Logging pipeline without stream

• Logging pipeline with stream
Logging Patterns
Remote Logging
App Somewhere
Logging over network
Ex)

Logback / log4j of java
DB, Storage, etc.
• Low risk of losing records

• High risk of lag / throughput
Logging Patterns
File Logging & Cron Backup
App
PutObject
S3
• High risk of losing records 

• It’s depends on deployment patterns

• Di
ffi
cult to analyse

• It’s simple
Cron
Disk volume
Logging Patterns
Logging Pipeline Patterns (w/o stream)
App
• Risk of high throughput

• Risk of losing records
Forwarder


(pre-
processor)
Disk volume
Forwarder


(Post-
processor)
Search Engine
Logging Patterns
Logging Pipeline Patterns (w/ stream)
App
• Low risk of high throughput

• Low risk of losing records 

• High cost
Forwarder


(pre-
processor)
Disk volume
Forwarder


(post-
processor)
Search Engine
Stream
Logging Patterns
ELK Stack (Elastic Stack)
App
• Low risk of high throughput & losing records

• High cost

• Requires deep & wide technical knowledge
Disk volume
Elasticsearch
MSK (Kafka)
Filebeat
Logstash
Kibana
&
$$$ $$$
Logging Lag
Increase logging
Elasticsearch
MSK (Kafka)
App
Dis
Fi
Logstash
App
Dis
Fi
App
Dis
Fi
Requests
Lag!!
Now
Lag
What is the problem?
So many things could be a reason
• Filebeat I/O problem

• Kafka performance problem

• Logstash slow ingestion / processing problem

• Elasticsearch performance problem

• etc
Measurement
Measurement
What to measure?
• Basic system
metrics

• Etc
• Basic system
metrics

• Burst balance

• Bandwidth throttling

• Lag per topics

• Etc
• Basic system
metrics

• Num of events
processed

• Etc
• Basic system
metrics

• Indexing rate /
latency

• Etc
Filebeat MSK
(Kafka) Logstash
Elasticsearch
Measurement
How to measure? (Based on my experience)
• Telegraf 

• In
fl
uxDB

• Grafana
• Cloudwatch

• Burrow /
Prometheus

• Elasticsearch

• Grafana

• Telegraf

• Elasticsearch

• Grafana
• Cloudwatch

• Grafana
Filebeat MSK
(Kafka) Logstash
Elasticsearch
Measurement
How to measure? (Based on my experience)
• Telegraf 

• In
fl
uxDB

• Grafana
• Cloudwatch

• Burrow /
Prometheus

• Elasticsearch

• Grafana

• Telegraf

• Elasticsearch

• Grafana
• Cloudwatch

• Grafana
Filebeat MSK
(Kafka) Logstash
Elasticsearch
Consumer Lag monitoring Logstash processing rate monitoring
Measurement
Consumer Lag
Measurement
Consumer-lag
https://www.lightbend.com/blog/monitor-kafka-consumer-group-latency-with-kafka-lag-exporter
Measurement
Consumer-lag measurement
• Kubernetes friendly way

• Open Monitoring with Prometheus 



• All the time available way (demo in this session)

• Burrow / Telegraf
Measurement
Burrow / Telegraf
• Burrow

• Open source developed by Linkedin

• Apache Kafka monitoring tool

• HTTP endpoint for information

• Telegraf

• Open source developed by In
fl
uxdata

• All purpose gathering metrics

• Plugin systems
Measurement
Consumer-lag measurement with Burrow
MSK
(Kafka)
Burrow / Telegraf
Elasticsearch Grafana
Burrow Telegraf
Measurement
Burrow con
fi
g code snippet
..
.

..
.

..
.

[zookeeper
]

servers=[ "z-3.elk.abc.kafka.ap-northeast-2.amazonaws.com:2181","z-2.elk.kafka.ap-northeast-2.amazonaws.com:2181",

"z-1.product-elk-msk-abc.kafka.ap-northeast-2.amazonaws.com:2181"
]

timeout=
6

root-path="/burrow
"

[consumer.product-elk
]

class-name="kafka
"

cluster="product-elk
"

servers=[ "b-2.elk.kafka.ap-northeast-2.amazonaws.com:9094","b-1.elk.kafka.ap-northeast-2.amazonaws.com:9094"
]

client-profile=“your_prpfile
”

group-denylist=“^(some-group-|python-kafka-consumer-|quick-).*$
"

group-allowlist="
"

[cluster.product-elk
]

class-name="kafka
"

servers=[ “b-2.elk.abc.kafka.ap-northeast-2.amazonaws.com:9094”,"b-1.elk.abc.kafka.ap-northeast-2.amazonaws.com:9094"
]

client-profile="test
"

topic-refresh=6
0

offset-refresh=3
0

[tls.msk-mTLS
]

cafile="/etc/burrow/truststore.pem
"

noverify=tru
e

..
.

..
.

..
.

If you use clients / brokers encryption
Your zookeeper endpoint
Your bootstrap server endpoint
Burrow con
fi
guration - /etc/burrow/burrow.toml
Measurement
Telegraf con
fi
g code snippet
[[inputs.burrow]
]

servers = [“https://your.burrow-endpoint.com”
]

topics_exclude = [ "__consumer_offsets"
]

groups_exclude = ["console-*"
]

[inputs.burrow.tags
]

burrow = "burrow
"

[[outputs.elasticsearch]
]

urls = [ “http://your-elasticsearch-endpoint:9200”
]

timeout = "5s
"

enable_sniffer = fals
e

health_check_interval = "10s
"

index_name = "burrow-%Y.%m.%d
"

manage_template = tru
e

template_name = "telegraf-burrow
"

[outputs.elasticsearch.tagpass
]

burrow = ["burrow"]
Use tag if you have another metrics
Filter metric by tags
telegraf con
fi
guration - /etc/telegraf/telegraf.d/burrow.conf
Measurement
Data from burrow index
Some Topic Name
Lag Information
Partition
Measurement
Visualization with Grafana
Some Topic Lag
Some
topic
Some
topic
Measurement
Logstash Processing Rate
Measurement
Visualizatoin with Timelion
input
{

kafka
{

bootstrap_servers => "b-2.elk.abc.kafka.ap-northeast-2.amazonaws.com:9094,b-1.elk.abc.kafka.ap-northeast-2.amazonaws.com:9094
"

topics_pattern => "*
"

consumer_threads =>
1

codec => "json
"

decorate_events => tru
e

group_id => "logstash
"

security_protocol => "SSL
"

ssl_truststore_location => "/logstash/kafka.client.truststore.jks
"

enable_auto_commit => "true
"

}

}

..
.

filter
{

..
.

metrics
{

meter => "events
"

add_tag => "metric
"

add_field =>
{

"lsname" => “some-logstash
”

}

}
}

...

output
{

else if "metric" in [tags]
{

elasticsearch
{

hosts => ["eskibana.prd.in.musinsa.com:9200"
]

index => "logstash-metric-%{+yyyy.MM.dd}
"

}

..
.

}

Add logstash metric
logstash pipeline con
fi
guration - ./logstash/pipeline/logstash.conf
Measurement
Data from burrow index
Some Logstash Name
Event processing rate 1m
Measurement
Visualizatoin with Timelion
Problems & Solves
Logstash grok performance
Logstash filter performance
grok grok grok!
• Some log message might cause parsing problem

• Some special characters

• Long log messages

• Etc
http://some-domain/app/product/goodsview_stats/1474978/0?
utm_source=naver_jisicshopping&utm_medium=sh&source=NVSH&NaPm=ct%3Dkvyxfobc%7Cci%3Dd4151183d55ce2828c56f84eb392eab7338b2026%7Ctr%3Dslct%7Csn%3D204973%7Chk
ab6de6182e50b01b182e15ae740bcb84ce&menu=view&3Dcee524ab6de6182e50b01b182e15ae740bcb84ce&q=b3Dcee524ab6de6182e50b01b182e15ae740bcb84ce.....................
Logstash filter performance
grok grok grok!
[2021-09-03T17:26:25,923][WARN ][logstash.filters.grok ][main]
[8c1ed634e6ffe7026b0a684399b6a4893634d376554d997095836bd11d71a1c7]


Timeout executing grok


'%{IPORHOST:[nginx][access][remote_ip]} ......................'
https://www.elastic.co/guide/en/logstash/current/plugins-
fi
lters-grok.html#plugins-
fi
lters-grok-timeout_millis
Logstash filter performance
grok grok grok!
...

...

..
.

filter
{

if [event][dataset] == "nginx.access"
{

grok
{

match => { "message" => ["%{IPORHOST:[nginx][access][remote_ip]} - ................”]
}

remove_field => "message
"

timeout_millis => 30
0

}

...

...

...
Add short grok parsing timeout
logstash pipeline con
fi
guration - ./logstash/pipeline/logstash.conf
Problems & Solves
Logstash pipeline & batch
Logstash pipeline & batch
Too many topics to ingest
• The number of workers and CPU cores

• How many messages fetch each time

• How long to wait for undersized batch
https://www.elastic.co/guide/en/logstash/6.8/logstash-settings-
fi
le.html#logstash-settings-
fi
le
Logstash pipeline & batch
Too many topics to ingest
• The number of workers and CPU cores

• Same as CPU cores or little more

• How many messages fetch each time

• Default value is 125, New value is 1000

• How long to wait for undersized batch
https://www.elastic.co/guide/en/logstash/6.8/logstash-settings-
fi
le.html#logstash-settings-
fi
le
- pipeline.id: mai
n

path.config: "/usr/share/logstash/pipeline
"

pipeline.workers:
4

pipeline.batch.size: 100
0

pipeline.batch.delay: 5
0

logstash con
fi
guration - logstash.yaml
Problems & Solves
Kafka Partitions
Kakfa Partitions
Unbalanced input messages. It’s natural.
Order Service
Auth Service
Inventory Service
Order Topic
Inventory Topic
Auth Topic
Less log message
Heavy log message
Same amount of log ingestion per each topic
High consumer-lag possibility
Increase a number of partitions
Kakfa Partitions
Wait. What is partitions?
https://medium.com/event-driven-utopia/understanding-kafka-topic-partitions-ae40f80552e8
Topic with one partition
Writes Injest
Partition 0
Kakfa Partitions
Wait. What is partitions?
https://medium.com/event-driven-utopia/understanding-kafka-topic-partitions-ae40f80552e8
Topic with multiple partition
Writes
Partition 0
Partition 1
Partition 2
Injest
Kakfa Partitions
Wait. What is partitions?
https://medium.com/event-driven-utopia/understanding-kafka-topic-partitions-ae40f80552e8
#!/bin/bas
h

## get topic
s

ZOOKEEPER=z-3.elk.abc.kafka.ap-northeast-2.amazonaws.com:218
1

bin/kafka-topics.sh --list --zookeeper $ZOOKEEPER > topiclist.txt
 

## increase partition
s

while read line; d
o

echo "$line
"

bin/kafka-topics.sh --zookeeper $ZOOKEEPER --alter --topic $line --partitions
3

sleep 1
;

done < topiclist.tx
t

• Increase partitions of all existing topics
...
default.replication.factor=
2

num.partitions=3
log.retention.hours = 4
8

delete.topic.enable=tru
e

...
• Increase partitions from Kafka default setting (this is no e
ff
ect on existing topics)
Kakfa Partitions
Partitions / Consumers
Topic with multiple partition
Writes
Partition 0
Partition 1
Partition 2
input
{

kafka
{

..
.

bootstrap_servers => "...
"

topics_pattern => "*
"

consumer_threads =>
1

..
.

}

}

https://www.elastic.co/guide/en/logstash/current/plugins-inputs-kafka.html
Sequential injest
Injest
Kakfa Partitions
Partitions / Consumers
Topic with multiple partition
Writes
Partition 0
Partition 1
Partition 2
input
{

kafka
{

..
.

bootstrap_servers => "...
"

topics_pattern => "*
"

consumer_threads =>
3

..
.

}

}

https://www.elastic.co/guide/en/logstash/current/plugins-inputs-kafka.html
Parallel injest
Injest
Kakfa Partitions
Partitions / Consumers
Topic with multiple partition
Writes
Partition 0
Partition 1
Partition 2
input
{

kafka
{

..
.

bootstrap_servers => "...
"

topics_pattern => "*
"

consumer_threads =>
1

..
.

}

}

https://www.elastic.co/guide/en/logstash/current/plugins-inputs-kafka.html
Injest
Live demo
My architecture
ELK Stack (Elastic Stack)
Elasticsearch
MSK (Kafka)
A
Di
F
Logstash
A
Di
F
A
Di
F
A
Di
F
A
Di
F
Improve partition settings
S3
Improve grok parser


Increase consumers
Wrap-up
Wrap-up
• First of all, measure it!

• Log Forwarder (in my case Logstash)

• Improve parsing performance (grok)

• Increase number of forwarders

• Message Stream (in my case Kafka)

• Partitioning

More Related Content

What's hot

Infrastructure & System Monitoring using Prometheus
Infrastructure & System Monitoring using PrometheusInfrastructure & System Monitoring using Prometheus
Infrastructure & System Monitoring using PrometheusMarco Pas
 
Changelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache FlinkChangelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache FlinkFlink Forward
 
The Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and ContainersThe Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and ContainersSATOSHI TAGOMORI
 
엘라스틱서치, 로그스태시, 키바나
엘라스틱서치, 로그스태시, 키바나엘라스틱서치, 로그스태시, 키바나
엘라스틱서치, 로그스태시, 키바나종민 김
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security ArchitectureOwen O'Malley
 
/path/to/content - the Apache Jackrabbit content repository
/path/to/content - the Apache Jackrabbit content repository/path/to/content - the Apache Jackrabbit content repository
/path/to/content - the Apache Jackrabbit content repositoryJukka Zitting
 
Scaling Box-Search: Gearing up for Petabyte Scale - Shubhro Roy & Anthony Urb...
Scaling Box-Search: Gearing up for Petabyte Scale - Shubhro Roy & Anthony Urb...Scaling Box-Search: Gearing up for Petabyte Scale - Shubhro Roy & Anthony Urb...
Scaling Box-Search: Gearing up for Petabyte Scale - Shubhro Roy & Anthony Urb...Lucidworks
 
Elastic Stack ELK, Beats, and Cloud
Elastic Stack ELK, Beats, and CloudElastic Stack ELK, Beats, and Cloud
Elastic Stack ELK, Beats, and CloudJoe Ryan
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxFlink Forward
 
Elastic Stack & Data pipeline
Elastic Stack & Data pipelineElastic Stack & Data pipeline
Elastic Stack & Data pipelineJongho Woo
 
Airflow를 이용한 데이터 Workflow 관리
Airflow를 이용한  데이터 Workflow 관리Airflow를 이용한  데이터 Workflow 관리
Airflow를 이용한 데이터 Workflow 관리YoungHeon (Roy) Kim
 
Elastic Stack 을 이용한 게임 서비스 통합 로깅 플랫폼 - elastic{on} 2019 Seoul
Elastic Stack 을 이용한 게임 서비스 통합 로깅 플랫폼 - elastic{on} 2019 SeoulElastic Stack 을 이용한 게임 서비스 통합 로깅 플랫폼 - elastic{on} 2019 Seoul
Elastic Stack 을 이용한 게임 서비스 통합 로깅 플랫폼 - elastic{on} 2019 SeoulSeungYong Oh
 
Understanding of Apache kafka metrics for monitoring
Understanding of Apache kafka metrics for monitoring Understanding of Apache kafka metrics for monitoring
Understanding of Apache kafka metrics for monitoring SANG WON PARK
 
Microservices, Apache Kafka, Node, Dapr and more - Part Two (Fontys Hogeschoo...
Microservices, Apache Kafka, Node, Dapr and more - Part Two (Fontys Hogeschoo...Microservices, Apache Kafka, Node, Dapr and more - Part Two (Fontys Hogeschoo...
Microservices, Apache Kafka, Node, Dapr and more - Part Two (Fontys Hogeschoo...Lucas Jellema
 
Pinterest - Big Data Machine Learning Platform at Pinterest
Pinterest - Big Data Machine Learning Platform at PinterestPinterest - Big Data Machine Learning Platform at Pinterest
Pinterest - Big Data Machine Learning Platform at PinterestAlluxio, Inc.
 
[네이버오픈소스세미나] Pinpoint를 이용해서 서버리스 플랫폼 Apache Openwhisk 트레이싱하기 - 오승현
[네이버오픈소스세미나] Pinpoint를 이용해서 서버리스 플랫폼 Apache Openwhisk 트레이싱하기 - 오승현[네이버오픈소스세미나] Pinpoint를 이용해서 서버리스 플랫폼 Apache Openwhisk 트레이싱하기 - 오승현
[네이버오픈소스세미나] Pinpoint를 이용해서 서버리스 플랫폼 Apache Openwhisk 트레이싱하기 - 오승현NAVER Engineering
 
[야생의 땅: 듀랑고] 서버 아키텍처 - SPOF 없는 분산 MMORPG 서버
[야생의 땅: 듀랑고] 서버 아키텍처 - SPOF 없는 분산 MMORPG 서버[야생의 땅: 듀랑고] 서버 아키텍처 - SPOF 없는 분산 MMORPG 서버
[야생의 땅: 듀랑고] 서버 아키텍처 - SPOF 없는 분산 MMORPG 서버Heungsub Lee
 
카프카, 산전수전 노하우
카프카, 산전수전 노하우카프카, 산전수전 노하우
카프카, 산전수전 노하우if kakao
 

What's hot (20)

Infrastructure & System Monitoring using Prometheus
Infrastructure & System Monitoring using PrometheusInfrastructure & System Monitoring using Prometheus
Infrastructure & System Monitoring using Prometheus
 
Changelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache FlinkChangelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache Flink
 
The Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and ContainersThe Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and Containers
 
Elk
Elk Elk
Elk
 
엘라스틱서치, 로그스태시, 키바나
엘라스틱서치, 로그스태시, 키바나엘라스틱서치, 로그스태시, 키바나
엘라스틱서치, 로그스태시, 키바나
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security Architecture
 
/path/to/content - the Apache Jackrabbit content repository
/path/to/content - the Apache Jackrabbit content repository/path/to/content - the Apache Jackrabbit content repository
/path/to/content - the Apache Jackrabbit content repository
 
Scaling Box-Search: Gearing up for Petabyte Scale - Shubhro Roy & Anthony Urb...
Scaling Box-Search: Gearing up for Petabyte Scale - Shubhro Roy & Anthony Urb...Scaling Box-Search: Gearing up for Petabyte Scale - Shubhro Roy & Anthony Urb...
Scaling Box-Search: Gearing up for Petabyte Scale - Shubhro Roy & Anthony Urb...
 
Elastic Stack ELK, Beats, and Cloud
Elastic Stack ELK, Beats, and CloudElastic Stack ELK, Beats, and Cloud
Elastic Stack ELK, Beats, and Cloud
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptx
 
Elastic Stack & Data pipeline
Elastic Stack & Data pipelineElastic Stack & Data pipeline
Elastic Stack & Data pipeline
 
Airflow를 이용한 데이터 Workflow 관리
Airflow를 이용한  데이터 Workflow 관리Airflow를 이용한  데이터 Workflow 관리
Airflow를 이용한 데이터 Workflow 관리
 
Elastic Stack 을 이용한 게임 서비스 통합 로깅 플랫폼 - elastic{on} 2019 Seoul
Elastic Stack 을 이용한 게임 서비스 통합 로깅 플랫폼 - elastic{on} 2019 SeoulElastic Stack 을 이용한 게임 서비스 통합 로깅 플랫폼 - elastic{on} 2019 Seoul
Elastic Stack 을 이용한 게임 서비스 통합 로깅 플랫폼 - elastic{on} 2019 Seoul
 
Understanding of Apache kafka metrics for monitoring
Understanding of Apache kafka metrics for monitoring Understanding of Apache kafka metrics for monitoring
Understanding of Apache kafka metrics for monitoring
 
Microservices, Apache Kafka, Node, Dapr and more - Part Two (Fontys Hogeschoo...
Microservices, Apache Kafka, Node, Dapr and more - Part Two (Fontys Hogeschoo...Microservices, Apache Kafka, Node, Dapr and more - Part Two (Fontys Hogeschoo...
Microservices, Apache Kafka, Node, Dapr and more - Part Two (Fontys Hogeschoo...
 
Pinterest - Big Data Machine Learning Platform at Pinterest
Pinterest - Big Data Machine Learning Platform at PinterestPinterest - Big Data Machine Learning Platform at Pinterest
Pinterest - Big Data Machine Learning Platform at Pinterest
 
[네이버오픈소스세미나] Pinpoint를 이용해서 서버리스 플랫폼 Apache Openwhisk 트레이싱하기 - 오승현
[네이버오픈소스세미나] Pinpoint를 이용해서 서버리스 플랫폼 Apache Openwhisk 트레이싱하기 - 오승현[네이버오픈소스세미나] Pinpoint를 이용해서 서버리스 플랫폼 Apache Openwhisk 트레이싱하기 - 오승현
[네이버오픈소스세미나] Pinpoint를 이용해서 서버리스 플랫폼 Apache Openwhisk 트레이싱하기 - 오승현
 
Apache Airflow overview
Apache Airflow overviewApache Airflow overview
Apache Airflow overview
 
[야생의 땅: 듀랑고] 서버 아키텍처 - SPOF 없는 분산 MMORPG 서버
[야생의 땅: 듀랑고] 서버 아키텍처 - SPOF 없는 분산 MMORPG 서버[야생의 땅: 듀랑고] 서버 아키텍처 - SPOF 없는 분산 MMORPG 서버
[야생의 땅: 듀랑고] 서버 아키텍처 - SPOF 없는 분산 MMORPG 서버
 
카프카, 산전수전 노하우
카프카, 산전수전 노하우카프카, 산전수전 노하우
카프카, 산전수전 노하우
 

Similar to How to improve ELK log pipeline performance

YOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixYOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixBrendan Gregg
 
(WEB301) Operational Web Log Analysis | AWS re:Invent 2014
(WEB301) Operational Web Log Analysis | AWS re:Invent 2014(WEB301) Operational Web Log Analysis | AWS re:Invent 2014
(WEB301) Operational Web Log Analysis | AWS re:Invent 2014Amazon Web Services
 
Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek PROIDEA
 
Docker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic StackDocker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic StackJakub Hajek
 
DOD 2016 - Stefan Thies - Monitoring and Log Management for Docker Swarm and...
 DOD 2016 - Stefan Thies - Monitoring and Log Management for Docker Swarm and... DOD 2016 - Stefan Thies - Monitoring and Log Management for Docker Swarm and...
DOD 2016 - Stefan Thies - Monitoring and Log Management for Docker Swarm and...PROIDEA
 
Comprehensive Monitoring for Docker
Comprehensive Monitoring for DockerComprehensive Monitoring for Docker
Comprehensive Monitoring for DockerChristian Beedgen
 
OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica SarbuOSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica SarbuNETWAYS
 
OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica SarbuOSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica SarbuNETWAYS
 
Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Databricks
 
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly SolarWinds Loggly
 
Cloud Foundry Monitoring How-To: Collecting Metrics and Logs
Cloud Foundry Monitoring How-To: Collecting Metrics and LogsCloud Foundry Monitoring How-To: Collecting Metrics and Logs
Cloud Foundry Monitoring How-To: Collecting Metrics and LogsAltoros
 
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per SecondAmazon Web Services
 
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Amazon Web Services
 
Deploy secure, scalable, and highly available web apps with Azure Front Door ...
Deploy secure, scalable, and highly available web apps with Azure Front Door ...Deploy secure, scalable, and highly available web apps with Azure Front Door ...
Deploy secure, scalable, and highly available web apps with Azure Front Door ...Stamo Petkov
 
DSLing your System For Scalability Testing Using Gatling - Dublin Scala User ...
DSLing your System For Scalability Testing Using Gatling - Dublin Scala User ...DSLing your System For Scalability Testing Using Gatling - Dublin Scala User ...
DSLing your System For Scalability Testing Using Gatling - Dublin Scala User ...Aman Kohli
 
Mcas log collector deck
Mcas log collector deckMcas log collector deck
Mcas log collector deckMatt Soseman
 
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...GetInData
 

Similar to How to improve ELK log pipeline performance (20)

YOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixYOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at Netflix
 
(WEB301) Operational Web Log Analysis | AWS re:Invent 2014
(WEB301) Operational Web Log Analysis | AWS re:Invent 2014(WEB301) Operational Web Log Analysis | AWS re:Invent 2014
(WEB301) Operational Web Log Analysis | AWS re:Invent 2014
 
Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek
 
Docker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic StackDocker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic Stack
 
DOD 2016 - Stefan Thies - Monitoring and Log Management for Docker Swarm and...
 DOD 2016 - Stefan Thies - Monitoring and Log Management for Docker Swarm and... DOD 2016 - Stefan Thies - Monitoring and Log Management for Docker Swarm and...
DOD 2016 - Stefan Thies - Monitoring and Log Management for Docker Swarm and...
 
Comprehensive Monitoring for Docker
Comprehensive Monitoring for DockerComprehensive Monitoring for Docker
Comprehensive Monitoring for Docker
 
OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica SarbuOSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica Sarbu
 
OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica SarbuOSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica Sarbu
 
Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0
 
Logging & Docker - Season 2
Logging & Docker - Season 2Logging & Docker - Season 2
Logging & Docker - Season 2
 
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
 
Cloud Foundry Monitoring How-To: Collecting Metrics and Logs
Cloud Foundry Monitoring How-To: Collecting Metrics and LogsCloud Foundry Monitoring How-To: Collecting Metrics and Logs
Cloud Foundry Monitoring How-To: Collecting Metrics and Logs
 
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
 
Tracer
TracerTracer
Tracer
 
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
 
Deploy secure, scalable, and highly available web apps with Azure Front Door ...
Deploy secure, scalable, and highly available web apps with Azure Front Door ...Deploy secure, scalable, and highly available web apps with Azure Front Door ...
Deploy secure, scalable, and highly available web apps with Azure Front Door ...
 
Monitoring and Log Management for
Monitoring and Log Management forMonitoring and Log Management for
Monitoring and Log Management for
 
DSLing your System For Scalability Testing Using Gatling - Dublin Scala User ...
DSLing your System For Scalability Testing Using Gatling - Dublin Scala User ...DSLing your System For Scalability Testing Using Gatling - Dublin Scala User ...
DSLing your System For Scalability Testing Using Gatling - Dublin Scala User ...
 
Mcas log collector deck
Mcas log collector deckMcas log collector deck
Mcas log collector deck
 
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
 

Recently uploaded

Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substationstephanwindworld
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxsiddharthjain2303
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm Systemirfanmechengr
 
Vishratwadi & Ghorpadi Bridge Tender documents
Vishratwadi & Ghorpadi Bridge Tender documentsVishratwadi & Ghorpadi Bridge Tender documents
Vishratwadi & Ghorpadi Bridge Tender documentsSachinPawar510423
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleAlluxio, Inc.
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvLewisJB
 
System Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingSystem Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingBootNeck1
 
Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxRomil Mishra
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHC Sai Kiran
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfROCENODodongVILLACER
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxKartikeyaDwivedi3
 
The SRE Report 2024 - Great Findings for the teams
The SRE Report 2024 - Great Findings for the teamsThe SRE Report 2024 - Great Findings for the teams
The SRE Report 2024 - Great Findings for the teamsDILIPKUMARMONDAL6
 
Internet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxInternet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxVelmuruganTECE
 
Industrial Safety Unit-I SAFETY TERMINOLOGIES
Industrial Safety Unit-I SAFETY TERMINOLOGIESIndustrial Safety Unit-I SAFETY TERMINOLOGIES
Industrial Safety Unit-I SAFETY TERMINOLOGIESNarmatha D
 
Indian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptIndian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptMadan Karki
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxk795866
 

Recently uploaded (20)

Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substation
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptx
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm System
 
Vishratwadi & Ghorpadi Bridge Tender documents
Vishratwadi & Ghorpadi Bridge Tender documentsVishratwadi & Ghorpadi Bridge Tender documents
Vishratwadi & Ghorpadi Bridge Tender documents
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvv
 
System Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingSystem Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event Scheduling
 
Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptx
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECH
 
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdf
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptx
 
The SRE Report 2024 - Great Findings for the teams
The SRE Report 2024 - Great Findings for the teamsThe SRE Report 2024 - Great Findings for the teams
The SRE Report 2024 - Great Findings for the teams
 
Internet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxInternet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptx
 
Industrial Safety Unit-I SAFETY TERMINOLOGIES
Industrial Safety Unit-I SAFETY TERMINOLOGIESIndustrial Safety Unit-I SAFETY TERMINOLOGIES
Industrial Safety Unit-I SAFETY TERMINOLOGIES
 
Indian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptIndian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.ppt
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptx
 

How to improve ELK log pipeline performance

  • 1. 2021/11/14 Hojin Shim / Site Reliabilty Engineer ELK Stack - Log 처리 속도 개선 요청량 평균 약 100만건/분, Log 가 밀리기 시작했다.
  • 3. Logging Patterns Well-known patterns • Remote logging • File Logging & Cron backup • Logging pipeline without stream • Logging pipeline with stream
  • 4. Logging Patterns Remote Logging App Somewhere Logging over network Ex)
 Logback / log4j of java DB, Storage, etc. • Low risk of losing records • High risk of lag / throughput
  • 5. Logging Patterns File Logging & Cron Backup App PutObject S3 • High risk of losing records • It’s depends on deployment patterns • Di ffi cult to analyse • It’s simple Cron Disk volume
  • 6. Logging Patterns Logging Pipeline Patterns (w/o stream) App • Risk of high throughput • Risk of losing records Forwarder 
 (pre- processor) Disk volume Forwarder 
 (Post- processor) Search Engine
  • 7. Logging Patterns Logging Pipeline Patterns (w/ stream) App • Low risk of high throughput • Low risk of losing records • High cost Forwarder 
 (pre- processor) Disk volume Forwarder 
 (post- processor) Search Engine Stream
  • 8. Logging Patterns ELK Stack (Elastic Stack) App • Low risk of high throughput & losing records • High cost • Requires deep & wide technical knowledge Disk volume Elasticsearch MSK (Kafka) Filebeat Logstash Kibana & $$$ $$$
  • 12. What is the problem? So many things could be a reason • Filebeat I/O problem • Kafka performance problem • Logstash slow ingestion / processing problem • Elasticsearch performance problem • etc
  • 14. Measurement What to measure? • Basic system metrics • Etc • Basic system metrics • Burst balance • Bandwidth throttling • Lag per topics • Etc • Basic system metrics • Num of events processed • Etc • Basic system metrics • Indexing rate / latency • Etc Filebeat MSK (Kafka) Logstash Elasticsearch
  • 15. Measurement How to measure? (Based on my experience) • Telegraf • In fl uxDB • Grafana • Cloudwatch • Burrow / Prometheus • Elasticsearch • Grafana • Telegraf • Elasticsearch • Grafana • Cloudwatch • Grafana Filebeat MSK (Kafka) Logstash Elasticsearch
  • 16. Measurement How to measure? (Based on my experience) • Telegraf • In fl uxDB • Grafana • Cloudwatch • Burrow / Prometheus • Elasticsearch • Grafana • Telegraf • Elasticsearch • Grafana • Cloudwatch • Grafana Filebeat MSK (Kafka) Logstash Elasticsearch Consumer Lag monitoring Logstash processing rate monitoring
  • 19. Measurement Consumer-lag measurement • Kubernetes friendly way • Open Monitoring with Prometheus 
 
 • All the time available way (demo in this session) • Burrow / Telegraf
  • 20. Measurement Burrow / Telegraf • Burrow • Open source developed by Linkedin • Apache Kafka monitoring tool • HTTP endpoint for information
 • Telegraf • Open source developed by In fl uxdata • All purpose gathering metrics • Plugin systems
  • 21. Measurement Consumer-lag measurement with Burrow MSK (Kafka) Burrow / Telegraf Elasticsearch Grafana Burrow Telegraf
  • 22. Measurement Burrow con fi g code snippet .. . .. . .. . [zookeeper ] servers=[ "z-3.elk.abc.kafka.ap-northeast-2.amazonaws.com:2181","z-2.elk.kafka.ap-northeast-2.amazonaws.com:2181",
 "z-1.product-elk-msk-abc.kafka.ap-northeast-2.amazonaws.com:2181" ] timeout= 6 root-path="/burrow " [consumer.product-elk ] class-name="kafka " cluster="product-elk " servers=[ "b-2.elk.kafka.ap-northeast-2.amazonaws.com:9094","b-1.elk.kafka.ap-northeast-2.amazonaws.com:9094" ] client-profile=“your_prpfile ” group-denylist=“^(some-group-|python-kafka-consumer-|quick-).*$ " group-allowlist=" " [cluster.product-elk ] class-name="kafka " servers=[ “b-2.elk.abc.kafka.ap-northeast-2.amazonaws.com:9094”,"b-1.elk.abc.kafka.ap-northeast-2.amazonaws.com:9094" ] client-profile="test " topic-refresh=6 0 offset-refresh=3 0 [tls.msk-mTLS ] cafile="/etc/burrow/truststore.pem " noverify=tru e .. . .. . .. . If you use clients / brokers encryption Your zookeeper endpoint Your bootstrap server endpoint Burrow con fi guration - /etc/burrow/burrow.toml
  • 23. Measurement Telegraf con fi g code snippet [[inputs.burrow] ] servers = [“https://your.burrow-endpoint.com” ] topics_exclude = [ "__consumer_offsets" ] groups_exclude = ["console-*" ] [inputs.burrow.tags ] burrow = "burrow " [[outputs.elasticsearch] ] urls = [ “http://your-elasticsearch-endpoint:9200” ] timeout = "5s " enable_sniffer = fals e health_check_interval = "10s " index_name = "burrow-%Y.%m.%d " manage_template = tru e template_name = "telegraf-burrow " [outputs.elasticsearch.tagpass ] burrow = ["burrow"] Use tag if you have another metrics Filter metric by tags telegraf con fi guration - /etc/telegraf/telegraf.d/burrow.conf
  • 24. Measurement Data from burrow index Some Topic Name Lag Information Partition
  • 25. Measurement Visualization with Grafana Some Topic Lag Some topic Some topic
  • 27. Measurement Visualizatoin with Timelion input { kafka { bootstrap_servers => "b-2.elk.abc.kafka.ap-northeast-2.amazonaws.com:9094,b-1.elk.abc.kafka.ap-northeast-2.amazonaws.com:9094 " topics_pattern => "* " consumer_threads => 1 codec => "json " decorate_events => tru e group_id => "logstash " security_protocol => "SSL " ssl_truststore_location => "/logstash/kafka.client.truststore.jks " enable_auto_commit => "true " } } .. . filter { .. . metrics { meter => "events " add_tag => "metric " add_field => { "lsname" => “some-logstash ” } } } ...
 output { else if "metric" in [tags] { elasticsearch { hosts => ["eskibana.prd.in.musinsa.com:9200" ] index => "logstash-metric-%{+yyyy.MM.dd} " } .. . } Add logstash metric logstash pipeline con fi guration - ./logstash/pipeline/logstash.conf
  • 28. Measurement Data from burrow index Some Logstash Name Event processing rate 1m
  • 30. Problems & Solves Logstash grok performance
  • 31. Logstash filter performance grok grok grok! • Some log message might cause parsing problem • Some special characters • Long log messages • Etc http://some-domain/app/product/goodsview_stats/1474978/0? utm_source=naver_jisicshopping&utm_medium=sh&source=NVSH&NaPm=ct%3Dkvyxfobc%7Cci%3Dd4151183d55ce2828c56f84eb392eab7338b2026%7Ctr%3Dslct%7Csn%3D204973%7Chk ab6de6182e50b01b182e15ae740bcb84ce&menu=view&3Dcee524ab6de6182e50b01b182e15ae740bcb84ce&q=b3Dcee524ab6de6182e50b01b182e15ae740bcb84ce.....................
  • 32. Logstash filter performance grok grok grok! [2021-09-03T17:26:25,923][WARN ][logstash.filters.grok ][main] [8c1ed634e6ffe7026b0a684399b6a4893634d376554d997095836bd11d71a1c7] 
 Timeout executing grok 
 '%{IPORHOST:[nginx][access][remote_ip]} ......................' https://www.elastic.co/guide/en/logstash/current/plugins- fi lters-grok.html#plugins- fi lters-grok-timeout_millis
  • 33. Logstash filter performance grok grok grok! ...
 ...
 .. . filter { if [event][dataset] == "nginx.access" { grok { match => { "message" => ["%{IPORHOST:[nginx][access][remote_ip]} - ................”] } remove_field => "message " timeout_millis => 30 0 } ...
 ...
 ... Add short grok parsing timeout logstash pipeline con fi guration - ./logstash/pipeline/logstash.conf
  • 34. Problems & Solves Logstash pipeline & batch
  • 35. Logstash pipeline & batch Too many topics to ingest • The number of workers and CPU cores • How many messages fetch each time • How long to wait for undersized batch https://www.elastic.co/guide/en/logstash/6.8/logstash-settings- fi le.html#logstash-settings- fi le
  • 36. Logstash pipeline & batch Too many topics to ingest • The number of workers and CPU cores • Same as CPU cores or little more • How many messages fetch each time • Default value is 125, New value is 1000 • How long to wait for undersized batch https://www.elastic.co/guide/en/logstash/6.8/logstash-settings- fi le.html#logstash-settings- fi le - pipeline.id: mai n path.config: "/usr/share/logstash/pipeline " pipeline.workers: 4 pipeline.batch.size: 100 0 pipeline.batch.delay: 5 0 logstash con fi guration - logstash.yaml
  • 38. Kakfa Partitions Unbalanced input messages. It’s natural. Order Service Auth Service Inventory Service Order Topic Inventory Topic Auth Topic Less log message Heavy log message Same amount of log ingestion per each topic High consumer-lag possibility Increase a number of partitions
  • 39. Kakfa Partitions Wait. What is partitions? https://medium.com/event-driven-utopia/understanding-kafka-topic-partitions-ae40f80552e8 Topic with one partition Writes Injest Partition 0
  • 40. Kakfa Partitions Wait. What is partitions? https://medium.com/event-driven-utopia/understanding-kafka-topic-partitions-ae40f80552e8 Topic with multiple partition Writes Partition 0 Partition 1 Partition 2 Injest
  • 41. Kakfa Partitions Wait. What is partitions? https://medium.com/event-driven-utopia/understanding-kafka-topic-partitions-ae40f80552e8 #!/bin/bas h ## get topic s ZOOKEEPER=z-3.elk.abc.kafka.ap-northeast-2.amazonaws.com:218 1 bin/kafka-topics.sh --list --zookeeper $ZOOKEEPER > topiclist.txt ## increase partition s while read line; d o echo "$line " bin/kafka-topics.sh --zookeeper $ZOOKEEPER --alter --topic $line --partitions 3 sleep 1 ; done < topiclist.tx t • Increase partitions of all existing topics ... default.replication.factor= 2 num.partitions=3 log.retention.hours = 4 8 delete.topic.enable=tru e ... • Increase partitions from Kafka default setting (this is no e ff ect on existing topics)
  • 42. Kakfa Partitions Partitions / Consumers Topic with multiple partition Writes Partition 0 Partition 1 Partition 2 input { kafka { .. . bootstrap_servers => "... " topics_pattern => "* " consumer_threads => 1 .. . } } https://www.elastic.co/guide/en/logstash/current/plugins-inputs-kafka.html Sequential injest Injest
  • 43. Kakfa Partitions Partitions / Consumers Topic with multiple partition Writes Partition 0 Partition 1 Partition 2 input { kafka { .. . bootstrap_servers => "... " topics_pattern => "* " consumer_threads => 3 .. . } } https://www.elastic.co/guide/en/logstash/current/plugins-inputs-kafka.html Parallel injest Injest
  • 44. Kakfa Partitions Partitions / Consumers Topic with multiple partition Writes Partition 0 Partition 1 Partition 2 input { kafka { .. . bootstrap_servers => "... " topics_pattern => "* " consumer_threads => 1 .. . } } https://www.elastic.co/guide/en/logstash/current/plugins-inputs-kafka.html Injest
  • 46. My architecture ELK Stack (Elastic Stack) Elasticsearch MSK (Kafka) A Di F Logstash A Di F A Di F A Di F A Di F Improve partition settings S3 Improve grok parser Increase consumers
  • 48. Wrap-up • First of all, measure it!
 • Log Forwarder (in my case Logstash) • Improve parsing performance (grok) • Increase number of forwarders
 • Message Stream (in my case Kafka) • Partitioning