How to improve ELK log pipeline performance

2021/11/14
Hojin Shim / Site Reliabilty Engineer
ELK Stack - Log 처리 속도 개선
요청량 평균 약 100만건/분, Log 가 밀리기 시작했다.

Various Logging Pipeline
 
Architecture Patterns

Logging Patterns
Well-known patterns
• Remote logging

• File Logging & Cron backup

• Logging pipeline without stream

• Logging pipeline with stream

Logging Patterns
Remote Logging
App Somewhere
Logging over network
Ex) 
Logback / log4j of java
DB, Storage, etc.
• Low risk of losing records

• High risk of lag / throughput

Logging Patterns
File Logging & Cron Backup
App
PutObject
S3
• High risk of losing records

• It’s depends on deployment patterns

• Di
ffi
cult to analyse

• It’s simple
Cron
Disk volume

Logging Patterns
Logging Pipeline Patterns (w/o stream)
App
• Risk of high throughput

• Risk of losing records
Forwarder
 
(pre-
processor)
Disk volume
Forwarder
 
(Post-
processor)
Search Engine

Logging Patterns
Logging Pipeline Patterns (w/ stream)
App
• Low risk of high throughput

• Low risk of losing records

• High cost
Forwarder
 
(pre-
processor)
Disk volume
Forwarder
 
(post-
processor)
Search Engine
Stream

Logging Patterns
ELK Stack (Elastic Stack)
App
• Low risk of high throughput & losing records

• High cost

• Requires deep & wide technical knowledge
Disk volume
Elasticsearch
MSK (Kafka)
Filebeat
Logstash
Kibana
&
$$$ $$$

Increase logging
Elasticsearch
MSK (Kafka)
App
Dis
Fi
Logstash
App
Dis
Fi
App
Dis
Fi
Requests

What is the problem?
So many things could be a reason
• Filebeat I/O problem

• Kafka performance problem

• Logstash slow ingestion / processing problem

• Elasticsearch performance problem

• etc

Measurement
What to measure?
• Basic system
metrics

• Etc
• Basic system
metrics

• Burst balance

• Bandwidth throttling

• Lag per topics

• Etc
• Basic system
metrics

• Num of events
processed

• Etc
• Basic system
metrics

• Indexing rate /
latency

• Etc
Filebeat MSK
(Kafka) Logstash
Elasticsearch

Measurement
How to measure? (Based on my experience)
• Telegraf

• In
fl
uxDB

• Grafana
• Cloudwatch

• Burrow /
Prometheus

• Elasticsearch

• Grafana

• Telegraf

• Elasticsearch

• Grafana
• Cloudwatch

• Grafana
Filebeat MSK
(Kafka) Logstash
Elasticsearch

Measurement
How to measure? (Based on my experience)
• Telegraf

• In
fl
uxDB

• Grafana
• Cloudwatch

• Burrow /
Prometheus

• Elasticsearch

• Grafana

• Telegraf

• Elasticsearch

• Grafana
• Cloudwatch

• Grafana
Filebeat MSK
(Kafka) Logstash
Elasticsearch
Consumer Lag monitoring Logstash processing rate monitoring

Measurement
Consumer-lag
https://www.lightbend.com/blog/monitor-kafka-consumer-group-latency-with-kafka-lag-exporter

Measurement
Consumer-lag measurement
• Kubernetes friendly way

• Open Monitoring with Prometheus  
 
• All the time available way (demo in this session)

• Burrow / Telegraf

Measurement
Burrow / Telegraf
• Burrow

• Open source developed by Linkedin

• Apache Kafka monitoring tool

• HTTP endpoint for information 
• Telegraf

• Open source developed by In
fl
uxdata

• All purpose gathering metrics

• Plugin systems

Measurement
Consumer-lag measurement with Burrow
MSK
(Kafka)
Burrow / Telegraf
Elasticsearch Grafana
Burrow Telegraf

Measurement
Burrow con
fi
g code snippet
..
.

..
.

..
.

[zookeeper
]

servers=[ "z-3.elk.abc.kafka.ap-northeast-2.amazonaws.com:2181","z-2.elk.kafka.ap-northeast-2.amazonaws.com:2181", 
"z-1.product-elk-msk-abc.kafka.ap-northeast-2.amazonaws.com:2181"
]

timeout=
6

root-path="/burrow
"

[consumer.product-elk
]

class-name="kafka
"

cluster="product-elk
"

servers=[ "b-2.elk.kafka.ap-northeast-2.amazonaws.com:9094","b-1.elk.kafka.ap-northeast-2.amazonaws.com:9094"
]

client-profile=“your_prpfile
”

group-denylist=“^(some-group-|python-kafka-consumer-|quick-).*$
"

group-allowlist="
"

[cluster.product-elk
]

class-name="kafka
"

servers=[ “b-2.elk.abc.kafka.ap-northeast-2.amazonaws.com:9094”,"b-1.elk.abc.kafka.ap-northeast-2.amazonaws.com:9094"
]

client-profile="test
"

topic-refresh=6
0

offset-refresh=3
0

[tls.msk-mTLS
]

cafile="/etc/burrow/truststore.pem
"

noverify=tru
e

..
.

..
.

..
.

If you use clients / brokers encryption
Your zookeeper endpoint
Your bootstrap server endpoint
Burrow con
fi
guration - /etc/burrow/burrow.toml

Measurement
Telegraf con
fi
g code snippet
[[inputs.burrow]
]

servers = [“https://your.burrow-endpoint.com”
]

topics_exclude = [ "__consumer_offsets"
]

groups_exclude = ["console-*"
]

[inputs.burrow.tags
]

burrow = "burrow
"

[[outputs.elasticsearch]
]

urls = [ “http://your-elasticsearch-endpoint:9200”
]

timeout = "5s
"

enable_sniffer = fals
e

health_check_interval = "10s
"

index_name = "burrow-%Y.%m.%d
"

manage_template = tru
e

template_name = "telegraf-burrow
"

[outputs.elasticsearch.tagpass
]

burrow = ["burrow"]
Use tag if you have another metrics
Filter metric by tags
telegraf con
fi
guration - /etc/telegraf/telegraf.d/burrow.conf

Measurement
Data from burrow index
Some Topic Name
Lag Information
Partition

Measurement
Visualization with Grafana
Some Topic Lag
Some
topic
Some
topic

Measurement
Logstash Processing Rate

Measurement
Visualizatoin with Timelion
input
{

kafka
{

bootstrap_servers => "b-2.elk.abc.kafka.ap-northeast-2.amazonaws.com:9094,b-1.elk.abc.kafka.ap-northeast-2.amazonaws.com:9094
"

topics_pattern => "*
"

consumer_threads =>
1

codec => "json
"

decorate_events => tru
e

group_id => "logstash
"

security_protocol => "SSL
"

ssl_truststore_location => "/logstash/kafka.client.truststore.jks
"

enable_auto_commit => "true
"

}

}

..
.

filter
{

..
.

metrics
{

meter => "events
"

add_tag => "metric
"

add_field =>
{

"lsname" => “some-logstash
”

}

}
}

... 
output
{

else if "metric" in [tags]
{

elasticsearch
{

hosts => ["eskibana.prd.in.musinsa.com:9200"
]

index => "logstash-metric-%{+yyyy.MM.dd}
"

}

..
.

}

Add logstash metric
logstash pipeline con
fi
guration - ./logstash/pipeline/logstash.conf

Measurement
Data from burrow index
Some Logstash Name
Event processing rate 1m

Measurement
Visualizatoin with Timelion

Problems & Solves
Logstash grok performance

Logstash filter performance
grok grok grok!
• Some log message might cause parsing problem

• Some special characters

• Long log messages

• Etc
http://some-domain/app/product/goodsview_stats/1474978/0?
utm_source=naver_jisicshopping&utm_medium=sh&source=NVSH&NaPm=ct%3Dkvyxfobc%7Cci%3Dd4151183d55ce2828c56f84eb392eab7338b2026%7Ctr%3Dslct%7Csn%3D204973%7Chk
ab6de6182e50b01b182e15ae740bcb84ce&menu=view&3Dcee524ab6de6182e50b01b182e15ae740bcb84ce&q=b3Dcee524ab6de6182e50b01b182e15ae740bcb84ce.....................

grok grok grok!
[2021-09-03T17:26:25,923][WARN ][logstash.filters.grok ][main]
[8c1ed634e6ffe7026b0a684399b6a4893634d376554d997095836bd11d71a1c7]
 
Timeout executing grok
 
'%{IPORHOST:[nginx][access][remote_ip]} ......................'
https://www.elastic.co/guide/en/logstash/current/plugins-
fi
lters-grok.html#plugins-
fi
lters-grok-timeout_millis

grok grok grok!
... 
... 
..
.

filter
{

if [event][dataset] == "nginx.access"
{

grok
{

match => { "message" => ["%{IPORHOST:[nginx][access][remote_ip]} - ................”]
}

remove_field => "message
"

timeout_millis => 30
0

}

... 
... 
...
Add short grok parsing timeout
logstash pipeline con
fi
guration - ./logstash/pipeline/logstash.conf

Problems & Solves
Logstash pipeline & batch

Too many topics to ingest
• The number of workers and CPU cores

• How many messages fetch each time

• How long to wait for undersized batch
https://www.elastic.co/guide/en/logstash/6.8/logstash-settings-
fi
le.html#logstash-settings-
fi
le

Too many topics to ingest
• The number of workers and CPU cores

• Same as CPU cores or little more

• How many messages fetch each time

• Default value is 125, New value is 1000

• How long to wait for undersized batch
https://www.elastic.co/guide/en/logstash/6.8/logstash-settings-
fi
le.html#logstash-settings-
fi
le
- pipeline.id: mai
n

path.config: "/usr/share/logstash/pipeline
"

pipeline.workers:
4

pipeline.batch.size: 100
0

pipeline.batch.delay: 5
0

logstash con
fi
guration - logstash.yaml

Problems & Solves
Kafka Partitions

Kakfa Partitions
Unbalanced input messages. It’s natural.
Order Service
Auth Service
Inventory Service
Order Topic
Inventory Topic
Auth Topic
Less log message
Heavy log message
Same amount of log ingestion per each topic
High consumer-lag possibility
Increase a number of partitions

Kakfa Partitions
Wait. What is partitions?
https://medium.com/event-driven-utopia/understanding-kafka-topic-partitions-ae40f80552e8
Topic with one partition
Writes Injest
Partition 0

Kakfa Partitions
Topic with multiple partition
Writes
Partition 0
Partition 1
Partition 2
Injest

Kakfa Partitions
#!/bin/bas
h

## get topic
s

ZOOKEEPER=z-3.elk.abc.kafka.ap-northeast-2.amazonaws.com:218
1

bin/kafka-topics.sh --list --zookeeper $ZOOKEEPER > topiclist.txt

## increase partition
s

while read line; d
o

echo "$line
"

bin/kafka-topics.sh --zookeeper $ZOOKEEPER --alter --topic $line --partitions
3

sleep 1
;

done < topiclist.tx
t

• Increase partitions of all existing topics
...
default.replication.factor=
2

num.partitions=3
log.retention.hours = 4
8

delete.topic.enable=tru
e

...
• Increase partitions from Kafka default setting (this is no e
ff
ect on existing topics)

Kakfa Partitions
Partitions / Consumers
Writes
Partition 0
Partition 1
Partition 2
input
{

kafka
{

..
.

bootstrap_servers => "...
"

"

consumer_threads =>
1

..
.

}

}

https://www.elastic.co/guide/en/logstash/current/plugins-inputs-kafka.html
Sequential injest
Injest

Kakfa Partitions
Writes
Partition 0
Partition 1
Partition 2
input
{

kafka
{

..
.

"

"

consumer_threads =>
3

..
.

}

}

Parallel injest
Injest

Kakfa Partitions
Writes
Partition 0
Partition 1
Partition 2
input
{

kafka
{

..
.

"

"

consumer_threads =>
1

..
.

}

}

Injest

My architecture
ELK Stack (Elastic Stack)
Elasticsearch
MSK (Kafka)
A
Di
F
Logstash
A
Di
F
A
Di
F
A
Di
F
A
Di
F
Improve partition settings
S3
Improve grok parser

Increase consumers

Wrap-up
• First of all, measure it! 
• Log Forwarder (in my case Logstash)

• Improve parsing performance (grok)

• Increase number of forwarders 
• Message Stream (in my case Kafka)

• Partitioning

How to improve ELK log pipeline performance

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to How to improve ELK log pipeline performance

Similar to How to improve ELK log pipeline performance (20)

Recently uploaded

Recently uploaded (20)

How to improve ELK log pipeline performance