The document discusses improving the processing speed of logs in an ELK stack. It finds that logs are beginning to back up due to high average request volumes of around 1 million requests per minute. It analyzes various logging pipeline architectures and patterns to address this. It recommends measuring key parts of the pipeline to identify bottlenecks, improving the Logstash grok parser performance, increasing Kafka partitions to distribute load more evenly, and scaling Logstash instances to parallelize ingestion. These changes aim to reduce the risks of high throughput, lost records, and latency in the logging pipeline.
4. Logging Patterns
Remote Logging
App Somewhere
Logging over network
Ex)
Logback / log4j of java
DB, Storage, etc.
• Low risk of losing records
• High risk of lag / throughput
5. Logging Patterns
File Logging & Cron Backup
App
PutObject
S3
• High risk of losing records
• It’s depends on deployment patterns
• Di
ffi
cult to analyse
• It’s simple
Cron
Disk volume
6. Logging Patterns
Logging Pipeline Patterns (w/o stream)
App
• Risk of high throughput
• Risk of losing records
Forwarder
(pre-
processor)
Disk volume
Forwarder
(Post-
processor)
Search Engine
7. Logging Patterns
Logging Pipeline Patterns (w/ stream)
App
• Low risk of high throughput
• Low risk of losing records
• High cost
Forwarder
(pre-
processor)
Disk volume
Forwarder
(post-
processor)
Search Engine
Stream
8. Logging Patterns
ELK Stack (Elastic Stack)
App
• Low risk of high throughput & losing records
• High cost
• Requires deep & wide technical knowledge
Disk volume
Elasticsearch
MSK (Kafka)
Filebeat
Logstash
Kibana
&
$$$ $$$
12. What is the problem?
So many things could be a reason
• Filebeat I/O problem
• Kafka performance problem
• Logstash slow ingestion / processing problem
• Elasticsearch performance problem
• etc
20. Measurement
Burrow / Telegraf
• Burrow
• Open source developed by Linkedin
• Apache Kafka monitoring tool
• HTTP endpoint for information
• Telegraf
• Open source developed by In
fl
uxdata
• All purpose gathering metrics
• Plugin systems
31. Logstash filter performance
grok grok grok!
• Some log message might cause parsing problem
• Some special characters
• Long log messages
• Etc
http://some-domain/app/product/goodsview_stats/1474978/0?
utm_source=naver_jisicshopping&utm_medium=sh&source=NVSH&NaPm=ct%3Dkvyxfobc%7Cci%3Dd4151183d55ce2828c56f84eb392eab7338b2026%7Ctr%3Dslct%7Csn%3D204973%7Chk
ab6de6182e50b01b182e15ae740bcb84ce&menu=view&3Dcee524ab6de6182e50b01b182e15ae740bcb84ce&q=b3Dcee524ab6de6182e50b01b182e15ae740bcb84ce.....................
35. Logstash pipeline & batch
Too many topics to ingest
• The number of workers and CPU cores
• How many messages fetch each time
• How long to wait for undersized batch
https://www.elastic.co/guide/en/logstash/6.8/logstash-settings-
fi
le.html#logstash-settings-
fi
le
36. Logstash pipeline & batch
Too many topics to ingest
• The number of workers and CPU cores
• Same as CPU cores or little more
• How many messages fetch each time
• Default value is 125, New value is 1000
• How long to wait for undersized batch
https://www.elastic.co/guide/en/logstash/6.8/logstash-settings-
fi
le.html#logstash-settings-
fi
le
- pipeline.id: mai
n
path.config: "/usr/share/logstash/pipeline
"
pipeline.workers:
4
pipeline.batch.size: 100
0
pipeline.batch.delay: 5
0
logstash con
fi
guration - logstash.yaml
38. Kakfa Partitions
Unbalanced input messages. It’s natural.
Order Service
Auth Service
Inventory Service
Order Topic
Inventory Topic
Auth Topic
Less log message
Heavy log message
Same amount of log ingestion per each topic
High consumer-lag possibility
Increase a number of partitions
39. Kakfa Partitions
Wait. What is partitions?
https://medium.com/event-driven-utopia/understanding-kafka-topic-partitions-ae40f80552e8
Topic with one partition
Writes Injest
Partition 0
40. Kakfa Partitions
Wait. What is partitions?
https://medium.com/event-driven-utopia/understanding-kafka-topic-partitions-ae40f80552e8
Topic with multiple partition
Writes
Partition 0
Partition 1
Partition 2
Injest
41. Kakfa Partitions
Wait. What is partitions?
https://medium.com/event-driven-utopia/understanding-kafka-topic-partitions-ae40f80552e8
#!/bin/bas
h
## get topic
s
ZOOKEEPER=z-3.elk.abc.kafka.ap-northeast-2.amazonaws.com:218
1
bin/kafka-topics.sh --list --zookeeper $ZOOKEEPER > topiclist.txt
## increase partition
s
while read line; d
o
echo "$line
"
bin/kafka-topics.sh --zookeeper $ZOOKEEPER --alter --topic $line --partitions
3
sleep 1
;
done < topiclist.tx
t
• Increase partitions of all existing topics
...
default.replication.factor=
2
num.partitions=3
log.retention.hours = 4
8
delete.topic.enable=tru
e
...
• Increase partitions from Kafka default setting (this is no e
ff
ect on existing topics)
46. My architecture
ELK Stack (Elastic Stack)
Elasticsearch
MSK (Kafka)
A
Di
F
Logstash
A
Di
F
A
Di
F
A
Di
F
A
Di
F
Improve partition settings
S3
Improve grok parser
Increase consumers
48. Wrap-up
• First of all, measure it!
• Log Forwarder (in my case Logstash)
• Improve parsing performance (grok)
• Increase number of forwarders
• Message Stream (in my case Kafka)
• Partitioning