Help, my Kafka is broken! (Emma Humber, IBM) Kafka Summit SF 2019

IBM Event StreamsApache Kafka
© 2019 IBM Corporation
Help, My Kafka’s Broken
Emma Humber
Kafka Summit SF 2019

Metrics alert
Applications stop processing
SLAs are missed
End users complain
© 2019 IBM Corporation 2

Help my Kafka’s broken
Prepare
Review
Include resource names such as topics and hostnames as well as
routes between systems in a topology diagram.
Collect logs and store JMX metrics published by Kafka brokers,
clients, the JVM and the OS to avoid outages.
Make logs useful. Include time stamps, method and class names as
well as thread ids and connection handles.
Change one thing at a time.

Help my Kafka’s broken
Prepare
Review
Use logs to create a timeline of events. Consult your metrics.
Compare with a working system.
Collect the information you think you may need to get to root
cause before restarting.
If you can’t get to root cause, work out what extra data you need
for the next occurrence to bring you a step closer to understanding
the failure next time.

Narrow down the problem
Messages on a topic
kafka-install/bin/kafka-console-consumer.sh --topic NAME --bootstrap-server HOSTNAME:PORT
Consumers connected
kafka-install/bin/kafka-consumer-groups.sh --bootstrap-server HOSTNAME:PORT --list
Producers connected
JMX : kafka.server -> BrokerTopicMetrics -> BytesInPerSec -> TOPICNAME

No, really, what
changed?

Logs

Logs
Find a log4j.properties
kafka-install/config/log4j.properties
Edit output location
log4j.appender.kafkaAppender.File= mylog123.log
Change log level
log4j.rootLogger=INFO, stdout, kafkaAppender
log4j.logger.kafka=DEBUG log4j.logger.org.apache.kafka=TRACE

Logs
[2019-09-12 13:44:02,633] INFO Replica loaded for partition asdf-0 with initial
high watermark 0 (kafka.cluster.Replica)

Java

Hangs
Symptoms
Actions
Clients were connected and running but now processing has
stopped.
Logs may have stopped completely or a specific thread within
these may have stopped.
Suspect a deadlock.

Hangs
Symptoms
Actions
Collect javacores at intervals. Content and output location
depends on your JRE vendor and settings.
kill -3 <pid_of_process>
Look for threads that don’t change and deadlock alerts.
Find which threads are waiting for which resources, draw a
diagram and use analysis tools.

Javacore

Memory
Problems due to excessive load or a memory leak.
Use analysis tools such as Health Center to help with
diagnosis based on a heap dump
-XX:+HeapDumpOnOutOfMemoryError
In Kubernetes, containers that are taking up too much
memory will often be terminated.
Last State: Terminated
Reason: OOMKilled
Exit Code: 137

Garbage collection
Kafka can be sensitive to garbage collection.
Unexplained delays in processing increase message
latency. Gaps seen between log time stamps.
-verbose:gc
-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
-XX:+PrintGCTimeStamps
-XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles=10
-XX:GCLogFileSize=100M

Monitoring

Monitoring
Metrics
System
Kafka and the JVM it runs in emit JMX metrics that show point-in-
time data for various statistics. OS level resources can also be
monitored.
Use your tools of choice to scrape the data, graph it and alert on it.
Look for trends, changes, and other anomalies.
Collect everything but alert only on carefully selected, key data.

Monitoring
Metrics
System
CPU usage, available storage, I/O latency, network traffic, memory
usage, garbage collection…
Insufficient memory leads to unexpected behavior. Full disks
prevent data flowing, latency will cause delays. Slow networks and
insufficient bandwidth throttle throughput.
A system running at the edge of its limits is more likely to fail.

Partitions
Replication
Warnings
Alerts
Balance
Under-replicated partitions have an in-sync replica count that is
less than the number of replicas for the partition.
When a partition has fewer than min in-sync replicas, producers
with ack=all can no longer produce and there is a higher risk of
data loss.
Offline partitions have no leader.

Partitions
Replication
Warnings
Alerts
Balance
Partitions can regularly go in and out of fully replicated state.
Brokers can restart, garbage collection runs. New partitions may
be catching up after a rebalance.
The ISR count drops, before quickly returning to normal.
kafka-topics.sh --describe --under-replicated-partitions --zookeeper <zk
location>

Partitions
Replication
Warnings
Alerts
Balance
Look for partitions where in-sync replica issues don’t resolve
themselves, show decreasing number of in-sync replicas, or those
that are already under replicated.
kafka.common.NotEnoughReplicasException: Number of insync replicas for
partition [partname-1,0] is [1], below required minimum [2] at
Indicator of failed brokers, performance issues, or an unbalanced
cluster.

Partitions
Replication
Warnings
Alerts
Balance
Ensure leaders are preferred.
kafka-preferred-leadership-election.sh
Not all partitions are equal. Look for imbalanced partitions and act
early, ensuring partitions are evenly distributed.
kafka-reassign-partitions.sh
Consider adding partitions and consumers.
kafka-topics.sh --alter --zookeeper <ZK> --topic t1 --partitions 50

More metrics
Brokers
Producers
Track throughput over time.
Ensure there is a single controller. If a restart is required consider
deleting the zNode representing the controller in ZooKeeper to
trigger re-election.
Monitor for exhaustion of request handlers.
Watch for no metrics!

More metrics
Brokers
Clients
Monitor metrics showing the trend of flow rates, latency and
consumer lag.
Producers can saturate a cluster. Consider imposing quotas to
prevent outages.
Look at data transfer rates to identify problem connections
kafka-consumer-groups.sh

Zookeeper

ZooKeeper
Send 4 letter words to the ZooKeeper cluster to query state
Zookeeper version: 3.4.14-4c25d480e66aadd371de8bd2fd8da255ac140bcf, built on 03/06/2019
16:18 GMT
Latency min/avg/max: 0/9/487
Received: 121953
Sent: 122133
Connections: 3
Outstanding: 0
Zxid: 0x100012f2e
Mode: follower
Node count: 147
echo “srvr” | nc <zookeeper_ip> 2181

ZooKeeper
Navigate the ZooKeeper tree
{“listener_security_protocol_map”:{“INTERNAL”:“PLAINTEXT”,“INTERNAL_SECURE”:“SASL_PLAINTEXT”
,“EXTERNAL”:“SASL_PLAINTEXT”},“endpoints”:[“INTERNAL://elh1nonp-ibm-es-kafka-sts-1.elh1nonp-
ibm-es-kafka-headless-svc.elh.svc.cluster.local:9092”,“INTERNAL_SECURE://elh1nonp-ibm-es-kafka-
sts-1.elh1nonp-ibm-es-kafka-headless-
svc.elh.svc.cluster.local:9094",“EXTERNAL://9.20.193.141:30683”],“jmx_port”:9999,“host”:“elh1nonp-
ibm-es-kafka-sts-1.elh1nonp-ibm-es-kafka-headless-
svc.elh.svc.cluster.local”,“timestamp”:“1537201852269",“port”:9092,“version”:4}
bin/zkCli.sh
ls /
get /brokers/ids/1

Help, I’ve found
a bug

Raise a JIRA
cwiki.apache.org/confluence/display/KAFKA/Reporting+Issues+in+Apache+Kafka
issues.apache.org/jira/projects/KAFKA/issues/KAFKA-8425?filter=allopenissues
If it’s a big change or has externals implications, raise a KIP
cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals
Submit a fix
kafka.apache.org/contributing
Suspect a problem

Fun links
blog.newrelic.com/engineering/new-relic-kafkapocalypse/
engineering.linkedin.com/garbage-collection/garbage-collection-optimization-high-throughput-and-low-latency-java-applications
ibm.com/developerworks/community/blogs/aimsupport/entry/solve_the_easy_outofmemory_problems_with_a_javacore
kafka.apache.org/documentation.html#basic_ops_cluster_expansion
sematext.com/blog/kafka-metrics-to-monitor/
www.datadoghq.com/blog/monitoring-kafka-performance-metrics/

Thank you
Emma Humber
Support Lead - IBM Event Streams
—
emma.humber@uk.ibm.com
© Copyright IBM Corporation 2019. All rights reserved. The information contained in these materials is provided for informational purposes only, and is provided AS IS without warranty of any kind, express
or implied. Any statement of direction represents IBM’s current intent, is subject to change or withdrawal, and represent only goals and objectives. IBM, the IBM logo, and ibm.com are trademarks of IBM
Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available at Copyright and trademark
information.

Help, my Kafka is broken! (Emma Humber, IBM) Kafka Summit SF 2019

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Help, my Kafka is broken! (Emma Humber, IBM) Kafka Summit SF 2019

Semelhante a Help, my Kafka is broken! (Emma Humber, IBM) Kafka Summit SF 2019 (20)

Mais de confluent

Mais de confluent (20)

Último

Último (20)

Help, my Kafka is broken! (Emma Humber, IBM) Kafka Summit SF 2019