Mais conteúdo relacionado Semelhante a Help, my Kafka is broken! (Emma Humber, IBM) Kafka Summit SF 2019 (20) Help, my Kafka is broken! (Emma Humber, IBM) Kafka Summit SF 20193. Help my Kafka’s broken
Prepare
Review
© 2019 IBM Corporation 3
Include resource names such as topics and hostnames as well as
routes between systems in a topology diagram.
Collect logs and store JMX metrics published by Kafka brokers,
clients, the JVM and the OS to avoid outages.
Make logs useful. Include time stamps, method and class names as
well as thread ids and connection handles.
Change one thing at a time.
4. Help my Kafka’s broken
Prepare
Review
© 2019 IBM Corporation 4
Use logs to create a timeline of events. Consult your metrics.
Compare with a working system.
Collect the information you think you may need to get to root
cause before restarting.
If you can’t get to root cause, work out what extra data you need
for the next occurrence to bring you a step closer to understanding
the failure next time.
5. Narrow down the problem
Messages on a topic
kafka-install/bin/kafka-console-consumer.sh --topic NAME --bootstrap-server HOSTNAME:PORT
Consumers connected
kafka-install/bin/kafka-consumer-groups.sh --bootstrap-server HOSTNAME:PORT --list
Producers connected
JMX : kafka.server -> BrokerTopicMetrics -> BytesInPerSec -> TOPICNAME
© 2019 IBM Corporation 5
8. Logs
© 2019 IBM Corporation 8
Find a log4j.properties
kafka-install/config/log4j.properties
Edit output location
log4j.appender.kafkaAppender.File= mylog123.log
Change log level
log4j.rootLogger=INFO, stdout, kafkaAppender
log4j.logger.kafka=DEBUG log4j.logger.org.apache.kafka=TRACE
9. © 2019 IBM Corporation 9
Logs
[2019-09-12 13:44:02,633] INFO Replica loaded for partition asdf-0 with initial
high watermark 0 (kafka.cluster.Replica)
11. Hangs
Symptoms
Actions
© 2019 IBM Corporation 11
Clients were connected and running but now processing has
stopped.
Logs may have stopped completely or a specific thread within
these may have stopped.
Suspect a deadlock.
12. Hangs
Symptoms
Actions
© 2019 IBM Corporation 12
Collect javacores at intervals. Content and output location
depends on your JRE vendor and settings.
kill -3 <pid_of_process>
Look for threads that don’t change and deadlock alerts.
Find which threads are waiting for which resources, draw a
diagram and use analysis tools.
14. Memory
Problems due to excessive load or a memory leak.
Use analysis tools such as Health Center to help with
diagnosis based on a heap dump
-XX:+HeapDumpOnOutOfMemoryError
In Kubernetes, containers that are taking up too much
memory will often be terminated.
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
© 2019 IBM Corporation 14
15. Garbage collection
Kafka can be sensitive to garbage collection.
Unexplained delays in processing increase message
latency. Gaps seen between log time stamps.
-verbose:gc
-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
-XX:+PrintGCTimeStamps
-XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles=10
-XX:GCLogFileSize=100M
© 2019 IBM Corporation 15
17. Monitoring
Metrics
System
© 2019 IBM Corporation 17
Kafka and the JVM it runs in emit JMX metrics that show point-in-
time data for various statistics. OS level resources can also be
monitored.
Use your tools of choice to scrape the data, graph it and alert on it.
Look for trends, changes, and other anomalies.
Collect everything but alert only on carefully selected, key data.
18. Monitoring
Metrics
System
© 2019 IBM Corporation 18
CPU usage, available storage, I/O latency, network traffic, memory
usage, garbage collection…
Insufficient memory leads to unexpected behavior. Full disks
prevent data flowing, latency will cause delays. Slow networks and
insufficient bandwidth throttle throughput.
A system running at the edge of its limits is more likely to fail.
19. Partitions
Replication
Warnings
Alerts
Balance
© 2019 IBM Corporation 19
Under-replicated partitions have an in-sync replica count that is
less than the number of replicas for the partition.
When a partition has fewer than min in-sync replicas, producers
with ack=all can no longer produce and there is a higher risk of
data loss.
Offline partitions have no leader.
20. Partitions
Replication
Warnings
Alerts
Balance
© 2019 IBM Corporation 20
Partitions can regularly go in and out of fully replicated state.
Brokers can restart, garbage collection runs. New partitions may
be catching up after a rebalance.
The ISR count drops, before quickly returning to normal.
kafka-topics.sh --describe --under-replicated-partitions --zookeeper <zk
location>
21. Partitions
Replication
Warnings
Alerts
Balance
© 2019 IBM Corporation 21
Look for partitions where in-sync replica issues don’t resolve
themselves, show decreasing number of in-sync replicas, or those
that are already under replicated.
kafka.common.NotEnoughReplicasException: Number of insync replicas for
partition [partname-1,0] is [1], below required minimum [2] at
Indicator of failed brokers, performance issues, or an unbalanced
cluster.
22. Partitions
Replication
Warnings
Alerts
Balance
© 2019 IBM Corporation 22
Ensure leaders are preferred.
kafka-preferred-leadership-election.sh
Not all partitions are equal. Look for imbalanced partitions and act
early, ensuring partitions are evenly distributed.
kafka-reassign-partitions.sh
Consider adding partitions and consumers.
kafka-topics.sh --alter --zookeeper <ZK> --topic t1 --partitions 50
23. More metrics
Brokers
Producers
© 2019 IBM Corporation 23
Track throughput over time.
Ensure there is a single controller. If a restart is required consider
deleting the zNode representing the controller in ZooKeeper to
trigger re-election.
Monitor for exhaustion of request handlers.
Watch for no metrics!
24. More metrics
Brokers
Clients
© 2019 IBM Corporation 24
Monitor metrics showing the trend of flow rates, latency and
consumer lag.
Producers can saturate a cluster. Consider imposing quotas to
prevent outages.
Look at data transfer rates to identify problem connections
kafka-consumer-groups.sh
26. ZooKeeper
Send 4 letter words to the ZooKeeper cluster to query state
Zookeeper version: 3.4.14-4c25d480e66aadd371de8bd2fd8da255ac140bcf, built on 03/06/2019
16:18 GMT
Latency min/avg/max: 0/9/487
Received: 121953
Sent: 122133
Connections: 3
Outstanding: 0
Zxid: 0x100012f2e
Mode: follower
Node count: 147
© 2019 IBM Corporation 26
echo “srvr” | nc <zookeeper_ip> 2181
27. ZooKeeper
Navigate the ZooKeeper tree
{“listener_security_protocol_map”:{“INTERNAL”:“PLAINTEXT”,“INTERNAL_SECURE”:“SASL_PLAINTEXT”
,“EXTERNAL”:“SASL_PLAINTEXT”},“endpoints”:[“INTERNAL://elh1nonp-ibm-es-kafka-sts-1.elh1nonp-
ibm-es-kafka-headless-svc.elh.svc.cluster.local:9092”,“INTERNAL_SECURE://elh1nonp-ibm-es-kafka-
sts-1.elh1nonp-ibm-es-kafka-headless-
svc.elh.svc.cluster.local:9094",“EXTERNAL://9.20.193.141:30683”],“jmx_port”:9999,“host”:“elh1nonp-
ibm-es-kafka-sts-1.elh1nonp-ibm-es-kafka-headless-
svc.elh.svc.cluster.local”,“timestamp”:“1537201852269",“port”:9092,“version”:4}
© 2019 IBM Corporation 27
bin/zkCli.sh
ls /
get /brokers/ids/1
29. © 2019 IBM Corporation 29
Raise a JIRA
cwiki.apache.org/confluence/display/KAFKA/Reporting+Issues+in+Apache+Kafka
issues.apache.org/jira/projects/KAFKA/issues/KAFKA-8425?filter=allopenissues
If it’s a big change or has externals implications, raise a KIP
cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals
Submit a fix
kafka.apache.org/contributing
Suspect a problem
31. Thank you
Emma Humber
Support Lead - IBM Event Streams
—
emma.humber@uk.ibm.com
© Copyright IBM Corporation 2019. All rights reserved. The information contained in these materials is provided for informational purposes only, and is provided AS IS without warranty of any kind, express
or implied. Any statement of direction represents IBM’s current intent, is subject to change or withdrawal, and represent only goals and objectives. IBM, the IBM logo, and ibm.com are trademarks of IBM
Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available at Copyright and trademark
information.
© 2019 IBM Corporation 31