SlideShare uma empresa Scribd logo
1 de 31
IBM Event StreamsApache Kafka
© 2019 IBM Corporation
Help, My Kafka’s Broken
Emma Humber
Kafka Summit SF 2019
Metrics alert
Applications stop processing
SLAs are missed
End users complain
© 2019 IBM Corporation 2
Help my Kafka’s broken
Prepare
Review
© 2019 IBM Corporation 3
Include resource names such as topics and hostnames as well as
routes between systems in a topology diagram.
Collect logs and store JMX metrics published by Kafka brokers,
clients, the JVM and the OS to avoid outages.
Make logs useful. Include time stamps, method and class names as
well as thread ids and connection handles.
Change one thing at a time.
Help my Kafka’s broken
Prepare
Review
© 2019 IBM Corporation 4
Use logs to create a timeline of events. Consult your metrics.
Compare with a working system.
Collect the information you think you may need to get to root
cause before restarting.
If you can’t get to root cause, work out what extra data you need
for the next occurrence to bring you a step closer to understanding
the failure next time.
Narrow down the problem
Messages on a topic
kafka-install/bin/kafka-console-consumer.sh --topic NAME --bootstrap-server HOSTNAME:PORT
Consumers connected
kafka-install/bin/kafka-consumer-groups.sh --bootstrap-server HOSTNAME:PORT --list
Producers connected
JMX : kafka.server -> BrokerTopicMetrics -> BytesInPerSec -> TOPICNAME
© 2019 IBM Corporation 5
No, really, what
changed?
© 2019 IBM Corporation 6
Logs
© 2019 IBM Corporation 7
Logs
© 2019 IBM Corporation 8
Find a log4j.properties
kafka-install/config/log4j.properties
Edit output location
log4j.appender.kafkaAppender.File= mylog123.log
Change log level
log4j.rootLogger=INFO, stdout, kafkaAppender
log4j.logger.kafka=DEBUG log4j.logger.org.apache.kafka=TRACE
© 2019 IBM Corporation 9
Logs
[2019-09-12 13:44:02,633] INFO Replica loaded for partition asdf-0 with initial
high watermark 0 (kafka.cluster.Replica)
Java
© 2019 IBM Corporation 10
Hangs
Symptoms
Actions
© 2019 IBM Corporation 11
Clients were connected and running but now processing has
stopped.
Logs may have stopped completely or a specific thread within
these may have stopped.
Suspect a deadlock.
Hangs
Symptoms
Actions
© 2019 IBM Corporation 12
Collect javacores at intervals. Content and output location
depends on your JRE vendor and settings.
kill -3 <pid_of_process>
Look for threads that don’t change and deadlock alerts.
Find which threads are waiting for which resources, draw a
diagram and use analysis tools.
Javacore
© 2019 IBM Corporation 13
Memory
Problems due to excessive load or a memory leak.
Use analysis tools such as Health Center to help with
diagnosis based on a heap dump
-XX:+HeapDumpOnOutOfMemoryError
In Kubernetes, containers that are taking up too much
memory will often be terminated.
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
© 2019 IBM Corporation 14
Garbage collection
Kafka can be sensitive to garbage collection.
Unexplained delays in processing increase message
latency. Gaps seen between log time stamps.
-verbose:gc
-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
-XX:+PrintGCTimeStamps
-XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles=10
-XX:GCLogFileSize=100M
© 2019 IBM Corporation 15
Monitoring
© 2019 IBM Corporation 16
Monitoring
Metrics
System
© 2019 IBM Corporation 17
Kafka and the JVM it runs in emit JMX metrics that show point-in-
time data for various statistics. OS level resources can also be
monitored.
Use your tools of choice to scrape the data, graph it and alert on it.
Look for trends, changes, and other anomalies.
Collect everything but alert only on carefully selected, key data.
Monitoring
Metrics
System
© 2019 IBM Corporation 18
CPU usage, available storage, I/O latency, network traffic, memory
usage, garbage collection…
Insufficient memory leads to unexpected behavior. Full disks
prevent data flowing, latency will cause delays. Slow networks and
insufficient bandwidth throttle throughput.
A system running at the edge of its limits is more likely to fail.
Partitions
Replication
Warnings
Alerts
Balance
© 2019 IBM Corporation 19
Under-replicated partitions have an in-sync replica count that is
less than the number of replicas for the partition.
When a partition has fewer than min in-sync replicas, producers
with ack=all can no longer produce and there is a higher risk of
data loss.
Offline partitions have no leader.
Partitions
Replication
Warnings
Alerts
Balance
© 2019 IBM Corporation 20
Partitions can regularly go in and out of fully replicated state.
Brokers can restart, garbage collection runs. New partitions may
be catching up after a rebalance.
The ISR count drops, before quickly returning to normal.
kafka-topics.sh --describe --under-replicated-partitions --zookeeper <zk
location>
Partitions
Replication
Warnings
Alerts
Balance
© 2019 IBM Corporation 21
Look for partitions where in-sync replica issues don’t resolve
themselves, show decreasing number of in-sync replicas, or those
that are already under replicated.
kafka.common.NotEnoughReplicasException: Number of insync replicas for
partition [partname-1,0] is [1], below required minimum [2] at
Indicator of failed brokers, performance issues, or an unbalanced
cluster.
Partitions
Replication
Warnings
Alerts
Balance
© 2019 IBM Corporation 22
Ensure leaders are preferred.
kafka-preferred-leadership-election.sh
Not all partitions are equal. Look for imbalanced partitions and act
early, ensuring partitions are evenly distributed.
kafka-reassign-partitions.sh
Consider adding partitions and consumers.
kafka-topics.sh --alter --zookeeper <ZK> --topic t1 --partitions 50
More metrics
Brokers
Producers
© 2019 IBM Corporation 23
Track throughput over time.
Ensure there is a single controller. If a restart is required consider
deleting the zNode representing the controller in ZooKeeper to
trigger re-election.
Monitor for exhaustion of request handlers.
Watch for no metrics!
More metrics
Brokers
Clients
© 2019 IBM Corporation 24
Monitor metrics showing the trend of flow rates, latency and
consumer lag.
Producers can saturate a cluster. Consider imposing quotas to
prevent outages.
Look at data transfer rates to identify problem connections
kafka-consumer-groups.sh
Zookeeper
© 2019 IBM Corporation 25
ZooKeeper
Send 4 letter words to the ZooKeeper cluster to query state
Zookeeper version: 3.4.14-4c25d480e66aadd371de8bd2fd8da255ac140bcf, built on 03/06/2019
16:18 GMT
Latency min/avg/max: 0/9/487
Received: 121953
Sent: 122133
Connections: 3
Outstanding: 0
Zxid: 0x100012f2e
Mode: follower
Node count: 147
© 2019 IBM Corporation 26
echo “srvr” | nc <zookeeper_ip> 2181
ZooKeeper
Navigate the ZooKeeper tree
{“listener_security_protocol_map”:{“INTERNAL”:“PLAINTEXT”,“INTERNAL_SECURE”:“SASL_PLAINTEXT”
,“EXTERNAL”:“SASL_PLAINTEXT”},“endpoints”:[“INTERNAL://elh1nonp-ibm-es-kafka-sts-1.elh1nonp-
ibm-es-kafka-headless-svc.elh.svc.cluster.local:9092”,“INTERNAL_SECURE://elh1nonp-ibm-es-kafka-
sts-1.elh1nonp-ibm-es-kafka-headless-
svc.elh.svc.cluster.local:9094",“EXTERNAL://9.20.193.141:30683”],“jmx_port”:9999,“host”:“elh1nonp-
ibm-es-kafka-sts-1.elh1nonp-ibm-es-kafka-headless-
svc.elh.svc.cluster.local”,“timestamp”:“1537201852269",“port”:9092,“version”:4}
© 2019 IBM Corporation 27
bin/zkCli.sh
ls /
get /brokers/ids/1
Help, I’ve found
a bug
© 2019 IBM Corporation 28
© 2019 IBM Corporation 29
Raise a JIRA
cwiki.apache.org/confluence/display/KAFKA/Reporting+Issues+in+Apache+Kafka
issues.apache.org/jira/projects/KAFKA/issues/KAFKA-8425?filter=allopenissues
If it’s a big change or has externals implications, raise a KIP
cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals
Submit a fix
kafka.apache.org/contributing
Suspect a problem
Fun links
blog.newrelic.com/engineering/new-relic-kafkapocalypse/
engineering.linkedin.com/garbage-collection/garbage-collection-optimization-high-throughput-and-low-latency-java-applications
ibm.com/developerworks/community/blogs/aimsupport/entry/solve_the_easy_outofmemory_problems_with_a_javacore
kafka.apache.org/documentation.html#basic_ops_cluster_expansion
sematext.com/blog/kafka-metrics-to-monitor/
www.datadoghq.com/blog/monitoring-kafka-performance-metrics/
© 2019 IBM Corporation 30
Thank you
Emma Humber
Support Lead - IBM Event Streams
—
emma.humber@uk.ibm.com
© Copyright IBM Corporation 2019. All rights reserved. The information contained in these materials is provided for informational purposes only, and is provided AS IS without warranty of any kind, express
or implied. Any statement of direction represents IBM’s current intent, is subject to change or withdrawal, and represent only goals and objectives. IBM, the IBM logo, and ibm.com are trademarks of IBM
Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available at Copyright and trademark
information.
© 2019 IBM Corporation 31

Mais conteúdo relacionado

Mais procurados

When Kafka Meets the Scaling and Reliability needs of World's Largest Retaile...
When Kafka Meets the Scaling and Reliability needs of World's Largest Retaile...When Kafka Meets the Scaling and Reliability needs of World's Largest Retaile...
When Kafka Meets the Scaling and Reliability needs of World's Largest Retaile...
confluent
 
Design and Implementation of Incremental Cooperative Rebalancing
Design and Implementation of Incremental Cooperative RebalancingDesign and Implementation of Incremental Cooperative Rebalancing
Design and Implementation of Incremental Cooperative Rebalancing
confluent
 
Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...
Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...
Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...
HostedbyConfluent
 
Kafkaesque days at linked in in 2015
Kafkaesque days at linked in in 2015Kafkaesque days at linked in in 2015
Kafkaesque days at linked in in 2015
Joel Koshy
 
Running large scale Kafka upgrades at Yelp (Manpreet Singh,Yelp) Kafka Summit...
Running large scale Kafka upgrades at Yelp (Manpreet Singh,Yelp) Kafka Summit...Running large scale Kafka upgrades at Yelp (Manpreet Singh,Yelp) Kafka Summit...
Running large scale Kafka upgrades at Yelp (Manpreet Singh,Yelp) Kafka Summit...
confluent
 

Mais procurados (20)

Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
 
When Kafka Meets the Scaling and Reliability needs of World's Largest Retaile...
When Kafka Meets the Scaling and Reliability needs of World's Largest Retaile...When Kafka Meets the Scaling and Reliability needs of World's Largest Retaile...
When Kafka Meets the Scaling and Reliability needs of World's Largest Retaile...
 
Lessons Learned Building a Connector Using Kafka Connect (Katherine Stanley &...
Lessons Learned Building a Connector Using Kafka Connect (Katherine Stanley &...Lessons Learned Building a Connector Using Kafka Connect (Katherine Stanley &...
Lessons Learned Building a Connector Using Kafka Connect (Katherine Stanley &...
 
How Tencent Applies Apache Pulsar to Apache InLong —— A Streaming Data Integr...
How Tencent Applies Apache Pulsar to Apache InLong —— A Streaming Data Integr...How Tencent Applies Apache Pulsar to Apache InLong —— A Streaming Data Integr...
How Tencent Applies Apache Pulsar to Apache InLong —— A Streaming Data Integr...
 
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity PlanningFrom Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
 
Streaming all over the world Real life use cases with Kafka Streams
Streaming all over the world  Real life use cases with Kafka StreamsStreaming all over the world  Real life use cases with Kafka Streams
Streaming all over the world Real life use cases with Kafka Streams
 
Design and Implementation of Incremental Cooperative Rebalancing
Design and Implementation of Incremental Cooperative RebalancingDesign and Implementation of Incremental Cooperative Rebalancing
Design and Implementation of Incremental Cooperative Rebalancing
 
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache KafkaBuilding Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
 
Not Your Mother's Kafka - Deep Dive into Confluent Cloud Infrastructure | Gwe...
Not Your Mother's Kafka - Deep Dive into Confluent Cloud Infrastructure | Gwe...Not Your Mother's Kafka - Deep Dive into Confluent Cloud Infrastructure | Gwe...
Not Your Mother's Kafka - Deep Dive into Confluent Cloud Infrastructure | Gwe...
 
Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...
Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...
Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Kafkaesque days at linked in in 2015
Kafkaesque days at linked in in 2015Kafkaesque days at linked in in 2015
Kafkaesque days at linked in in 2015
 
Episode 3: Kubernetes and Big Data Services
Episode 3: Kubernetes and Big Data ServicesEpisode 3: Kubernetes and Big Data Services
Episode 3: Kubernetes and Big Data Services
 
8 Lessons Learned from Using Kafka in 1000 Scala microservices - Scale by the...
8 Lessons Learned from Using Kafka in 1000 Scala microservices - Scale by the...8 Lessons Learned from Using Kafka in 1000 Scala microservices - Scale by the...
8 Lessons Learned from Using Kafka in 1000 Scala microservices - Scale by the...
 
Event Sourcing, Stream Processing and Serverless (Benjamin Stopford, Confluen...
Event Sourcing, Stream Processing and Serverless (Benjamin Stopford, Confluen...Event Sourcing, Stream Processing and Serverless (Benjamin Stopford, Confluen...
Event Sourcing, Stream Processing and Serverless (Benjamin Stopford, Confluen...
 
Apache Kafka Introduction
Apache Kafka IntroductionApache Kafka Introduction
Apache Kafka Introduction
 
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
 
Getting up to Speed with MirrorMaker 2 (Mickael Maison, IBM & Ryanne Dolan) K...
Getting up to Speed with MirrorMaker 2 (Mickael Maison, IBM & Ryanne Dolan) K...Getting up to Speed with MirrorMaker 2 (Mickael Maison, IBM & Ryanne Dolan) K...
Getting up to Speed with MirrorMaker 2 (Mickael Maison, IBM & Ryanne Dolan) K...
 
Running large scale Kafka upgrades at Yelp (Manpreet Singh,Yelp) Kafka Summit...
Running large scale Kafka upgrades at Yelp (Manpreet Singh,Yelp) Kafka Summit...Running large scale Kafka upgrades at Yelp (Manpreet Singh,Yelp) Kafka Summit...
Running large scale Kafka upgrades at Yelp (Manpreet Singh,Yelp) Kafka Summit...
 
Service Mesh - Observability
Service Mesh - ObservabilityService Mesh - Observability
Service Mesh - Observability
 

Semelhante a Help, my Kafka is broken! (Emma Humber, IBM) Kafka Summit SF 2019

26.1.7 lab snort and firewall rules
26.1.7 lab   snort and firewall rules26.1.7 lab   snort and firewall rules
26.1.7 lab snort and firewall rules
Freddy Buenaño
 

Semelhante a Help, my Kafka is broken! (Emma Humber, IBM) Kafka Summit SF 2019 (20)

Help, My Kafka is Broken! (Emma Humber & Gantigmaa Selenge, IBM) Kafka Summit...
Help, My Kafka is Broken! (Emma Humber & Gantigmaa Selenge, IBM) Kafka Summit...Help, My Kafka is Broken! (Emma Humber & Gantigmaa Selenge, IBM) Kafka Summit...
Help, My Kafka is Broken! (Emma Humber & Gantigmaa Selenge, IBM) Kafka Summit...
 
Fast Kafka Apps! (Edoardo Comar and Mickael Maison, IBM) Kafka Summit London ...
Fast Kafka Apps! (Edoardo Comar and Mickael Maison, IBM) Kafka Summit London ...Fast Kafka Apps! (Edoardo Comar and Mickael Maison, IBM) Kafka Summit London ...
Fast Kafka Apps! (Edoardo Comar and Mickael Maison, IBM) Kafka Summit London ...
 
Event Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache KafkaEvent Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache Kafka
 
TSRT Crashes
TSRT CrashesTSRT Crashes
TSRT Crashes
 
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
 
Memory Management in Trading Platforms
Memory Management in Trading PlatformsMemory Management in Trading Platforms
Memory Management in Trading Platforms
 
Five cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark fasterFive cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark faster
 
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...
 
Removing performance bottlenecks with Kafka Monitoring and topic configuration
Removing performance bottlenecks with Kafka Monitoring and topic configurationRemoving performance bottlenecks with Kafka Monitoring and topic configuration
Removing performance bottlenecks with Kafka Monitoring and topic configuration
 
Low latency in java 8 v5
Low latency in java 8 v5Low latency in java 8 v5
Low latency in java 8 v5
 
Amazon Elastic Fabric Adapter: Anatomy, Capabilities, and the Road Ahead
Amazon Elastic Fabric Adapter: Anatomy, Capabilities, and the Road AheadAmazon Elastic Fabric Adapter: Anatomy, Capabilities, and the Road Ahead
Amazon Elastic Fabric Adapter: Anatomy, Capabilities, and the Road Ahead
 
26.1.7 lab snort and firewall rules
26.1.7 lab   snort and firewall rules26.1.7 lab   snort and firewall rules
26.1.7 lab snort and firewall rules
 
Kafka monitoring and metrics
Kafka monitoring and metricsKafka monitoring and metrics
Kafka monitoring and metrics
 
Apache Kafka - Strakin Technologies Pvt Ltd
Apache Kafka - Strakin Technologies Pvt LtdApache Kafka - Strakin Technologies Pvt Ltd
Apache Kafka - Strakin Technologies Pvt Ltd
 
Kafka at Peak Performance
Kafka at Peak PerformanceKafka at Peak Performance
Kafka at Peak Performance
 
Experiences with serverless for high throughput low usage applications | ryan...
Experiences with serverless for high throughput low usage applications | ryan...Experiences with serverless for high throughput low usage applications | ryan...
Experiences with serverless for high throughput low usage applications | ryan...
 
Data stream with cruise control
Data stream with cruise controlData stream with cruise control
Data stream with cruise control
 
Avoiding SAN Perfomance Problems
Avoiding SAN Perfomance ProblemsAvoiding SAN Perfomance Problems
Avoiding SAN Perfomance Problems
 
Introduction to Apache NiFi 1.10
Introduction to Apache NiFi 1.10Introduction to Apache NiFi 1.10
Introduction to Apache NiFi 1.10
 
JVMs in Containers - Best Practices
JVMs in Containers - Best PracticesJVMs in Containers - Best Practices
JVMs in Containers - Best Practices
 

Mais de confluent

Mais de confluent (20)

Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Santander Stream Processing with Apache Flink
Santander Stream Processing with Apache FlinkSantander Stream Processing with Apache Flink
Santander Stream Processing with Apache Flink
 
Unlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insightsUnlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insights
 
Workshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con FlinkWorkshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con Flink
 
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
 
AWS Immersion Day Mapfre - Confluent
AWS Immersion Day Mapfre   -   ConfluentAWS Immersion Day Mapfre   -   Confluent
AWS Immersion Day Mapfre - Confluent
 
Eventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalkEventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalk
 
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent CloudQ&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
 
Citi TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep DiveCiti TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep Dive
 
Build real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with ConfluentBuild real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with Confluent
 
Q&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service MeshQ&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service Mesh
 
Citi Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka MicroservicesCiti Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka Microservices
 
Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3
 
Citi Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging ModernizationCiti Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging Modernization
 
Citi Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time dataCiti Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time data
 
Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2
 
Data In Motion Paris 2023
Data In Motion Paris 2023Data In Motion Paris 2023
Data In Motion Paris 2023
 
Confluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with SynthesisConfluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with Synthesis
 
The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023
 
The Playful Bond Between REST And Data Streams
The Playful Bond Between REST And Data StreamsThe Playful Bond Between REST And Data Streams
The Playful Bond Between REST And Data Streams
 

Último

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Último (20)

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 

Help, my Kafka is broken! (Emma Humber, IBM) Kafka Summit SF 2019

  • 1. IBM Event StreamsApache Kafka © 2019 IBM Corporation Help, My Kafka’s Broken Emma Humber Kafka Summit SF 2019
  • 2. Metrics alert Applications stop processing SLAs are missed End users complain © 2019 IBM Corporation 2
  • 3. Help my Kafka’s broken Prepare Review © 2019 IBM Corporation 3 Include resource names such as topics and hostnames as well as routes between systems in a topology diagram. Collect logs and store JMX metrics published by Kafka brokers, clients, the JVM and the OS to avoid outages. Make logs useful. Include time stamps, method and class names as well as thread ids and connection handles. Change one thing at a time.
  • 4. Help my Kafka’s broken Prepare Review © 2019 IBM Corporation 4 Use logs to create a timeline of events. Consult your metrics. Compare with a working system. Collect the information you think you may need to get to root cause before restarting. If you can’t get to root cause, work out what extra data you need for the next occurrence to bring you a step closer to understanding the failure next time.
  • 5. Narrow down the problem Messages on a topic kafka-install/bin/kafka-console-consumer.sh --topic NAME --bootstrap-server HOSTNAME:PORT Consumers connected kafka-install/bin/kafka-consumer-groups.sh --bootstrap-server HOSTNAME:PORT --list Producers connected JMX : kafka.server -> BrokerTopicMetrics -> BytesInPerSec -> TOPICNAME © 2019 IBM Corporation 5
  • 6. No, really, what changed? © 2019 IBM Corporation 6
  • 7. Logs © 2019 IBM Corporation 7
  • 8. Logs © 2019 IBM Corporation 8 Find a log4j.properties kafka-install/config/log4j.properties Edit output location log4j.appender.kafkaAppender.File= mylog123.log Change log level log4j.rootLogger=INFO, stdout, kafkaAppender log4j.logger.kafka=DEBUG log4j.logger.org.apache.kafka=TRACE
  • 9. © 2019 IBM Corporation 9 Logs [2019-09-12 13:44:02,633] INFO Replica loaded for partition asdf-0 with initial high watermark 0 (kafka.cluster.Replica)
  • 10. Java © 2019 IBM Corporation 10
  • 11. Hangs Symptoms Actions © 2019 IBM Corporation 11 Clients were connected and running but now processing has stopped. Logs may have stopped completely or a specific thread within these may have stopped. Suspect a deadlock.
  • 12. Hangs Symptoms Actions © 2019 IBM Corporation 12 Collect javacores at intervals. Content and output location depends on your JRE vendor and settings. kill -3 <pid_of_process> Look for threads that don’t change and deadlock alerts. Find which threads are waiting for which resources, draw a diagram and use analysis tools.
  • 13. Javacore © 2019 IBM Corporation 13
  • 14. Memory Problems due to excessive load or a memory leak. Use analysis tools such as Health Center to help with diagnosis based on a heap dump -XX:+HeapDumpOnOutOfMemoryError In Kubernetes, containers that are taking up too much memory will often be terminated. Last State: Terminated Reason: OOMKilled Exit Code: 137 © 2019 IBM Corporation 14
  • 15. Garbage collection Kafka can be sensitive to garbage collection. Unexplained delays in processing increase message latency. Gaps seen between log time stamps. -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=100M © 2019 IBM Corporation 15
  • 16. Monitoring © 2019 IBM Corporation 16
  • 17. Monitoring Metrics System © 2019 IBM Corporation 17 Kafka and the JVM it runs in emit JMX metrics that show point-in- time data for various statistics. OS level resources can also be monitored. Use your tools of choice to scrape the data, graph it and alert on it. Look for trends, changes, and other anomalies. Collect everything but alert only on carefully selected, key data.
  • 18. Monitoring Metrics System © 2019 IBM Corporation 18 CPU usage, available storage, I/O latency, network traffic, memory usage, garbage collection… Insufficient memory leads to unexpected behavior. Full disks prevent data flowing, latency will cause delays. Slow networks and insufficient bandwidth throttle throughput. A system running at the edge of its limits is more likely to fail.
  • 19. Partitions Replication Warnings Alerts Balance © 2019 IBM Corporation 19 Under-replicated partitions have an in-sync replica count that is less than the number of replicas for the partition. When a partition has fewer than min in-sync replicas, producers with ack=all can no longer produce and there is a higher risk of data loss. Offline partitions have no leader.
  • 20. Partitions Replication Warnings Alerts Balance © 2019 IBM Corporation 20 Partitions can regularly go in and out of fully replicated state. Brokers can restart, garbage collection runs. New partitions may be catching up after a rebalance. The ISR count drops, before quickly returning to normal. kafka-topics.sh --describe --under-replicated-partitions --zookeeper <zk location>
  • 21. Partitions Replication Warnings Alerts Balance © 2019 IBM Corporation 21 Look for partitions where in-sync replica issues don’t resolve themselves, show decreasing number of in-sync replicas, or those that are already under replicated. kafka.common.NotEnoughReplicasException: Number of insync replicas for partition [partname-1,0] is [1], below required minimum [2] at Indicator of failed brokers, performance issues, or an unbalanced cluster.
  • 22. Partitions Replication Warnings Alerts Balance © 2019 IBM Corporation 22 Ensure leaders are preferred. kafka-preferred-leadership-election.sh Not all partitions are equal. Look for imbalanced partitions and act early, ensuring partitions are evenly distributed. kafka-reassign-partitions.sh Consider adding partitions and consumers. kafka-topics.sh --alter --zookeeper <ZK> --topic t1 --partitions 50
  • 23. More metrics Brokers Producers © 2019 IBM Corporation 23 Track throughput over time. Ensure there is a single controller. If a restart is required consider deleting the zNode representing the controller in ZooKeeper to trigger re-election. Monitor for exhaustion of request handlers. Watch for no metrics!
  • 24. More metrics Brokers Clients © 2019 IBM Corporation 24 Monitor metrics showing the trend of flow rates, latency and consumer lag. Producers can saturate a cluster. Consider imposing quotas to prevent outages. Look at data transfer rates to identify problem connections kafka-consumer-groups.sh
  • 25. Zookeeper © 2019 IBM Corporation 25
  • 26. ZooKeeper Send 4 letter words to the ZooKeeper cluster to query state Zookeeper version: 3.4.14-4c25d480e66aadd371de8bd2fd8da255ac140bcf, built on 03/06/2019 16:18 GMT Latency min/avg/max: 0/9/487 Received: 121953 Sent: 122133 Connections: 3 Outstanding: 0 Zxid: 0x100012f2e Mode: follower Node count: 147 © 2019 IBM Corporation 26 echo “srvr” | nc <zookeeper_ip> 2181
  • 27. ZooKeeper Navigate the ZooKeeper tree {“listener_security_protocol_map”:{“INTERNAL”:“PLAINTEXT”,“INTERNAL_SECURE”:“SASL_PLAINTEXT” ,“EXTERNAL”:“SASL_PLAINTEXT”},“endpoints”:[“INTERNAL://elh1nonp-ibm-es-kafka-sts-1.elh1nonp- ibm-es-kafka-headless-svc.elh.svc.cluster.local:9092”,“INTERNAL_SECURE://elh1nonp-ibm-es-kafka- sts-1.elh1nonp-ibm-es-kafka-headless- svc.elh.svc.cluster.local:9094",“EXTERNAL://9.20.193.141:30683”],“jmx_port”:9999,“host”:“elh1nonp- ibm-es-kafka-sts-1.elh1nonp-ibm-es-kafka-headless- svc.elh.svc.cluster.local”,“timestamp”:“1537201852269",“port”:9092,“version”:4} © 2019 IBM Corporation 27 bin/zkCli.sh ls / get /brokers/ids/1
  • 28. Help, I’ve found a bug © 2019 IBM Corporation 28
  • 29. © 2019 IBM Corporation 29 Raise a JIRA cwiki.apache.org/confluence/display/KAFKA/Reporting+Issues+in+Apache+Kafka issues.apache.org/jira/projects/KAFKA/issues/KAFKA-8425?filter=allopenissues If it’s a big change or has externals implications, raise a KIP cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals Submit a fix kafka.apache.org/contributing Suspect a problem
  • 31. Thank you Emma Humber Support Lead - IBM Event Streams — emma.humber@uk.ibm.com © Copyright IBM Corporation 2019. All rights reserved. The information contained in these materials is provided for informational purposes only, and is provided AS IS without warranty of any kind, express or implied. Any statement of direction represents IBM’s current intent, is subject to change or withdrawal, and represent only goals and objectives. IBM, the IBM logo, and ibm.com are trademarks of IBM Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available at Copyright and trademark information. © 2019 IBM Corporation 31