Strategies and techniques to optimize Kafka brokers and producers to minimize data loss under huge traffic volume, limited configuration options, less ideal and constant changing environment and balance against cost.
2. At 10,000 Feet
Minimize your data loss under these conditions
● Huge volume of data
● Limited configuration options
● Less ideal and constantly changing environment
● Balanced against cost
3. The State Of Kafka in Netflix
● Daily average
○ 1 trillion events
○ 3 Petabyte of data processed
● At peak
○ 1.26 trillion events / day
○ 20 million events / sec
○ 55 GB / sec
4. The State Of Kafka in Netflix
● Managing 3,000+ brokers and ~50 clusters
● Currently on 0.9
● In AWS VPC
7. Deployment Configuration
Fronting Kafka Clusters Consumer Kafka Clusters
Number of clusters 24 15
Total number of instances 1700+ 1100+
Instance type d2.2xl i2.2xl
Replication factor 2 2
Retention period 8 to 24 hours 2 to 4 hours
8. A Peek into the Data
● Business related
○ Session information
○ Device logs
○ Feedback to recommendation and streaming algorithms
● System and infrastructure related
○ Application logs and distributed tracing
9. The Data Loss Philosophy
● Not all data are created equal
● The spectrum of data loss
● Lossless data delivery is not a necessity and should
be always balanced against cost
0.1% 0.5% 1% 5% Percent loss
10. Data Loss Measurement
● Use producer send callback API
● Related counters
○ Send attempt
○ Send success
○ Send fail → Lost record
● Data loss rate = lost record / send attempt
11. Design Principles
● Priority is application availability and user
experience
○ Non-blocking event producing
● Minimize data loss into fronting Kafka at reasonable
cost
12. Key Configurations
● acks = 1 for producing
○ Reduce the chance that the producer buffer gets full
● max.block.ms = 0
● 2 replicas → 20% cost saving compared to 3
replicas
● Allow unclean leader election
○ Maximize availability for producers
○ Potential duplicates/loss for consumers
13. The Cloud Reality
● Unpredictable instance lifecycle
● Unstable networking
○ Noisy neighbours
○ Cold start
● Little control over clients
14. ZooKeeper And Controller
● Inconsistent controller state upon session timeout
● Broker’s inability to recover from temporary
ZooKeeper outage
● Can cause big incidences and hard to identify root
cause
15. Our Producer Data Delivery SLA
● Started from 99.9%
○ Loss was a little higher than the original Chukwa pipeline
○ “At three nines, we lose more data than you generate”
● Some big incidences …
19. Why Messages Are Dropped
● Producer buffer full
● Root causes
○ Slow response from broker
○ Metadata stale / unavailable
○ Client side problems (hardware, traffic)
20. What Has Been Done
● Improve broker availability
○ Optimize broker deployment strategy
○ Get rid of the “bad guys” - elimination of broker outliers
○ Move to AWS VPC - Better networking
● Automated producer configuration optimization
● When in trouble - failover!
21. Change in Deployment Strategy
● Kafka clusters
○ Big clusters with 500 brokers → Small to medium clusters
with 20 to 100 brokers
● ZooKeeper
○ Shared ZooKeeper cluster for all Kafka clusters →
Dedicated ZooKeeper cluster for each fronting Kafka cluster
● Data balancing
○ Uneven distribution of partitions → even distribution of
partitions among brokers
22. Rack Aware Partition Assignment
● Our contribution to Kafka 0.10
● Replicas of each partition is guaranteed to be
placed on different “racks”
○ Rack is logical and represent your failure protection domain
● Improved availability
○ OK to lose multiple brokers in the same rack
23. Partition Assignment Without Considering Rack
Rack 0 Rack 1
0
Broker 0 Broker 1 Broker 2 Broker 3
3 0 1 1 2 2 3
N = Partition N for a topic with 2 replicas
0 ← Off line partition
24. Rack Aware Partition Assignment
Rack 0 Rack 1
0
Broker 0 Broker 1 Broker 2 Broker 3
3 1 2 0 1 2 3
N = Partition N for a topic with 2 replicas
No offline partition
25. Overcome the “Co-location” Problem
● Multiple brokers “killed” at the same time by AWS.
Why?
● Definition
○ Multiple brokers in the same cluster are located on the
same physical host in cloud
● Impact reduced by Rack Aware Partition
Assignment
● Manually apply the trick of “detach” from ASG
26. Outliers
● Origins of outliers
○ Bad hardware
○ Noisy neighbours
○ Uneven workload
● Symptoms of outliers
○ Significantly higher response time
○ Frequent TCP timeouts/retransmissions
27. Cascading Effect of Outliers
Event
Producer
Kafka
Buffer exhausted
and message
drop Slow replication
Broker with
networking
problem
Disk read
causes slow
responses
X
X
X
31. To Kill or Not To Kill, That Is the Question
● The dilemma of terminating brokers
● Automated termination with time based
suppression
○ Use 99th percentile of produce and fetch response time
○ Static threshold
○ Limit one per 24 hours per cluster
32. Move To AWS VPC
● Huge improvement of networking vs. EC2 classic
○ Less transient networking errors
○ Lower latency
○ Tolerate higher packet per second
33. Producer Tuning
● Buffer size tuning
○ Handle transient traffic spike
○ The goal: buffer size large enough to hold 10 seconds of
send data
● “Eager” vs. “lazy” initialization of producers
● Re-instantiate the producer
● Termination of bad clients
35. When Things Go Wrong - Failover
● Taking advantage of cloud elasticity
● Cold standby Kafka cluster with 0 instances and
ready to scale up
● Different ZooKeeper cluster with no state
● Replication factor = 1