Streaming in Practice - Putting Apache Kafka in Production

1
Streaming in Practice
Putting Apache Kafka in Production
Roger Hoover, Engineer, Confluent

2
Apache Kafka: Online Talk Series
Part 1: September 27 Part 2: October 6 Part 3: October 27
Part 4: November 17 Part 6: December 15Part 5: December 1
Introduction To Streaming
Data and Stream
Processing with Apache
Kafka
Deep Dive into
Apache Kafka
Demystifying
Stream Processing with
Apache Kafka
Data Integration with
Apache Kafka
A Practical Guide to
Selecting a Stream
Processing Technology
https://www.confluent.io/apache-kafka-talk-series/

3
Agenda
• Kafka Basics
• Tuning Kafka For Your Application
• Data Balancing
• Spanning Multiple Datacenters

4
Agenda
• Kafka Basics
• Data Balancing

6
Architecture
Kafka cluster
broker 1
…
producer producer producer
consum
er
consum
er
broker 2 broker n topic partition
server 1
server 2
server 3
ZooKeeper
cluster

7
Operations
• Simple Deployment
• Rolling Upgrades
• Good metrics for component monitoring

8
Agenda
• Kafka Basics
• Data Balancing

9
Two Example Apps
• User activity tracking
• Collect page view events while users are browsing
our web and mobile storefronts
• Persist the data to HDFS for subsequent use in
recommendation engine
• Inventory adjustments
• Track sales, maintain inventory, and re-order on-
demand

10
Application Priorities
• User activity tracking
• High throughput (100x the sales stream)
• Availability is most important
• Low retention required - 3 days
• Inventory adjustments
• Relatively low throughput
• Durability is most important
• Long retention required – 6 months

11
Knobs
- Partition count
- Replication factor
- Retention
- Batching + compression
- Producer send acknowledgements
- Minimum ISRs
- Unclean Leader Election

12
Partition Count
- Partitions are the unit of consumer parallelism
- Over-partition your topics (especially keyed topics)
- Easy to add consumers but hard to add partitions for keyed topics
- Kafka can support ~10s k partitions

13
Partition Count
- High Throughput (User activity tracking)
- Large number of partitions (~100)
- Fewer Resources (Inventory adjustments)
- Smaller number of partitions (< 50)

14
Replication Factor
- More replicas require more storage, disk I/O, and network bandwidth
- More replicas can tolerate more failures
topic1-part1
logs
broker 1
topic1-part2
logs
broker 2
topic2-part2
topic2-part1
logs
broker 3
topic1-part1
logs
broker 4
topic1-part2
topic2-part2 topic1-part1 topic1-part2
topic2-part1
topic2-part2
topic2-part1

15
Replication Factor
- Lower cost (User activity tracking)
- replication.factor = 2
- High Fault Tolerance (Inventory adjustments)
- Defaults to 1

16
Retention
- Retention time can be set per topic
- Longer retention times require more storage (imagine that!)
- Longer retention allows consumers to rewind further back in time
- Part of the consumer’s SLA!

17
Retention
- Less Storage (User activity tracking)
- log.retention.hours=72 (3 days)
- Longer Time Travel (Inventory adjustments)
- log.retention.hours=4380 (6 months)
- Default is 7 days

18
Side-note: Time Travel
- Kafka 0.10.1 supports rewinding by time
- E.g. “Rewind to 10 minutes ago”

19
Batching & Compression
- Producer: batch.size, linger.ms, compression.type
- Consumer: fetch.min.bytes, fetch.wait.max.ms
compressed
batch 1send()
send()
send()
send()
producer
async
flush
poll()compressed
batch 2
compressed
batch 3
compressed
batch 1
compressed
batch 2
compressed
batch 3
consumerbroker

20
Batching & Compression
- High throughput (User activity tracking)
- Producer: compression.type=lz4, batch.size (256KB), linger.ms (~10ms) or flush manually
- Consumer: fetch.min.bytes (256KB), fetch.wait.max.ms (~10ms)
- Low latency (Inventory adjustments)
- Producer: linger.ms=0
- Consumer: fetch.min.bytes=1
- Defaults
- compression.type = none
- linger.ms = 0 (i.e. send immediately)
- fetch.min.bytes = 1 (i.e. receive immediately)

21
Producer Acknowledgements on Send
broker 1
producer
leader
broker 2
follower
broker 3
follower
4
2
2
3
commit
ack
When producer receives ack Latency Durability on failures
acks=0 (no ack) no network delay some data loss
acks=1 (wait for leader) 1 network roundtrip a few data loss
acks=all (wait for committed) 2 network roundtrips no data loss
consumer
1

22
Producer Acknowledgements on Send
- Throughput++ (User activity tracking)
- acks = 1
- Durability++ (Inventory adjustments)
- acks = all
- Default
- acks = 1

23
In-Sync Replicas (ISRs)
broker 1
producer
leader
broker 2
follower
broker 3
follower
2
2
1
m1 m1 m1
m2 m2 m2
ISR
last
committed
m2, m1
In-sync : replica reads from leader’s
log end within
replica.lag.time.max.ms

24
Minimum In-Sync Replicas
broker 1
producer
leader
broker 2
follower
broker 3
m1 m1 m1
m2 m2 m2
ISR
m3
m4last
committed
m5 follower
- Topic config to tell Kafka how to handle writes during severe outages (rare)
- Leader will reject writes if the ISR count is too small
topic1: min.insync.replicas=2

25
Minimum In-Sync Replicas
- Availability++ (User activity tracking)
- min.insync.replicas = 1
- Defaults to 1

26
Unclean Leader Election
- Topic config to tell Kafka how to handle topic leadership during severe outages
(rare)
- Allows automatic recovery in exchange for losing data
m5
broker 1
producer
leader ???
broker 2
leader
broker 3
2
1
m1 m1 m1
m2 m2 m2
ISR
m3 m3
m4 m4last
committed
m3
follower
m4
m5

27
Unclean Leader Election
- Availability++ (User activity tracking)
- unclean.leader.election.enable = true
- unclean.leader.election.enable = false
- Defaults to true

28
Mission Critical Data
- Producer acknowledgments
- acks=all
- Replication factor
- Minimum ISRs
- Unclean Leader Election
- unclean.leader.election.enable = false

29
Agenda
• Kafka Basics
• Data Balancing

30
Replica Placement
• Partitions are replicated
• Replicas are spread evenly across the cluster
• Only when the topic is created or modified
topic1-part1
logs
broker 1
topic1-part2
logs
broker 2
topic2-part2
topic2-part1
logs
broker 3
topic1-part1
logs
broker 4
topic1-part2
topic2-part1
topic2-part2
topic2-part1

31
Replica Placement
• Over time broker load and storage become unbalanced
• Initial replica placement does not account for topic throughput or retention
• Adding or removing brokers
topic1-part1
broker 1
topic1-part2
broker 2
topic2-part2
topic2-part1
broker 3
topic1-part1
broker 4
topic1-part2
topic2-part2
topic1-part1
topic1-part2
topic2-part1
topic2-part2
topic2-part1
broker 5

32
Replica Reassignment
• Create plan to rebalance replicas
• Upload new assignment to the cluster
• Kafka migrates replicas without disruption
topic1-part1
broker 1
topic1-part2
broker 2
topic2-part2
topic2-part1
broker 3
topic1-part1
broker 4
topic1-part2
topic1-part1
topic1-part2
topic2-part1
topic2-part2
broker 5
topic2-part1
topic2-part2
topic1-part1
broker 1
topic1-part2
broker 2
topic2-part2
topic2-part1
broker 3
topic1-part1
broker 4
topic1-part2
topic2-part2
topic1-part1
topic1-part2
topic2-part1
topic2-part2
topic2-part1
broker 5
Before
After

33
Data Balancing: Tricky Parts
• Creating a good plan
• Balance broker disk space
• Balance broker load
• Minimize data movement
• Preserve rack placement
• Movement of replicas can overload I/O and bandwidth resources
• Use replication quota feature in 0.10.1

34
Data Balancing: Solutions
• DIY
• kafka-reassign-partitions.sh script in Apache Kafka
• Confluent Enterprise Auto Data Balancing
• Optimizes storage utilization
• Rack awareness and minimal data movement
• Leverages replication quotas during rebalance

35
Agenda
• Kafka Basics
• Data Balancing

36
Use cases
• Disaster Recovery
• Replicate data out to geo-localized data centers
• Aggregate data from other data centers for analysis
• Part of hybrid cloud or cloud migration strategy

37
Multi-DC: Two Approaches
• Stretched cluster
• Mirroring across clusters

38
Stretched Cluster
• Low-latency links between 3 DCs. Typically AZs in a single AWS region.
• Applications in all 3 DCs share the same cluster and handle failures automatically.
• Relies on intra-cluster replication to copy data across DCs (replication.factor >= 3)
• Use rack awareness in Kafka 0.10; manual partition placement otherwise
Kafka
producers
consumer
s
AZ 1 AZ 3AZ 2
producersproducers
consumer
s
consumer
s
AWS
Region

39
Mirroring Across Clusters
• Separate Kafka clusters in each DC. Mirroring process copies data between them.
• Several variations of this pattern. Some require manual intervention on failover and recovery.

40
How to Mirror Across Clusters
• MirrorMaker tool in Apache Kafka
• Manual topic creation
• Manual sync of topic configuration
• Confluent Enterprise Multi-DC
• Dynamic topic creation at the destination
• Automatic sync for topic configurations (including access controls)
• Can be configured and managed from the Control Center UI
• Leverages Connect API

41
More Information: Tuning Tradeoffs
• Apache Kafka and Confluent Documentation
• When it Absolutely, Positively, Has to be There: Reliability Guarantees in Kafka
• Gwen Shapira and Jeff Holoman - https://www.confluent.io/kafka-summit-2016-ops-when-it-absolutely-
positively-has-to-be-there/
• Chapter 6: Reliability Guarantees
• Neha Narkhede, Gwen Shapira, Todd Palino – Kafka: The Definitive Guide
• Confluent Operations Training

42
More Information: Multi-DC
• Building Large Scale Stream Infrastructures Across Multiple Data Centers with Apache Kafka –
Jun Rao
• Video: https://www.youtube.com/watch?v=XcvHmqmh16g
• Slides: http://www.slideshare.net/HadoopSummit/building-largescale-stream-infrastructures-
across-multiple-data-centers-with-apache-kafka
• Confluent Enterprise Multi-DC - https://www.confluent.io/product/multi-datacenter/

43
More Information: Metadata Management
• Yes, Virginia, You Really Do Need a Schema Registry
• Gwen Shapira - https://www.confluent.io/blog/schema-registry-kafka-stream-
processing-yes-virginia-you-really-need-one/

44
Thank you!
www.kafka-summit.org May 8, 2017
New York City
Hilton Midtown
August 28, 2017
San Francisco
Hilton
Union Square

Streaming in Practice - Putting Apache Kafka in Production

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Streaming in Practice - Putting Apache Kafka in Production

Semelhante a Streaming in Practice - Putting Apache Kafka in Production (20)

Mais de confluent

Mais de confluent (20)

Último

Último (20)

Streaming in Practice - Putting Apache Kafka in Production

Notas do Editor