O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Enterprise Kafka
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Why Am I Here?
 You want to find out what th...
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Clark Haskins
Todd Palino
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Who Are We?
 Kafka SRE at LinkedIn
 Site Re...
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Kafka Overview
5
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
What Is Kafka?
6
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
What Is Kafka?
Broker
A
P0
A
P1
A
P0
7
Consum...
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Attributes of a Kafka Cluster
 Disk Based
 ...
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Kafka At LinkedIn
 Multiple Datacenters, Mul...
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Kafka At LinkedIn
10
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Kafka At LinkedIn
 300+ Kafka brokers
 Over...
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Challenges We Have Overcome
12
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Solutions
 Kafka is young…..we Influenced de...
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Hyper Growth
 Need to expand clusters to kee...
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Adding brokers
15
Brokers
Consumers
Producers...
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Adding a broker(with broker leveling)
16
Brok...
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Logs vs. Metrics
 Logging data killed the me...
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Quality of Service with Kafka
18
Brokers
Cons...
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Deployment Nightmares
 Parallel deployment w...
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Easy deployments
 Kafka 0.8.1 makes sure the...
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Killing Zookeeper
 Consumer offset managemen...
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Zookeeper on SSD
22
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Monitoring
23
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Kafka Is Broken!
24
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Kafka Is Broken!
 Everything is Kafka’s faul...
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
How Do We Sleep At Night?
 Educating Users
–...
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Cluster Health and Utilization
 Under replic...
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Zookeeper
 Ensemble availability
 Latency
...
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Mirror Maker and Audit
 Mirror Maker
– Lag
–...
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Audit UI
30
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Audit UI
31
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Tuning
32
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Hardware and OS
 Kernel Tuning
– Swapping is...
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Java Virtual Machine
34
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Garbage Collection
35
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Garbage Collection
 Java 7, update 51
 Garb...
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Closing
37
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
What’s Coming in 0.8.2
 Consumer offsets in ...
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Upcoming Operational Work
 Learning to share...
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
How Can You Get Involved?
 http://kafka.apac...
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Talk To Us
 Kafka SREs at LinkedIn
– Clark H...
SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved.
Questions
42
Enterprise Kafka: Kafka as a Service
Próximos SlideShares
Carregando em…5
×

Enterprise Kafka: Kafka as a Service

15.396 visualizações

Publicada em

Kafka is a publish/subscribe messaging system that, while young, forms a vital core for data flow inside many organizations, including LinkedIn. We will discuss Kafka from an Operations point of view, including the use cases for Kafka and the tools LinkedIn has been developing to improve the management of deployed clusters. We'll also talk about some of the challenges of managing a multi-tenant data service and how to avoid getting woken up at 3 AM.

NOTE: I highly recommend viewing the original PPT. It has copious speaker notes for each slide, and the animations will actually work properly.

Publicada em: Dados e análise
  • Hi there! Essay Help For Students | Discount 10% for your first order! - Check our website! https://vk.cc/80SakO
       Responder 
    Tem certeza que deseja  Sim  Não
    Insira sua mensagem aqui

Enterprise Kafka: Kafka as a Service

  1. 1. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Enterprise Kafka
  2. 2. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Why Am I Here?  You want to find out what this “Kafka” thing is  You’re running Kafka, but you want to go big  You’re looking for some neat whizbangs 2
  3. 3. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Clark Haskins Todd Palino
  4. 4. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Who Are We?  Kafka SRE at LinkedIn  Site Reliability Engineering – Administrators – Architects – Developers  Keep the site running, always 4
  5. 5. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Kafka Overview 5
  6. 6. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. What Is Kafka? 6
  7. 7. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. What Is Kafka? Broker A P0 A P1 A P0 7 Consumer Producer Zookeeper
  8. 8. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Attributes of a Kafka Cluster  Disk Based  Durable  Scalable  Low Latency  Finite Retention  NOT Idempotent (yet) 8
  9. 9. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Kafka At LinkedIn  Multiple Datacenters, Multiple Clusters  Mirroring between clusters  Message Types – Metrics – Tracking – Queuing  Data transport from applications to Hadoop, and back 9
  10. 10. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Kafka At LinkedIn 10
  11. 11. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Kafka At LinkedIn  300+ Kafka brokers  Over 18,000 topics  140,000+ Partitions  220 Billion messages per day  40 Terabytes In  160 Terabytes Out  Peak Load – 3.25 Million messages per second – 5.5 Gigabits/sec Inbound – 18 Gigabits/sec Outbound 11
  12. 12. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Challenges We Have Overcome 12
  13. 13. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Solutions  Kafka is young…..we Influenced development  Operations wizardry… 13
  14. 14. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Hyper Growth  Need to expand clusters to keep up with site traffic, and then balance them. 14
  15. 15. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Adding brokers 15 Brokers Consumers Producers A P1 A P0 B P1 B P0 a P5 A P4 B P5 B P4 A P3 A P2 B P3 B P2 A P7 A P6 B P7 B P6 A P5 A P4 B P5 B P4 A P1 A P0 B P1 B P0 A P7 A P6 B P7 B P6 A P3 A P2 B P3 B P2 C P1 C P0 C P3 C P2 C P1 C P0 C P3 C P2
  16. 16. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Adding a broker(with broker leveling) 16 Brokers Consumers Producers A P1 A P0 B P1 B P0 A P5 A P4 B P5 B P4 A P3 A P2 B P3 B P2 A P7 A P6 B P7 B P6 A P5 A P4 B P5 B P4 A P1 A P0 B P1 B P0 A P7 A P6 B P7 B P6 A P3 A P2 B P3 B P2 C P1 C P0 C P3 C P2 C P1 C P0 C P3 C P2
  17. 17. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Logs vs. Metrics  Logging data killed the metrics cluster 17
  18. 18. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Quality of Service with Kafka 18 Brokers Consumers Producers A P1 A P0 B P1 B P0 A P5 A P4 B P5 B P4 A P3 A P2 B P3 B P2 A P7 A P6 B P7 B P6 A P5 A P4 B P5 B P4 A P1 A P0 B P1 B P0 A P7 A P6 B P7 B P6 A P3 A P2 B P3 B P2 C P1 C P0 C P3 C P2 C P1 C P0 C P3 C P2
  19. 19. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Deployment Nightmares  Parallel deployment wasn’t possible so…  Babysitting sequential deployments 19
  20. 20. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Easy deployments  Kafka 0.8.1 makes sure the cluster is in a good state before shutting down – If any brokers in the cluster have under replicated partitions, Kafka will not shut down – Kafka ensures that only 1 broker is in shutdown sequence at a time. 20
  21. 21. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Killing Zookeeper  Consumer offset management done within Zookeeper  Every consumer committing offsets every minute for every partition makes ZK very unhappy. 21
  22. 22. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Zookeeper on SSD 22
  23. 23. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Monitoring 23
  24. 24. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Kafka Is Broken! 24
  25. 25. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Kafka Is Broken!  Everything is Kafka’s fault first  What is lag?  Consumer Problems – Application problems – Kafka client problems 25
  26. 26. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. How Do We Sleep At Night?  Educating Users – Why lag is their fault  Monitoring the Ecosystem – Kafka Brokers – Zookeeper – Mirror Makers – Audit – REST Interfaces  Week Over Week 26
  27. 27. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Cluster Health and Utilization  Under replicated partitions  Offline partitions  Broker partition count  Data size on disk  Leader partition count  Network utilization 27
  28. 28. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Zookeeper  Ensemble availability  Latency  Outstanding requests 28
  29. 29. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Mirror Maker and Audit  Mirror Maker – Lag – Dropped Messages  Audit Consumer – Lag – Completeness check  Audit UI 29 Producer Cluster ClusterMM MessagesMessage Counts Audit Consumer All Messages Audit State Audit Consumer Audit UI Audit State
  30. 30. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Audit UI 30
  31. 31. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Audit UI 31
  32. 32. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Tuning 32
  33. 33. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Hardware and OS  Kernel Tuning – Swapping is Death – Allow more dirty pages – Allow less dirty cache  Disk throughput – More spindles – Longer commit interval 33
  34. 34. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Java Virtual Machine 34
  35. 35. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Garbage Collection 35
  36. 36. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Garbage Collection  Java 7, update 51  Garbage First (G1) Collector – Set the heap size – Specify a target GC pause time – Don’t set the New size  GC Times – Less than 15ms per second in GC – Steady 20-22ms GC intervals – Almost no full GC cycles (and only 200-400ms when it does) 36
  37. 37. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Closing 37
  38. 38. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. What’s Coming in 0.8.2  Consumer offsets in the broker  Delete topic  Further down the road – New producer – Improved producer API 38
  39. 39. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Upcoming Operational Work  Learning to share  Shrinking a cluster  Cluster comparison  Advanced monitoring 39
  40. 40. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. How Can You Get Involved?  http://kafka.apache.org  Join the mailing lists – users@kafka.apache.org  irc.freenode.net - #apache-kafka  Contribute tools 40
  41. 41. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Talk To Us  Kafka SREs at LinkedIn – Clark Haskins  https://www.linkedin.com/in/clarkhaskins  chaskins@linkedin.com – Todd Palino  https://www.linkedin.com/in/toddpalino  tpalino@linkedin.com 41
  42. 42. SITE RELIABILITY ENGINEERING©2014 LinkedIn Corporation. All Rights Reserved. Questions 42

×