O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

How to Fail at Kafka

670 visualizações

Publicada em

View recording here: https://www.confluent.io/online-talks/how-to-fail-at-apache-kafka-on-demand

Apache Kafka® is used by thousands of companies across the world but, how difficult is it to operate? Which parameters do you need to set? What can go wrong? This online talk is based on real-world experience of Kafka deployments and explores a collection of common mistakes that are made when running Kafka in production and some best practices to avoid them.

Watch now to learn:
-How to ensure your Kafka data is never lost
-How to write code to cope when things go wrong
-How to ensure data governance between producers and consumers
-How to monitor your cluster

Join Apache Kafka expert, Pete Godfrey, for this engaging talk and delve into best practice ideas and insights.

Publicada em: Tecnologia
  • Seja o primeiro a comentar

How to Fail at Kafka

  1. 1. 1 How to Fail at Kafka Pete Godfrey Systems Engineer - Confluent
  2. 2. 2 Problem ?
  3. 3. 3 With
  4. 4. 4 Store & ETL Process Publish & Subscribe In short
  5. 5. 5 From a simple idea
  6. 6. 6 From a simple idea
  7. 7. 7 with great properties ! • Scalability • Retention • Durability • Replication • Security • Resiliency • Throughput • Ordering • Exactly Once Semantic • Transaction • Idempotency • Immutability • …
  8. 8. 8 8 What could possibly go wrong?
  9. 9. 9 9 1. Not thinking about Durability
  10. 10. 10
  11. 11. 11
  12. 12. 12 Data durability Kafka is not waiting for a disk flush by default. Durability is achieved through replication.
  13. 13. 13
  14. 14. 14
  15. 15. 15
  16. 16. 16 Data durability Is my data safe?
  17. 17. 17 Data durability Is my data safe? It depends on your configuration...
  18. 18. 18
  19. 19. 19
  20. 20. 20
  21. 21. 21 Data durability acks=1 (default value) good for latency acks=all good for durability
  22. 22. 22
  23. 23. 23 Parameter: acks=all The leader will wait for the full set of in-sync replicas to acknowledge the record.
  24. 24. 24 Parameter: min.insync.replicas minimum number of replicas that must acknowledge. Default is 1.
  25. 25. 25
  26. 26. 26
  27. 27. 27
  28. 28. 28 defaults The default values are optimized for availability & latency. If durability is more important, tune it!
  29. 29. 29 29 Data Durability while Producing ? Tune it with the parameters acks and min.insync.replicas
  30. 30. 30 30 2.Assuming everything will always work
  31. 31. 31
  32. 32. 32
  33. 33. 33 Parameter: retries It will cause the client to resend any record whose send fails with a potentially transient error. Default value : 0
  34. 34. 34
  35. 35. 35 Parameter: retries Use built in retries ! Bump it from 0 to infinity!
  36. 36. 36 Parameter: retries But you are exposed to a different kind of issue…
  37. 37. 37
  38. 38. 38 Parameter: enable.idempotence When set to 'true', the producer will ensure that exactly one copy of each message is written. Default value: false
  39. 39. 39
  40. 40. 40
  41. 41. 41 41 Use built in idempotency!
  42. 42. 42 42 3. No exception handling
  43. 43. 43 error handling We don’t expect the unexpected until the unexpected is expected.
  44. 44. 44
  45. 45. 45 45 Infinite retry
  46. 46. 46
  47. 47. 47 47 Write to a dead letter queue and continue
  48. 48. 48
  49. 49. 49 49 Ignore and continue
  50. 50. 50
  51. 51. 51 51 No silver bullet
  52. 52. 52 52 Handle the exceptions ! https://eng.uber.com/reliable-reprocessing/
  53. 53. 53 53 4. No data governance
  54. 54. 54
  55. 55. 55
  56. 56. 56 governance Changes in producers might impact consumers
  57. 57. 57 governance Schema registry
  58. 58. 58
  59. 59. 59 59 Share Schemas
  60. 60. 60 60 5. Inadequate Network Bandwidth
  61. 61. 6161 Partition Leadership and Replication Broker 1 Topic1 partition1 Broker 2 Broker 3 Broker 4 Topic1 partition1 Topic1 partition1 Leader Follower Topic1 partition2 Topic1 partition2 Topic1 partition2 Topic1 partition3 Topic1 partition4 Topic1 partition3 Topic1 partition3 Topic1 partition4 Topic1 partition4
  62. 62. 6262 Partition Leadership and Replication Broker 1 Topic1 partition1 Broker 2 Broker 3 Broker 4 Topic1 partition1 Topic1 partition1 Leader Follower Topic1 partition2 Topic1 partition2 Topic1 partition2 Topic1 partition3 Topic1 partition4 Topic1 partition3 Topic1 partition3 Topic1 partition4 Topic1 partition4
  63. 63. 6363 Partition Leadership and Replication Broker 1 Topic1 partition1 Broker 2 Broker 3 Broker 4 Topic1 partition1 Topic1 partition1 Leader Follower Topic1 partition2 Topic1 partition2 Topic1 partition3 Topic1 partition3 Topic1 partition4 Topic1 partition4
  64. 64. 6464 Partition Leadership and Replication Broker 1 Topic1 partition1 Broker 2 Broker 3 Topic1 partition1 Topic1 partition1 Leader Follower Topic1 partition2 Topic1 partition2 Topic1 partition3 Topic1 partition3 Topic1 partition4 Topic1 partition4
  65. 65. 6565 Partition Leadership and Replication Broker 1 Topic1 partition1 Broker 2 Broker 3 Broker 4 Topic1 partition1 Topic1 partition1 Leader Follower Topic1 partition2 Topic1 partition2 Topic1 partition3 Topic1 partition3 Topic1 partition4 Topic1 partition4
  66. 66. 6666 Partition Leadership and Replication Broker 1 Topic1 partition1 Broker 2 Broker 3 Broker 4 Topic1 partition1 Topic1 partition1 Leader Follower Topic1 partition2 Topic1 partition2 Topic1 partition3 Topic1 partition3 Topic1 partition4 Topic1 partition4 Topic1 partition2
  67. 67. 6767 Partition Leadership and Replication Broker 1 Topic1 partition1 Broker 2 Broker 3 Broker 4 Topic1 partition1 Topic1 partition1 Leader Follower Topic1 partition2 Topic1 partition2 Topic1 partition3 Topic1 partition3 Topic1 partition4 Topic1 partition4 Topic1 partition2 Topic1 partition3
  68. 68. 6868 Partition Leadership and Replication Broker 1 Topic1 partition1 Broker 2 Broker 3 Broker 4 Topic1 partition1 Topic1 partition1 Leader Follower Topic1 partition2 Topic1 partition2 Topic1 partition3 Topic1 partition3 Topic1 partition4 Topic1 partition4 Topic1 partition2 Topic1 partition3 Topic1 partition4
  69. 69. 6969 Partition Leadership and Replication Broker 1 Topic1 partition1 Broker 2 Broker 3 Broker 4 Topic1 partition1 Topic1 partition1 Leader Follower Topic1 partition2 Topic1 partition2 Topic1 partition2 Topic1 partition3 Topic1 partition4 Topic1 partition3 Topic1 partition3 Topic1 partition4 Topic1 partition4
  70. 70. 70 70 Plan for moving data
  71. 71. 71 71 6. No monitoring
  72. 72. Gather JMX metrics ! For every brokers and every clients ! Create dashboard for Apache Kafka ! Use Confluent Control Center ! https://docs.confluent.io/current/kafka/monitoring.html Confluent Control Center
  73. 73. 73 Questions You Should Be Able to Answer Do all your services are behaving properly and meeting SLAs? ● Are applications receiving all data? ● Are my business applications showing the latest data? ● Why are the applications running slowly? ● Do we need to scale up? ● Can any data get lost? ● Will there be service interruptions? ● Are there assurances in case of a disaster event?
  74. 74. 74 Control Center: The Simplest Way to Build, Control and Understand Apache Kafka Look inside Kafka ● Inspect messages in topics ● View / edit Schema Registry Meet event streaming SLAs ● Track KPI for event streams ● View consumer lag ● Receive automatic alerts Build pipelines and process streams ● Manage multiple Connect / KSQL clusters ● Add and remove connectors ● Write KSQL queries View Kafka clusters at a glance ● Check Kafka broker health ● View and dynamically change broker configs
  75. 75. 75 Topic Inspection
  76. 76. 76 Schema Registry Integration
  77. 77. 77 KSQL User Interface
  78. 78. 78 Event Stream KPI
  79. 79. 79 Consumer Lag
  80. 80. 80 Check Kafka Cluster Health

×