Mais conteúdo relacionado

Apresentações para você(20)

Similar a DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apache Kafka(20)


Mais de DevOps_Fest(20)


DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apache Kafka

  1. Building Data Streaming Platform with Apache Kafka Serhii Kalinets System Architect
  2. History of Kafka Created in Linkedin Creators then founded Confluent Why name is Kafka? Jay Kreps (Confluent CEO): I thought that since Kafka was a system optimized for writing, using a writer’s name would make sense. I had taken a lot of lit classes in college and liked Franz Kafka. Plus the name sounded cool for an open source project.
  3. Kafka use cases Message Broker Logs Commit log Streaming
  4. What is Kafka A publish/subscribe messaging system that has an interface typical of messaging systems but a storage layer more like a log-aggregation system
  5. Messaging System Messages Topics Partitions Producers Consumers
  6. Messages Key / Value pair, both can be nulls Kafka treats both just as bytes Serialization / deserialization happens on clients Confluent broker can validate messages against schema
  8. How many partitions? What is the throughput you expect to achieve for the topic? What is the maximum throughput you expect to achieve when consuming from a single partition? Throughput for producers can be ignored
  9. How many partitions? Adding partitions later can be very challenging Consider the number of partitions you will place on each broker and available disk space and network bandwidth per broker. Avoid overestimating, as each partition uses memory and other resources on the broker and will increase the time for leader elections.
  10. Producers Can specify partition explicitly or explicitly (via partitioners) Decision is taken on producer side Different SKDs might have different default partitioners Adding new partitions can change partition assignments
  11. Producers guarantees Kafka guarantees ordering within partition for producers Can be broken for retries if > 1 Idempotent producers (retries will not cause duplicates) Transactions (messages sent within transactions will be available for consumers only after transaction completes)
  12. Consumer Groups Common One consumer is a group coordinator Poll loop Simple for developer: while (true) { consumer.poll(); processMessages(); } Complicated implementation: coordination, rebalancing, heartbeats etc.
  13. Commits and offsets Consumers commit their last offsets to Kafka Automatic / manual commits Sync / async commits auto.offset.reset from where start reading (start or end)
  14. Datastore Partitions Replicas Segments Compaction
  15. Replication
  16. Default topic configuration Replication factor = 3 min.insync.replicas = 2 In producers: acks = all
  17. Segments Physical files with raw data Kafka keep open handles to all segments, including inactive Writes are being done to active segments Retention, compaction are applied only to inactive segments
  18. Retention Kafka does not wait until all consumers read data -- retention by time log.retention.bytes -- retention by size (per partition) log.segment.bytes -- size of when active segment is closed -- time when active segment is closed
  19. Compaction: removes old data
  20. Compaction when to compact messages To delete event, send new message with key and null value (tombstone) when tombstone can be deleted (the default is 24 hours) Compaction process is configurable (# of threads, resource consumption, frequency etc.)
  21. Brokers Cluster use zookeeper to handle membership One of broker is a controller (leader), it is responsible for partition leader election There are plans to get rid of zookeeper
  22. Kafka guaranties Durability and high availability Message ordering in partition At least once / exactly once Transactions
  23. Kafka Streams High level DSL for working with Kafka topics as stream Currently JVM only (Java / Scala) DSL is rather simple (kind of map / join / reduce) Supports joins, filters, aggregations Streams and tables Handles all low level stuff
  24. Kafka Streams
  25. Kafka Connect Is a framework for connecting Kafka with external systems such as databases, key-value stores, search indexes, and file systems Built with Kafka streams Deploys as cluster via operators / helm charts Configurable via REST endpoint
  26. Add connector to mysql echo '{"name":"mysql-login-connector", "config":{"connector.class": "JdbcSourceConnector", "connection.url":"jdbc:mysql:// user=root", "mode":"timestamp","table.whitelist":"login", "validate.non.null":false, "":"login_time","topic.prefix":"mysql."}}' | curl -X POST -d @- http://localhost:8083/connectors --header "content-Type:application/json"
  28. ksqlDB is an event streaming database SQL on top of Kafka streams + materialized views
  29. ksqlDB Components Streams: immutable sequences of events Tables: mutable sequences of events Stream processing: transform, filter, aggregate and join Push queries let you subscribe to a query's result as it changes in real-time. Pull queries allow you to fetch the current state of a materialized view.
  30. Creating tables CREATE TABLE currentCarLocations ( vehicleId VARCHAR, latitude DOUBLE(10, 2), longitude DOUBLE(10, 2) ) WITH ( kafka_topic = 'locations', partitions = 3, key = 'vehicleId', value_format = 'json' );
  31. Queries SELECT vehicleId, latitude, longitude FROM currentCarLocations WHERE ROWKEY = '6fd0fcdb' EMIT CHANGES;
  32. Advantages Non developers can write their queries Read from and write to many data sources Much less code -- less bugs Data exploration
  33. Our Roadmap Consumer / producer API Kafka Streams / Connect ← we are here ksqlDB
  34. Thanks! @skalinets