Apache Kafka is an open-source message broker project developed by the Apache Software Foundation written in Scala. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.
10. 10
Message Delivery Semantics
● At most once
– Messages may be lost by are never delivered.
● At least once
– Messages are never lost byt may be redliverd.
● Exactly once
– This is what people actually want.
11. 11
Apache Kafka is publish-subscribe messaging
rethought as a distributed commit log.
12. 12
Apache Kafka
● Apache Kafka is publish-subscribe messaging
rethought as a distributed commit log.
– Kafka is super fast.
– Kafka is scalable.
– Kafka is durable.
– Kafka is distributed by design.
13. 13
Apache Kafka
● Apache Kafka is publish-subscribe messaging
rethought as a distributed commit log.
– Kafka is super fast.
– Kafka is scalable.
– Kafka is durable.
– Kafka is distributed by design.
14. 14
Apache Kafka
● Apache Kafka is publish-subscribe messaging
rethought as a distributed commit log.
– Kafka is super fast.
– Kafka is scalable.
– Kafka is durable.
– Kafka is distributed by design.
15. 15
Apache Kafka
● A single Kafka broker
(server) can handle
hundreds of
megabytes of reads
and writes per second
from thousands of
clients.
16. 16
Apache Kafka
● Apache Kafka is publish-subscribe messaging
rethought as a distributed commit log.
– Kafka is super fast.
– Kafka is scalable.
– Kafka is durable.
– Kafka is distributed by design.
17. 17
Apache Kafka
● Kafka is designed to
allow a single cluster
to serve as the central
data backbone for a
large organization. It
can be elastically and
transparently
expanded without
downtime.
18. 18
Apache Kafka
● Apache Kafka is publish-subscribe messaging
rethought as a distributed commit log.
– Kafka is super fast.
– Kafka is scalable.
– Kafka is durable.
– Kafka is distributed by design.
19. 19
Apache Kafka
● Messages are
persisted on disk and
replicated within the
cluster to prevent
data loss. Each
broker can handle
terabytes of
messages without
performance impact.
20. 20
Apache Kafka
● Apache Kafka is publish-subscribe messaging
rethought as a distributed commit log.
– Kafka is super fast.
– Kafka is scalable.
– Kafka is durable.
– Kafka is distributed by design.
21. 21
Apache Kafka
● Kafka has a modern
cluster-centric design
that offers strong
durability and fault-tolerance
guarantees.
26. 26
Topic
● Topic
● Producer
● Consumer
● Broker
● Kafka maintains feeds
of messages in
categories called
topics.
● Topics are the highest
level of abstraction
that Kafka provides.
34. 34
Consumer
● Topic
● Producer
● Consumer
● Broker
● We'll call processes
that subscribe to
topics and process
the feed of published
messages,
consumers.
– Hadoop Consumer
39. 39
Topics
● A topic is a category
or feed name to which
messages are
published.
● Kafka cluster
maintains a
partitioned log for
each topic.
40. 40
Partition
● Is an ordered,
immutable sequence of
messages that is
continually appended to
a commit log.
● The messages in the
partitions are each
assigned a sequential id
number called the offset.
44. 44
Producer
● The producer is responsible for choosing which
message to assign to which partition within the
topic.
– Round-Robin
– Load-Balanced
– Key-Based (Semantic-Oriented)
53. 53
Create Topic
● bin/kafka-topics.sh --create --zookeeper
localhost:2181 --replication-factor 1 --partitions
1 --topic test
> Created topic "test".
54. 54
List all Topics
● bin/kafka-topics.sh --list --zookeeper
localhost:2181
55. 55
Send some Messages by Producer
● bin/kafka-console-producer.sh --broker-list
localhost:9092 --topic test
Hello DatisPars Guys!
How is it going with you?
56. 56
Start a Consumer
● bin/kafka-console-consumer.sh --zookeeper
localhost:2181 --topic test --from-beginning
59. 59
Use Cases
● Messaging
– Kafka is comparable to traditional messaging
systems such as ActiveMQ and RabbitMQ.
● Kafka provides customizable latency
● Kafka has better throughput
● Kafka is highly Fault-tolerance
60. 60
Use Cases
● Log Aggregation
– Many people use Kafka as a replacement for a log aggregation
solution.
– Log aggregation typically collects physical log files off servers
and puts them in a central place (a file server or HDFS perhaps)
for processing.
– In comparison to log-centric systems like Scribe or Flume, Kafka
offers equally good performance, stronger durability guarantees
due to replication, and much lower end-to-end latency.
● Lower-latency
● Easier support
61. 61
Use Cases
● Stream Processing
– Storm and Samza are popular frameworks for stream processing. They
both use Kafka.
● Event Sourcing
– Event sourcing is a style of application design where state changes are
logged as a time-ordered sequence of records. Kafka's support for very
large stored log data makes it an excellent backend for an application
built in this style.
● Commit Log
– Kafka can serve as a kind of external commit-log for a distributed
system. The log helps replicate data between nodes and acts as a re-syncing
mechanism for failed nodes to restore their data.
62. 62
Message Format
● /**
● * A message. The format of an N byte message is the following:
● * If magic byte is 0
● * 1. 1 byte "magic" identifier to allow format changes
● * 2. 4 byte CRC32 of the payload
● * 3. N - 5 byte payload
● * If magic byte is 1
● * 1. 1 byte "magic" identifier to allow format changes
● * 2. 1 byte "attributes" identifier to allow annotations on the message independent of the
version (e.g. compression enabled, type of codec used)
● * 3. 4 byte CRC32 of the payload
● * 4. N - 6 byte payload
● */