An Introduction to Apache Kafka

By Amir Sedighi
@amirsedighi
Data Solutions Engineer at DatisPars
Nov 2014

2
References
● http://kafka.apache.org/documentation.html
● http://www.slideshare.net/charmalloc/current-an
d-future-of-apache-kafka
● http://www.michael-noll.com/blog/2013/03/13/ru
nning-a-multi-broker-apache-kafka-cluster-on-a
-single-node/

3
At first data pipelining looks easy!
● It often starts with one
data pipeline from a
producer to a
consumer.

4
It looks pretty wise either to reuse
things!
● Reusing the pipeline
for new producers.

5
We may handle some situations!
● Reusing added
producers for new
consumers.

6
But we can't go far!
● Eventually the
solution becomes the
problem!

7
The additional requirements make
things complicated!
● By later developments it gets even worse!

10
Message Delivery Semantics
● At most once
– Messages may be lost by are never delivered.
● At least once
– Messages are never lost byt may be redliverd.
● Exactly once
– This is what people actually want.

11
Apache Kafka is publish-subscribe messaging
rethought as a distributed commit log.

12
Apache Kafka
● Apache Kafka is publish-subscribe messaging
– Kafka is super fast.
– Kafka is scalable.
– Kafka is durable.
– Kafka is distributed by design.

13
Apache Kafka

14
Apache Kafka

15
Apache Kafka
● A single Kafka broker
(server) can handle
hundreds of
megabytes of reads
and writes per second
from thousands of
clients.

16
Apache Kafka

17
Apache Kafka
● Kafka is designed to
allow a single cluster
to serve as the central
data backbone for a
large organization. It
can be elastically and
transparently
expanded without
downtime.

18
Apache Kafka

19
Apache Kafka
● Messages are
persisted on disk and
replicated within the
cluster to prevent
data loss. Each
broker can handle
terabytes of
messages without
performance impact.

20
Apache Kafka

21
Apache Kafka
● Kafka has a modern
cluster-centric design
that offers strong
durability and fault-tolerance
guarantees.

24
Kafka is a distributed, partitioned, replicated
commit log service.

25
Main Components
● Topic
● Producer
● Consumer
● Broker

26
Topic
● Topic
● Producer
● Consumer
● Broker
● Kafka maintains feeds
of messages in
categories called
topics.
● Topics are the highest
level of abstraction
that Kafka provides.

30
Producer
● Topic
● Producer
● Consumer
● Broker
● We'll call processes
that publish
messages to a Kafka
topic producers.

34
Consumer
● Topic
● Producer
● Consumer
● Broker
● We'll call processes
that subscribe to
topics and process
the feed of published
messages,
consumers.
– Hadoop Consumer

36
Broker
● Topic
● Producer
● Consumer
● Broker
● Kafka is run as a
cluster comprised of
one or more servers
each of which is
called a broker.

39
Topics
● A topic is a category
or feed name to which
messages are
published.
● Kafka cluster
maintains a
partitioned log for
each topic.

40
Partition
● Is an ordered,
immutable sequence of
messages that is
continually appended to
a commit log.
● The messages in the
partitions are each
assigned a sequential id
number called the offset.

44
Producer
● The producer is responsible for choosing which
message to assign to which partition within the
topic.
– Round-Robin
– Load-Balanced
– Key-Based (Semantic-Oriented)

46
How a Kafka cluster looks Like?

47
How Kafka replicates a Topic's
partitions through the cluster?

49
What if we put jobs (Processors)
cross the flow?

50
Where to Start?
● http://kafka.apache.org/downloads.html

51
Run Zookeeper
● bin/zookeeper-server-start.sh
config/zookeeper.properties

52
Run kafka-server
● bin/kafka-server-start.sh
config/server.properties

53
Create Topic
● bin/kafka-topics.sh --create --zookeeper
localhost:2181 --replication-factor 1 --partitions
1 --topic test
> Created topic "test".

54
List all Topics
● bin/kafka-topics.sh --list --zookeeper
localhost:2181

55
Send some Messages by Producer
● bin/kafka-console-producer.sh --broker-list
localhost:9092 --topic test
Hello DatisPars Guys!
How is it going with you?

56
Start a Consumer
● bin/kafka-console-consumer.sh --zookeeper
localhost:2181 --topic test --from-beginning

59
Use Cases
● Messaging
– Kafka is comparable to traditional messaging
systems such as ActiveMQ and RabbitMQ.
● Kafka provides customizable latency
● Kafka has better throughput
● Kafka is highly Fault-tolerance

60
Use Cases
● Log Aggregation
– Many people use Kafka as a replacement for a log aggregation
solution.
– Log aggregation typically collects physical log files off servers
and puts them in a central place (a file server or HDFS perhaps)
for processing.
– In comparison to log-centric systems like Scribe or Flume, Kafka
offers equally good performance, stronger durability guarantees
due to replication, and much lower end-to-end latency.
● Lower-latency
● Easier support

61
Use Cases
● Stream Processing
– Storm and Samza are popular frameworks for stream processing. They
both use Kafka.
● Event Sourcing
– Event sourcing is a style of application design where state changes are
logged as a time-ordered sequence of records. Kafka's support for very
large stored log data makes it an excellent backend for an application
built in this style.
● Commit Log
– Kafka can serve as a kind of external commit-log for a distributed
system. The log helps replicate data between nodes and acts as a re-syncing
mechanism for failed nodes to restore their data.

62
Message Format
● /**
● * A message. The format of an N byte message is the following:
● * If magic byte is 0
● * 1. 1 byte "magic" identifier to allow format changes
● * 2. 4 byte CRC32 of the payload
● * 3. N - 5 byte payload
● * If magic byte is 1
● * 1. 1 byte "magic" identifier to allow format changes
● * 2. 1 byte "attributes" identifier to allow annotations on the message independent of the
version (e.g. compression enabled, type of codec used)
● * 3. 4 byte CRC32 of the payload
● * 4. N - 6 byte payload
● */

An Introduction to Apache Kafka

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (11)

Semelhante a An Introduction to Apache Kafka

Semelhante a An Introduction to Apache Kafka (20)

Mais de Amir Sedighi

Mais de Amir Sedighi (9)

Último

Último (20)

An Introduction to Apache Kafka