Apache Kafka is a distributed streaming platform used for building real-time data pipelines and streaming apps. It provides a unified, scalable, and durable platform for handling real-time data feeds. Kafka works by accepting streams of records from one or more producers and organizing them into topics. It allows both storing and forwarding of these streams to consumers. Producers write data to topics which are replicated across clusters for fault tolerance. Consumers can then read the data from the topics in the order it was produced. Major companies like LinkedIn, Yahoo, Twitter, and Netflix use Kafka for applications like metrics, logging, stream processing and more.
7. Why’s for Apache Kafka
• Clean and simple architecture
• Easy to use
• Easy to deploy
• High throughput
• Scalability
• High availability
• Persistence (for a while)
6
8. Apache Kafka 101
• Distributed, partitioned, replicated commit log
service.
• Provides the functionality of a messaging
system.
7
10. Topic
9
• Category or feed name to which messages are
published.
• Partitioned log
• Each partition
– Ordered
– Immutable seq.
– Appended to
offset => sequential id number
11. Partition Distribution
• Distributed over servers in the cluster
• Replicated for fault tolerance (configurable)
• Each partition has a leader server (read &
writes)
• Others acts followers (replicate leader)
• In case of partition failure one of the followers
becomes new leader
10
13. Consumer
• Queue vs. Publish/Subscribe
• Traditional queue ordering vs per-partition
ordering
12
14. Guarantees
• Messages in a partition will be same order
they are sent by a producer.
• Consumers see messages in the stored order
in log.
13
15. Demo
• Basic Command Line Tools
– Start a server
– Create a topic
– Send a message
– Start a consumer
– Multi-broker cluster
• No arguments displays usage information
14
17. Administrative Tools
• Kafka Manager (powered by Yahoo)
• Kafkat : Command-line administration for Kafka
brokers.
• Kafka Web Console : Displays information about
your Kafka cluster including which nodes are up
and what topics they host data for.
• Kafka Offset Monitor : Displays the state of all
consumers and how far behind the head of the
stream they are.
16