2. History
• Apache Kafka is an open-source stream-processing software platform developed by LinkedIn and donated to the Apache
Software Foundation, written in Scala and Java.
• Kafka can connect to external systems (for data import/export) via Kafka Connect and provides Kafka Streams, a Java
stream processing library.
• Kafka uses a binary TCP-based protocol
3. Use cases
• Messaging system
• Activity Tracking
• Gather metrics from many different locations
• Application logs gathering
• Stream processing (with the Kafka streams API or Spark for example)
• De-coupling of system dependencies.
• Integration with Spark, FLink, Strom ,Hadoop and many big data tech.
7. Companies Use cases
• Netflix - it uses kafka to apply recommendations in the real time while watching TV shows
• Uber - It uses to gather user,taxi and trip data in real-time to compute and forcast demand and compute surge pricing in
the real time.
• LinkedIn - it uses to prevent spam , collect user interactions to make better connections recommendations in the real
time.
• Spotify - Kafka is used at Spotify as part of their log delivery system.
• Coursera - At Coursera, Kafka powers education at scale, serving as the data pipeline for realtime learning
analytics/dashboards.
• Oracle - Oracle provides native connectivity to Kafka from its Enterprise Service Bus product called OSB (Oracle Service
Bus) which allows developers to leverage OSB built-in mediation capabilities to implement staged data pipelines.
• Trivago - Trivago uses Kafka for stream processing in Storm as well as processing of application logs.
• Zalando: As the leading online fashion retailer in Europe, Zalando uses Kafka as an ESB (Enterprise Service Bus), which helps us in
transitioning from a monolithic to a micro services architecture. Using Kafka for processing event streams enables our technical
team to do near-real time business intelligence.
11. Partition
• Topics split into partitions .
• Partition contains msg. in an immutable
ordered seq.
• Partition is impl. as set of segment files of
equal sizes.
• Data once written to a partition are
immutable.
12. Offset
Each message gets stored into partitions with
an incr. ID (i.e. Unique seq. id )called as
offset”.
Offset
13. Replicas
• Backup of partition.
• Replication factor – No. of copies of data
over multiple brokers.
Offset
14. Replicas
• Topics X and partition 0 is available in
broker 0 and Similarly for Partition 1 .
• Problem :-
• In Broker 2 , we are keeping actual data
(i.e. Topic- X Partition 1 ) and replicated
data (i.e. Topic – X Partition 0 ).
• Solution :-
• Choose one broker’s partition as a
leader and the rest as followers.
15. Brokers(containers)
• System responsible for maintaining the
published data.
• Holds multiple topics with multiple
partitions.
• Brokers are stateless.
• 1 Kafka broker = ~ 1 Million read/write
per sec.
• Handles TBs of meg. Without
performance hit.
• Brokers in the cluster is identified by an
ID.
• Kafka broker are also known as Bootstrap
broker because con. With any one broker
means connection with entire cluster.
Offset
16. Kafka Clusters
• Kafka’s having more than one broker are
called as Kafka cluster.
• A Kafka cluster can be expanded without
downtime.
• These clusters are used to manage the
persistence and replication of message
data.
• It typically consists of multiple broker to
maintain load balance.
Kafka Ecosystem
18. Consumer
• Read data from brokers.
• Consumers subscribes to one or more
topics and consume published messages
by pulling data from the brokers.
Offset
20. Follower
• Node which follows leader instructions
are called as followers.
• If leader fails , one of the follower will
automatically become the new leader.
Offset
21. Zookeeper
• It manages and co-ordinates Kafka
brokers.
• Used to notify producer and consumer
abt. the presence and failure of any
broker in the Kafka system.
• So that in Failure, Producer & Consumer
can take decision and start coordinating
their task with some other broker.
22. Kafka Producers
• How does the producer write data to the cluster?
• Message Keys
• Acknowledgment
• With the concept of key to send message in a specific order. The key enables the producer with two choices
• Send the data to the each partition
• If the value of key=NULL, it means that the data is sent without a key. Thus, it will be distributed in a round-robin manner (i.e.,
distributed to each partition).
• Send the data to specific partition.
• If the value of the key!=NULL, it means the key is attached with the data, and thus all messages will always be delivered to the
same partition.
24. with key
• scenario where a producer specifies a key
as Prod_id
Prod_id_1
Prod_id_2
25. Acknowledgment
• In order to write data to the Kafka cluster,
the producer has another choice of
acknowledgment. Message
Sent
Message
Received
26. Case 1
• Producer sends data to each of the
Broker, but not receiving any
acknowledgment
• acks = 0 : producer sends the data to the
broker but does not wait for the
acknowledgement.
27. Case 2 (half - Duplex)
• Producer sends data to each of the
Broker, receiving any acknowledgment
• acks = 1 : producer will wait for the
leader's acknowledgement. The leader
asks the broker whether it successfully
received the data, and then
acknowledgment.
• The producers send data to the brokers.
Broker 1 holds the leader. Thus, the
leader asks Broker 1 whether it has
successfully received data. After receiving
the Broker's confirmation, the leader
sends the feedback to the Producer with
ack=1.
28. Case 3 (full - Duplex)
• Producer sends data to each of the
Broker, receiving acknowledgment from
both end.
• acks = all : the acknowledgment is done
by both the leader and its followers.
30. Comparision
Parameters Apache Kafka Apache Spark
Developers Originally developed by LinkedIn. Later, donated to Apache
Software Foundation.
Originally developed at the University of California. Later, it was
donated to Apache Software Foundation.
Infrastructure It is a Java client library. Thus, it can execute wherever Java is
supported.
It executes on the top of the Spark stack. It can be either Spark
standalone, YARN, or container-based.
Data Sources It processes data from Kafka itself via topics and streams. Spark ingest data from various files, Kafka, Socket source, etc.
Processing Model It processes the events as it arrives. Thus, it uses Event-at-a-
time (continuous) processing model.
It has a micro-batch processing model. It splits the incoming
streams into small batches for further processing.
Latency It has low latency than Apache Spark It has a higher latency.
ETL Transformation It is not supported in Apache Kafka. This transformation is supported in Spark.
Fault-tolerance Fault-tolerance is complex in Kafka. Fault-tolerance is easy in Spark.
Language Support It supports Java mainly. It supports multiple languages such as Java, Scala, R, Python.
Use Cases The New York Times, Zalando, Trivago, etc. use Kafka Streams
to store and distribute data.
Booking.com, Yelp (ad platform) uses Spark streams for
handling millions of ad requests per day.
33. Hoe can we use Spark, Kafka and Cassandra
to build a robust analytical platform?
• Concerns ?
1. High data flow
concern 1 :- A lot of orders get placed on the Walmart website every second, item availability also changes
frequently. Updating data (which can be 100 MB per second) means streaming information to analytics platform in real-
time.
Solution :- Kafka is a distributed, scalable fault-tolerant messaging system which by default provides a streaming
support.
34. Hoe can we use Spark, Kafka and Cassandra
to build a robust analytical platform?
• Concerns ?
2. Storing terabytes of data with frequent updates
concern 2 :- To store item availability data, we needed datastore which can process huge amount of upsert
without compromising on performance . To even generate reports, data had to be processed every few hours — so
read had to be fast too.
Solution :- Though RDBMS can store large amount of data however it cannot provide reliable upsert and read
performance. We had good experience with Cassandra in past, hence, it was the first choice. Apache Cassandra has best
write and read performance. Like Kafka it is distributed, highly scalable and fault-tolerant.
35. Hoe can we use Spark, Kafka and Cassandra
to build a robust analytical platform?
• Concerns ?
3. Processing huge amount of data
concern 3 Data processing had to be carried out at two places in the pipeline.
1. During write, where we have to stream data from Kafka, process it and save it to Cassandra.
2. while generating business reports, where we have to read complete Cassandra table, join it with other data sources
and aggregate it at multiple columns.
Solution :- Apache Spark achieves high performance for both batch and streaming data, using a state-of-the-art DAG
scheduler, a query optimizer, and a physical execution engine.
36. Hoe can we use Spark, Kafka and Cassandra
to build a robust analytical platform?
Spark
batch job
37. Security
• Data Encription among brokers and between client – broker
• Using SSL
• Authentication modes between client and brokers
• Using SSL(mutual Authentication)
• Using SASL(i.e. Kerberos or SCRAM-SHA)
• Authorisation of read/write operation by cients
• ACLs on topics.
38. Thank you!
Keep in touch.
https://www.linkedin.com/in/kumar-shivam-3a07807b/
Kshivam@firstam.com
https://github.com/ThirstyBrain