SlideShare uma empresa Scribd logo
1 de 38
Apache Kafka
Introduction
Kumar Shivam
A distributed streaming platform
History
• Apache Kafka is an open-source stream-processing software platform developed by LinkedIn and donated to the Apache
Software Foundation, written in Scala and Java.
• Kafka can connect to external systems (for data import/export) via Kafka Connect and provides Kafka Streams, a Java
stream processing library.
• Kafka uses a binary TCP-based protocol
Use cases
• Messaging system
• Activity Tracking
• Gather metrics from many different locations
• Application logs gathering
• Stream processing (with the Kafka streams API or Spark for example)
• De-coupling of system dependencies.
• Integration with Spark, FLink, Strom ,Hadoop and many big data tech.
Application data flow(without using Kafka)
Application data flow(using Kafka)
Companies Use cases
• Netflix - it uses kafka to apply recommendations in the real time while watching TV shows
• Uber - It uses to gather user,taxi and trip data in real-time to compute and forcast demand and compute surge pricing in
the real time.
• LinkedIn - it uses to prevent spam , collect user interactions to make better connections recommendations in the real
time.
• Spotify - Kafka is used at Spotify as part of their log delivery system.
• Coursera - At Coursera, Kafka powers education at scale, serving as the data pipeline for realtime learning
analytics/dashboards.
• Oracle - Oracle provides native connectivity to Kafka from its Enterprise Service Bus product called OSB (Oracle Service
Bus) which allows developers to leverage OSB built-in mediation capabilities to implement staged data pipelines.
• Trivago - Trivago uses Kafka for stream processing in Storm as well as processing of application logs.
• Zalando: As the leading online fashion retailer in Europe, Zalando uses Kafka as an ESB (Enterprise Service Bus), which helps us in
transitioning from a monolithic to a micro services architecture. Using Kafka for processing event streams enables our technical
team to do near-real time business intelligence.
Kafka in ERP
Jargons
• Topics (category)
• Partition
• Offset
• Replicas
• Broker
• Cluster
• Producers
• Consumer
• Leader
• Follower
Topic(Category)
Stream of messages belonging to a particular
category is called a topic. Data is stored in
topics.
Partition
• Topics split into partitions .
• Partition contains msg. in an immutable
ordered seq.
• Partition is impl. as set of segment files of
equal sizes.
• Data once written to a partition are
immutable.
Offset
Each message gets stored into partitions with
an incr. ID (i.e. Unique seq. id )called as
offset”.
Offset
Replicas
• Backup of partition.
• Replication factor – No. of copies of data
over multiple brokers.
Offset
Replicas
• Topics X and partition 0 is available in
broker 0 and Similarly for Partition 1 .
• Problem :-
• In Broker 2 , we are keeping actual data
(i.e. Topic- X Partition 1 ) and replicated
data (i.e. Topic – X Partition 0 ).
• Solution :-
• Choose one broker’s partition as a
leader and the rest as followers.
Brokers(containers)
• System responsible for maintaining the
published data.
• Holds multiple topics with multiple
partitions.
• Brokers are stateless.
• 1 Kafka broker = ~ 1 Million read/write
per sec.
• Handles TBs of meg. Without
performance hit.
• Brokers in the cluster is identified by an
ID.
• Kafka broker are also known as Bootstrap
broker because con. With any one broker
means connection with entire cluster.
Offset
Kafka Clusters
• Kafka’s having more than one broker are
called as Kafka cluster.
• A Kafka cluster can be expanded without
downtime.
• These clusters are used to manage the
persistence and replication of message
data.
• It typically consists of multiple broker to
maintain load balance.
Kafka Ecosystem
Producer
• The publisher of messages to one or
more Kafka topics
Offset
Consumer
• Read data from brokers.
• Consumers subscribes to one or more
topics and consume published messages
by pulling data from the brokers.
Offset
Leaders
• Node responsible for all reads and writes
for the given partition.
Offset
Follower
• Node which follows leader instructions
are called as followers.
• If leader fails , one of the follower will
automatically become the new leader.
Offset
Zookeeper
• It manages and co-ordinates Kafka
brokers.
• Used to notify producer and consumer
abt. the presence and failure of any
broker in the Kafka system.
• So that in Failure, Producer & Consumer
can take decision and start coordinating
their task with some other broker.
Kafka Producers
• How does the producer write data to the cluster?
• Message Keys
• Acknowledgment
• With the concept of key to send message in a specific order. The key enables the producer with two choices
• Send the data to the each partition
• If the value of key=NULL, it means that the data is sent without a key. Thus, it will be distributed in a round-robin manner (i.e.,
distributed to each partition).
• Send the data to specific partition.
• If the value of the key!=NULL, it means the key is attached with the data, and thus all messages will always be delivered to the
same partition.
without key
• scenario where a producer writes data to
the Kafka cluster
with key
• scenario where a producer specifies a key
as Prod_id
Prod_id_1
Prod_id_2
Acknowledgment
• In order to write data to the Kafka cluster,
the producer has another choice of
acknowledgment. Message
Sent
Message
Received
Case 1
• Producer sends data to each of the
Broker, but not receiving any
acknowledgment
• acks = 0 : producer sends the data to the
broker but does not wait for the
acknowledgement.
Case 2 (half - Duplex)
• Producer sends data to each of the
Broker, receiving any acknowledgment
• acks = 1 : producer will wait for the
leader's acknowledgement. The leader
asks the broker whether it successfully
received the data, and then
acknowledgment.
• The producers send data to the brokers.
Broker 1 holds the leader. Thus, the
leader asks Broker 1 whether it has
successfully received data. After receiving
the Broker's confirmation, the leader
sends the feedback to the Producer with
ack=1.
Case 3 (full - Duplex)
• Producer sends data to each of the
Broker, receiving acknowledgment from
both end.
• acks = all : the acknowledgment is done
by both the leader and its followers.
Kafka Core Apis
Producer Consumer
Comparision
Parameters Apache Kafka Apache Spark
Developers Originally developed by LinkedIn. Later, donated to Apache
Software Foundation.
Originally developed at the University of California. Later, it was
donated to Apache Software Foundation.
Infrastructure It is a Java client library. Thus, it can execute wherever Java is
supported.
It executes on the top of the Spark stack. It can be either Spark
standalone, YARN, or container-based.
Data Sources It processes data from Kafka itself via topics and streams. Spark ingest data from various files, Kafka, Socket source, etc.
Processing Model It processes the events as it arrives. Thus, it uses Event-at-a-
time (continuous) processing model.
It has a micro-batch processing model. It splits the incoming
streams into small batches for further processing.
Latency It has low latency than Apache Spark It has a higher latency.
ETL Transformation It is not supported in Apache Kafka. This transformation is supported in Spark.
Fault-tolerance Fault-tolerance is complex in Kafka. Fault-tolerance is easy in Spark.
Language Support It supports Java mainly. It supports multiple languages such as Java, Scala, R, Python.
Use Cases The New York Times, Zalando, Trivago, etc. use Kafka Streams
to store and distribute data.
Booking.com, Yelp (ad platform) uses Spark streams for
handling millions of ad requests per day.
Interact with Apache Kafka clusters in Azure
HDInsight using a REST proxy
Hoe can we use Spark, Kafka and Cassandra
to build a robust analytical platform?
• Concerns ?
1. High data flow
concern 1 :- A lot of orders get placed on the Walmart website every second, item availability also changes
frequently. Updating data (which can be 100 MB per second) means streaming information to analytics platform in real-
time.
Solution :- Kafka is a distributed, scalable fault-tolerant messaging system which by default provides a streaming
support.
Hoe can we use Spark, Kafka and Cassandra
to build a robust analytical platform?
• Concerns ?
2. Storing terabytes of data with frequent updates
concern 2 :- To store item availability data, we needed datastore which can process huge amount of upsert
without compromising on performance . To even generate reports, data had to be processed every few hours — so
read had to be fast too.
Solution :- Though RDBMS can store large amount of data however it cannot provide reliable upsert and read
performance. We had good experience with Cassandra in past, hence, it was the first choice. Apache Cassandra has best
write and read performance. Like Kafka it is distributed, highly scalable and fault-tolerant.
Hoe can we use Spark, Kafka and Cassandra
to build a robust analytical platform?
• Concerns ?
3. Processing huge amount of data
concern 3 Data processing had to be carried out at two places in the pipeline.
1. During write, where we have to stream data from Kafka, process it and save it to Cassandra.
2. while generating business reports, where we have to read complete Cassandra table, join it with other data sources
and aggregate it at multiple columns.
Solution :- Apache Spark achieves high performance for both batch and streaming data, using a state-of-the-art DAG
scheduler, a query optimizer, and a physical execution engine.
Hoe can we use Spark, Kafka and Cassandra
to build a robust analytical platform?
Spark
batch job
Security
• Data Encription among brokers and between client – broker
• Using SSL
• Authentication modes between client and brokers
• Using SSL(mutual Authentication)
• Using SASL(i.e. Kerberos or SCRAM-SHA)
• Authorisation of read/write operation by cients
• ACLs on topics.
Thank you!
Keep in touch.
https://www.linkedin.com/in/kumar-shivam-3a07807b/
Kshivam@firstam.com
https://github.com/ThirstyBrain

Mais conteúdo relacionado

Mais procurados

An Introduction to Apache Kafka
An Introduction to Apache KafkaAn Introduction to Apache Kafka
An Introduction to Apache KafkaAmir Sedighi
 
Apache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and DevelopersApache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and Developersconfluent
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explainedconfluent
 
A visual introduction to Apache Kafka
A visual introduction to Apache KafkaA visual introduction to Apache Kafka
A visual introduction to Apache KafkaPaul Brebner
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafkaemreakis
 
Apache Kafka - Martin Podval
Apache Kafka - Martin PodvalApache Kafka - Martin Podval
Apache Kafka - Martin PodvalMartin Podval
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using KafkaKnoldus Inc.
 
Kafka Overview
Kafka OverviewKafka Overview
Kafka Overviewiamtodor
 
Kafka Connect and Streams (Concepts, Architecture, Features)
Kafka Connect and Streams (Concepts, Architecture, Features)Kafka Connect and Streams (Concepts, Architecture, Features)
Kafka Connect and Streams (Concepts, Architecture, Features)Kai Wähner
 
Kafka Tutorial - basics of the Kafka streaming platform
Kafka Tutorial - basics of the Kafka streaming platformKafka Tutorial - basics of the Kafka streaming platform
Kafka Tutorial - basics of the Kafka streaming platformJean-Paul Azar
 
ksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database SystemksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database Systemconfluent
 
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Jean-Paul Azar
 

Mais procurados (20)

An Introduction to Apache Kafka
An Introduction to Apache KafkaAn Introduction to Apache Kafka
An Introduction to Apache Kafka
 
Apache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and DevelopersApache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and Developers
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explained
 
A visual introduction to Apache Kafka
A visual introduction to Apache KafkaA visual introduction to Apache Kafka
A visual introduction to Apache Kafka
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
Apache Kafka - Martin Podval
Apache Kafka - Martin PodvalApache Kafka - Martin Podval
Apache Kafka - Martin Podval
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Kafka Overview
Kafka OverviewKafka Overview
Kafka Overview
 
kafka
kafkakafka
kafka
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Kafka Connect and Streams (Concepts, Architecture, Features)
Kafka Connect and Streams (Concepts, Architecture, Features)Kafka Connect and Streams (Concepts, Architecture, Features)
Kafka Connect and Streams (Concepts, Architecture, Features)
 
Kafka Tutorial - basics of the Kafka streaming platform
Kafka Tutorial - basics of the Kafka streaming platformKafka Tutorial - basics of the Kafka streaming platform
Kafka Tutorial - basics of the Kafka streaming platform
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
Kafka 101
Kafka 101Kafka 101
Kafka 101
 
ksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database SystemksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database System
 
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
 

Semelhante a Apache kafka

Unleashing Real-time Power with Kafka.pptx
Unleashing Real-time Power with Kafka.pptxUnleashing Real-time Power with Kafka.pptx
Unleashing Real-time Power with Kafka.pptxKnoldus Inc.
 
Ten reasons to choose Apache Pulsar over Apache Kafka for Event Sourcing_Robe...
Ten reasons to choose Apache Pulsar over Apache Kafka for Event Sourcing_Robe...Ten reasons to choose Apache Pulsar over Apache Kafka for Event Sourcing_Robe...
Ten reasons to choose Apache Pulsar over Apache Kafka for Event Sourcing_Robe...StreamNative
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Guido Schmutz
 
Fundamentals and Architecture of Apache Kafka
Fundamentals and Architecture of Apache KafkaFundamentals and Architecture of Apache Kafka
Fundamentals and Architecture of Apache KafkaAngelo Cesaro
 
Building streaming data applications using Kafka*[Connect + Core + Streams] b...
Building streaming data applications using Kafka*[Connect + Core + Streams] b...Building streaming data applications using Kafka*[Connect + Core + Streams] b...
Building streaming data applications using Kafka*[Connect + Core + Streams] b...Data Con LA
 
Building Streaming Data Applications Using Apache Kafka
Building Streaming Data Applications Using Apache KafkaBuilding Streaming Data Applications Using Apache Kafka
Building Streaming Data Applications Using Apache KafkaSlim Baltagi
 
Lessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and MicroservicesLessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and MicroservicesAlexis Seigneurin
 
apachekafka-160907180205.pdf
apachekafka-160907180205.pdfapachekafka-160907180205.pdf
apachekafka-160907180205.pdfTarekHamdi8
 
Cluster_Performance_Apache_Kafak_vs_RabbitMQ
Cluster_Performance_Apache_Kafak_vs_RabbitMQCluster_Performance_Apache_Kafak_vs_RabbitMQ
Cluster_Performance_Apache_Kafak_vs_RabbitMQShameera Rathnayaka
 
Unlocking the Power of Apache Kafka: How Kafka Listeners Facilitate Real-time...
Unlocking the Power of Apache Kafka: How Kafka Listeners Facilitate Real-time...Unlocking the Power of Apache Kafka: How Kafka Listeners Facilitate Real-time...
Unlocking the Power of Apache Kafka: How Kafka Listeners Facilitate Real-time...Denodo
 
Introduction to Kafka Streams Presentation
Introduction to Kafka Streams PresentationIntroduction to Kafka Streams Presentation
Introduction to Kafka Streams PresentationKnoldus Inc.
 
Kafka Tutorial - introduction to the Kafka streaming platform
Kafka Tutorial - introduction to the Kafka streaming platformKafka Tutorial - introduction to the Kafka streaming platform
Kafka Tutorial - introduction to the Kafka streaming platformJean-Paul Azar
 
Timothy Spann: Apache Pulsar for ML
Timothy Spann: Apache Pulsar for MLTimothy Spann: Apache Pulsar for ML
Timothy Spann: Apache Pulsar for MLEdunomica
 
Session 23 - Kafka and Zookeeper
Session 23 - Kafka and ZookeeperSession 23 - Kafka and Zookeeper
Session 23 - Kafka and ZookeeperAnandMHadoop
 
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...Trivadis
 
Current and Future of Apache Kafka
Current and Future of Apache KafkaCurrent and Future of Apache Kafka
Current and Future of Apache KafkaJoe Stein
 
Python Kafka Integration: Developers Guide
Python Kafka Integration: Developers GuidePython Kafka Integration: Developers Guide
Python Kafka Integration: Developers GuideInexture Solutions
 
kafka_session_updated.pptx
kafka_session_updated.pptxkafka_session_updated.pptx
kafka_session_updated.pptxKoiuyt1
 
Copy of Kafka-Camus
Copy of Kafka-CamusCopy of Kafka-Camus
Copy of Kafka-CamusDeep Shah
 

Semelhante a Apache kafka (20)

Unleashing Real-time Power with Kafka.pptx
Unleashing Real-time Power with Kafka.pptxUnleashing Real-time Power with Kafka.pptx
Unleashing Real-time Power with Kafka.pptx
 
Ten reasons to choose Apache Pulsar over Apache Kafka for Event Sourcing_Robe...
Ten reasons to choose Apache Pulsar over Apache Kafka for Event Sourcing_Robe...Ten reasons to choose Apache Pulsar over Apache Kafka for Event Sourcing_Robe...
Ten reasons to choose Apache Pulsar over Apache Kafka for Event Sourcing_Robe...
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
 
Fundamentals and Architecture of Apache Kafka
Fundamentals and Architecture of Apache KafkaFundamentals and Architecture of Apache Kafka
Fundamentals and Architecture of Apache Kafka
 
Building streaming data applications using Kafka*[Connect + Core + Streams] b...
Building streaming data applications using Kafka*[Connect + Core + Streams] b...Building streaming data applications using Kafka*[Connect + Core + Streams] b...
Building streaming data applications using Kafka*[Connect + Core + Streams] b...
 
Building Streaming Data Applications Using Apache Kafka
Building Streaming Data Applications Using Apache KafkaBuilding Streaming Data Applications Using Apache Kafka
Building Streaming Data Applications Using Apache Kafka
 
Kafka tutorial
Kafka tutorialKafka tutorial
Kafka tutorial
 
Lessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and MicroservicesLessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and Microservices
 
apachekafka-160907180205.pdf
apachekafka-160907180205.pdfapachekafka-160907180205.pdf
apachekafka-160907180205.pdf
 
Cluster_Performance_Apache_Kafak_vs_RabbitMQ
Cluster_Performance_Apache_Kafak_vs_RabbitMQCluster_Performance_Apache_Kafak_vs_RabbitMQ
Cluster_Performance_Apache_Kafak_vs_RabbitMQ
 
Unlocking the Power of Apache Kafka: How Kafka Listeners Facilitate Real-time...
Unlocking the Power of Apache Kafka: How Kafka Listeners Facilitate Real-time...Unlocking the Power of Apache Kafka: How Kafka Listeners Facilitate Real-time...
Unlocking the Power of Apache Kafka: How Kafka Listeners Facilitate Real-time...
 
Introduction to Kafka Streams Presentation
Introduction to Kafka Streams PresentationIntroduction to Kafka Streams Presentation
Introduction to Kafka Streams Presentation
 
Kafka Tutorial - introduction to the Kafka streaming platform
Kafka Tutorial - introduction to the Kafka streaming platformKafka Tutorial - introduction to the Kafka streaming platform
Kafka Tutorial - introduction to the Kafka streaming platform
 
Timothy Spann: Apache Pulsar for ML
Timothy Spann: Apache Pulsar for MLTimothy Spann: Apache Pulsar for ML
Timothy Spann: Apache Pulsar for ML
 
Session 23 - Kafka and Zookeeper
Session 23 - Kafka and ZookeeperSession 23 - Kafka and Zookeeper
Session 23 - Kafka and Zookeeper
 
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
 
Current and Future of Apache Kafka
Current and Future of Apache KafkaCurrent and Future of Apache Kafka
Current and Future of Apache Kafka
 
Python Kafka Integration: Developers Guide
Python Kafka Integration: Developers GuidePython Kafka Integration: Developers Guide
Python Kafka Integration: Developers Guide
 
kafka_session_updated.pptx
kafka_session_updated.pptxkafka_session_updated.pptx
kafka_session_updated.pptx
 
Copy of Kafka-Camus
Copy of Kafka-CamusCopy of Kafka-Camus
Copy of Kafka-Camus
 

Último

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 

Último (20)

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 

Apache kafka

  • 1. Apache Kafka Introduction Kumar Shivam A distributed streaming platform
  • 2. History • Apache Kafka is an open-source stream-processing software platform developed by LinkedIn and donated to the Apache Software Foundation, written in Scala and Java. • Kafka can connect to external systems (for data import/export) via Kafka Connect and provides Kafka Streams, a Java stream processing library. • Kafka uses a binary TCP-based protocol
  • 3. Use cases • Messaging system • Activity Tracking • Gather metrics from many different locations • Application logs gathering • Stream processing (with the Kafka streams API or Spark for example) • De-coupling of system dependencies. • Integration with Spark, FLink, Strom ,Hadoop and many big data tech.
  • 6.
  • 7. Companies Use cases • Netflix - it uses kafka to apply recommendations in the real time while watching TV shows • Uber - It uses to gather user,taxi and trip data in real-time to compute and forcast demand and compute surge pricing in the real time. • LinkedIn - it uses to prevent spam , collect user interactions to make better connections recommendations in the real time. • Spotify - Kafka is used at Spotify as part of their log delivery system. • Coursera - At Coursera, Kafka powers education at scale, serving as the data pipeline for realtime learning analytics/dashboards. • Oracle - Oracle provides native connectivity to Kafka from its Enterprise Service Bus product called OSB (Oracle Service Bus) which allows developers to leverage OSB built-in mediation capabilities to implement staged data pipelines. • Trivago - Trivago uses Kafka for stream processing in Storm as well as processing of application logs. • Zalando: As the leading online fashion retailer in Europe, Zalando uses Kafka as an ESB (Enterprise Service Bus), which helps us in transitioning from a monolithic to a micro services architecture. Using Kafka for processing event streams enables our technical team to do near-real time business intelligence.
  • 9. Jargons • Topics (category) • Partition • Offset • Replicas • Broker • Cluster • Producers • Consumer • Leader • Follower
  • 10. Topic(Category) Stream of messages belonging to a particular category is called a topic. Data is stored in topics.
  • 11. Partition • Topics split into partitions . • Partition contains msg. in an immutable ordered seq. • Partition is impl. as set of segment files of equal sizes. • Data once written to a partition are immutable.
  • 12. Offset Each message gets stored into partitions with an incr. ID (i.e. Unique seq. id )called as offset”. Offset
  • 13. Replicas • Backup of partition. • Replication factor – No. of copies of data over multiple brokers. Offset
  • 14. Replicas • Topics X and partition 0 is available in broker 0 and Similarly for Partition 1 . • Problem :- • In Broker 2 , we are keeping actual data (i.e. Topic- X Partition 1 ) and replicated data (i.e. Topic – X Partition 0 ). • Solution :- • Choose one broker’s partition as a leader and the rest as followers.
  • 15. Brokers(containers) • System responsible for maintaining the published data. • Holds multiple topics with multiple partitions. • Brokers are stateless. • 1 Kafka broker = ~ 1 Million read/write per sec. • Handles TBs of meg. Without performance hit. • Brokers in the cluster is identified by an ID. • Kafka broker are also known as Bootstrap broker because con. With any one broker means connection with entire cluster. Offset
  • 16. Kafka Clusters • Kafka’s having more than one broker are called as Kafka cluster. • A Kafka cluster can be expanded without downtime. • These clusters are used to manage the persistence and replication of message data. • It typically consists of multiple broker to maintain load balance. Kafka Ecosystem
  • 17. Producer • The publisher of messages to one or more Kafka topics Offset
  • 18. Consumer • Read data from brokers. • Consumers subscribes to one or more topics and consume published messages by pulling data from the brokers. Offset
  • 19. Leaders • Node responsible for all reads and writes for the given partition. Offset
  • 20. Follower • Node which follows leader instructions are called as followers. • If leader fails , one of the follower will automatically become the new leader. Offset
  • 21. Zookeeper • It manages and co-ordinates Kafka brokers. • Used to notify producer and consumer abt. the presence and failure of any broker in the Kafka system. • So that in Failure, Producer & Consumer can take decision and start coordinating their task with some other broker.
  • 22. Kafka Producers • How does the producer write data to the cluster? • Message Keys • Acknowledgment • With the concept of key to send message in a specific order. The key enables the producer with two choices • Send the data to the each partition • If the value of key=NULL, it means that the data is sent without a key. Thus, it will be distributed in a round-robin manner (i.e., distributed to each partition). • Send the data to specific partition. • If the value of the key!=NULL, it means the key is attached with the data, and thus all messages will always be delivered to the same partition.
  • 23. without key • scenario where a producer writes data to the Kafka cluster
  • 24. with key • scenario where a producer specifies a key as Prod_id Prod_id_1 Prod_id_2
  • 25. Acknowledgment • In order to write data to the Kafka cluster, the producer has another choice of acknowledgment. Message Sent Message Received
  • 26. Case 1 • Producer sends data to each of the Broker, but not receiving any acknowledgment • acks = 0 : producer sends the data to the broker but does not wait for the acknowledgement.
  • 27. Case 2 (half - Duplex) • Producer sends data to each of the Broker, receiving any acknowledgment • acks = 1 : producer will wait for the leader's acknowledgement. The leader asks the broker whether it successfully received the data, and then acknowledgment. • The producers send data to the brokers. Broker 1 holds the leader. Thus, the leader asks Broker 1 whether it has successfully received data. After receiving the Broker's confirmation, the leader sends the feedback to the Producer with ack=1.
  • 28. Case 3 (full - Duplex) • Producer sends data to each of the Broker, receiving acknowledgment from both end. • acks = all : the acknowledgment is done by both the leader and its followers.
  • 30. Comparision Parameters Apache Kafka Apache Spark Developers Originally developed by LinkedIn. Later, donated to Apache Software Foundation. Originally developed at the University of California. Later, it was donated to Apache Software Foundation. Infrastructure It is a Java client library. Thus, it can execute wherever Java is supported. It executes on the top of the Spark stack. It can be either Spark standalone, YARN, or container-based. Data Sources It processes data from Kafka itself via topics and streams. Spark ingest data from various files, Kafka, Socket source, etc. Processing Model It processes the events as it arrives. Thus, it uses Event-at-a- time (continuous) processing model. It has a micro-batch processing model. It splits the incoming streams into small batches for further processing. Latency It has low latency than Apache Spark It has a higher latency. ETL Transformation It is not supported in Apache Kafka. This transformation is supported in Spark. Fault-tolerance Fault-tolerance is complex in Kafka. Fault-tolerance is easy in Spark. Language Support It supports Java mainly. It supports multiple languages such as Java, Scala, R, Python. Use Cases The New York Times, Zalando, Trivago, etc. use Kafka Streams to store and distribute data. Booking.com, Yelp (ad platform) uses Spark streams for handling millions of ad requests per day.
  • 31.
  • 32. Interact with Apache Kafka clusters in Azure HDInsight using a REST proxy
  • 33. Hoe can we use Spark, Kafka and Cassandra to build a robust analytical platform? • Concerns ? 1. High data flow concern 1 :- A lot of orders get placed on the Walmart website every second, item availability also changes frequently. Updating data (which can be 100 MB per second) means streaming information to analytics platform in real- time. Solution :- Kafka is a distributed, scalable fault-tolerant messaging system which by default provides a streaming support.
  • 34. Hoe can we use Spark, Kafka and Cassandra to build a robust analytical platform? • Concerns ? 2. Storing terabytes of data with frequent updates concern 2 :- To store item availability data, we needed datastore which can process huge amount of upsert without compromising on performance . To even generate reports, data had to be processed every few hours — so read had to be fast too. Solution :- Though RDBMS can store large amount of data however it cannot provide reliable upsert and read performance. We had good experience with Cassandra in past, hence, it was the first choice. Apache Cassandra has best write and read performance. Like Kafka it is distributed, highly scalable and fault-tolerant.
  • 35. Hoe can we use Spark, Kafka and Cassandra to build a robust analytical platform? • Concerns ? 3. Processing huge amount of data concern 3 Data processing had to be carried out at two places in the pipeline. 1. During write, where we have to stream data from Kafka, process it and save it to Cassandra. 2. while generating business reports, where we have to read complete Cassandra table, join it with other data sources and aggregate it at multiple columns. Solution :- Apache Spark achieves high performance for both batch and streaming data, using a state-of-the-art DAG scheduler, a query optimizer, and a physical execution engine.
  • 36. Hoe can we use Spark, Kafka and Cassandra to build a robust analytical platform? Spark batch job
  • 37. Security • Data Encription among brokers and between client – broker • Using SSL • Authentication modes between client and brokers • Using SSL(mutual Authentication) • Using SASL(i.e. Kerberos or SCRAM-SHA) • Authorisation of read/write operation by cients • ACLs on topics.
  • 38. Thank you! Keep in touch. https://www.linkedin.com/in/kumar-shivam-3a07807b/ Kshivam@firstam.com https://github.com/ThirstyBrain