7. Introduction - Data Ingestion
script to
read data
aggregation
script
aggregation
script
Tweet fetch
script
script to
read data
api rest
script
script to
read data
8. Introduction - Data Ingestion
script to
read data
aggregation
script
aggregation
script
Tweet fetch
script
script to
read data
api rest
script
script to
read data
11. Linkedin data pipeline problem
They had a lot of data
● User activity tracking
● Server logs and metrics
● Messaging
● Analytics
They Build products on data
● Newsfeed
● Recommendation
● Search
● Metrics and monitoring
Problem
How to integrate this variety of data
and make it available to all their
products?
13. Many publisher using direct connections
Frontend Server
Frontend Server
Database Server
Chat Server
Metrics Analysis
Metrics UI Database MonitorActive Monitoring
Backend server
Linkedin data pipeline problem
14. Publish/subscribe system
Frontend Server
Frontend Server
Database Server
Chat Server
Metrics Analysis
Metrics UI Database MonitorActive Monitoring
Backend server
Metrics pub/sub
Linkedin data pipeline problem
16. Log Search
Multiple publish/subscribe systems
Frontend Server
Frontend Server
Database Server
Chat Server
Metrics Analysis
Metrics UI Database Monitor
Active Monitoring
Backend server
Metrics pub/sub
Log Search
Offline processing
Logging pub/sub Tracking pub/sub
Linkedin data pipeline problem
17. Log Search
Linkedin data pipeline problem
Custom infrastructure for the data pipeline
Frontend Server
Frontend Server
Database Server
Chat Server
Metrics Analysis
Metrics UI Database Monitor
Active Monitoring
Backend server
Metrics pub/sub
Log Search
Offline processing
Logging pub/sub Tracking pub/sub
18. Log Search
Frontend ServerFrontend Server
Database Server Chat Server
Metrics Analysis
Metrics UI Database Monitor
Backend server
Log Search
Offline processing
● Decouple data pipelines
● Provide persistence for
message data to allow
multiple consumers
● Optimize for high
throughput of messages
● Allow for horizontal scaling
of the system to grow as the
data stream grow
Linkedin data pipeline problem - Kafka Goals
19.
20. Log Search
Kafka Architecture - Elements
Frontend ServerFrontend Server
Producer Producer
Metrics Analysis
Consumer Consumer
Producer
Log Search
Consumer
Kafka Cluster
Kafka → distributed, replicated commit log
Broker Partition X / Topic Y
22. Kafka Cluster
Kafka Architecture - Broker
Broker 1
Broker 2
Broker 3
Partition 0 / Topic B
Partition 0 / Topic C
Partition 0 / Topic A
Partition 0 / Topic B
Partition 1 / Topic A
distributed
replicated
23. Kafka Architecture - Producer/Consumer
Log Search
Frontend Server
Producer
Consumer
Kafka Cluster
Basic Concepts
● Latency
● Throughput
● Quality of service:
at most once, at least once, exactly once
Use Case Requirements
o Quality of service / Latency
o Throughput / Latency
Producer/Consumer Technology
o Ingestion Technologies
o Kafka Client API
o Kafka Connect
27. Kafka Producer
Kafka Protocol - Producer
Producer Record
Topic
[Partition]
[Key]
Value
Broker Partition 0 / Topic A
Producer.send (record)
exception/metadata
28. Productor Kafka
Topic / Partition Buffer Sender Thread
Producer Record
Topic
[Partition]
[Key]
Value
Serializer
Partitioner
Topic A / Partition 0
Batch 0
Batch 1
Batch 0 / Topic A /
Partition 0
Batch 0 / Topic B /
Partition 0
Batch 0 / Topic B /
Partition 1
Batch 1 / Topic B /
Partition 0
Retry
Fail
Yes
Yes
No
NoException
Metadata
Topic
Partition
X
Partition
Commit
Metadata
Topic Part. Offset
Send
Kafka Protocol - Producer
29. Broker
91 2 4 5 6 7 83
Consumer
Partition 0
Kafka Protocol - Consumer
● Subscribe (topic) & poll
● Reads topic-partition-offset
● Order is guaranteed only within a partition
● Data is kept only for a limited time (configurable)
● Numbers represents offsets not messages
● Deserialize data
Topic A
34. Kafka Connect
Make it easy to add new systems to your
scalable and secure stream data pipelines
Kafka Connect is a framework included in
Apache Kafka that integrates Kafka with other
systems
KafkaSourceConnect
KafkaSinkConnect
36. Schema Registry
Task
stream
stream
stream
Worker
Conector
Source record
Source record
Source record
sendRecords()
Task
Worker
Schema
Registry
Converter
Producer recordProducer record
Producer record
schema
id
Converter
id
schema
Subject
topic
schema
version
Subject
topic
schema
version
Subject
topic
schema
version
Producer recordProducer record
Consumer record
pollConsumer()
Sink record
Sink record
Sink record
Connector