2. Agenda
● Data collection vs Data ingestion
● Why they are key?
● Streaming data sources
● Kafka overview
● Integration of kafka and spark
● Checkpointing
● Kafka as Sink
● Delivery semantics
● What next?
3. Data collection and Data ingestion
Data Collection
● Happens where data is created
● Varies for different type of workloads Batch vs Streaming
● Different modes of data collection pull vs push
Data ingestion
● Receive and store data
● Coupled with input sources
● Help in routing data
4. Data collection vs Data ingestion
Data Source
Data Source
Data Source
Input data
store
Data
processing
engine
Analytical
engine
Data Collection Data Ingestion Data Processing
5. Why Data collection/ingestion is key?
Data Source
Data Source
Data Source
Input data
store
Data
processing
engine
Analytical
engine
Data Collection Data Ingestion Data Processing
6. Data collection tools
● rsyslog
○ Ancient data collector
○ Streaming mode
○ Comes in default and widely known
● Flume
○ Distributed data collection service
○ Solution for data collection of all formats
○ Initially designed to transfer log data into HDFS frequently and reliably
○ Written and maintained by cloudera
○ Popular for data collection even today in hadoop ecosystem
7. Data collection tools cont..
● LogStash
○ Pluggable architecture
○ Popular choice in ELK stack
○ Written in JRuby
○ Multiple input/ Multiple output
○ Centralize logs - collect, parse and store/forward
● Fluentd
○ Plugin architecture
○ Built in HA architecture
○ Lightweight multi-source, multi-destination log routing
○ Its offered as a service inside google cloud
8. Data Ingestion tools
● RabbitMQ
○ Written in Erlang
○ Implements AMQP (Advanced Message Queuing Protocol) architecture
○ Has pluggable architecture and provides extension for HTTP
○ Provides strong guarantees for messages
9. Kafka Overview
● High throughput publish subscribe based messaging
system
● Distributed, partitioned and replicated commit log
● Messages are persistent in system as Topics
● Uses Zookeeper for cluster management
● Written in scala, but supports many client API’s - Java,
Ruby, Python etc
● Developed by LinkedIn, now backed by Confluent
11. Terminology
● Brokers: Every server which is part of kafka cluster
● Producers : Processes which produces messages to Topic
● Consumers: Processes which subscribes to topic and read messages
● Consumer Group: Set of consumers sharing a common group to consume
topic data
● Topics : Is where messages are maintained and partitioned.
○ Partitions: It’s an ordered immutable sequence of messages or a commit
log.
○ Offset: seqId given to each message to track its position in topic partition
13. Spark vs Kafka compatibility
Kafka Version Spark Streaming Spark Structured
Streaming
Spark Kafka Sink
Below 0.10 Yes No No
After 0.10 Yes Yes Yes
● Consumer semantics has changed from Kafka 0.10
● Timestamp is introduced in message formats
● Reduced client dependency on ZK (Offsets are stored in
kafka topic)
● Transport encryption SSL/TLS and ACLs are introduced
14. Kafka with Spark Structured Streaming
● Kafka becoming de facto streaming source
● Direct integration support from 2.1.0
○ Broker,
○ Topic,
○ Partitions
17. Starting offsets in Streaming Query
● Ways to start accessing kafka data with respect to offset
○ Earliest - start from beginning of the topic, except the deleted data.
○ Latest - start processing only new data that arrives after the query has started.
○ Assign - specify the precise offset to start from for every partition
19. Checkpointing and write ahead logs
● We still have both of these in structured streaming
● Is used to track progress of query and often keep writing intermediate state to
filesystem
● For kafka, OffsetRange and data processed in each trigger are tracked
● Checkpoint location has to be HDFS compatible path and should be specified
as option for DataStreamWriter
○ https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#starting-str
eaming-queries
● You can modify the application code and just start the query again, it will work
from the same offsets where it’s stopped earlier
21. Kafka Sink
● Introduced Kafka sink from 2.2.0 (Topic, Broker)
● Currently at-least once semantics is supported
● To achieve the exactly once semantics, you can have unique <key> in output
data
● While reading the data run a deduplication logic to get each data exactly once
val streamingDf = spark.readStream. ... // columns: guid, eventTime, ...
// Without watermark using guid column
streamingDf.dropDuplicates("guid")
// With watermark using guid and eventTime columns
streamingDf
.withWatermark("eventTime", "10 seconds")
.dropDuplicates("guid", "eventTime")
25. Delivery semantics
● Type of delivery semantics
○ At-least once
■ Results will be delivered at least once, probably there is a chance to
have duplicates in end
○ At-most once
■ Results will be delivered at most once, there is a chance to miss
some results
○ Exactly once
■ Each data is processed once and corresponding results will be
produced
26. Spark delivery semantics
● Depends on type of sources/sink
● Streaming sinks are designed to be idempotent for handling reprocessing
● Together, using replayable sources and idempotent sinks, Structured
Streaming can ensure end-to-end exactly-once semantics under any
failure.
● Currently Spark support exactly-once semantics for File output sink.
Input source Spark Output Store
Replayable source Idempotent Sink
29. What kafka has in v0.11
● Idempotent producer
○ Exactly Once semantics in input
○ https://issues.apache.org/jira/browse/KAFKA-4815
● Transactional producer
○ Atomic writes across multiple partitions
● Exactly once stream processing
○ Transactional read-process-write-commit operations
○ https://issues.apache.org/jira/browse/KAFKA-4923
30. What kafka has in v0.8
● At-least once guarantees
Producer Kafka Broker (K,V)
Send
Message
(K,V)
Ack
Append
data to topic
31. What kafka has in v0.11
Producer Kafka Broker
K,V
Seq,
Pid
Send
Message
Ack
Append
data to topic
(K,V, Seq,Pid)
Idempotent Producer enable.idempotence = true
● Exactly once guarantees
34. Exactly once stream processing
● Based on transactional read-process-write-commit pattern
35. What’s coming in Future
● Spark essentially will support the new semantics from Kafka
● JIRA to follow
○ SPARK - https://issues.apache.org/jira/browse/SPARK-18057
○ Blocking JIRA from KAFKA - https://issues.apache.org/jira/browse/KAFKA-4879
● Kafka to make idempotent producer behaviour as default in latest versions
○ https://issues.apache.org/jira/browse/KAFKA-5795
● Structured Streaming continuous processing mode
https://issues.apache.org/jira/browse/SPARK-20928