SlideShare uma empresa Scribd logo
1 de 37
Baixar para ler offline
Structured Streaming with
Kafka
Deeper look into the integration of kafka and spark
https://github.com/Shasidhar/kafka-streaming
Agenda
● Data collection vs Data ingestion
● Why they are key?
● Streaming data sources
● Kafka overview
● Integration of kafka and spark
● Checkpointing
● Kafka as Sink
● Delivery semantics
● What next?
Data collection and Data ingestion
Data Collection
● Happens where data is created
● Varies for different type of workloads Batch vs Streaming
● Different modes of data collection pull vs push
Data ingestion
● Receive and store data
● Coupled with input sources
● Help in routing data
Data collection vs Data ingestion
Data Source
Data Source
Data Source
Input data
store
Data
processing
engine
Analytical
engine
Data Collection Data Ingestion Data Processing
Why Data collection/ingestion is key?
Data Source
Data Source
Data Source
Input data
store
Data
processing
engine
Analytical
engine
Data Collection Data Ingestion Data Processing
Data collection tools
● rsyslog
○ Ancient data collector
○ Streaming mode
○ Comes in default and widely known
● Flume
○ Distributed data collection service
○ Solution for data collection of all formats
○ Initially designed to transfer log data into HDFS frequently and reliably
○ Written and maintained by cloudera
○ Popular for data collection even today in hadoop ecosystem
Data collection tools cont..
● LogStash
○ Pluggable architecture
○ Popular choice in ELK stack
○ Written in JRuby
○ Multiple input/ Multiple output
○ Centralize logs - collect, parse and store/forward
● Fluentd
○ Plugin architecture
○ Built in HA architecture
○ Lightweight multi-source, multi-destination log routing
○ Its offered as a service inside google cloud
Data Ingestion tools
● RabbitMQ
○ Written in Erlang
○ Implements AMQP (Advanced Message Queuing Protocol) architecture
○ Has pluggable architecture and provides extension for HTTP
○ Provides strong guarantees for messages
Kafka Overview
● High throughput publish subscribe based messaging
system
● Distributed, partitioned and replicated commit log
● Messages are persistent in system as Topics
● Uses Zookeeper for cluster management
● Written in scala, but supports many client API’s - Java,
Ruby, Python etc
● Developed by LinkedIn, now backed by Confluent
High Level Architecture
Terminology
● Brokers: Every server which is part of kafka cluster
● Producers : Processes which produces messages to Topic
● Consumers: Processes which subscribes to topic and read messages
● Consumer Group: Set of consumers sharing a common group to consume
topic data
● Topics : Is where messages are maintained and partitioned.
○ Partitions: It’s an ordered immutable sequence of messages or a commit
log.
○ Offset: seqId given to each message to track its position in topic partition
Anatomy of Kafka Topic
Spark vs Kafka compatibility
Kafka Version Spark Streaming Spark Structured
Streaming
Spark Kafka Sink
Below 0.10 Yes No No
After 0.10 Yes Yes Yes
● Consumer semantics has changed from Kafka 0.10
● Timestamp is introduced in message formats
● Reduced client dependency on ZK (Offsets are stored in
kafka topic)
● Transport encryption SSL/TLS and ACLs are introduced
Kafka with Spark Structured Streaming
● Kafka becoming de facto streaming source
● Direct integration support from 2.1.0
○ Broker,
○ Topic,
○ Partitions
Kafka Wordcount
Kafka ingestion time Wordcount
Starting offsets in Streaming Query
● Ways to start accessing kafka data with respect to offset
○ Earliest - start from beginning of the topic, except the deleted data.
○ Latest - start processing only new data that arrives after the query has started.
○ Assign - specify the precise offset to start from for every partition
Kafka read from offset
Checkpointing and write ahead logs
● We still have both of these in structured streaming
● Is used to track progress of query and often keep writing intermediate state to
filesystem
● For kafka, OffsetRange and data processed in each trigger are tracked
● Checkpoint location has to be HDFS compatible path and should be specified
as option for DataStreamWriter
○ https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#starting-str
eaming-queries
● You can modify the application code and just start the query again, it will work
from the same offsets where it’s stopped earlier
Kafka Checkpointing and recovering
Kafka Sink
● Introduced Kafka sink from 2.2.0 (Topic, Broker)
● Currently at-least once semantics is supported
● To achieve the exactly once semantics, you can have unique <key> in output
data
● While reading the data run a deduplication logic to get each data exactly once
val streamingDf = spark.readStream. ... // columns: guid, eventTime, ...
// Without watermark using guid column
streamingDf.dropDuplicates("guid")
// With watermark using guid and eventTime columns
streamingDf
.withWatermark("eventTime", "10 seconds")
.dropDuplicates("guid", "eventTime")
Kafka Sink example
Kafka Sink update mode example
Kafka Source
Delivery semantics
● Type of delivery semantics
○ At-least once
■ Results will be delivered at least once, probably there is a chance to
have duplicates in end
○ At-most once
■ Results will be delivered at most once, there is a chance to miss
some results
○ Exactly once
■ Each data is processed once and corresponding results will be
produced
Spark delivery semantics
● Depends on type of sources/sink
● Streaming sinks are designed to be idempotent for handling reprocessing
● Together, using replayable sources and idempotent sinks, Structured
Streaming can ensure end-to-end exactly-once semantics under any
failure.
● Currently Spark support exactly-once semantics for File output sink.
Input source Spark Output Store
Replayable source Idempotent Sink
Structured Streaming write semantics
File Sink Example
What kafka has in v0.11
● Idempotent producer
○ Exactly Once semantics in input
○ https://issues.apache.org/jira/browse/KAFKA-4815
● Transactional producer
○ Atomic writes across multiple partitions
● Exactly once stream processing
○ Transactional read-process-write-commit operations
○ https://issues.apache.org/jira/browse/KAFKA-4923
What kafka has in v0.8
● At-least once guarantees
Producer Kafka Broker (K,V)
Send
Message
(K,V)
Ack
Append
data to topic
What kafka has in v0.11
Producer Kafka Broker
K,V
Seq,
Pid
Send
Message
Ack
Append
data to topic
(K,V, Seq,Pid)
Idempotent Producer enable.idempotence = true
● Exactly once guarantees
Atomic Multi partition Writes
Transactional Producer transactional.id = “unique-id”
Atomic Multi partition Writes
Transactional Consumer isolation.level = “read_committed”
Exactly once stream processing
● Based on transactional read-process-write-commit pattern
What’s coming in Future
● Spark essentially will support the new semantics from Kafka
● JIRA to follow
○ SPARK - https://issues.apache.org/jira/browse/SPARK-18057
○ Blocking JIRA from KAFKA - https://issues.apache.org/jira/browse/KAFKA-4879
● Kafka to make idempotent producer behaviour as default in latest versions
○ https://issues.apache.org/jira/browse/KAFKA-5795
● Structured Streaming continuous processing mode
https://issues.apache.org/jira/browse/SPARK-20928
References
● https://www.confluent.io/blog/exactly-once-semantics-are-possible-heres-how
-apache-kafka-does-it/
● https://databricks.com/session/introducing-exactly-once-semantics-in-apache-
kafka
● https://databricks.com/blog/2017/04/26/processing-data-in-apache-kafka-with-
structured-streaming-in-apache-spark-2-2.html
● http://shashidhare.com/spark,/kafka/2017/03/23/spark-structured-streaming-w
ith-kafka-advanced.html
● http://shashidhare.com/spark,/kafka/2017/01/14/spark-structured-streaming-w
ith-kafka-basic.html
● Shashidhar E S
● Lead Solution Engineer at Databricks
● www.shashidhare.com

Mais conteúdo relacionado

Mais procurados

Introduction to Flink Streaming
Introduction to Flink StreamingIntroduction to Flink Streaming
Introduction to Flink Streamingdatamantra
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streamingdatamantra
 
Migrating to spark 2.0
Migrating to spark 2.0Migrating to spark 2.0
Migrating to spark 2.0datamantra
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streamingdatamantra
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2datamantra
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Executiondatamantra
 
Building end to end streaming application on Spark
Building end to end streaming application on SparkBuilding end to end streaming application on Spark
Building end to end streaming application on Sparkdatamantra
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark MLdatamantra
 
Productionalizing a spark application
Productionalizing a spark applicationProductionalizing a spark application
Productionalizing a spark applicationdatamantra
 
Introduction to concurrent programming with Akka actors
Introduction to concurrent programming with Akka actorsIntroduction to concurrent programming with Akka actors
Introduction to concurrent programming with Akka actorsShashank L
 
Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2datamantra
 
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, UberKafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, UberHostedbyConfluent
 
Introduction to dataset
Introduction to datasetIntroduction to dataset
Introduction to datasetdatamantra
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streamingdatamantra
 
Understanding transactional writes in datasource v2
Understanding transactional writes in  datasource v2Understanding transactional writes in  datasource v2
Understanding transactional writes in datasource v2datamantra
 
Interactive workflow management using Azkaban
Interactive workflow management using AzkabanInteractive workflow management using Azkaban
Interactive workflow management using Azkabandatamantra
 
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
Anatomy of Data Frame API :  A deep dive into Spark Data Frame APIAnatomy of Data Frame API :  A deep dive into Spark Data Frame API
Anatomy of Data Frame API : A deep dive into Spark Data Frame APIdatamantra
 
Introduction to spark 2.0
Introduction to spark 2.0Introduction to spark 2.0
Introduction to spark 2.0datamantra
 
Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1datamantra
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsdatamantra
 

Mais procurados (20)

Introduction to Flink Streaming
Introduction to Flink StreamingIntroduction to Flink Streaming
Introduction to Flink Streaming
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streaming
 
Migrating to spark 2.0
Migrating to spark 2.0Migrating to spark 2.0
Migrating to spark 2.0
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streaming
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Execution
 
Building end to end streaming application on Spark
Building end to end streaming application on SparkBuilding end to end streaming application on Spark
Building end to end streaming application on Spark
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark ML
 
Productionalizing a spark application
Productionalizing a spark applicationProductionalizing a spark application
Productionalizing a spark application
 
Introduction to concurrent programming with Akka actors
Introduction to concurrent programming with Akka actorsIntroduction to concurrent programming with Akka actors
Introduction to concurrent programming with Akka actors
 
Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2
 
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, UberKafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
 
Introduction to dataset
Introduction to datasetIntroduction to dataset
Introduction to dataset
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Understanding transactional writes in datasource v2
Understanding transactional writes in  datasource v2Understanding transactional writes in  datasource v2
Understanding transactional writes in datasource v2
 
Interactive workflow management using Azkaban
Interactive workflow management using AzkabanInteractive workflow management using Azkaban
Interactive workflow management using Azkaban
 
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
Anatomy of Data Frame API :  A deep dive into Spark Data Frame APIAnatomy of Data Frame API :  A deep dive into Spark Data Frame API
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
 
Introduction to spark 2.0
Introduction to spark 2.0Introduction to spark 2.0
Introduction to spark 2.0
 
Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloads
 

Semelhante a Structured Streaming with Kafka

Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using KafkaKnoldus Inc.
 
Real time data pipline with kafka streams
Real time data pipline with kafka streamsReal time data pipline with kafka streams
Real time data pipline with kafka streamsYoni Farin
 
Introducing Kafka-on-Pulsar: bring native Kafka protocol support to Apache Pu...
Introducing Kafka-on-Pulsar: bring native Kafka protocol support to Apache Pu...Introducing Kafka-on-Pulsar: bring native Kafka protocol support to Apache Pu...
Introducing Kafka-on-Pulsar: bring native Kafka protocol support to Apache Pu...StreamNative
 
Change data capture
Change data captureChange data capture
Change data captureRon Barabash
 
Kafka on Pulsar:bringing native Kafka protocol support to Pulsar_Sijie&Pierre
Kafka on Pulsar:bringing native Kafka protocol support to Pulsar_Sijie&PierreKafka on Pulsar:bringing native Kafka protocol support to Pulsar_Sijie&Pierre
Kafka on Pulsar:bringing native Kafka protocol support to Pulsar_Sijie&PierreStreamNative
 
Python Kafka Integration: Developers Guide
Python Kafka Integration: Developers GuidePython Kafka Integration: Developers Guide
Python Kafka Integration: Developers GuideInexture Solutions
 
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...Athens Big Data
 
Virtual Flink Forward 2020: Build your next-generation stream platform based ...
Virtual Flink Forward 2020: Build your next-generation stream platform based ...Virtual Flink Forward 2020: Build your next-generation stream platform based ...
Virtual Flink Forward 2020: Build your next-generation stream platform based ...Flink Forward
 
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Monal Daxini
 
Friends don't let friends do dual writes: Outbox pattern with OpenShift Strea...
Friends don't let friends do dual writes: Outbox pattern with OpenShift Strea...Friends don't let friends do dual writes: Outbox pattern with OpenShift Strea...
Friends don't let friends do dual writes: Outbox pattern with OpenShift Strea...Red Hat Developers
 
Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache KafkaChhavi Parasher
 
Apache Kafka - Martin Podval
Apache Kafka - Martin PodvalApache Kafka - Martin Podval
Apache Kafka - Martin PodvalMartin Podval
 
Insta clustr seattle kafka meetup presentation bb
Insta clustr seattle kafka meetup presentation   bbInsta clustr seattle kafka meetup presentation   bb
Insta clustr seattle kafka meetup presentation bbNitin Kumar
 
Unlocking the Power of Apache Kafka: How Kafka Listeners Facilitate Real-time...
Unlocking the Power of Apache Kafka: How Kafka Listeners Facilitate Real-time...Unlocking the Power of Apache Kafka: How Kafka Listeners Facilitate Real-time...
Unlocking the Power of Apache Kafka: How Kafka Listeners Facilitate Real-time...Denodo
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
 

Semelhante a Structured Streaming with Kafka (20)

Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
 
Real time data pipline with kafka streams
Real time data pipline with kafka streamsReal time data pipline with kafka streams
Real time data pipline with kafka streams
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Introducing Kafka-on-Pulsar: bring native Kafka protocol support to Apache Pu...
Introducing Kafka-on-Pulsar: bring native Kafka protocol support to Apache Pu...Introducing Kafka-on-Pulsar: bring native Kafka protocol support to Apache Pu...
Introducing Kafka-on-Pulsar: bring native Kafka protocol support to Apache Pu...
 
Change data capture
Change data captureChange data capture
Change data capture
 
Kafka on Pulsar:bringing native Kafka protocol support to Pulsar_Sijie&Pierre
Kafka on Pulsar:bringing native Kafka protocol support to Pulsar_Sijie&PierreKafka on Pulsar:bringing native Kafka protocol support to Pulsar_Sijie&Pierre
Kafka on Pulsar:bringing native Kafka protocol support to Pulsar_Sijie&Pierre
 
Python Kafka Integration: Developers Guide
Python Kafka Integration: Developers GuidePython Kafka Integration: Developers Guide
Python Kafka Integration: Developers Guide
 
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
 
Virtual Flink Forward 2020: Build your next-generation stream platform based ...
Virtual Flink Forward 2020: Build your next-generation stream platform based ...Virtual Flink Forward 2020: Build your next-generation stream platform based ...
Virtual Flink Forward 2020: Build your next-generation stream platform based ...
 
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
 
Envoy and Kafka
Envoy and KafkaEnvoy and Kafka
Envoy and Kafka
 
Friends don't let friends do dual writes: Outbox pattern with OpenShift Strea...
Friends don't let friends do dual writes: Outbox pattern with OpenShift Strea...Friends don't let friends do dual writes: Outbox pattern with OpenShift Strea...
Friends don't let friends do dual writes: Outbox pattern with OpenShift Strea...
 
Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache Kafka
 
Apache Kafka - Martin Podval
Apache Kafka - Martin PodvalApache Kafka - Martin Podval
Apache Kafka - Martin Podval
 
Apache storm
Apache stormApache storm
Apache storm
 
Insta clustr seattle kafka meetup presentation bb
Insta clustr seattle kafka meetup presentation   bbInsta clustr seattle kafka meetup presentation   bb
Insta clustr seattle kafka meetup presentation bb
 
Streamsets and spark
Streamsets and sparkStreamsets and spark
Streamsets and spark
 
Unlocking the Power of Apache Kafka: How Kafka Listeners Facilitate Real-time...
Unlocking the Power of Apache Kafka: How Kafka Listeners Facilitate Real-time...Unlocking the Power of Apache Kafka: How Kafka Listeners Facilitate Real-time...
Unlocking the Power of Apache Kafka: How Kafka Listeners Facilitate Real-time...
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 

Mais de datamantra

Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Telliusdatamantra
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle managementdatamantra
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scaladatamantra
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scaladatamantra
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetesdatamantra
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsdatamantra
 
Functional programming in Scala
Functional programming in ScalaFunctional programming in Scala
Functional programming in Scaladatamantra
 
Telco analytics at scale
Telco analytics at scaleTelco analytics at scale
Telco analytics at scaledatamantra
 
Platform for Data Scientists
Platform for Data ScientistsPlatform for Data Scientists
Platform for Data Scientistsdatamantra
 
Building scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTPBuilding scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTPdatamantra
 
Real time ETL processing using Spark streaming
Real time ETL processing using Spark streamingReal time ETL processing using Spark streaming
Real time ETL processing using Spark streamingdatamantra
 
Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2datamantra
 
Anatomy of spark catalyst
Anatomy of spark catalystAnatomy of spark catalyst
Anatomy of spark catalystdatamantra
 
Introduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset APIIntroduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset APIdatamantra
 

Mais de datamantra (14)

Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Tellius
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle management
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scala
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scala
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetes
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actors
 
Functional programming in Scala
Functional programming in ScalaFunctional programming in Scala
Functional programming in Scala
 
Telco analytics at scale
Telco analytics at scaleTelco analytics at scale
Telco analytics at scale
 
Platform for Data Scientists
Platform for Data ScientistsPlatform for Data Scientists
Platform for Data Scientists
 
Building scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTPBuilding scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTP
 
Real time ETL processing using Spark streaming
Real time ETL processing using Spark streamingReal time ETL processing using Spark streaming
Real time ETL processing using Spark streaming
 
Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2
 
Anatomy of spark catalyst
Anatomy of spark catalystAnatomy of spark catalyst
Anatomy of spark catalyst
 
Introduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset APIIntroduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset API
 

Último

如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样wsppdmt
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...Health
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...nirzagarg
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...HyderabadDolls
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...gajnagarg
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdfkhraisr
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfSayantanBiswas37
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制vexqp
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themeitharjee
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...HyderabadDolls
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangeThinkInnovation
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraGovindSinghDasila
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...gajnagarg
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.pptibrahimabdi22
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRajesh Mondal
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...HyderabadDolls
 

Último (20)

如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
 

Structured Streaming with Kafka

  • 1. Structured Streaming with Kafka Deeper look into the integration of kafka and spark https://github.com/Shasidhar/kafka-streaming
  • 2. Agenda ● Data collection vs Data ingestion ● Why they are key? ● Streaming data sources ● Kafka overview ● Integration of kafka and spark ● Checkpointing ● Kafka as Sink ● Delivery semantics ● What next?
  • 3. Data collection and Data ingestion Data Collection ● Happens where data is created ● Varies for different type of workloads Batch vs Streaming ● Different modes of data collection pull vs push Data ingestion ● Receive and store data ● Coupled with input sources ● Help in routing data
  • 4. Data collection vs Data ingestion Data Source Data Source Data Source Input data store Data processing engine Analytical engine Data Collection Data Ingestion Data Processing
  • 5. Why Data collection/ingestion is key? Data Source Data Source Data Source Input data store Data processing engine Analytical engine Data Collection Data Ingestion Data Processing
  • 6. Data collection tools ● rsyslog ○ Ancient data collector ○ Streaming mode ○ Comes in default and widely known ● Flume ○ Distributed data collection service ○ Solution for data collection of all formats ○ Initially designed to transfer log data into HDFS frequently and reliably ○ Written and maintained by cloudera ○ Popular for data collection even today in hadoop ecosystem
  • 7. Data collection tools cont.. ● LogStash ○ Pluggable architecture ○ Popular choice in ELK stack ○ Written in JRuby ○ Multiple input/ Multiple output ○ Centralize logs - collect, parse and store/forward ● Fluentd ○ Plugin architecture ○ Built in HA architecture ○ Lightweight multi-source, multi-destination log routing ○ Its offered as a service inside google cloud
  • 8. Data Ingestion tools ● RabbitMQ ○ Written in Erlang ○ Implements AMQP (Advanced Message Queuing Protocol) architecture ○ Has pluggable architecture and provides extension for HTTP ○ Provides strong guarantees for messages
  • 9. Kafka Overview ● High throughput publish subscribe based messaging system ● Distributed, partitioned and replicated commit log ● Messages are persistent in system as Topics ● Uses Zookeeper for cluster management ● Written in scala, but supports many client API’s - Java, Ruby, Python etc ● Developed by LinkedIn, now backed by Confluent
  • 11. Terminology ● Brokers: Every server which is part of kafka cluster ● Producers : Processes which produces messages to Topic ● Consumers: Processes which subscribes to topic and read messages ● Consumer Group: Set of consumers sharing a common group to consume topic data ● Topics : Is where messages are maintained and partitioned. ○ Partitions: It’s an ordered immutable sequence of messages or a commit log. ○ Offset: seqId given to each message to track its position in topic partition
  • 13. Spark vs Kafka compatibility Kafka Version Spark Streaming Spark Structured Streaming Spark Kafka Sink Below 0.10 Yes No No After 0.10 Yes Yes Yes ● Consumer semantics has changed from Kafka 0.10 ● Timestamp is introduced in message formats ● Reduced client dependency on ZK (Offsets are stored in kafka topic) ● Transport encryption SSL/TLS and ACLs are introduced
  • 14. Kafka with Spark Structured Streaming ● Kafka becoming de facto streaming source ● Direct integration support from 2.1.0 ○ Broker, ○ Topic, ○ Partitions
  • 16. Kafka ingestion time Wordcount
  • 17. Starting offsets in Streaming Query ● Ways to start accessing kafka data with respect to offset ○ Earliest - start from beginning of the topic, except the deleted data. ○ Latest - start processing only new data that arrives after the query has started. ○ Assign - specify the precise offset to start from for every partition
  • 18. Kafka read from offset
  • 19. Checkpointing and write ahead logs ● We still have both of these in structured streaming ● Is used to track progress of query and often keep writing intermediate state to filesystem ● For kafka, OffsetRange and data processed in each trigger are tracked ● Checkpoint location has to be HDFS compatible path and should be specified as option for DataStreamWriter ○ https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#starting-str eaming-queries ● You can modify the application code and just start the query again, it will work from the same offsets where it’s stopped earlier
  • 21. Kafka Sink ● Introduced Kafka sink from 2.2.0 (Topic, Broker) ● Currently at-least once semantics is supported ● To achieve the exactly once semantics, you can have unique <key> in output data ● While reading the data run a deduplication logic to get each data exactly once val streamingDf = spark.readStream. ... // columns: guid, eventTime, ... // Without watermark using guid column streamingDf.dropDuplicates("guid") // With watermark using guid and eventTime columns streamingDf .withWatermark("eventTime", "10 seconds") .dropDuplicates("guid", "eventTime")
  • 23. Kafka Sink update mode example
  • 25. Delivery semantics ● Type of delivery semantics ○ At-least once ■ Results will be delivered at least once, probably there is a chance to have duplicates in end ○ At-most once ■ Results will be delivered at most once, there is a chance to miss some results ○ Exactly once ■ Each data is processed once and corresponding results will be produced
  • 26. Spark delivery semantics ● Depends on type of sources/sink ● Streaming sinks are designed to be idempotent for handling reprocessing ● Together, using replayable sources and idempotent sinks, Structured Streaming can ensure end-to-end exactly-once semantics under any failure. ● Currently Spark support exactly-once semantics for File output sink. Input source Spark Output Store Replayable source Idempotent Sink
  • 29. What kafka has in v0.11 ● Idempotent producer ○ Exactly Once semantics in input ○ https://issues.apache.org/jira/browse/KAFKA-4815 ● Transactional producer ○ Atomic writes across multiple partitions ● Exactly once stream processing ○ Transactional read-process-write-commit operations ○ https://issues.apache.org/jira/browse/KAFKA-4923
  • 30. What kafka has in v0.8 ● At-least once guarantees Producer Kafka Broker (K,V) Send Message (K,V) Ack Append data to topic
  • 31. What kafka has in v0.11 Producer Kafka Broker K,V Seq, Pid Send Message Ack Append data to topic (K,V, Seq,Pid) Idempotent Producer enable.idempotence = true ● Exactly once guarantees
  • 32. Atomic Multi partition Writes Transactional Producer transactional.id = “unique-id”
  • 33. Atomic Multi partition Writes Transactional Consumer isolation.level = “read_committed”
  • 34. Exactly once stream processing ● Based on transactional read-process-write-commit pattern
  • 35. What’s coming in Future ● Spark essentially will support the new semantics from Kafka ● JIRA to follow ○ SPARK - https://issues.apache.org/jira/browse/SPARK-18057 ○ Blocking JIRA from KAFKA - https://issues.apache.org/jira/browse/KAFKA-4879 ● Kafka to make idempotent producer behaviour as default in latest versions ○ https://issues.apache.org/jira/browse/KAFKA-5795 ● Structured Streaming continuous processing mode https://issues.apache.org/jira/browse/SPARK-20928
  • 36. References ● https://www.confluent.io/blog/exactly-once-semantics-are-possible-heres-how -apache-kafka-does-it/ ● https://databricks.com/session/introducing-exactly-once-semantics-in-apache- kafka ● https://databricks.com/blog/2017/04/26/processing-data-in-apache-kafka-with- structured-streaming-in-apache-spark-2-2.html ● http://shashidhare.com/spark,/kafka/2017/03/23/spark-structured-streaming-w ith-kafka-advanced.html ● http://shashidhare.com/spark,/kafka/2017/01/14/spark-structured-streaming-w ith-kafka-basic.html
  • 37. ● Shashidhar E S ● Lead Solution Engineer at Databricks ● www.shashidhare.com