SlideShare uma empresa Scribd logo
1 de 30
Spark Streaming
+ Kafka
Best Practices
Brandon O’Brien
@hakczar
Expedia, Inc
Or
“A Case Study in Operationalizing
Spark Streaming”
Context/Disclaimer
 Our use case: Build resilient, scalable data pipeline with
streaming ref data lookups, 24hr stream self-join and some
aggregation. Values accuracy over speed.
 Spark Streaming 1.5-1.6, Kafka 0.9
 Standalone Cluster (not YARN or Mesos)
 No Hadoop
 Message velocity: k/s. Batch window: 10s
 Data sourcee: Kafka (primary), Redis (joins + ref data) & S3
(ref data)
Demo: Spark in Action
Game & Scoreboard Architecture
Outline
 Spark Streaming & Standalone Cluster Overview
 Design Patterns for Performance
 Guaranteed Message Processing & Direct Kafka
Integration
 Operational Monitoring & Alerting
 Spark Cluster & App Resilience
Outline
 Spark Streaming & Standalone Cluster Overview
 Design Patterns for Performance
 Guaranteed Message Processing & Direct Kafka
Integration
 Operational Monitoring & Alerting
 Spark Cluster & App Resilience
Spark Streaming & Standalone
Cluster Overview
 RDD: Partitioned, replicated collection of data
objects
 Driver: JVM that creates Spark program,
negotiates for resources. Handles scheduling of
tasks but does not do heavy lifting. Bottlenecks.
 Executor: Slave to the driver, executes tasks on
RDD partitions. Function serialization.
 Lazy Execution: Transformations & Actions
 Cluster Types: Standalone, YARN, Mesos
Spark Streaming & Standalone
Cluster Overview
 Standalone Cluster
 Each node
 Master
 Worker
 Executor
 Driver
 Zookeeper cluster
Outline
 Spark Streaming & Standalone Cluster Overview
 Design Patterns for Performance
 Guaranteed Message Processing & Direct Kafka
Integration
 Operational Monitoring & Alerting
 Spark Cluster & App Resilience
Design Patterns for Performance
 Delegate all IO/CPU to the Executors
 Avoid unnecessary shuffles (join, groupBy,
repartition)
 Externalize streaming joins & reference data
lookups. Large/volatile ref data set.
 JVM static hashmap
 External cache (e.g. Redis)
 Static LRU cache (amortize lookups)
 RocksDB
 Hygienic function closures
We’re done, right?
We’re done, right?
Just need to QA the data…
70% missing data
Outline
 Spark Streaming & Standalone Cluster Overview
 Design Patterns for Performance
 Guaranteed Message Processing & Direct Kafka
Integration
 Operational Monitoring & Alerting
 Spark Cluster & App Resilience
Guaranteed Message Processing &
Direct Kafka Integration
 Guaranteed Message Processing = At-least-once
processing + idempotence
 Kafka Receiver
 Consumes messages faster than Spark can process
 Checkpoints before processing finished
 Inefficient CPU utilization
 Direct Kafka Integration
 Control over checkpointing & transactionality
 Better distribution on resource consumption
 1:1 Kafka Topic-partition to Spark RDD-partition
 Use Kafka as WAL
 Statelessness, Fail-fast
Outline
 Spark Streaming & Standalone Cluster Overview
 Design Patterns for Performance
 Guaranteed Message Processing & Direct Kafka
Integration
 Operational Monitoring & Alerting
 Spark Cluster & App Resilience
Operational Monitoring
& Alerting
 Driver “Heartbeat”
 Batch processing time
 Message count
 Kafka lag (latest offsets vs last processed)
 Driver start events
 StatsD + Graphite + Seyren
 http://localhost:4040/metrics/json/
Data loss fixed
Data loss fixed
So we’re done, right?
Cluster & app
continuously crashing
Outline
 Spark Streaming & Standalone Cluster Overview
 Design Patterns for Performance
 Guaranteed Message Processing & Direct Kafka
Integration
 Operational Monitoring & Alerting
 Spark Cluster & App Resilience
Spark Cluster & App Stability
Spark slave memory utilization
Spark Cluster & App Stability
 Slave memory overhead
 OOM killer
 Crashes + Kafka Receiver = missing data
 Supervised driver: “--supervise” for spark-submit.
Driver restart logging
 Cluster resource overprovisioning
 Standby Masters for failover
 Auto-cleanup of work directories
spark.worker.cleanup.enabled=true
We’re done, right?
We’re done, right?
Finally, yes
Party Time
TL;DR
1. Use Direct Kafka Integration + transactionality
2. Cache reference data for speed
3. Avoid shuffles & driver bottlenecks
4. Supervised driver
5. Cleanup worker temp directory
6. Beware of function closures
7. Cluster resource over-provisioning
8. Spark slave memory headroom
9. Monitoring on Driver heartbeat & Kafka lag
10. Standby masters
Spark Streaming
+ Kafka
Best Practices
Brandon O’Brien
@hakczar
Expedia, Inc
Thanks!
Links
 Operationalization Spark Streaming:
https://techblog.expedia.com/2016/12/29/operationalizing-
spark-streaming-part-1/
 Direct Kafka Integration:
https://databricks.com/blog/2015/03/30/improvements-to-
kafka-integration-of-spark-streaming.html
 App metrics: http://localhost:4040/metrics/json/
 MetricsSystem:
http://www.hammerlab.org/2015/02/27/monitoring-spark-
with-graphite-and-grafana/
 sparkConf.set("spark.worker.cleanup.enabled", "true")

Mais conteúdo relacionado

Destaque

Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...Brandon O'Brien
 
Tapjoy: Building a Real-Time Data Science Service for Mobile Advertising
Tapjoy: Building a Real-Time Data Science Service for Mobile AdvertisingTapjoy: Building a Real-Time Data Science Service for Mobile Advertising
Tapjoy: Building a Real-Time Data Science Service for Mobile AdvertisingSingleStore
 
Ingesting Drone Data into Big Data Platforms
Ingesting Drone Data into Big Data Platforms Ingesting Drone Data into Big Data Platforms
Ingesting Drone Data into Big Data Platforms Timothy Spann
 
A primer on building real time data-driven products
A primer on building real time data-driven productsA primer on building real time data-driven products
A primer on building real time data-driven productsLars Albertsson
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
Designing for Diversity in Design Orgs (Presentation)
Designing for Diversity in Design Orgs (Presentation)Designing for Diversity in Design Orgs (Presentation)
Designing for Diversity in Design Orgs (Presentation)Eli Silva
 
The greatest tragedy of western front pakistani stupidity at its lowest height
The greatest tragedy of western front   pakistani stupidity at its lowest heightThe greatest tragedy of western front   pakistani stupidity at its lowest height
The greatest tragedy of western front pakistani stupidity at its lowest heightAgha A
 
Gustavo Germano Proyecto Ausencias
Gustavo Germano Proyecto AusenciasGustavo Germano Proyecto Ausencias
Gustavo Germano Proyecto AusenciasMonica Oporto
 
3行ラベリングの勧め
3行ラベリングの勧め3行ラベリングの勧め
3行ラベリングの勧めMizuhiro Kaimai
 
TEDx Manchester: AI & The Future of Work
TEDx Manchester: AI & The Future of WorkTEDx Manchester: AI & The Future of Work
TEDx Manchester: AI & The Future of WorkVolker Hirsch
 
History of Drupal: From Drop 1.0 to Drupal 8
History of Drupal: From Drop 1.0 to Drupal 8History of Drupal: From Drop 1.0 to Drupal 8
History of Drupal: From Drop 1.0 to Drupal 8Websolutions Agency
 
ドローン農業最前線
ドローン農業最前線ドローン農業最前線
ドローン農業最前線tetsuya furukawa
 
Devel for Drupal 8
Devel for Drupal 8Devel for Drupal 8
Devel for Drupal 8Luca Lusso
 
Goをカンストさせる話
Goをカンストさせる話Goをカンストさせる話
Goをカンストさせる話Moriyoshi Koizumi
 
Drupal Developer Days Keynote
Drupal Developer Days KeynoteDrupal Developer Days Keynote
Drupal Developer Days KeynoteAngela Byron
 
Introduction to Streaming Distributed Processing with Storm
Introduction to Streaming Distributed Processing with StormIntroduction to Streaming Distributed Processing with Storm
Introduction to Streaming Distributed Processing with StormBrandon O'Brien
 

Destaque (17)

Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
 
Tapjoy: Building a Real-Time Data Science Service for Mobile Advertising
Tapjoy: Building a Real-Time Data Science Service for Mobile AdvertisingTapjoy: Building a Real-Time Data Science Service for Mobile Advertising
Tapjoy: Building a Real-Time Data Science Service for Mobile Advertising
 
Ingesting Drone Data into Big Data Platforms
Ingesting Drone Data into Big Data Platforms Ingesting Drone Data into Big Data Platforms
Ingesting Drone Data into Big Data Platforms
 
A primer on building real time data-driven products
A primer on building real time data-driven productsA primer on building real time data-driven products
A primer on building real time data-driven products
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Kafka presentation
Kafka presentationKafka presentation
Kafka presentation
 
Designing for Diversity in Design Orgs (Presentation)
Designing for Diversity in Design Orgs (Presentation)Designing for Diversity in Design Orgs (Presentation)
Designing for Diversity in Design Orgs (Presentation)
 
The greatest tragedy of western front pakistani stupidity at its lowest height
The greatest tragedy of western front   pakistani stupidity at its lowest heightThe greatest tragedy of western front   pakistani stupidity at its lowest height
The greatest tragedy of western front pakistani stupidity at its lowest height
 
Gustavo Germano Proyecto Ausencias
Gustavo Germano Proyecto AusenciasGustavo Germano Proyecto Ausencias
Gustavo Germano Proyecto Ausencias
 
3行ラベリングの勧め
3行ラベリングの勧め3行ラベリングの勧め
3行ラベリングの勧め
 
TEDx Manchester: AI & The Future of Work
TEDx Manchester: AI & The Future of WorkTEDx Manchester: AI & The Future of Work
TEDx Manchester: AI & The Future of Work
 
History of Drupal: From Drop 1.0 to Drupal 8
History of Drupal: From Drop 1.0 to Drupal 8History of Drupal: From Drop 1.0 to Drupal 8
History of Drupal: From Drop 1.0 to Drupal 8
 
ドローン農業最前線
ドローン農業最前線ドローン農業最前線
ドローン農業最前線
 
Devel for Drupal 8
Devel for Drupal 8Devel for Drupal 8
Devel for Drupal 8
 
Goをカンストさせる話
Goをカンストさせる話Goをカンストさせる話
Goをカンストさせる話
 
Drupal Developer Days Keynote
Drupal Developer Days KeynoteDrupal Developer Days Keynote
Drupal Developer Days Keynote
 
Introduction to Streaming Distributed Processing with Storm
Introduction to Streaming Distributed Processing with StormIntroduction to Streaming Distributed Processing with Storm
Introduction to Streaming Distributed Processing with Storm
 

Último

Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 

Último (20)

Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 

Spark Streaming + Kafka Best Practices (w/ Brandon O'Brien)

  • 1. Spark Streaming + Kafka Best Practices Brandon O’Brien @hakczar Expedia, Inc
  • 2. Or “A Case Study in Operationalizing Spark Streaming”
  • 3. Context/Disclaimer  Our use case: Build resilient, scalable data pipeline with streaming ref data lookups, 24hr stream self-join and some aggregation. Values accuracy over speed.  Spark Streaming 1.5-1.6, Kafka 0.9  Standalone Cluster (not YARN or Mesos)  No Hadoop  Message velocity: k/s. Batch window: 10s  Data sourcee: Kafka (primary), Redis (joins + ref data) & S3 (ref data)
  • 4. Demo: Spark in Action
  • 5. Game & Scoreboard Architecture
  • 6. Outline  Spark Streaming & Standalone Cluster Overview  Design Patterns for Performance  Guaranteed Message Processing & Direct Kafka Integration  Operational Monitoring & Alerting  Spark Cluster & App Resilience
  • 7. Outline  Spark Streaming & Standalone Cluster Overview  Design Patterns for Performance  Guaranteed Message Processing & Direct Kafka Integration  Operational Monitoring & Alerting  Spark Cluster & App Resilience
  • 8. Spark Streaming & Standalone Cluster Overview  RDD: Partitioned, replicated collection of data objects  Driver: JVM that creates Spark program, negotiates for resources. Handles scheduling of tasks but does not do heavy lifting. Bottlenecks.  Executor: Slave to the driver, executes tasks on RDD partitions. Function serialization.  Lazy Execution: Transformations & Actions  Cluster Types: Standalone, YARN, Mesos
  • 9. Spark Streaming & Standalone Cluster Overview  Standalone Cluster  Each node  Master  Worker  Executor  Driver  Zookeeper cluster
  • 10. Outline  Spark Streaming & Standalone Cluster Overview  Design Patterns for Performance  Guaranteed Message Processing & Direct Kafka Integration  Operational Monitoring & Alerting  Spark Cluster & App Resilience
  • 11. Design Patterns for Performance  Delegate all IO/CPU to the Executors  Avoid unnecessary shuffles (join, groupBy, repartition)  Externalize streaming joins & reference data lookups. Large/volatile ref data set.  JVM static hashmap  External cache (e.g. Redis)  Static LRU cache (amortize lookups)  RocksDB  Hygienic function closures
  • 13. We’re done, right? Just need to QA the data…
  • 15. Outline  Spark Streaming & Standalone Cluster Overview  Design Patterns for Performance  Guaranteed Message Processing & Direct Kafka Integration  Operational Monitoring & Alerting  Spark Cluster & App Resilience
  • 16. Guaranteed Message Processing & Direct Kafka Integration  Guaranteed Message Processing = At-least-once processing + idempotence  Kafka Receiver  Consumes messages faster than Spark can process  Checkpoints before processing finished  Inefficient CPU utilization  Direct Kafka Integration  Control over checkpointing & transactionality  Better distribution on resource consumption  1:1 Kafka Topic-partition to Spark RDD-partition  Use Kafka as WAL  Statelessness, Fail-fast
  • 17. Outline  Spark Streaming & Standalone Cluster Overview  Design Patterns for Performance  Guaranteed Message Processing & Direct Kafka Integration  Operational Monitoring & Alerting  Spark Cluster & App Resilience
  • 18. Operational Monitoring & Alerting  Driver “Heartbeat”  Batch processing time  Message count  Kafka lag (latest offsets vs last processed)  Driver start events  StatsD + Graphite + Seyren  http://localhost:4040/metrics/json/
  • 20. Data loss fixed So we’re done, right?
  • 22. Outline  Spark Streaming & Standalone Cluster Overview  Design Patterns for Performance  Guaranteed Message Processing & Direct Kafka Integration  Operational Monitoring & Alerting  Spark Cluster & App Resilience
  • 23. Spark Cluster & App Stability Spark slave memory utilization
  • 24. Spark Cluster & App Stability  Slave memory overhead  OOM killer  Crashes + Kafka Receiver = missing data  Supervised driver: “--supervise” for spark-submit. Driver restart logging  Cluster resource overprovisioning  Standby Masters for failover  Auto-cleanup of work directories spark.worker.cleanup.enabled=true
  • 28. TL;DR 1. Use Direct Kafka Integration + transactionality 2. Cache reference data for speed 3. Avoid shuffles & driver bottlenecks 4. Supervised driver 5. Cleanup worker temp directory 6. Beware of function closures 7. Cluster resource over-provisioning 8. Spark slave memory headroom 9. Monitoring on Driver heartbeat & Kafka lag 10. Standby masters
  • 29. Spark Streaming + Kafka Best Practices Brandon O’Brien @hakczar Expedia, Inc Thanks!
  • 30. Links  Operationalization Spark Streaming: https://techblog.expedia.com/2016/12/29/operationalizing- spark-streaming-part-1/  Direct Kafka Integration: https://databricks.com/blog/2015/03/30/improvements-to- kafka-integration-of-spark-streaming.html  App metrics: http://localhost:4040/metrics/json/  MetricsSystem: http://www.hammerlab.org/2015/02/27/monitoring-spark- with-graphite-and-grafana/  sparkConf.set("spark.worker.cleanup.enabled", "true")

Notas do Editor

  1. Tell our story, to share learnings
  2. This was our use case, yours may be different
  3. This is our use case, yours may be different
  4. Live system to reason about
  5. Not necessarily the only way to set it up. Save IP space
  6. Ok, we built the app in the spark framework for scalability, made it fast,
  7. Pause, check on game player
  8. Spark is hiding the fact that it can’t keep up with the stream. Crash + restart + bad checkpoint = missing messages. Config to ameliorate, artifact of absence of WAL/HDFS. Multiple data loss scenarios Direct Kafka Integration = statelessness
  9. Simple, At a glance, batch process time < batch interval. Strong Checkpointing strategy(direct) + fail fast / idempotent code, then driver heart beat + kafka lag = confidence
  10. After a few days, we notice…
  11. After a few days, we notice…
  12. I thought resiliency was the promise of Spark. Resilient distributed datasets
  13. The app was crashing, but why
  14. Crashes while using Kafka Receiver = missing data. No WAL Is Spark so flaky? Spark was being attacked by the operating system…and doing surprisingly well given the circumstance, especially with the direct kafka Integration and checkpointing Goal: have enough resiliency, redundancy, idempotence, checkpointing. Multiple failures without causing problems.