Spark Streaming + Kafka Best Practices (w/ Brandon O'Brien)

•Transferir como PPTX, PDF•

4 gostaram•4,199 visualizações

Presented at the Seattle Spark meetup on March 23th, 2017 hosted at Expedia. (https://www.meetup.com/Seattle-Spark-Meetup/events/230310598/) This presentation focuses on a case study of taking Spark Streaming to production using Kafka as a data source, and highlights best practices for different concerns of streaming processing: 1. Spark Streaming & Standalone Cluster Overview 2. Design Patterns for Performance 3. Guaranteed Message Processing & Direct Kafka Integration 4. Operational Monitoring & Alerting 5. Spark Cluster & App Resilience

Tecnologia

Spark Streaming
+ Kafka
Best Practices
Brandon O’Brien
@hakczar
Expedia, Inc

Or
“A Case Study in Operationalizing
Spark Streaming”

Context/Disclaimer
 Our use case: Build resilient, scalable data pipeline with
streaming ref data lookups, 24hr stream self-join and some
aggregation. Values accuracy over speed.
 Spark Streaming 1.5-1.6, Kafka 0.9
 Standalone Cluster (not YARN or Mesos)
 No Hadoop
 Message velocity: k/s. Batch window: 10s
 Data sourcee: Kafka (primary), Redis (joins + ref data) & S3
(ref data)

Outline
 Spark Streaming & Standalone Cluster Overview
 Design Patterns for Performance
 Guaranteed Message Processing & Direct Kafka
Integration
 Operational Monitoring & Alerting
 Spark Cluster & App Resilience

Spark Streaming & Standalone
Cluster Overview
 RDD: Partitioned, replicated collection of data
objects
 Driver: JVM that creates Spark program,
negotiates for resources. Handles scheduling of
tasks but does not do heavy lifting. Bottlenecks.
 Executor: Slave to the driver, executes tasks on
RDD partitions. Function serialization.
 Lazy Execution: Transformations & Actions
 Cluster Types: Standalone, YARN, Mesos

Spark Streaming & Standalone
Cluster Overview
 Standalone Cluster
 Each node
 Master
 Worker
 Executor
 Driver
 Zookeeper cluster

Design Patterns for Performance
 Delegate all IO/CPU to the Executors
 Avoid unnecessary shuffles (join, groupBy,
repartition)
 Externalize streaming joins & reference data
lookups. Large/volatile ref data set.
 JVM static hashmap
 External cache (e.g. Redis)
 Static LRU cache (amortize lookups)
 RocksDB
 Hygienic function closures

We’re done, right?
Just need to QA the data…

Guaranteed Message Processing &
Direct Kafka Integration
 Guaranteed Message Processing = At-least-once
processing + idempotence
 Kafka Receiver
 Consumes messages faster than Spark can process
 Checkpoints before processing finished
 Inefficient CPU utilization
 Direct Kafka Integration
 Control over checkpointing & transactionality
 Better distribution on resource consumption
 1:1 Kafka Topic-partition to Spark RDD-partition
 Use Kafka as WAL
 Statelessness, Fail-fast

Operational Monitoring
& Alerting
 Driver “Heartbeat”
 Batch processing time
 Message count
 Kafka lag (latest offsets vs last processed)
 Driver start events
 StatsD + Graphite + Seyren
 http://localhost:4040/metrics/json/

Spark Cluster & App Stability
Spark slave memory utilization

Spark Cluster & App Stability
 Slave memory overhead
 OOM killer
 Crashes + Kafka Receiver = missing data
 Supervised driver: “--supervise” for spark-submit.
Driver restart logging
 Cluster resource overprovisioning
 Standby Masters for failover
 Auto-cleanup of work directories
spark.worker.cleanup.enabled=true

TL;DR
1. Use Direct Kafka Integration + transactionality
2. Cache reference data for speed
3. Avoid shuffles & driver bottlenecks
4. Supervised driver
5. Cleanup worker temp directory
6. Beware of function closures
7. Cluster resource over-provisioning
8. Spark slave memory headroom
9. Monitoring on Driver heartbeat & Kafka lag
10. Standby masters

Spark Streaming
+ Kafka
Best Practices
Brandon O’Brien
@hakczar
Expedia, Inc
Thanks!

Links
 Operationalization Spark Streaming:
https://techblog.expedia.com/2016/12/29/operationalizing-
spark-streaming-part-1/
 Direct Kafka Integration:
https://databricks.com/blog/2015/03/30/improvements-to-
kafka-integration-of-spark-streaming.html
 App metrics: http://localhost:4040/metrics/json/
 MetricsSystem:
http://www.hammerlab.org/2015/02/27/monitoring-spark-
with-graphite-and-grafana/
 sparkConf.set("spark.worker.cleanup.enabled", "true")

Mais conteúdo relacionado

Destaque

Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...Brandon O'Brien

Tapjoy: Building a Real-Time Data Science Service for Mobile AdvertisingSingleStore

Ingesting Drone Data into Big Data Platforms Timothy Spann

A primer on building real time data-driven productsLars Albertsson

Real time Analytics with Apache Kafka and Apache SparkRahul Jain

Kafka presentationMohammed Fazuluddin

Designing for Diversity in Design Orgs (Presentation)Eli Silva

The greatest tragedy of western front pakistani stupidity at its lowest heightAgha A

Gustavo Germano Proyecto AusenciasMonica Oporto

3行ラベリングの勧めMizuhiro Kaimai

TEDx Manchester: AI & The Future of WorkVolker Hirsch

History of Drupal: From Drop 1.0 to Drupal 8Websolutions Agency

ドローン農業最前線tetsuya furukawa

Devel for Drupal 8Luca Lusso

Goをカンストさせる話Moriyoshi Koizumi

Drupal Developer Days KeynoteAngela Byron

Introduction to Streaming Distributed Processing with StormBrandon O'Brien

Destaque (17)

Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...

Tapjoy: Building a Real-Time Data Science Service for Mobile Advertising

Ingesting Drone Data into Big Data Platforms

A primer on building real time data-driven products

Real time Analytics with Apache Kafka and Apache Spark

Kafka presentation

Designing for Diversity in Design Orgs (Presentation)

The greatest tragedy of western front pakistani stupidity at its lowest height

Gustavo Germano Proyecto Ausencias

3行ラベリングの勧め

TEDx Manchester: AI & The Future of Work

History of Drupal: From Drop 1.0 to Drupal 8

ドローン農業最前線

Devel for Drupal 8

Goをカンストさせる話

Drupal Developer Days Keynote

Introduction to Streaming Distributed Processing with Storm

Último

Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

"ML in Production",Oleksandr BaganFwdays

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren

Anypoint Exchange: It’s Not Just a Repo!Manik S Magar

Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar

My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer

SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal

Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski

CloudStudio User manual (basic edition):comworks

Commit 2024 - Secret Management made easyAlfredo García Lavilla

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

Spark Streaming + Kafka Best Practices (w/ Brandon O'Brien)

1. Spark Streaming + Kafka Best Practices Brandon O’Brien @hakczar Expedia, Inc

2. Or “A Case Study in Operationalizing Spark Streaming”

3. Context/Disclaimer  Our use case: Build resilient, scalable data pipeline with streaming ref data lookups, 24hr stream self-join and some aggregation. Values accuracy over speed.  Spark Streaming 1.5-1.6, Kafka 0.9  Standalone Cluster (not YARN or Mesos)  No Hadoop  Message velocity: k/s. Batch window: 10s  Data sourcee: Kafka (primary), Redis (joins + ref data) & S3 (ref data)

4. Demo: Spark in Action

5. Game & Scoreboard Architecture

6. Outline  Spark Streaming & Standalone Cluster Overview  Design Patterns for Performance  Guaranteed Message Processing & Direct Kafka Integration  Operational Monitoring & Alerting  Spark Cluster & App Resilience

7. Outline  Spark Streaming & Standalone Cluster Overview  Design Patterns for Performance  Guaranteed Message Processing & Direct Kafka Integration  Operational Monitoring & Alerting  Spark Cluster & App Resilience

8. Spark Streaming & Standalone Cluster Overview  RDD: Partitioned, replicated collection of data objects  Driver: JVM that creates Spark program, negotiates for resources. Handles scheduling of tasks but does not do heavy lifting. Bottlenecks.  Executor: Slave to the driver, executes tasks on RDD partitions. Function serialization.  Lazy Execution: Transformations & Actions  Cluster Types: Standalone, YARN, Mesos

9. Spark Streaming & Standalone Cluster Overview  Standalone Cluster  Each node  Master  Worker  Executor  Driver  Zookeeper cluster

10. Outline  Spark Streaming & Standalone Cluster Overview  Design Patterns for Performance  Guaranteed Message Processing & Direct Kafka Integration  Operational Monitoring & Alerting  Spark Cluster & App Resilience

11. Design Patterns for Performance  Delegate all IO/CPU to the Executors  Avoid unnecessary shuffles (join, groupBy, repartition)  Externalize streaming joins & reference data lookups. Large/volatile ref data set.  JVM static hashmap  External cache (e.g. Redis)  Static LRU cache (amortize lookups)  RocksDB  Hygienic function closures

12. We’re done, right?

13. We’re done, right? Just need to QA the data…

14. 70% missing data

15. Outline  Spark Streaming & Standalone Cluster Overview  Design Patterns for Performance  Guaranteed Message Processing & Direct Kafka Integration  Operational Monitoring & Alerting  Spark Cluster & App Resilience

16. Guaranteed Message Processing & Direct Kafka Integration  Guaranteed Message Processing = At-least-once processing + idempotence  Kafka Receiver  Consumes messages faster than Spark can process  Checkpoints before processing finished  Inefficient CPU utilization  Direct Kafka Integration  Control over checkpointing & transactionality  Better distribution on resource consumption  1:1 Kafka Topic-partition to Spark RDD-partition  Use Kafka as WAL  Statelessness, Fail-fast

17. Outline  Spark Streaming & Standalone Cluster Overview  Design Patterns for Performance  Guaranteed Message Processing & Direct Kafka Integration  Operational Monitoring & Alerting  Spark Cluster & App Resilience

18. Operational Monitoring & Alerting  Driver “Heartbeat”  Batch processing time  Message count  Kafka lag (latest offsets vs last processed)  Driver start events  StatsD + Graphite + Seyren  http://localhost:4040/metrics/json/

19. Data loss fixed

20. Data loss fixed So we’re done, right?

21. Cluster & app continuously crashing

22. Outline  Spark Streaming & Standalone Cluster Overview  Design Patterns for Performance  Guaranteed Message Processing & Direct Kafka Integration  Operational Monitoring & Alerting  Spark Cluster & App Resilience

23. Spark Cluster & App Stability Spark slave memory utilization

24. Spark Cluster & App Stability  Slave memory overhead  OOM killer  Crashes + Kafka Receiver = missing data  Supervised driver: “--supervise” for spark-submit. Driver restart logging  Cluster resource overprovisioning  Standby Masters for failover  Auto-cleanup of work directories spark.worker.cleanup.enabled=true

25. We’re done, right?

26. We’re done, right? Finally, yes

27. Party Time

28. TL;DR 1. Use Direct Kafka Integration + transactionality 2. Cache reference data for speed 3. Avoid shuffles & driver bottlenecks 4. Supervised driver 5. Cleanup worker temp directory 6. Beware of function closures 7. Cluster resource over-provisioning 8. Spark slave memory headroom 9. Monitoring on Driver heartbeat & Kafka lag 10. Standby masters

29. Spark Streaming + Kafka Best Practices Brandon O’Brien @hakczar Expedia, Inc Thanks!

30. Links  Operationalization Spark Streaming: https://techblog.expedia.com/2016/12/29/operationalizing- spark-streaming-part-1/  Direct Kafka Integration: https://databricks.com/blog/2015/03/30/improvements-to- kafka-integration-of-spark-streaming.html  App metrics: http://localhost:4040/metrics/json/  MetricsSystem: http://www.hammerlab.org/2015/02/27/monitoring-spark- with-graphite-and-grafana/  sparkConf.set("spark.worker.cleanup.enabled", "true")

Notas do Editor

Tell our story, to share learnings
This was our use case, yours may be different
This is our use case, yours may be different
Live system to reason about
Not necessarily the only way to set it up. Save IP space
Ok, we built the app in the spark framework for scalability, made it fast,
Pause, check on game player
Spark is hiding the fact that it can’t keep up with the stream. Crash + restart + bad checkpoint = missing messages. Config to ameliorate, artifact of absence of WAL/HDFS. Multiple data loss scenarios Direct Kafka Integration = statelessness
Simple, At a glance, batch process time < batch interval. Strong Checkpointing strategy(direct) + fail fast / idempotent code, then driver heart beat + kafka lag = confidence
After a few days, we notice…
After a few days, we notice…
I thought resiliency was the promise of Spark. Resilient distributed datasets
The app was crashing, but why
Crashes while using Kafka Receiver = missing data. No WAL Is Spark so flaky? Spark was being attacked by the operating system…and doing surprisingly well given the circumstance, especially with the direct kafka Integration and checkpointing Goal: have enough resiliency, redundancy, idempotence, checkpointing. Multiple failures without causing problems.

Spark Streaming + Kafka Best Practices (w/ Brandon O'Brien)

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (17)

Último

Último (20)

Spark Streaming + Kafka Best Practices (w/ Brandon O'Brien)

Notas do Editor