Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

•

0 likes•712 views

We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark. Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue · Why? o Custom queries on top a table; We load the data once and query N times · Why not Structured Streaming · Working Solution using Redis Niche 2 : Distributed Counters · Problems with Spark Accumulators · Utilize Redis Hashes as distributed counters · Precautions for retries and speculative execution · Pipelining to improve performance

Data & Analytics

Redis + Apache Spark =
Swiss Army Knife meets
Kitchen Sink
Yeshwanth Vijayakumar
Sr. Engineering Manager/Architect @ Adobe

Agenda
§ Niche 1
▪ Long Running Spark Batch
Job - Dispatch New Jobs by
polling a Redis Queue
§ Niche 2 :
▪ Distributed Counters

• Niche 1 :
Long Running Spark Batch Job

Run as many queries as possible in parallel on top a
denormalized dataframe
• foo = 1
Query 1
• bar.baz > 120
Query 2
• state in [CA, NY]
Query 3
Query 1000
ProfileIds field1 field1000 eventsArray
a@a.com a x [e1,2,3]
b@g.com b x [e1]
d@d.com d y [e1,2,3]
z@z.com z y [e1,2,3,5,7]

What do we need?
• Long Running Spark Batch Job
• Dispatch New Jobs by polling a Redis Queue
• We want to parametrize a Spark Action repeatedly for
interactive results
• E.g. Submit custom queries on top a table
• We load the data once query N times
• Bringing up a Spark Cluster per job has a latency cost
• Wasted time doing same initialization actions multiple times.
• Possible Multi tenancy

Why not Structured Streaming?
• Lack of access to Spark Context within executor context
• Can’t do a spark action on top of dataframe that is
already loaded in the driver unless you do a join
• Doing a join is extremely limited

Working Solution Summary
• Blocking POP on Redis inside driver and use Command
Pattern to send queries to rediscover queue
• Consume the commands and trigger spark actions using a
FAIR scheduler
• Communicate status of job through a micro
service/database or Redis itself!

Session Workflow – Spark
Continuous Session
10
Submit
Query API
Spark Driver
Executor 1
Executor N
Fetch
Results
Executor Logic
API
1. POST /preview
2. Check if result in Cache
1. GET /preview/<previewID> 2. Fetch Counters from Redis
3. Push <query> into queue
4. Pop queries till
queue is empty
[q1, q2, q3, q100]
Sample
Dataframe
Sample
Dataframe
partition
1
partition
2
partition
1

What is wrong with Accumulators?
• Repeated Task Execution - Non idempotency
• Task Failures and Retries
• Re-using stage in repeated operations
• Speculative Execution
• Memory pressure on driver on collect()
• Can’t access per partition stats programmatically AFAIK

What is wrong with Accumulators? - Example

Utilize Redis Hashes as distributed counters

Digging into Redis Pipelining + Spark
From https://redis.io/topics/pipelining
Without Pipelining With Pipelining

Important Config Optimizations
Off-Heap Allocation

Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

What's hot

Running Apache Spark on Kubernetes: Best Practices and PitfallsDatabricks

Building Robust Production Data Pipelines with Databricks DeltaDatabricks

Productizing Structured Streaming JobsDatabricks

Flink vs. SparkSlim Baltagi

Parquet performance tuning: the missing guideRyan Blue

Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkBo Yang

Performant Streaming in Production: Preventing Common Pitfalls when Productio...Databricks

Introduction to Apache Flinkdatamantra

Data Security at Scale through Spark and Parquet EncryptionDatabricks

The Parquet Format and Performance Optimization OpportunitiesDatabricks

Performance Troubleshooting Using Apache Spark MetricsDatabricks

Common Strategies for Improving Performance on Your Delta LakehouseDatabricks

Understanding Memory Management In Spark For Fun And ProfitSpark Summit

Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, PresetHostedbyConfluent

Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks

Bootstrapping state in Apache FlinkDataWorks Summit

From Query Plan to Query Performance: Supercharging your Apache Spark Queries...Databricks

Introduction To Streaming Data and Stream Processing with Apache Kafkaconfluent

Native Support of Prometheus Monitoring in Apache Spark 3.0Databricks

Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Edureka!

What's hot (20)

Running Apache Spark on Kubernetes: Best Practices and Pitfalls

Building Robust Production Data Pipelines with Databricks Delta

Productizing Structured Streaming Jobs

Flink vs. Spark

Parquet performance tuning: the missing guide

Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark

Performant Streaming in Production: Preventing Common Pitfalls when Productio...

Introduction to Apache Flink

Data Security at Scale through Spark and Parquet Encryption

The Parquet Format and Performance Optimization Opportunities

Performance Troubleshooting Using Apache Spark Metrics

Common Strategies for Improving Performance on Your Delta Lakehouse

Understanding Memory Management In Spark For Fun And Profit

Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset

Scaling your Data Pipelines with Apache Spark on Kubernetes

Bootstrapping state in Apache Flink

From Query Plan to Query Performance: Supercharging your Apache Spark Queries...

Introduction To Streaming Data and Stream Processing with Apache Kafka

Native Support of Prometheus Monitoring in Apache Spark 3.0

Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...

Similar to Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Benchmarking at ParseTravis Redman

Advanced Benchmarking at ParseMongoDB

261197832 8-performance-tuning-part iNaviSoft

BTV PHP - Building Fast WebsitesJonathan Klein

[262] netflix 빅데이터 플랫폼NAVER D2

Performance Scenario: Diagnosing and resolving sudden slow down on two node RACKristofferson A

Scaling habits of ASP.NETDavid Giard

PGConf APAC 2018 - Tale from TrenchesPGConf APAC

Ansible benelux meetup - Amsterdam 27-5-2015Pavel Chunyayev

Continuous Application with FAIR Scheduler with Robert XueDatabricks

Queick: A Simple Job Queue System for PythonRyota Suenaga

Background processing with hangfireAleksandar Bozinovski

Give your little scripts big wings: Using cron in the cloud with Amazon Simp...Amazon Web Services

Performance testing as part of Agile - Continius Delivery solutionSergey Radov

Batch Processing with Amazon EC2 Container ServiceAmazon Web Services

improving the performance of Rails web ApplicationsJohn McCaffrey

Radical Speed for SQL Queries on Databricks: Photon Under the HoodDatabricks

Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...Flink Forward

Internals of Presto ServiceTreasure Data, Inc.

Celery: The Distributed Task QueueRichard Leland

Similar to Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink (20)

Benchmarking at Parse

Advanced Benchmarking at Parse

261197832 8-performance-tuning-part i

BTV PHP - Building Fast Websites

[262] netflix 빅데이터 플랫폼

Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC

Scaling habits of ASP.NET

PGConf APAC 2018 - Tale from Trenches

Ansible benelux meetup - Amsterdam 27-5-2015

Continuous Application with FAIR Scheduler with Robert Xue

Queick: A Simple Job Queue System for Python

Background processing with hangfire

Give your little scripts big wings: Using cron in the cloud with Amazon Simp...

Performance testing as part of Agile - Continius Delivery solution

Batch Processing with Amazon EC2 Container Service

improving the performance of Rails web Applications

Radical Speed for SQL Queries on Databricks: Photon Under the Hood

Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...

Internals of Presto Service

Celery: The Distributed Task Queue

Recently uploaded

NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster

While-For-loop in python used in collegessuser7a7cd61

Semantic Shed - Squashing and Squeezing.pptxMike Bennett

科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss

Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research

Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy

Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna

Machine learning classification ppt.pptamreenkhanum0307

Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7

Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha

Learn How Data Science Changes Our WorldEduminds Learning

专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss

NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics

RadioAdProWritingCinderellabyButleri.pdfgstagge

DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett

Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen

Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2

From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck

Recently uploaded (20)

NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024

While-For-loop in python used in college

Semantic Shed - Squashing and Squeezing.pptx

科罗拉多大学波尔得分校毕业证学位证成绩单-可办理

Biometric Authentication: The Evolution, Applications, Benefits and Challenge...

Student profile product demonstration on grades, ability, well-being and mind...

Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...

Machine learning classification ppt.ppt

Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...

Call Girls In Dwarka 9654467111 Escorts Service

Learn How Data Science Changes Our World

专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改

NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx

RadioAdProWritingCinderellabyButleri.pdf

DBA Basics: Getting Started with Performance Tuning.pdf

Data Factory in Microsoft Fabric (MsBIP #82)

Identifying Appropriate Test Statistics Involving Population Mean

From idea to production in a day – Leveraging Azure ML and Streamlit to build...

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

1. Redis + Apache Spark = Swiss Army Knife meets Kitchen Sink Yeshwanth Vijayakumar Sr. Engineering Manager/Architect @ Adobe

2. Agenda § Niche 1 ▪ Long Running Spark Batch Job - Dispatch New Jobs by polling a Redis Queue § Niche 2 : ▪ Distributed Counters

3. • Niche 1 : Long Running Spark Batch Job

4. Problem Context

5. Run as many queries as possible in parallel on top a denormalized dataframe • foo = 1 Query 1 • bar.baz > 120 Query 2 • state in [CA, NY] Query 3 Query 1000 ProfileIds field1 field1000 eventsArray a@a.com a x [e1,2,3] b@g.com b x [e1] d@d.com d y [e1,2,3] z@z.com z y [e1,2,3,5,7]

6. What do we need? • Long Running Spark Batch Job • Dispatch New Jobs by polling a Redis Queue • We want to parametrize a Spark Action repeatedly for interactive results • E.g. Submit custom queries on top a table • We load the data once query N times • Bringing up a Spark Cluster per job has a latency cost • Wasted time doing same initialization actions multiple times. • Possible Multi tenancy

7. Why not Apache Livy et. al?

8. Why not Structured Streaming? • Lack of access to Spark Context within executor context • Can’t do a spark action on top of dataframe that is already loaded in the driver unless you do a join • Doing a join is extremely limited

9. Working Solution Summary • Blocking POP on Redis inside driver and use Command Pattern to send queries to rediscover queue • Consume the commands and trigger spark actions using a FAIR scheduler • Communicate status of job through a micro service/database or Redis itself!

10. Session Workflow – Spark Continuous Session 10 Submit Query API Spark Driver Executor 1 Executor N Fetch Results Executor Logic API 1. POST /preview 2. Check if result in Cache 1. GET /preview/<previewID> 2. Fetch Counters from Redis 3. Push <query> into queue 4. Pop queries till queue is empty [q1, q2, q3, q100] Sample Dataframe Sample Dataframe partition 1 partition 2 partition 1

11. Working Solution – Code View

12. Working Solution – Code View

13. Working Solution – Code View

14. • Niche 2 : Distributed Counters

15. What is wrong with Accumulators? • Repeated Task Execution - Non idempotency • Task Failures and Retries • Re-using stage in repeated operations • Speculative Execution • Memory pressure on driver on collect() • Can’t access per partition stats programmatically AFAIK

16. What is wrong with Accumulators? - Example

17. Utilize Redis Hashes as distributed counters

18. Utilize Redis Hashes as distributed counters

19. Excellent Throughput

20. Digging into Redis Pipelining + Spark From https://redis.io/topics/pipelining Without Pipelining With Pipelining

21.

22. Important Config Optimizations Off-Heap Allocation

23. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Similar to Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink