SlideShare a Scribd company logo
1 of 23
Download to read offline
1
“Don’t make the assumption that things in
Spark just work, there is a good chance
that Spark underneath the hood is going to
do something unexpected”
2
Holden Karau
Apache Spark PMC
2017, BeeScala
@adipolak
What can go wrong with Data flows?
EVERYTHING
3
● new component logic
● new data source
● introducing incompatible schema
change
● Kafka broker failed to validate record
● Spark job runs twice, in parallel
● changing tables’ relationship keys
● accidentally delete yesterday’s
`events/` partition
● data duplication
@adipolak
Testing is hard!
Distributed data systems with many moving parts
4
Unit testing
Integration testing
System testing
E2E …
@adipolak
Chaos Engineering in the
World of Large-Scale Complex
Data Flow
Adi Polak
Treeverse
4 Principles:
6
• Defining a steady-state
• Acknowledging a variety of real-world events
• Running manual experiments in production
• Automating production experiments
@adipolak
Chaos Engineering
#1 - Steady state
• system’s throughput
• error rates
• latency percentiles
• etc
DevOps / SRE
7
• Data products requirements
Data Engineer
@adipolak
Data Product Requirements
8
• Data Quality
• Accuracy
• No duplications
• SLA
@adipolak
In 3 lines of code..
9
# Read data from file
df = spark.read.parquet('s3a://bank_transactions/ts=123123123/’)
# Some transformation over the data related to finance
updated_df = df… do some magic!
updated_df.write.mode('append').parquet('s3a://bank_transactions/ts=123123123/’)
1
2
3
4
5
6
7
8
9
10
@adipolak
Data Duplication
#2 - Vary Real-world Events
hardware failures like :
• servers dying
• spike in traffic
DevOps / SRE
10
• schema change
• corrupted data record
• data variance
• accidentally delete yesterday’s `events/`
partition
Data Engineer
@adipolak
11
@adipolak
#3 – Run Experiments in Production
• Production machines and network
DevOps / SRE
12
• Production data
Data Engineer
@adipolak
🙀
Experimental environment
On production data in production ..
13
s3://data-repo/collections/foo
s3://data-repo/exp1/collections/foo
@adipolak
★
★
#4 – Automate Experiments to Run
Continuously in Prod
• Discover weaknesses
DevOps / SRE
14
• Manage Data lifecycle in stages
Data Engineer
@adipolak
& Minimize Blast Radius
Roll Back
Troubleshoot
Production
Version control
Best Practices & Data
Quality
Deployment
Data product stages & requirements
Experimentation
Debug
Collaborate
.
Development
15
@adipolak
16
What's the best
way to automate
data stages
propagation in
production?
@adipolak
Branching Strategy
Like source control -
17
@adipolak
18
@adipolak
Git for Data
Challenge Solution
$ spark.read.parquet
(‘s3://my-repo/<commit_id>’)
$ lakectl revert main^1
$ lakectl branch create my-branch
$ lakectl merge my-branch
main
$ lakectl branch create new-logic
$ lakectl merge new-logic main
Quickly recover from an error
Develop in Isolation
Troubleshoot
Atomic Update
Reprocess
Git for Data
21
How Does lakeFS Work?
@adipolak
22
What Does A Typical Environment Look
Like?
@adipolak
23
Recap
• Adopting Principles of Chaos Engineering to data systems
• Data Lifecycle Management stages
• Git for Data
• lakeFS

More Related Content

Similar to Chaos Engineering and How to Manage Data Stages With Adi Polak | Current 2022

SQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at ComcastSQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at Comcast
Databricks
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Databricks
 
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Kristofferson A
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
Databricks
 

Similar to Chaos Engineering and How to Manage Data Stages With Adi Polak | Current 2022 (20)

Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Connecting Hadoop and Oracle
Connecting Hadoop and OracleConnecting Hadoop and Oracle
Connecting Hadoop and Oracle
 
Common SQL Server Mistakes and How to Avoid Them with Tim Radney
Common SQL Server Mistakes and How to Avoid Them with Tim RadneyCommon SQL Server Mistakes and How to Avoid Them with Tim Radney
Common SQL Server Mistakes and How to Avoid Them with Tim Radney
 
Data as a Service
Data as a Service Data as a Service
Data as a Service
 
SQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at ComcastSQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at Comcast
 
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey KharlamovRUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
 
Python performance profiling
Python performance profilingPython performance profiling
Python performance profiling
 
Migrating to Database 12c Multitenant - New Opportunities To Get It Right!
Migrating to Database 12c Multitenant - New Opportunities To Get It Right!Migrating to Database 12c Multitenant - New Opportunities To Get It Right!
Migrating to Database 12c Multitenant - New Opportunities To Get It Right!
 
DataStax | DataStax Tools for Developers (Alex Popescu) | Cassandra Summit 2016
DataStax | DataStax Tools for Developers (Alex Popescu) | Cassandra Summit 2016DataStax | DataStax Tools for Developers (Alex Popescu) | Cassandra Summit 2016
DataStax | DataStax Tools for Developers (Alex Popescu) | Cassandra Summit 2016
 
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
 
[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼
 
Pig on Spark
Pig on SparkPig on Spark
Pig on Spark
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
 
From Zero to Performance Hero in Minutes - Agile Testing Days 2014 Potsdam
From Zero to Performance Hero in Minutes - Agile Testing Days 2014 PotsdamFrom Zero to Performance Hero in Minutes - Agile Testing Days 2014 Potsdam
From Zero to Performance Hero in Minutes - Agile Testing Days 2014 Potsdam
 
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
 
Deploying Data Science Engines to Production
Deploying Data Science Engines to ProductionDeploying Data Science Engines to Production
Deploying Data Science Engines to Production
 
Application Quality Gates in Continuous Delivery: Deliver Better Software Fas...
Application Quality Gates in Continuous Delivery: Deliver Better Software Fas...Application Quality Gates in Continuous Delivery: Deliver Better Software Fas...
Application Quality Gates in Continuous Delivery: Deliver Better Software Fas...
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
 

More from HostedbyConfluent

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
HostedbyConfluent
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at Trendyol
HostedbyConfluent
 

More from HostedbyConfluent (20)

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Renaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonRenaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit London
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at Trendyol
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesEnsuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
 
Exactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaExactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and Kafka
 
Fish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonFish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit London
 
Tiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonTiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit London
 
Building a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyBuilding a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And Why
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
 
Navigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersNavigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka Clusters
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformApache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
 
Explaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubExplaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy Pub
 
TL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonTL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit London
 
A Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLA Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSL
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceMastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
 
Data Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondData Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and Beyond
 
Code-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsCode-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink Apps
 
Debezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemDebezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC Ecosystem
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksBeyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local Disks
 

Recently uploaded

Recently uploaded (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 

Chaos Engineering and How to Manage Data Stages With Adi Polak | Current 2022

  • 1. 1
  • 2. “Don’t make the assumption that things in Spark just work, there is a good chance that Spark underneath the hood is going to do something unexpected” 2 Holden Karau Apache Spark PMC 2017, BeeScala @adipolak
  • 3. What can go wrong with Data flows? EVERYTHING 3 ● new component logic ● new data source ● introducing incompatible schema change ● Kafka broker failed to validate record ● Spark job runs twice, in parallel ● changing tables’ relationship keys ● accidentally delete yesterday’s `events/` partition ● data duplication @adipolak
  • 4. Testing is hard! Distributed data systems with many moving parts 4 Unit testing Integration testing System testing E2E … @adipolak
  • 5. Chaos Engineering in the World of Large-Scale Complex Data Flow Adi Polak Treeverse
  • 6. 4 Principles: 6 • Defining a steady-state • Acknowledging a variety of real-world events • Running manual experiments in production • Automating production experiments @adipolak Chaos Engineering
  • 7. #1 - Steady state • system’s throughput • error rates • latency percentiles • etc DevOps / SRE 7 • Data products requirements Data Engineer @adipolak
  • 8. Data Product Requirements 8 • Data Quality • Accuracy • No duplications • SLA @adipolak
  • 9. In 3 lines of code.. 9 # Read data from file df = spark.read.parquet('s3a://bank_transactions/ts=123123123/’) # Some transformation over the data related to finance updated_df = df… do some magic! updated_df.write.mode('append').parquet('s3a://bank_transactions/ts=123123123/’) 1 2 3 4 5 6 7 8 9 10 @adipolak Data Duplication
  • 10. #2 - Vary Real-world Events hardware failures like : • servers dying • spike in traffic DevOps / SRE 10 • schema change • corrupted data record • data variance • accidentally delete yesterday’s `events/` partition Data Engineer @adipolak
  • 12. #3 – Run Experiments in Production • Production machines and network DevOps / SRE 12 • Production data Data Engineer @adipolak 🙀
  • 13. Experimental environment On production data in production .. 13 s3://data-repo/collections/foo s3://data-repo/exp1/collections/foo @adipolak ★ ★
  • 14. #4 – Automate Experiments to Run Continuously in Prod • Discover weaknesses DevOps / SRE 14 • Manage Data lifecycle in stages Data Engineer @adipolak & Minimize Blast Radius
  • 15. Roll Back Troubleshoot Production Version control Best Practices & Data Quality Deployment Data product stages & requirements Experimentation Debug Collaborate . Development 15 @adipolak
  • 16. 16 What's the best way to automate data stages propagation in production? @adipolak
  • 17. Branching Strategy Like source control - 17 @adipolak
  • 19. Challenge Solution $ spark.read.parquet (‘s3://my-repo/<commit_id>’) $ lakectl revert main^1 $ lakectl branch create my-branch $ lakectl merge my-branch main $ lakectl branch create new-logic $ lakectl merge new-logic main Quickly recover from an error Develop in Isolation Troubleshoot Atomic Update Reprocess Git for Data
  • 20.
  • 21. 21 How Does lakeFS Work? @adipolak
  • 22. 22 What Does A Typical Environment Look Like? @adipolak
  • 23. 23 Recap • Adopting Principles of Chaos Engineering to data systems • Data Lifecycle Management stages • Git for Data • lakeFS