This document discusses applying chaos engineering principles to complex data systems. It recommends defining a steady state, acknowledging real-world events, running manual experiments in production, and automating production experiments. This helps discover weaknesses and manage the data lifecycle through stages like development, experimentation, deployment, and production. It also presents the idea of using a "Git for Data" system like lakeFS to version data and enable features like branching, rolling back, and atomic updates to more easily develop, test, and recover from errors in data pipelines.
2. “Don’t make the assumption that things in
Spark just work, there is a good chance
that Spark underneath the hood is going to
do something unexpected”
2
Holden Karau
Apache Spark PMC
2017, BeeScala
@adipolak
3. What can go wrong with Data flows?
EVERYTHING
3
● new component logic
● new data source
● introducing incompatible schema
change
● Kafka broker failed to validate record
● Spark job runs twice, in parallel
● changing tables’ relationship keys
● accidentally delete yesterday’s
`events/` partition
● data duplication
@adipolak
4. Testing is hard!
Distributed data systems with many moving parts
4
Unit testing
Integration testing
System testing
E2E …
@adipolak
5. Chaos Engineering in the
World of Large-Scale Complex
Data Flow
Adi Polak
Treeverse
6. 4 Principles:
6
• Defining a steady-state
• Acknowledging a variety of real-world events
• Running manual experiments in production
• Automating production experiments
@adipolak
Chaos Engineering
7. #1 - Steady state
• system’s throughput
• error rates
• latency percentiles
• etc
DevOps / SRE
7
• Data products requirements
Data Engineer
@adipolak
9. In 3 lines of code..
9
# Read data from file
df = spark.read.parquet('s3a://bank_transactions/ts=123123123/’)
# Some transformation over the data related to finance
updated_df = df… do some magic!
updated_df.write.mode('append').parquet('s3a://bank_transactions/ts=123123123/’)
1
2
3
4
5
6
7
8
9
10
@adipolak
Data Duplication
10. #2 - Vary Real-world Events
hardware failures like :
• servers dying
• spike in traffic
DevOps / SRE
10
• schema change
• corrupted data record
• data variance
• accidentally delete yesterday’s `events/`
partition
Data Engineer
@adipolak
14. #4 – Automate Experiments to Run
Continuously in Prod
• Discover weaknesses
DevOps / SRE
14
• Manage Data lifecycle in stages
Data Engineer
@adipolak
& Minimize Blast Radius