SlideShare uma empresa Scribd logo
1 de 43
Baixar para ler offline
www.mapflat.com
Test strategies for data
processing pipelines
Lars Albertsson, independent consultant (Mapflat)
Øyvind Løkling, Schibsted Products & Technology
1
www.mapflat.com
Who’s talking?
Swedish Institute of Computer. Science. (test & debug tools)
Sun Microsystems (large machine verification)
Google (Hangouts, productivity)
Recorded Future (NLP startup) (data integrations)
Cinnober Financial Tech. (trading systems)
Spotify (data processing & modelling, productivity)
Schibsted Products & Tech (data processing & modelling)
Mapflat (independent data engineering consultant)
2
www.mapflat.com
Agenda
● Data applications from a test perspective
● Testing stream processing product
● Testing batch processing products
● Data quality testing
Main focus is functional, regression testing
Prerequisites: Backend dev testing, basic data experience
3
www.mapflat.com
Test value
For data-centric applications, in this order:
● Productivity
○ Move fast without breaking things
● Fast experimentation
○ 10% good ideas, 90% bad
● Data quality
○ Challenging, more important than
● Technical quality
○ Technical failure => ops hassle, stale data
4
www.mapflat.com
Streaming data product anatomy
5
Pub / sub
Unified log
Ingress Stream processing Egress
DB
Service
TopicJob
Pipeline
Service
Export
Business
intelligence
DB
www.mapflat.com
Test harness
Test
fixture
Test concepts
6
System under test
(SUT)
3rd party
component
(e.g. DB)
3rd party
component
3rd party
component
Test
input
Test
oracle
Test framework (e.g. JUnit, Scalatest)
Seam
IDEs
Build
tools
www.mapflat.com
Test scopes
7
Unit
Class
Component
Integration
System / acceptance
● Pick stable seam
● Small scope
○ Fast?
○ Easy, simple?
● Large scope
○ Real app value?
○ Slow, unstable?
● Maintenance, cost
○ Pick few SUTs
www.mapflat.com
● Output = function(input, code)
○ No external factors => deterministic
● Pipeline and job endpoints are stable
○ Correspond to business value
● Internal abstractions are volatile
○ Reslicing in different dimensions is common
Data-centric application properties
8
q
www.mapflat.com
● Output = function(input, code)
○ Perfect for test!
○ Avoid: external service calls, wall clock
● Pipeline/job edges are suitable seams
○ Focus on large tests
● Internal seams => high maintenance, low value
○ Omit unit tests, mocks, dependency injection!
● Long pipelines crosses teams
○ Need for end-to-end tests, but culture challenge
Data-centric app test properties
9
q
www.mapflat.com
Suitable seams, streaming
10
Pub / sub
DB
Service
TopicJob
Pipeline
Service
Export
Business
intelligence
DB
www.mapflat.com
2, 7. Scalatest. 4. Spark Streaming jobs
IDE, CI, debug integration
Streaming SUT, example harness
11
DB
Topic
Kafka
5. Test
input
6. Test
oracle
3. Docker
1. IDE / Gradle
Polling
www.mapflat.com
Test lifecycle
12
1. Start fixture containers
2. Await fixture ready
3. Allocate test case resources
4. Start jobs
5. Push input data to Kafka
6. While (!done && !timeout) { pollDatabase(); sleep(1ms) }
7. While (moreTests) { Goto 3 }
8. Tear down fixture
For absence test, send dummy sync messages at end.
www.mapflat.com
Input generation
13
● Input & output is denormalised & wide
● Fields are frequently changed
○ Additions are compatible
○ Modifications are incompatible =>
new, similar data type
● Static test input, e.g. JSON files
○ Unmaintainable
● Input generation routines
○ Robust to changes, reusable
www.mapflat.com
Test oracles
14
● Compare with expected output
● Check fields relevant for test
○ Robust to field changes
○ Reusable for new, similar types
● Tip: Use lenses
○ JSON: JsonPath (Java), Play JSON (Scala)
○ Case classes: Monocle
● Express invariants for each data type
www.mapflat.com
Batch data product anatomy
15
Cluster storage
Unified log
Ingress ETL Egress
Data
lake
DB
Service
DatasetJob
Pipeline
Service
Export
Business
intelligence
DB
DB
Import
Datasets
● Pipeline equivalent of objects
● Dataset class == homogeneous records, open-ended
○ Compatible schema
○ E.g. MobileAdImpressions
● Dataset instance = dataset class + parameters
○ Immutable
○ Finite set of homogeneous records
○ E.g. MobileAdImpressions(hour=”2016-02-06T13”)
16
www.mapflat.com
Directory datasets
17
hdfs://red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS
part-00000.json
part-00001.json
● Some tools, e.g. Spark, understand Hive name
conventions
Dataset
class
Instance parameters,
Hive convention
Seal PartitionsPrivacy
level
Schema
version
Batch processing
● outDatasets =
code(inDatasets)
● Component that scale up
○ Spark, (Flink,
Scalding, Crunch)
● And scale down
○ Local mode
○ Most jobs fit in one
machine
18
Gradual refinement
1. Wash
- time shuffle, dedup, ...
2. Decorate
- geo, demographic, ...
3. Domain model
- similarity, clusters, ...
4. Application model
- Recommendations, ...
www.mapflat.com
Workflow manager
● Dataset “build tool”
● Run job instance when
○ input is available
○ output missing
● Backfills for previous failures
● DSL describes dependencies
● Includes ingress & egress
Suggested: Luigi / Airflow
19
DB
`
www.mapflat.com
ClientSessions A/B tests
DSL DAG example (Luigi)
20
class ClientActions(SparkSubmitTask):
hour = DateHourParameter()
def requires(self):
return [Actions(hour=self.hour - timedelta(hours=h)) for h in range(0, 24)] + 
[UserDB(date=self.hour.date)]
...
class ClientSessions(SparkSubmitTask):
hour = DateHourParameter()
def requires(self):
return [ClientActions(hour=self.hour - timedelta(hours=h)) for h in range(0, 3)]
...
class SessionsABResults(SparkSubmitTask):
hour = DateHourParameter()
def requires(self):
return [ClientSessions(hour=self.hour), ABExperiments(hour=self.hour)]
def output(self):
return HdfsTarget(“hdfs://production/red/ab_sessions/v1/” +
“{:year=%Y/month=%m/day=%d/hour=%H}”.format(self.hour))
...
Actions, hourly
● Expressive, embedded DSL - a must for ingress, egress
○ Avoid weak DSL tools: Oozie, AWS Data Pipeline
UserDB
Time shuffle,
user decorate
Form sessions
A/B compare
ClientActions, daily
A/B session
evaluationDataset instance
Job (aka Task) classes
www.mapflat.com
Batch processing testing
● Omit collection in SUT
○ Technically different
● Avoid clusters
○ Slow tests
● Seams
○ Between jobs
○ End of pipelines
21
Data lake
Artifact of business value
E.g. service index
Job
Pipeline
www.mapflat.com
Batch test frameworks - don’t
● Spark, Scalding, Crunch variants
● Seam == internal data structure
○ Omits I/O - common bug source
● Vendor lock-in
When switching batch framework:
○ Need tests for protection
○ Test rewrite is unnecessary burden
22
Test input Test output
Job
www.mapflat.com
Testing single job
23
Standard Scalatest harness
file://test_input/ file://test_output/
1. Generate input 2. Run in local mode 3. Verify output
f() p()
Don’t commit -
expensive to maintain.
Generate / verify with
code.
Runs well in
CI / from IDE
Job
www.mapflat.com
Testing pipelines - two options
24
Standard Scalatest harness
file://test_input/ file://test_output/
1. Generate input 2. Run custom multi-job 3. Verify output
f() p()
A:
Customised workflow manager setup
+ Runs in CI
+ Runs in IDE
+ Quick setup
- Multi-job
maintenance
p()
+ Tests workflow logic
+ More authentic
- Workflow mgr setup
for testability
- Difficult to debug
- Dataset handling
with Python
f()
B:
● Both can be extended with egress DBs
Test job with sequence of jobs
www.mapflat.com
Testing with cloud services
● PaaS components do not work locally
○ Cloud providers should provide fake
implementations
○ Exceptions: Kubernetes, Cloud SQL, (S3)
● Integrate PaaS service as fixture component
○ Distribute access tokens, etc
○ Pay $ or $$$
25
www.mapflat.com
Quality testing variants
● Functional regression
○ Binary, key to productivity
● Golden set
○ Extreme inputs => obvious output
○ No regressions tolerated
● (Saved) production data input
○ Individual regressions ok
○ Weighted sum must not decline
○ Beware of privacy
26
www.mapflat.com
Hadoop / Spark counters
● Processing tool (Spark/Hadoop) counters
○ Odd code path => bump counter
● Dedicated quality assessment pipelines
○ Reuse test oracle invariants in production
Obtaining quality metrics
27
DB
Quality assessment job
www.mapflat.com
Quality testing in the process
● Binary self-contained
○ Validate in CI
● Relative vs history
○ E.g. large drops
○ Precondition for
publishing dataset
● Push aggregates to DB
○ Standard ops:
monitor, alert
28
DB
∆?
Code ∆!
www.mapflat.com
RAM
Input
Computer program anatomy
29
Input data
Process Output
File
HID
VariableFunction
Execution path
Lookup
structure
Output
data
Window
File
File
Device
www.mapflat.com
Data pipeline = yet another program
Don’t veer from best practices
● Regression testing
● Design: Separation of concerns, modularity, etc
● Process: CI/CD, code review, static analysis tools
● Avoid anti-patterns: Global state, hard-coding location,
duplication, ...
In data engineering, slipping is in the culture... :-(
● Mix in solid backend engineers, document “golden path”
30
www.mapflat.com
Top anti-patterns
1. Test as afterthought or in production
Data processing applications are suited for test!
2. Static test input in version control
3. Exact expected output test oracle
4. Unit testing volatile interfaces
5. Using mocks & dependency injection
6. Tool-specific test framework
7. Calling services / databases from (batch) jobs
8. Using wall clock time
9. Embedded fixture components
31
www.mapflat.com
Further resources. Questions?
32
http://www.slideshare.net/lallea/data-pipelines-from-zero-to-solid
http://www.mapflat.com/lands/resources/reading-list
http://www.slideshare.net/mathieu-bastian/the-mechanics-of-testing-large-data-
pipelines-qcon-london-2016
http://www.slideshare.net/hkarau/effective-testing-for-spark-programs-strata-ny-
2015
https://spark-summit.org/2014/wp-content/uploads/2014/06/Testing-Spark-Best-
Practices-Anupama-Shetty-Neil-Marshall.pdf
www.mapflat.com
Bonus slides
33
www.mapflat.com
Unified log
● Replicated append-only log
● Pub / sub with history
● Decoupled producers and consumers
○ In source/deployment
○ In space
○ In time
● Recovers from link failures
● Replay on transformation bug fix
34
Credits: Confluent
www.mapflat.com
Stream pipelines
● Parallelised jobs
● Read / write to Kafka
● View egress
○ Serving index
○ SQL / cubes for Analytics
● Stream egress
○ Services subscribe to topic
○ E.g. REST post for export
35
Credits: Apache Samza
www.mapflat.com
The data lake
Unified log + snapshots
● Immutable datasets
● Raw, unprocessed
● Source of truth from batch
processing perspective
● Kept as long as permitted
● Technically homogeneous
○ Except for raw imports
36
Cluster storage
Data lake
www.mapflat.com
Batch pipelines
● Things will break
○ Input will be missing
○ Jobs will fail
○ Jobs will have bugs
● Datasets must be rebuilt
● Determinism, idempotency
● Backfill missing / failed
● Eventual correctness
37
Cluster storage
Data lake
Pristine,
immutable
datasets
Intermediate
Derived,
regenerable
www.mapflat.com
Job == function([input datasets]): [output datasets]
● No orthogonal concerns
○ Invocation
○ Scheduling
○ Input / output location
● Testable
● No other input factors, no side-effects
● Ideally: atomic, deterministic, idempotent
● Necessary for audit
Batch job
38
q
www.mapflat.com
Form teams that are driven by business cases & need
Forward-oriented -> filters implicitly applied
Beware of: duplication, tech chaos/autonomy, privacy loss
Data pipelines
www.mapflat.com
Data platform, pipeline chains
Common data infrastructure
Productivity, privacy, end-to-end agility, complexity
Beware: producer-consumer disconnect
www.mapflat.com
Ingress / egress representation
Larger variation:
● Single file
● Relational database table
● Cassandra column family, other NoSQL
● BI tool storage
● BigQuery, Redshift, ...
Egress datasets are also atomic and immutable.
E.g. write full DB table / CF, switch service to use it, never
change it.
41
www.mapflat.com
Egress datasets
● Serving
○ Precomputed user query answers
○ Denormalised
○ Cassandra, (many)
● Export & Analytics
○ SQL (single node / Hive, Presto, ..)
○ Workbenches (Zeppelin)
○ (Elasticsearch, proprietary OLAP)
● BI / analytics tool needs change frequently
○ Prepare to redirect pipelines
42
Deployment
43
Hg/git
repo Luigi DSL, jars, config
my-pipe-7.tar.gz
HDFS
Luigi
daemon
> pip install my-pipe-7.tar.gz
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Redundant cron schedule, higher
frequency + backfill (Luigi range tools)
* 10 * * * bin/my_pipe_daily 
--backfill 14
All that a pipeline needs, installed atomically

Mais conteúdo relacionado

Mais procurados

Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
Databricks
 

Mais procurados (20)

Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
 
Introduction to Apache Calcite
Introduction to Apache CalciteIntroduction to Apache Calcite
Introduction to Apache Calcite
 
Google BigQuery
Google BigQueryGoogle BigQuery
Google BigQuery
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Big Query Basics
Big Query BasicsBig Query Basics
Big Query Basics
 
Gcp dataflow
Gcp dataflowGcp dataflow
Gcp dataflow
 
Automated Testing For Protecting Data Pipelines from Undocumented Assumptions
Automated Testing For Protecting Data Pipelines from Undocumented AssumptionsAutomated Testing For Protecting Data Pipelines from Undocumented Assumptions
Automated Testing For Protecting Data Pipelines from Undocumented Assumptions
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OK
 
Key Considerations While Rolling Out Denodo Platform
Key Considerations While Rolling Out Denodo PlatformKey Considerations While Rolling Out Denodo Platform
Key Considerations While Rolling Out Denodo Platform
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
 
Google Cloud Composer
Google Cloud ComposerGoogle Cloud Composer
Google Cloud Composer
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
 
Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelin...
Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelin...Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelin...
Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelin...
 
Introducing Apache Airflow and how we are using it
Introducing Apache Airflow and how we are using itIntroducing Apache Airflow and how we are using it
Introducing Apache Airflow and how we are using it
 
DevOps for DataScience
DevOps for DataScienceDevOps for DataScience
DevOps for DataScience
 
Google Cloud Dataflow
Google Cloud DataflowGoogle Cloud Dataflow
Google Cloud Dataflow
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
 

Destaque

[211]대규모 시스템 시각화 현동석김광림
[211]대규모 시스템 시각화 현동석김광림[211]대규모 시스템 시각화 현동석김광림
[211]대규모 시스템 시각화 현동석김광림
NAVER D2
 
[125]react로개발자2명이플랫폼4개를서비스하는이야기 심상민
[125]react로개발자2명이플랫폼4개를서비스하는이야기 심상민[125]react로개발자2명이플랫폼4개를서비스하는이야기 심상민
[125]react로개발자2명이플랫폼4개를서비스하는이야기 심상민
NAVER D2
 

Destaque (17)

Real-time Big Data Processing with Storm
Real-time Big Data Processing with StormReal-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
 
Large scale data processing pipelines at trivago
Large scale data processing pipelines at trivago Large scale data processing pipelines at trivago
Large scale data processing pipelines at trivago
 
[246] foursquare데이터라이프사이클 설현준
[246] foursquare데이터라이프사이클 설현준[246] foursquare데이터라이프사이클 설현준
[246] foursquare데이터라이프사이클 설현준
 
[211]대규모 시스템 시각화 현동석김광림
[211]대규모 시스템 시각화 현동석김광림[211]대규모 시스템 시각화 현동석김광림
[211]대규모 시스템 시각화 현동석김광림
 
[115] clean fe development_윤지수
[115] clean fe development_윤지수[115] clean fe development_윤지수
[115] clean fe development_윤지수
 
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
 
Building a Data Processing Pipeline on AWS
Building a Data Processing Pipeline on AWSBuilding a Data Processing Pipeline on AWS
Building a Data Processing Pipeline on AWS
 
[225]yarn 기반의 deep learning application cluster 구축 김제민
[225]yarn 기반의 deep learning application cluster 구축 김제민[225]yarn 기반의 deep learning application cluster 구축 김제민
[225]yarn 기반의 deep learning application cluster 구축 김제민
 
Real-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache FlinkReal-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache Flink
 
Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka ...
Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka ...Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka ...
Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka ...
 
[125]react로개발자2명이플랫폼4개를서비스하는이야기 심상민
[125]react로개발자2명이플랫폼4개를서비스하는이야기 심상민[125]react로개발자2명이플랫폼4개를서비스하는이야기 심상민
[125]react로개발자2명이플랫폼4개를서비스하는이야기 심상민
 
[112]rest에서 graph ql과 relay로 갈아타기 이정우
[112]rest에서 graph ql과 relay로 갈아타기 이정우[112]rest에서 graph ql과 relay로 갈아타기 이정우
[112]rest에서 graph ql과 relay로 갈아타기 이정우
 
[236] 카카오의데이터파이프라인 윤도영
[236] 카카오의데이터파이프라인 윤도영[236] 카카오의데이터파이프라인 윤도영
[236] 카카오의데이터파이프라인 윤도영
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe Crobak
 
Real-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache SparkReal-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache Spark
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
 

Semelhante a Test strategies for data processing pipelines

How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
PyData
 
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...
confluent
 

Semelhante a Test strategies for data processing pipelines (20)

Testing data streaming applications
Testing data streaming applicationsTesting data streaming applications
Testing data streaming applications
 
Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0
 
Holistic data application quality
Holistic data application qualityHolistic data application quality
Holistic data application quality
 
Scaling up uber's real time data analytics
Scaling up uber's real time data analyticsScaling up uber's real time data analytics
Scaling up uber's real time data analytics
 
Data ops in practice - Swedish style
Data ops in practice - Swedish styleData ops in practice - Swedish style
Data ops in practice - Swedish style
 
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
 
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to StreamingBravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
 
Productionalizing a spark application
Productionalizing a spark applicationProductionalizing a spark application
Productionalizing a spark application
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache Calcite
 
Druid Optimizations for Scaling Customer Facing Analytics
Druid Optimizations for Scaling Customer Facing AnalyticsDruid Optimizations for Scaling Customer Facing Analytics
Druid Optimizations for Scaling Customer Facing Analytics
 
Yaetos Tech Overview
Yaetos Tech OverviewYaetos Tech Overview
Yaetos Tech Overview
 
OOW13 Exadata and ODI with Parallel
OOW13 Exadata and ODI with ParallelOOW13 Exadata and ODI with Parallel
OOW13 Exadata and ODI with Parallel
 
Scio - Moving to Google Cloud, A Spotify Story
 Scio - Moving to Google Cloud, A Spotify Story Scio - Moving to Google Cloud, A Spotify Story
Scio - Moving to Google Cloud, A Spotify Story
 
Under the hood of the Altalis Platform
Under the hood of the Altalis PlatformUnder the hood of the Altalis Platform
Under the hood of the Altalis Platform
 
A primer on building real time data-driven products
A primer on building real time data-driven productsA primer on building real time data-driven products
A primer on building real time data-driven products
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
 
Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21
 
DSDT Meetup Nov 2017
DSDT Meetup Nov 2017DSDT Meetup Nov 2017
DSDT Meetup Nov 2017
 
Schema management with Scalameta
Schema management with ScalametaSchema management with Scalameta
Schema management with Scalameta
 
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...
 

Mais de Lars Albertsson

Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
Lars Albertsson
 
The 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdfThe 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdf
Lars Albertsson
 

Mais de Lars Albertsson (20)

Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Crossing the data divide
Crossing the data divideCrossing the data divide
Crossing the data divide
 
How to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdfHow to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdf
 
Data engineering in 10 years.pdf
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdf
 
The 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdfThe 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdf
 
Secure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budgetSecure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budget
 
DataOps - Lean principles and lean practices
DataOps - Lean principles and lean practicesDataOps - Lean principles and lean practices
DataOps - Lean principles and lean practices
 
Ai legal and ethics
Ai   legal and ethicsAi   legal and ethics
Ai legal and ethics
 
The right side of speed - learning to shift left
The right side of speed - learning to shift leftThe right side of speed - learning to shift left
The right side of speed - learning to shift left
 
Mortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data qualityMortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data quality
 
The lean principles of data ops
The lean principles of data opsThe lean principles of data ops
The lean principles of data ops
 
Data democratised
Data democratisedData democratised
Data democratised
 
Eventually, time will kill your data processing
Eventually, time will kill your data processingEventually, time will kill your data processing
Eventually, time will kill your data processing
 
Taming the reproducibility crisis
Taming the reproducibility crisisTaming the reproducibility crisis
Taming the reproducibility crisis
 
Eventually, time will kill your data pipeline
Eventually, time will kill your data pipelineEventually, time will kill your data pipeline
Eventually, time will kill your data pipeline
 
Data ops in practice
Data ops in practiceData ops in practice
Data ops in practice
 
Kubernetes as data platform
Kubernetes as data platformKubernetes as data platform
Kubernetes as data platform
 
Don't build a data science team
Don't build a data science teamDon't build a data science team
Don't build a data science team
 
Big data == lean data
Big data == lean dataBig data == lean data
Big data == lean data
 

Último

Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
ptikerjasaptiker
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
cnajjemba
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
vexqp
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
vexqp
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
vexqp
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
q6pzkpark
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 

Último (20)

Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 

Test strategies for data processing pipelines

  • 1. www.mapflat.com Test strategies for data processing pipelines Lars Albertsson, independent consultant (Mapflat) Øyvind Løkling, Schibsted Products & Technology 1
  • 2. www.mapflat.com Who’s talking? Swedish Institute of Computer. Science. (test & debug tools) Sun Microsystems (large machine verification) Google (Hangouts, productivity) Recorded Future (NLP startup) (data integrations) Cinnober Financial Tech. (trading systems) Spotify (data processing & modelling, productivity) Schibsted Products & Tech (data processing & modelling) Mapflat (independent data engineering consultant) 2
  • 3. www.mapflat.com Agenda ● Data applications from a test perspective ● Testing stream processing product ● Testing batch processing products ● Data quality testing Main focus is functional, regression testing Prerequisites: Backend dev testing, basic data experience 3
  • 4. www.mapflat.com Test value For data-centric applications, in this order: ● Productivity ○ Move fast without breaking things ● Fast experimentation ○ 10% good ideas, 90% bad ● Data quality ○ Challenging, more important than ● Technical quality ○ Technical failure => ops hassle, stale data 4
  • 5. www.mapflat.com Streaming data product anatomy 5 Pub / sub Unified log Ingress Stream processing Egress DB Service TopicJob Pipeline Service Export Business intelligence DB
  • 6. www.mapflat.com Test harness Test fixture Test concepts 6 System under test (SUT) 3rd party component (e.g. DB) 3rd party component 3rd party component Test input Test oracle Test framework (e.g. JUnit, Scalatest) Seam IDEs Build tools
  • 7. www.mapflat.com Test scopes 7 Unit Class Component Integration System / acceptance ● Pick stable seam ● Small scope ○ Fast? ○ Easy, simple? ● Large scope ○ Real app value? ○ Slow, unstable? ● Maintenance, cost ○ Pick few SUTs
  • 8. www.mapflat.com ● Output = function(input, code) ○ No external factors => deterministic ● Pipeline and job endpoints are stable ○ Correspond to business value ● Internal abstractions are volatile ○ Reslicing in different dimensions is common Data-centric application properties 8 q
  • 9. www.mapflat.com ● Output = function(input, code) ○ Perfect for test! ○ Avoid: external service calls, wall clock ● Pipeline/job edges are suitable seams ○ Focus on large tests ● Internal seams => high maintenance, low value ○ Omit unit tests, mocks, dependency injection! ● Long pipelines crosses teams ○ Need for end-to-end tests, but culture challenge Data-centric app test properties 9 q
  • 10. www.mapflat.com Suitable seams, streaming 10 Pub / sub DB Service TopicJob Pipeline Service Export Business intelligence DB
  • 11. www.mapflat.com 2, 7. Scalatest. 4. Spark Streaming jobs IDE, CI, debug integration Streaming SUT, example harness 11 DB Topic Kafka 5. Test input 6. Test oracle 3. Docker 1. IDE / Gradle Polling
  • 12. www.mapflat.com Test lifecycle 12 1. Start fixture containers 2. Await fixture ready 3. Allocate test case resources 4. Start jobs 5. Push input data to Kafka 6. While (!done && !timeout) { pollDatabase(); sleep(1ms) } 7. While (moreTests) { Goto 3 } 8. Tear down fixture For absence test, send dummy sync messages at end.
  • 13. www.mapflat.com Input generation 13 ● Input & output is denormalised & wide ● Fields are frequently changed ○ Additions are compatible ○ Modifications are incompatible => new, similar data type ● Static test input, e.g. JSON files ○ Unmaintainable ● Input generation routines ○ Robust to changes, reusable
  • 14. www.mapflat.com Test oracles 14 ● Compare with expected output ● Check fields relevant for test ○ Robust to field changes ○ Reusable for new, similar types ● Tip: Use lenses ○ JSON: JsonPath (Java), Play JSON (Scala) ○ Case classes: Monocle ● Express invariants for each data type
  • 15. www.mapflat.com Batch data product anatomy 15 Cluster storage Unified log Ingress ETL Egress Data lake DB Service DatasetJob Pipeline Service Export Business intelligence DB DB Import
  • 16. Datasets ● Pipeline equivalent of objects ● Dataset class == homogeneous records, open-ended ○ Compatible schema ○ E.g. MobileAdImpressions ● Dataset instance = dataset class + parameters ○ Immutable ○ Finite set of homogeneous records ○ E.g. MobileAdImpressions(hour=”2016-02-06T13”) 16
  • 17. www.mapflat.com Directory datasets 17 hdfs://red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS part-00000.json part-00001.json ● Some tools, e.g. Spark, understand Hive name conventions Dataset class Instance parameters, Hive convention Seal PartitionsPrivacy level Schema version
  • 18. Batch processing ● outDatasets = code(inDatasets) ● Component that scale up ○ Spark, (Flink, Scalding, Crunch) ● And scale down ○ Local mode ○ Most jobs fit in one machine 18 Gradual refinement 1. Wash - time shuffle, dedup, ... 2. Decorate - geo, demographic, ... 3. Domain model - similarity, clusters, ... 4. Application model - Recommendations, ...
  • 19. www.mapflat.com Workflow manager ● Dataset “build tool” ● Run job instance when ○ input is available ○ output missing ● Backfills for previous failures ● DSL describes dependencies ● Includes ingress & egress Suggested: Luigi / Airflow 19 DB `
  • 20. www.mapflat.com ClientSessions A/B tests DSL DAG example (Luigi) 20 class ClientActions(SparkSubmitTask): hour = DateHourParameter() def requires(self): return [Actions(hour=self.hour - timedelta(hours=h)) for h in range(0, 24)] + [UserDB(date=self.hour.date)] ... class ClientSessions(SparkSubmitTask): hour = DateHourParameter() def requires(self): return [ClientActions(hour=self.hour - timedelta(hours=h)) for h in range(0, 3)] ... class SessionsABResults(SparkSubmitTask): hour = DateHourParameter() def requires(self): return [ClientSessions(hour=self.hour), ABExperiments(hour=self.hour)] def output(self): return HdfsTarget(“hdfs://production/red/ab_sessions/v1/” + “{:year=%Y/month=%m/day=%d/hour=%H}”.format(self.hour)) ... Actions, hourly ● Expressive, embedded DSL - a must for ingress, egress ○ Avoid weak DSL tools: Oozie, AWS Data Pipeline UserDB Time shuffle, user decorate Form sessions A/B compare ClientActions, daily A/B session evaluationDataset instance Job (aka Task) classes
  • 21. www.mapflat.com Batch processing testing ● Omit collection in SUT ○ Technically different ● Avoid clusters ○ Slow tests ● Seams ○ Between jobs ○ End of pipelines 21 Data lake Artifact of business value E.g. service index Job Pipeline
  • 22. www.mapflat.com Batch test frameworks - don’t ● Spark, Scalding, Crunch variants ● Seam == internal data structure ○ Omits I/O - common bug source ● Vendor lock-in When switching batch framework: ○ Need tests for protection ○ Test rewrite is unnecessary burden 22 Test input Test output Job
  • 23. www.mapflat.com Testing single job 23 Standard Scalatest harness file://test_input/ file://test_output/ 1. Generate input 2. Run in local mode 3. Verify output f() p() Don’t commit - expensive to maintain. Generate / verify with code. Runs well in CI / from IDE Job
  • 24. www.mapflat.com Testing pipelines - two options 24 Standard Scalatest harness file://test_input/ file://test_output/ 1. Generate input 2. Run custom multi-job 3. Verify output f() p() A: Customised workflow manager setup + Runs in CI + Runs in IDE + Quick setup - Multi-job maintenance p() + Tests workflow logic + More authentic - Workflow mgr setup for testability - Difficult to debug - Dataset handling with Python f() B: ● Both can be extended with egress DBs Test job with sequence of jobs
  • 25. www.mapflat.com Testing with cloud services ● PaaS components do not work locally ○ Cloud providers should provide fake implementations ○ Exceptions: Kubernetes, Cloud SQL, (S3) ● Integrate PaaS service as fixture component ○ Distribute access tokens, etc ○ Pay $ or $$$ 25
  • 26. www.mapflat.com Quality testing variants ● Functional regression ○ Binary, key to productivity ● Golden set ○ Extreme inputs => obvious output ○ No regressions tolerated ● (Saved) production data input ○ Individual regressions ok ○ Weighted sum must not decline ○ Beware of privacy 26
  • 27. www.mapflat.com Hadoop / Spark counters ● Processing tool (Spark/Hadoop) counters ○ Odd code path => bump counter ● Dedicated quality assessment pipelines ○ Reuse test oracle invariants in production Obtaining quality metrics 27 DB Quality assessment job
  • 28. www.mapflat.com Quality testing in the process ● Binary self-contained ○ Validate in CI ● Relative vs history ○ E.g. large drops ○ Precondition for publishing dataset ● Push aggregates to DB ○ Standard ops: monitor, alert 28 DB ∆? Code ∆!
  • 29. www.mapflat.com RAM Input Computer program anatomy 29 Input data Process Output File HID VariableFunction Execution path Lookup structure Output data Window File File Device
  • 30. www.mapflat.com Data pipeline = yet another program Don’t veer from best practices ● Regression testing ● Design: Separation of concerns, modularity, etc ● Process: CI/CD, code review, static analysis tools ● Avoid anti-patterns: Global state, hard-coding location, duplication, ... In data engineering, slipping is in the culture... :-( ● Mix in solid backend engineers, document “golden path” 30
  • 31. www.mapflat.com Top anti-patterns 1. Test as afterthought or in production Data processing applications are suited for test! 2. Static test input in version control 3. Exact expected output test oracle 4. Unit testing volatile interfaces 5. Using mocks & dependency injection 6. Tool-specific test framework 7. Calling services / databases from (batch) jobs 8. Using wall clock time 9. Embedded fixture components 31
  • 34. www.mapflat.com Unified log ● Replicated append-only log ● Pub / sub with history ● Decoupled producers and consumers ○ In source/deployment ○ In space ○ In time ● Recovers from link failures ● Replay on transformation bug fix 34 Credits: Confluent
  • 35. www.mapflat.com Stream pipelines ● Parallelised jobs ● Read / write to Kafka ● View egress ○ Serving index ○ SQL / cubes for Analytics ● Stream egress ○ Services subscribe to topic ○ E.g. REST post for export 35 Credits: Apache Samza
  • 36. www.mapflat.com The data lake Unified log + snapshots ● Immutable datasets ● Raw, unprocessed ● Source of truth from batch processing perspective ● Kept as long as permitted ● Technically homogeneous ○ Except for raw imports 36 Cluster storage Data lake
  • 37. www.mapflat.com Batch pipelines ● Things will break ○ Input will be missing ○ Jobs will fail ○ Jobs will have bugs ● Datasets must be rebuilt ● Determinism, idempotency ● Backfill missing / failed ● Eventual correctness 37 Cluster storage Data lake Pristine, immutable datasets Intermediate Derived, regenerable
  • 38. www.mapflat.com Job == function([input datasets]): [output datasets] ● No orthogonal concerns ○ Invocation ○ Scheduling ○ Input / output location ● Testable ● No other input factors, no side-effects ● Ideally: atomic, deterministic, idempotent ● Necessary for audit Batch job 38 q
  • 39. www.mapflat.com Form teams that are driven by business cases & need Forward-oriented -> filters implicitly applied Beware of: duplication, tech chaos/autonomy, privacy loss Data pipelines
  • 40. www.mapflat.com Data platform, pipeline chains Common data infrastructure Productivity, privacy, end-to-end agility, complexity Beware: producer-consumer disconnect
  • 41. www.mapflat.com Ingress / egress representation Larger variation: ● Single file ● Relational database table ● Cassandra column family, other NoSQL ● BI tool storage ● BigQuery, Redshift, ... Egress datasets are also atomic and immutable. E.g. write full DB table / CF, switch service to use it, never change it. 41
  • 42. www.mapflat.com Egress datasets ● Serving ○ Precomputed user query answers ○ Denormalised ○ Cassandra, (many) ● Export & Analytics ○ SQL (single node / Hive, Presto, ..) ○ Workbenches (Zeppelin) ○ (Elasticsearch, proprietary OLAP) ● BI / analytics tool needs change frequently ○ Prepare to redirect pipelines 42
  • 43. Deployment 43 Hg/git repo Luigi DSL, jars, config my-pipe-7.tar.gz HDFS Luigi daemon > pip install my-pipe-7.tar.gz Worker Worker Worker Worker Worker Worker Worker Worker Redundant cron schedule, higher frequency + backfill (Luigi range tools) * 10 * * * bin/my_pipe_daily --backfill 14 All that a pipeline needs, installed atomically