This document discusses strategies for testing data processing pipelines. It begins by introducing various companies and speakers working with data applications and pipelines. It then covers topics like the anatomy of streaming and batch data pipelines, suitable test seams, test scopes from unit to integration, and strategies for testing streaming jobs, batch pipelines, and data quality. Anti-patterns for data pipeline testing are also discussed.
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Test strategies for data processing pipelines
1. www.mapflat.com
Test strategies for data
processing pipelines
Lars Albertsson, independent consultant (Mapflat)
Øyvind Løkling, Schibsted Products & Technology
1
2. www.mapflat.com
Who’s talking?
Swedish Institute of Computer. Science. (test & debug tools)
Sun Microsystems (large machine verification)
Google (Hangouts, productivity)
Recorded Future (NLP startup) (data integrations)
Cinnober Financial Tech. (trading systems)
Spotify (data processing & modelling, productivity)
Schibsted Products & Tech (data processing & modelling)
Mapflat (independent data engineering consultant)
2
3. www.mapflat.com
Agenda
● Data applications from a test perspective
● Testing stream processing product
● Testing batch processing products
● Data quality testing
Main focus is functional, regression testing
Prerequisites: Backend dev testing, basic data experience
3
4. www.mapflat.com
Test value
For data-centric applications, in this order:
● Productivity
○ Move fast without breaking things
● Fast experimentation
○ 10% good ideas, 90% bad
● Data quality
○ Challenging, more important than
● Technical quality
○ Technical failure => ops hassle, stale data
4
5. www.mapflat.com
Streaming data product anatomy
5
Pub / sub
Unified log
Ingress Stream processing Egress
DB
Service
TopicJob
Pipeline
Service
Export
Business
intelligence
DB
8. www.mapflat.com
● Output = function(input, code)
○ No external factors => deterministic
● Pipeline and job endpoints are stable
○ Correspond to business value
● Internal abstractions are volatile
○ Reslicing in different dimensions is common
Data-centric application properties
8
q
9. www.mapflat.com
● Output = function(input, code)
○ Perfect for test!
○ Avoid: external service calls, wall clock
● Pipeline/job edges are suitable seams
○ Focus on large tests
● Internal seams => high maintenance, low value
○ Omit unit tests, mocks, dependency injection!
● Long pipelines crosses teams
○ Need for end-to-end tests, but culture challenge
Data-centric app test properties
9
q
11. www.mapflat.com
2, 7. Scalatest. 4. Spark Streaming jobs
IDE, CI, debug integration
Streaming SUT, example harness
11
DB
Topic
Kafka
5. Test
input
6. Test
oracle
3. Docker
1. IDE / Gradle
Polling
12. www.mapflat.com
Test lifecycle
12
1. Start fixture containers
2. Await fixture ready
3. Allocate test case resources
4. Start jobs
5. Push input data to Kafka
6. While (!done && !timeout) { pollDatabase(); sleep(1ms) }
7. While (moreTests) { Goto 3 }
8. Tear down fixture
For absence test, send dummy sync messages at end.
13. www.mapflat.com
Input generation
13
● Input & output is denormalised & wide
● Fields are frequently changed
○ Additions are compatible
○ Modifications are incompatible =>
new, similar data type
● Static test input, e.g. JSON files
○ Unmaintainable
● Input generation routines
○ Robust to changes, reusable
14. www.mapflat.com
Test oracles
14
● Compare with expected output
● Check fields relevant for test
○ Robust to field changes
○ Reusable for new, similar types
● Tip: Use lenses
○ JSON: JsonPath (Java), Play JSON (Scala)
○ Case classes: Monocle
● Express invariants for each data type
15. www.mapflat.com
Batch data product anatomy
15
Cluster storage
Unified log
Ingress ETL Egress
Data
lake
DB
Service
DatasetJob
Pipeline
Service
Export
Business
intelligence
DB
DB
Import
16. Datasets
● Pipeline equivalent of objects
● Dataset class == homogeneous records, open-ended
○ Compatible schema
○ E.g. MobileAdImpressions
● Dataset instance = dataset class + parameters
○ Immutable
○ Finite set of homogeneous records
○ E.g. MobileAdImpressions(hour=”2016-02-06T13”)
16
18. Batch processing
● outDatasets =
code(inDatasets)
● Component that scale up
○ Spark, (Flink,
Scalding, Crunch)
● And scale down
○ Local mode
○ Most jobs fit in one
machine
18
Gradual refinement
1. Wash
- time shuffle, dedup, ...
2. Decorate
- geo, demographic, ...
3. Domain model
- similarity, clusters, ...
4. Application model
- Recommendations, ...
19. www.mapflat.com
Workflow manager
● Dataset “build tool”
● Run job instance when
○ input is available
○ output missing
● Backfills for previous failures
● DSL describes dependencies
● Includes ingress & egress
Suggested: Luigi / Airflow
19
DB
`
20. www.mapflat.com
ClientSessions A/B tests
DSL DAG example (Luigi)
20
class ClientActions(SparkSubmitTask):
hour = DateHourParameter()
def requires(self):
return [Actions(hour=self.hour - timedelta(hours=h)) for h in range(0, 24)] +
[UserDB(date=self.hour.date)]
...
class ClientSessions(SparkSubmitTask):
hour = DateHourParameter()
def requires(self):
return [ClientActions(hour=self.hour - timedelta(hours=h)) for h in range(0, 3)]
...
class SessionsABResults(SparkSubmitTask):
hour = DateHourParameter()
def requires(self):
return [ClientSessions(hour=self.hour), ABExperiments(hour=self.hour)]
def output(self):
return HdfsTarget(“hdfs://production/red/ab_sessions/v1/” +
“{:year=%Y/month=%m/day=%d/hour=%H}”.format(self.hour))
...
Actions, hourly
● Expressive, embedded DSL - a must for ingress, egress
○ Avoid weak DSL tools: Oozie, AWS Data Pipeline
UserDB
Time shuffle,
user decorate
Form sessions
A/B compare
ClientActions, daily
A/B session
evaluationDataset instance
Job (aka Task) classes
21. www.mapflat.com
Batch processing testing
● Omit collection in SUT
○ Technically different
● Avoid clusters
○ Slow tests
● Seams
○ Between jobs
○ End of pipelines
21
Data lake
Artifact of business value
E.g. service index
Job
Pipeline
22. www.mapflat.com
Batch test frameworks - don’t
● Spark, Scalding, Crunch variants
● Seam == internal data structure
○ Omits I/O - common bug source
● Vendor lock-in
When switching batch framework:
○ Need tests for protection
○ Test rewrite is unnecessary burden
22
Test input Test output
Job
23. www.mapflat.com
Testing single job
23
Standard Scalatest harness
file://test_input/ file://test_output/
1. Generate input 2. Run in local mode 3. Verify output
f() p()
Don’t commit -
expensive to maintain.
Generate / verify with
code.
Runs well in
CI / from IDE
Job
24. www.mapflat.com
Testing pipelines - two options
24
Standard Scalatest harness
file://test_input/ file://test_output/
1. Generate input 2. Run custom multi-job 3. Verify output
f() p()
A:
Customised workflow manager setup
+ Runs in CI
+ Runs in IDE
+ Quick setup
- Multi-job
maintenance
p()
+ Tests workflow logic
+ More authentic
- Workflow mgr setup
for testability
- Difficult to debug
- Dataset handling
with Python
f()
B:
● Both can be extended with egress DBs
Test job with sequence of jobs
25. www.mapflat.com
Testing with cloud services
● PaaS components do not work locally
○ Cloud providers should provide fake
implementations
○ Exceptions: Kubernetes, Cloud SQL, (S3)
● Integrate PaaS service as fixture component
○ Distribute access tokens, etc
○ Pay $ or $$$
25
26. www.mapflat.com
Quality testing variants
● Functional regression
○ Binary, key to productivity
● Golden set
○ Extreme inputs => obvious output
○ No regressions tolerated
● (Saved) production data input
○ Individual regressions ok
○ Weighted sum must not decline
○ Beware of privacy
26
27. www.mapflat.com
Hadoop / Spark counters
● Processing tool (Spark/Hadoop) counters
○ Odd code path => bump counter
● Dedicated quality assessment pipelines
○ Reuse test oracle invariants in production
Obtaining quality metrics
27
DB
Quality assessment job
28. www.mapflat.com
Quality testing in the process
● Binary self-contained
○ Validate in CI
● Relative vs history
○ E.g. large drops
○ Precondition for
publishing dataset
● Push aggregates to DB
○ Standard ops:
monitor, alert
28
DB
∆?
Code ∆!
30. www.mapflat.com
Data pipeline = yet another program
Don’t veer from best practices
● Regression testing
● Design: Separation of concerns, modularity, etc
● Process: CI/CD, code review, static analysis tools
● Avoid anti-patterns: Global state, hard-coding location,
duplication, ...
In data engineering, slipping is in the culture... :-(
● Mix in solid backend engineers, document “golden path”
30
31. www.mapflat.com
Top anti-patterns
1. Test as afterthought or in production
Data processing applications are suited for test!
2. Static test input in version control
3. Exact expected output test oracle
4. Unit testing volatile interfaces
5. Using mocks & dependency injection
6. Tool-specific test framework
7. Calling services / databases from (batch) jobs
8. Using wall clock time
9. Embedded fixture components
31
34. www.mapflat.com
Unified log
● Replicated append-only log
● Pub / sub with history
● Decoupled producers and consumers
○ In source/deployment
○ In space
○ In time
● Recovers from link failures
● Replay on transformation bug fix
34
Credits: Confluent
35. www.mapflat.com
Stream pipelines
● Parallelised jobs
● Read / write to Kafka
● View egress
○ Serving index
○ SQL / cubes for Analytics
● Stream egress
○ Services subscribe to topic
○ E.g. REST post for export
35
Credits: Apache Samza
36. www.mapflat.com
The data lake
Unified log + snapshots
● Immutable datasets
● Raw, unprocessed
● Source of truth from batch
processing perspective
● Kept as long as permitted
● Technically homogeneous
○ Except for raw imports
36
Cluster storage
Data lake
37. www.mapflat.com
Batch pipelines
● Things will break
○ Input will be missing
○ Jobs will fail
○ Jobs will have bugs
● Datasets must be rebuilt
● Determinism, idempotency
● Backfill missing / failed
● Eventual correctness
37
Cluster storage
Data lake
Pristine,
immutable
datasets
Intermediate
Derived,
regenerable
38. www.mapflat.com
Job == function([input datasets]): [output datasets]
● No orthogonal concerns
○ Invocation
○ Scheduling
○ Input / output location
● Testable
● No other input factors, no side-effects
● Ideally: atomic, deterministic, idempotent
● Necessary for audit
Batch job
38
q
39. www.mapflat.com
Form teams that are driven by business cases & need
Forward-oriented -> filters implicitly applied
Beware of: duplication, tech chaos/autonomy, privacy loss
Data pipelines
40. www.mapflat.com
Data platform, pipeline chains
Common data infrastructure
Productivity, privacy, end-to-end agility, complexity
Beware: producer-consumer disconnect
41. www.mapflat.com
Ingress / egress representation
Larger variation:
● Single file
● Relational database table
● Cassandra column family, other NoSQL
● BI tool storage
● BigQuery, Redshift, ...
Egress datasets are also atomic and immutable.
E.g. write full DB table / CF, switch service to use it, never
change it.
41