3. What is a Data Pipeline?
• Discrete set of dependent operations
• Directional (Inputs -> [Operations] -> Outputs)
• One or more input sources and one or more
outputs
4. Pipelines are Used For
• Data aggregation / augmentation
• Data cleansing / de-duplication
• Data copying / synchronization
• Analytics processing
• AI Modeling
5. Sources and Targets
• Sources: Initial inputs into a pipeline
• REST API, Excel Sheet, Filesystem, HDFS,
RDBMS, etc.
• Targets: Terminal outputs of a pipeline
• REST API, Excel Sheet, [...], Email, Slack
6. Operations
• Operations are the fundamental units of work
within a pipeline.
• Operations can be domain specific.
• Operations can be composable.
7. O
O O OS T
S
T
Source
Operation
Target
O
Simple Linear Pipeline
11. Atomicity
• An entire operation fails or succeeds as a
whole.
• There is no partial state in the event of a
failure.
"the state or fact of being
composed of indivisible units."
13. Idempotency
• An operation can be run multiple times without failure.
• An operation can be run multiple times without
duplication of output.
Q: What is the correct way to pronounce 'idempotent'?
A: The same way every time.
"denoting an element of a set that is
unchanged in value when multiplied or
otherwise operated on by itself."
15. Concurrency
• Execute a non-resource bound operation via
many threads on the same core.
• Performant pipelines find concurrency within
an operation.
"the decomposability property of a
program, algorithm, or problem into
order-independent or partially-ordered
components or units."
16. Parallelism
• Execute operations on multiple cores /
machines simultaneously.
• Operators can operate in parallel as soon as a
new input is available.
"a computation architecture in which
many calculations or the execution of
processes are carried out simultaneously"
18. Periodic Workflows
• Pipeline executes on a timed interval
• Great for exhaustive data processing
• Easy backfilling
19. Event-Driven Workflows
• Pipeline handles inputs (events) as they are
received
• Real time data
• Best suited for non-exhaustive data processing
• Backfills?
20. ETL
• Extract, Transform, Load are distinct steps with
no shared operations
• Each step can performed one or more times
before the following step is performed.
Extract, Transform, Load
21. ETL
• Intermediate data stored between steps and
audit data is tracked for each step.
• Enables independent processing of Extract
Transform, and Load.
• Don't transform during extraction.
• Don't transform during loading!
27. Web Development Ecosystem
• Django, Flask, Pyramid
• Django REST Framework
• Scrapy
• Celery
In web development, we started solving distributed processing
problems a long time ago.
34. Luigi
• Open sourced by Spotify in 2012
• Lightweight configuration
• Does not support worker pooling
"Luigi is a Python package that helps you build complex pipelines
of batch jobs. It handles dependency resolution, workflow
management, visualization, handling failures, command line
integration, and much more."
35.
36. Airflow
• Open sourced by AirBnb in 2015
• Apache Incubation since March 2016
• Implements workflows as strict DAGs
• Visualization / Audit / Backfill tools
• Scales with Celery
"Airflow is a platform to programmaticaly
author, schedule and monitor data pipelines."
39. How To Attack a Pipeline
Problem
1. Pure Python functions
2. Convert to Celery (Parallel for free!)
3. Layer in Concurrency / Optimizations
4. Escalate to AirFlow
40. Things are going to fail.
• Log often and frequently.
• Remember Atomicity.
• Leverage aggregation / visualization tools.
41.
42.
43. Pipeline Takeaways
• Build operations with Atomicity and
Idempotency in mind.
• Optimize throughput with concurrency and
parallelism.
• Log and visualize (or just use Airflow).
44. Come talk to us about your data.
Casey Kinsey, Principal Consultant
hirelofty.com
@loftylabs
@quesokinsey