2. About Me
● Building Data Pipelines since 2004
● Founding employee at Crosswise (acquired by Oracle Data Cloud)
○ Luigi, Airflow, all possible Spark Deployments, taking ML
pipelining to the extreme.
● CTO at Databand.ai, father of two boys, enjoying both! :)
4. REAL DATA PIPELINES
● TOOLS
○ Spark, tensorflow, python, SQL..
● DATA
○ formats, schemas, versions
○ sources
● COMPLEXITY
○ pipelines of pipelines, wiring..
Real data pipelines often fail, take forever to
rerun and provide no idea of what’s wrong!
5. Because of... Code Changes
● DAGs changes a lot too!
● CI/CD?!
● A side-effect is somewhere in downstream..
→ OBSERVABILITY NEEDED!
6. Because of... Data Changes
● Data Schema changes
● Data Quality Changes
● Data Corruptions
→ OBSERVABILITY NEEDED!
12. take 1 - Production Engineering Approach
Production Metrics grafana, kibana
Production Logging loggly, logz.io, others
Production Alerting datadog, zabbix, nagios, grafana alerts
13. take 2 - Data Science Approach
Experimentation Management by an external system ( mlflow, sacred,
sagemaker, and many-many others)
Encapsulating reporting into Notebooks
14. take 3 - Data Ops Engineering Approach
Inhouse development
Custom Operators
Data management on its own
Job submission on its own
Validation Operators ( + external frameworks)
15. take N: Mix and Match
Understand your customer
16. Measure Everything
● Metrics, metrics, metrics
○ Start from data inputs - 90% of your bugs are somewhere in data
ingestion
○ Spark job/Tensorflow job is not a black box! (collect metrics)
● Build Comparison Methodology
○ Grafana dashboards
○ 1 to 1 compare!
● Develop
○ Data pipelines are a huge Engineering investment
○ Don’t be afraid of having multiple systems
○ Implement your own! (know how and when to Fork)
17. Airflow Operation vs Business Metrics
Connect Airflow STATSD, great for cluster monitoring!
Use Airflow Trends!
Not sure about inlets/outlets in BaseOperator
18. Your KEY and x-axes
● Treat your system as a BATCH system, not as 24/7
○ Restarts will happen
○ Scheduling and SLA
● Scheduled time, execution time, restart time
19. Your KEY name
Is it just a NAME?
Or more like ENV.PROJECT.PIPELINE.CLIENT.TASK_ID ?
● Metrics2.0 to the Rescue! (use labels)
● You’ll have similar tasks in the pipeline!
● You’ll run the same tasks in development!
20. Alerting
● Data processes are no different from your FrontEnd Customer Facing
○ You MUST monitor and alert!
● Pagerduty! Slack channel!
● (read good practices on alerting)
● …
● Your jobs have Discrete metrics; your Alerting system will not like that!
● Ask your team to develop stable KPIs .
21. What we didn’t discuss
Cost Monitoring
Advanced Alerting on Data Pipelines
How to reuse Production observability in Development and vice versa
22. What we did discuss
OBSERVABILITY!
Invest now!
Start from basics: instrument your code!