Data Pipeline Observability Solutions

•

0 likes•396 views

Omid Vahdaty

Data Pipeline Observability

Data & Analytics

About Me
● Building Data Pipelines since 2004
● Founding employee at Crosswise (acquired by Oracle Data Cloud)
○ Luigi, Airﬂow, all possible Spark Deployments, taking ML
pipelining to the extreme.
● CTO at Databand.ai, father of two boys, enjoying both! :)

REAL DATA PIPELINES
● TOOLS
○ Spark, tensorﬂow, python, SQL..
● DATA
○ formats, schemas, versions
○ sources
● COMPLEXITY
○ pipelines of pipelines, wiring..
Real data pipelines often fail, take forever to
rerun and provide no idea of what’s wrong!

Because of... Code Changes
● DAGs changes a lot too!
● CI/CD?!
● A side-effect is somewhere in downstream..
→ OBSERVABILITY NEEDED!

Because of... Data Changes
● Data Schema changes
● Data Quality Changes
● Data Corruptions
→ OBSERVABILITY NEEDED!

Because… Everything Changes!
● Python Dependency resolving: Pandas 1.0 :)
● Clusters
● Internal Libraries
→ OBSERVABILITY NEEDED!

9
What we love
● Flexibility
● Scheduling
● Community
● Maturity
Airﬂow in ML/Data? (What was it built for?)

Does it solve the PROBLEM?
(Yes! And No..)

take 1 - Production Engineering Approach
Production Metrics grafana, kibana
Production Logging loggly, logz.io, others
Production Alerting datadog, zabbix, nagios, grafana alerts

take 2 - Data Science Approach
Experimentation Management by an external system ( mlﬂow, sacred,
sagemaker, and many-many others)
Encapsulating reporting into Notebooks

take 3 - Data Ops Engineering Approach
Inhouse development
Custom Operators
Data management on its own
Job submission on its own
Validation Operators ( + external frameworks)

take N: Mix and Match
Understand your customer

Measure Everything
● Metrics, metrics, metrics
○ Start from data inputs - 90% of your bugs are somewhere in data
ingestion
○ Spark job/Tensorﬂow job is not a black box! (collect metrics)
● Build Comparison Methodology
○ Grafana dashboards
○ 1 to 1 compare!
● Develop
○ Data pipelines are a huge Engineering investment
○ Don’t be afraid of having multiple systems
○ Implement your own! (know how and when to Fork)

Airﬂow Operation vs Business Metrics
Connect Airﬂow STATSD, great for cluster monitoring!
Use Airﬂow Trends!
Not sure about inlets/outlets in BaseOperator

Your KEY and x-axes
● Treat your system as a BATCH system, not as 24/7
○ Restarts will happen
○ Scheduling and SLA
● Scheduled time, execution time, restart time

Your KEY name
Is it just a NAME?
Or more like ENV.PROJECT.PIPELINE.CLIENT.TASK_ID ?
● Metrics2.0 to the Rescue! (use labels)
● You’ll have similar tasks in the pipeline!
● You’ll run the same tasks in development!

Alerting
● Data processes are no different from your FrontEnd Customer Facing
○ You MUST monitor and alert!
● Pagerduty! Slack channel!
● (read good practices on alerting)
● …
● Your jobs have Discrete metrics; your Alerting system will not like that!
● Ask your team to develop stable KPIs .

What we didn’t discuss
Cost Monitoring
Advanced Alerting on Data Pipelines
How to reuse Production observability in Development and vice versa

What we did discuss
OBSERVABILITY!
Invest now!
Start from basics: instrument your code!

References
https://airﬂow.apache.org/
https://medium.com/databand-ai/observability-for-data-engineering-a2e826
587205

24
Contact us to learn more and see
the product in action!
www.databand.ai
contact@databand.ai

What's hot

Introduction SQL Analytics on Lakehouse ArchitectureDatabricks

Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...Dr. Arif Wider

Building Lakehouses on Delta Lake with SQL Analytics PrimerDatabricks

Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks

Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Flink Forward

Data Catalog for Better Data Discovery and GovernanceDenodo

Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra

Real-time Analytics with Trino and Apache PinotXiang Fu

Data and AI summit: data pipelines observability with open lineageJulien Le Dem

Iceberg: a fast table format for S3DataWorks Summit

Data Mesh Part 4 Monolith to MeshJeffrey T. Pollock

Webinar Data Mesh - Part 3Jeffrey T. Pollock

Modernizing to a Cloud Data ArchitectureDatabricks

Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...Databricks

Evolution from EDA to Data Mesh: Data in Motionconfluent

The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...Databricks

Data Platform Architecture Principles and Evaluation CriteriaScyllaDB

Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data PipelinesDATAVERSITY

DW Migration Webinar-March 2022.pptxDatabricks

Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021StreamNative

What's hot (20)

Introduction SQL Analytics on Lakehouse Architecture

Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...

Building Lakehouses on Delta Lake with SQL Analytics Primer

Architect’s Open-Source Guide for a Data Mesh Architecture

Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...

Data Catalog for Better Data Discovery and Governance

Data Lakehouse, Data Mesh, and Data Fabric (r1)

Real-time Analytics with Trino and Apache Pinot

Data and AI summit: data pipelines observability with open lineage

Iceberg: a fast table format for S3

Data Mesh Part 4 Monolith to Mesh

Webinar Data Mesh - Part 3

Modernizing to a Cloud Data Architecture

Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...

Evolution from EDA to Data Mesh: Data in Motion

The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...

Data Platform Architecture Principles and Evaluation Criteria

Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines

DW Migration Webinar-March 2022.pptx

Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021

Recently uploaded (20)

BabyOno dropshipping via API with DroFx.pptx

Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...

Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...

Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night

Carero dropshipping via API with DroFx.pptx

Best VIP Call Girls Noida Sector 22 Call Me: 8448380779

Data-Analysis for Chicago Crime Data 2023

Midocean dropshipping via API with DroFx

Schema on read is obsolete. Welcome metaprogramming..pdf

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...

BigBuy dropshipping via API with DroFx.pptx

Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call

Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service

CebaBaby dropshipping via API with DroFX.pptx

Accredited-Transport-Cooperatives-Jan-2021-Web.pdf

Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...

Zuja dropshipping via API with DroFx.pptx

Invezz.com - Grow your wealth with trading signals

Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha

Week-01-2.ppt BBB human Computer interaction

Data Pipeline Observability Solutions

1. Data Pipeline Observability By Evgeny Shulman CTO at databand.ai

2. About Me ● Building Data Pipelines since 2004 ● Founding employee at Crosswise (acquired by Oracle Data Cloud) ○ Luigi, Airﬂow, all possible Spark Deployments, taking ML pipelining to the extreme. ● CTO at Databand.ai, father of two boys, enjoying both! :)

3. THE DREAM!

4. REAL DATA PIPELINES ● TOOLS ○ Spark, tensorﬂow, python, SQL.. ● DATA ○ formats, schemas, versions ○ sources ● COMPLEXITY ○ pipelines of pipelines, wiring.. Real data pipelines often fail, take forever to rerun and provide no idea of what’s wrong!

5. Because of... Code Changes ● DAGs changes a lot too! ● CI/CD?! ● A side-effect is somewhere in downstream.. → OBSERVABILITY NEEDED!

6. Because of... Data Changes ● Data Schema changes ● Data Quality Changes ● Data Corruptions → OBSERVABILITY NEEDED!

7. Because… Everything Changes! ● Python Dependency resolving: Pandas 1.0 :) ● Clusters ● Internal Libraries → OBSERVABILITY NEEDED!

8. REALITY

9. 9 What we love ● Flexibility ● Scheduling ● Community ● Maturity Airﬂow in ML/Data? (What was it built for?)

10. Does it solve the PROBLEM? (Yes! And No..)

11. What’s the Solution?

12. take 1 - Production Engineering Approach Production Metrics grafana, kibana Production Logging loggly, logz.io, others Production Alerting datadog, zabbix, nagios, grafana alerts

13. take 2 - Data Science Approach Experimentation Management by an external system ( mlﬂow, sacred, sagemaker, and many-many others) Encapsulating reporting into Notebooks

14. take 3 - Data Ops Engineering Approach Inhouse development Custom Operators Data management on its own Job submission on its own Validation Operators ( + external frameworks)

15. take N: Mix and Match Understand your customer

16. Measure Everything ● Metrics, metrics, metrics ○ Start from data inputs - 90% of your bugs are somewhere in data ingestion ○ Spark job/Tensorﬂow job is not a black box! (collect metrics) ● Build Comparison Methodology ○ Grafana dashboards ○ 1 to 1 compare! ● Develop ○ Data pipelines are a huge Engineering investment ○ Don’t be afraid of having multiple systems ○ Implement your own! (know how and when to Fork)

17. Airflow Operation vs Business Metrics Connect Airflow STATSD, great for cluster monitoring! Use Airflow Trends! Not sure about inlets/outlets in BaseOperator

18. Your KEY and x-axes ● Treat your system as a BATCH system, not as 24/7 ○ Restarts will happen ○ Scheduling and SLA ● Scheduled time, execution time, restart time

19. Your KEY name Is it just a NAME? Or more like ENV.PROJECT.PIPELINE.CLIENT.TASK_ID ? ● Metrics2.0 to the Rescue! (use labels) ● You’ll have similar tasks in the pipeline! ● You’ll run the same tasks in development!

20. Alerting ● Data processes are no different from your FrontEnd Customer Facing ○ You MUST monitor and alert! ● Pagerduty! Slack channel! ● (read good practices on alerting) ● … ● Your jobs have Discrete metrics; your Alerting system will not like that! ● Ask your team to develop stable KPIs .

21. What we didn’t discuss Cost Monitoring Advanced Alerting on Data Pipelines How to reuse Production observability in Development and vice versa

22. What we did discuss OBSERVABILITY! Invest now! Start from basics: instrument your code!

23. References https://airﬂow.apache.org/ https://medium.com/databand-ai/observability-for-data-engineering-a2e826 587205

24. 24 Contact us to learn more and see the product in action! www.databand.ai contact@databand.ai

Data Pipeline Observability Solutions

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data Pipeline Observability Solutions

Similar to Data Pipeline Observability Solutions (20)

More from Omid Vahdaty

More from Omid Vahdaty (20)

Recently uploaded

Recently uploaded (20)

Data Pipeline Observability Solutions