SlideShare uma empresa Scribd logo
1 de 35
Baixar para ler offline
Building (Better) Data Pipelines
using Apache Airflow
Sid Anand (@r39132)
QCon.AI 2018
1
About Me
2
Work [ed | s] @
Maintainer of
Spare time
Co-Chair for
Apache Airflow
3
What is it?
4
Apache Airflow : What is it?
In a :
Airflow is a platform to
programmatically author, schedule
and monitor workflows (a.k.a. DAGs
or Directed Acyclic Graphs)
Apache Airflow
5
UI Walk-Through
6
Apache Airflow : UI Walk-through
Airflow - Authoring DAGs
7
Airflow: Visualizing a DAG
8
Airflow: Author DAGs in Python! No need to bundle many XML files!
Airflow - Authoring DAGs
9
Airflow: The Tree View offers a view of DAG Runs over time!
Airflow - Authoring DAGs
Airflow - Performance Insights
10
Airflow: Gantt charts reveal the slowest tasks for a run!
11
Airflow: …And we can easily see performance trends over time
Airflow - Performance Insights
Apache Airflow
12
Why use it?
13
Apache Airflow : Why use it?
When would you use a Workflow Scheduler like
Airflow?
• ETL Pipelines
• Machine Learning Pipelines
• Predictive Data Pipelines
• Fraud Detection, Scoring/Ranking, Classification,
Recommender System, etc…
• General Job Scheduling (e.g. Cron)
• DB Back-ups, Scheduled code/config deployment
14
What should a Workflow Scheduler do well?
• Schedule a graph of dependencies
• where Workflow = A DAG of Tasks
• Handle task failures
• Report / Alert on failures
• Monitor performance of tasks over time
• Enforce SLAs
• E.g. Alerting if time or correctness SLAs are not met
• Easily scale for growing load
Apache Airflow : Why use it?
15
What Does Apache Airflow Add?
• Configuration-as-code
• Usability - Stunning UI / UX
• Centralized configuration
• Resource Pooling
• Extensibility
Apache Airflow : Why use it?
Use-Case : Message
Scoring
Batch Pipeline Architecture
16
Use-Case : Message Scoring
17
enterprise A
enterprise B
enterprise C
S3
S3 uploads every 15
minutes
Use-Case : Message Scoring
18
enterprise A
enterprise B
enterprise C
S3
Airflow kicks of a Spark
message scoring job
every hour
Use-Case : Message Scoring
19
enterprise A
enterprise B
enterprise C
S3
Spark job writes scored
messages and stats to
another S3 bucket
S3
Use-Case : Message Scoring
20
enterprise A
enterprise B
enterprise C
S3
This triggers SNS/SQS
messages events
S3
SNS
SQS
Use-Case : Message Scoring
21
enterprise A
enterprise B
enterprise C
S3
An Autoscale Group
(ASG) of Importers spins
up when it detects SQS
messages
S3
SNS
SQS
Importers
ASG
22
enterprise A
enterprise B
enterprise C
S3
The importers rapidly ingest scored
messages and aggregate statistics into
the DB
S3
SNS
SQS
Importers
ASG
DB
Use-Case : Message Scoring
23
enterprise A
enterprise B
enterprise C
S3
Users receive alerts of
untrusted emails &
can review them in
the web app
S3
SNS
SQS
Importers
ASG
DB
Use-Case : Message Scoring
24
enterprise A
enterprise B
enterprise C
S3 S3
SNS
SQS
Importers
ASG
DB
Airflow manages the entire process
Use-Case : Message Scoring
25
Airflow DAG
Apache Airflow
26
Incubating
27
Apache Airflow : Incubating
Timeline
• Airflow was created @ Airbnb in 2015 by Maxime
Beauchemin
• Max launched it @ Hadoop Summit in Summer 2015
• On 3/31/2016, Airflow —> Apache Incubator
Today
• 2400+ Forks
• 7600+ GitHub Stars
• 430+ Contributors
• 150+ companies officially using it!
• 14 Committers/Maintainers <— We’re growing here
Thank You!
28
Apache Airflow
29
Behind the Scenes
30
Airflow is a platform to programmatically author,
schedule and monitor workflows (a.k.a. DAGs)
It ships with a
• DAG Scheduler
• Web application (UI)
• Powerful CLI
• Celery Workers!
Apache Airflow : Behind the Scenes
31
Apache Airflow : Behind the Scenes
Webserver
Scheduler
WorkerWorkerWorker
Meta DB
1. A user schedules / manages
DAGs using the Airflow UI!
2. Airflow’s webserver stores
scheduling metadata in the
metadata DB
3. The scheduler picks up new
schedules and distributes
work over Celery /
RabbitMQ
4. Airflow workers pick up
Airflow tasks over Celery
Celery / RabbitMQ
32
Webserver
Scheduler
WorkerWorkerWorker
Meta DB
1. A user schedules / manages
DAGs using the Airflow UI!
2. Airflow’s webserver stores
scheduling metadata in the
metadata DB
3. The scheduler picks up new
schedules and distributes
work over Celery /
RabbitMQ
4. Airflow workers pick up
Airflow tasks over Celery
Celery / RabbitMQ
Apache Airflow : Behind the Scenes
1. A user schedules / manages
DAGs using the Airflow UI!
2. Airflow’s webserver stores
scheduling metadata in the
metadata DB
3. The scheduler picks up new
schedules and distributes
work over Celery /
RabbitMQ
4. Airflow workers pick up
Airflow tasks over Celery
33
Webserver
Scheduler
WorkerWorkerWorker
Meta DB
Celery / RabbitMQ
Apache Airflow : Behind the Scenes
34
Webserver
Scheduler
WorkerWorkerWorker
Meta DB
1. A user schedules / manages
DAGs using the Airflow UI!
2. Airflow’s webserver stores
scheduling metadata in the
metadata DB
3. The scheduler picks up new
schedules and distributes
work over Celery /
RabbitMQ
4. Airflow workers pick up
Airflow tasks from RabbitMQ
Celery / RabbitMQ
Apache Airflow : Behind the Scenes
Thank You!
35

Mais conteúdo relacionado

Mais procurados

How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
PyData
 

Mais procurados (20)

Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
 
Apache Airflow overview
Apache Airflow overviewApache Airflow overview
Apache Airflow overview
 
Airflow introduction
Airflow introductionAirflow introduction
Airflow introduction
 
Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016
 
Apache Airflow Introduction
Apache Airflow IntroductionApache Airflow Introduction
Apache Airflow Introduction
 
Airflow for Beginners
Airflow for BeginnersAirflow for Beginners
Airflow for Beginners
 
Introduction to Apache Airflow
Introduction to Apache AirflowIntroduction to Apache Airflow
Introduction to Apache Airflow
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
 
Airflow - a data flow engine
Airflow - a data flow engineAirflow - a data flow engine
Airflow - a data flow engine
 
Apache Airflow Architecture
Apache Airflow ArchitectureApache Airflow Architecture
Apache Airflow Architecture
 
Airflow Intro-1.pdf
Airflow Intro-1.pdfAirflow Intro-1.pdf
Airflow Intro-1.pdf
 
Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0
 
Apache Airflow in Production
Apache Airflow in ProductionApache Airflow in Production
Apache Airflow in Production
 
Orchestrating workflows Apache Airflow on GCP & AWS
Orchestrating workflows Apache Airflow on GCP & AWSOrchestrating workflows Apache Airflow on GCP & AWS
Orchestrating workflows Apache Airflow on GCP & AWS
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
Airflow tutorials hands_on
Airflow tutorials hands_onAirflow tutorials hands_on
Airflow tutorials hands_on
 
Grafana introduction
Grafana introductionGrafana introduction
Grafana introduction
 

Semelhante a Building Better Data Pipelines using Apache Airflow

quickguide-einnovator-4-cloudfoundry
quickguide-einnovator-4-cloudfoundryquickguide-einnovator-4-cloudfoundry
quickguide-einnovator-4-cloudfoundry
jorgesimao71
 

Semelhante a Building Better Data Pipelines using Apache Airflow (20)

Cloud Native Data Pipelines (DataEngConf SF 2017)
Cloud Native Data Pipelines (DataEngConf SF 2017)Cloud Native Data Pipelines (DataEngConf SF 2017)
Cloud Native Data Pipelines (DataEngConf SF 2017)
 
Airflow @ Agari
Airflow @ Agari Airflow @ Agari
Airflow @ Agari
 
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)
 
20171122 aws usergrp_coretech-spn-cicd-aws-v01
20171122 aws usergrp_coretech-spn-cicd-aws-v0120171122 aws usergrp_coretech-spn-cicd-aws-v01
20171122 aws usergrp_coretech-spn-cicd-aws-v01
 
Cloud Native Data Pipelines (GoTo Chicago 2017)
Cloud Native Data Pipelines (GoTo Chicago 2017)Cloud Native Data Pipelines (GoTo Chicago 2017)
Cloud Native Data Pipelines (GoTo Chicago 2017)
 
Airflow at lyft
Airflow at lyftAirflow at lyft
Airflow at lyft
 
Cloud Native Data Pipelines
Cloud Native Data PipelinesCloud Native Data Pipelines
Cloud Native Data Pipelines
 
SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration)
SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration) SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration)
SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration)
 
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamUnified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache Beam
 
Apache Kafka with Spark Streaming: Real-time Analytics Redefined
Apache Kafka with Spark Streaming: Real-time Analytics RedefinedApache Kafka with Spark Streaming: Real-time Analytics Redefined
Apache Kafka with Spark Streaming: Real-time Analytics Redefined
 
Airflow techtonic template
Airflow   techtonic templateAirflow   techtonic template
Airflow techtonic template
 
Why Airflow? & What's new in Airflow 2.3?
Why Airflow? & What's new in Airflow 2.3?Why Airflow? & What's new in Airflow 2.3?
Why Airflow? & What's new in Airflow 2.3?
 
Serverless GraphQL for Product Developers
Serverless GraphQL for Product DevelopersServerless GraphQL for Product Developers
Serverless GraphQL for Product Developers
 
AWS re:Invent 2016: Building a Platform for Collaborative Scientific Research...
AWS re:Invent 2016: Building a Platform for Collaborative Scientific Research...AWS re:Invent 2016: Building a Platform for Collaborative Scientific Research...
AWS re:Invent 2016: Building a Platform for Collaborative Scientific Research...
 
Resilient Predictive Data Pipelines (QCon London 2016)
Resilient Predictive Data Pipelines (QCon London 2016)Resilient Predictive Data Pipelines (QCon London 2016)
Resilient Predictive Data Pipelines (QCon London 2016)
 
quickguide-einnovator-4-cloudfoundry
quickguide-einnovator-4-cloudfoundryquickguide-einnovator-4-cloudfoundry
quickguide-einnovator-4-cloudfoundry
 
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
 
Scheduling Apps in the Cloud - Glenn Renfro & Roy Clarkson
Scheduling Apps in the Cloud - Glenn Renfro & Roy ClarksonScheduling Apps in the Cloud - Glenn Renfro & Roy Clarkson
Scheduling Apps in the Cloud - Glenn Renfro & Roy Clarkson
 
Serverless cat detector workshop - cloudyna 2017 (16.12.2017)
Serverless cat detector   workshop - cloudyna 2017 (16.12.2017)Serverless cat detector   workshop - cloudyna 2017 (16.12.2017)
Serverless cat detector workshop - cloudyna 2017 (16.12.2017)
 
Realizing the promise of portability with Apache Beam
Realizing the promise of portability with Apache BeamRealizing the promise of portability with Apache Beam
Realizing the promise of portability with Apache Beam
 

Mais de Sid Anand

Big Data, Fast Data @ PayPal (YOW 2018)
Big Data, Fast Data @ PayPal (YOW 2018)Big Data, Fast Data @ PayPal (YOW 2018)
Big Data, Fast Data @ PayPal (YOW 2018)
Sid Anand
 

Mais de Sid Anand (20)

Building High Fidelity Data Streams (QCon London 2023)
Building High Fidelity Data Streams (QCon London 2023)Building High Fidelity Data Streams (QCon London 2023)
Building High Fidelity Data Streams (QCon London 2023)
 
Building & Operating High-Fidelity Data Streams - QCon Plus 2021
Building & Operating High-Fidelity Data Streams - QCon Plus 2021Building & Operating High-Fidelity Data Streams - QCon Plus 2021
Building & Operating High-Fidelity Data Streams - QCon Plus 2021
 
Low Latency Fraud Detection & Prevention
Low Latency Fraud Detection & PreventionLow Latency Fraud Detection & Prevention
Low Latency Fraud Detection & Prevention
 
YOW! Data Keynote (2021)
YOW! Data Keynote (2021)YOW! Data Keynote (2021)
YOW! Data Keynote (2021)
 
Big Data, Fast Data @ PayPal (YOW 2018)
Big Data, Fast Data @ PayPal (YOW 2018)Big Data, Fast Data @ PayPal (YOW 2018)
Big Data, Fast Data @ PayPal (YOW 2018)
 
Cloud Native Predictive Data Pipelines (micro talk)
Cloud Native Predictive Data Pipelines (micro talk)Cloud Native Predictive Data Pipelines (micro talk)
Cloud Native Predictive Data Pipelines (micro talk)
 
Cloud Native Data Pipelines (in Eng & Japanese) - QCon Tokyo
Cloud Native Data Pipelines (in Eng & Japanese)  - QCon TokyoCloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo
Cloud Native Data Pipelines (in Eng & Japanese) - QCon Tokyo
 
Resilient Predictive Data Pipelines (GOTO Chicago 2016)
Resilient Predictive Data Pipelines (GOTO Chicago 2016)Resilient Predictive Data Pipelines (GOTO Chicago 2016)
Resilient Predictive Data Pipelines (GOTO Chicago 2016)
 
Software Developer and Architecture @ LinkedIn (QCon SF 2014)
Software Developer and Architecture @ LinkedIn (QCon SF 2014)Software Developer and Architecture @ LinkedIn (QCon SF 2014)
Software Developer and Architecture @ LinkedIn (QCon SF 2014)
 
LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)
LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)
LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)
 
Building a Modern Website for Scale (QCon NY 2013)
Building a Modern Website for Scale (QCon NY 2013)Building a Modern Website for Scale (QCon NY 2013)
Building a Modern Website for Scale (QCon NY 2013)
 
Hands On with Maven
Hands On with MavenHands On with Maven
Hands On with Maven
 
Learning git
Learning gitLearning git
Learning git
 
LinkedIn Data Infrastructure Slides (Version 2)
LinkedIn Data Infrastructure Slides (Version 2)LinkedIn Data Infrastructure Slides (Version 2)
LinkedIn Data Infrastructure Slides (Version 2)
 
LinkedIn Data Infrastructure (QCon London 2012)
LinkedIn Data Infrastructure (QCon London 2012)LinkedIn Data Infrastructure (QCon London 2012)
LinkedIn Data Infrastructure (QCon London 2012)
 
Linked in nosql_atnetflix_2012_v1
Linked in nosql_atnetflix_2012_v1Linked in nosql_atnetflix_2012_v1
Linked in nosql_atnetflix_2012_v1
 
Keeping Movies Running Amid Thunderstorms!
Keeping Movies Running Amid Thunderstorms!Keeping Movies Running Amid Thunderstorms!
Keeping Movies Running Amid Thunderstorms!
 
OSCON Data 2011 -- NoSQL @ Netflix, Part 2
OSCON Data 2011 -- NoSQL @ Netflix, Part 2OSCON Data 2011 -- NoSQL @ Netflix, Part 2
OSCON Data 2011 -- NoSQL @ Netflix, Part 2
 
Intuit CTOF 2011 - Netflix for Mobile in the Cloud
Intuit CTOF 2011 - Netflix for Mobile in the CloudIntuit CTOF 2011 - Netflix for Mobile in the Cloud
Intuit CTOF 2011 - Netflix for Mobile in the Cloud
 
Svccg nosql 2011_v4
Svccg nosql 2011_v4Svccg nosql 2011_v4
Svccg nosql 2011_v4
 

Último

%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
masabamasaba
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
masabamasaba
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
VictoriaMetrics
 
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
chiefasafspells
 

Último (20)

%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
 
WSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security ProgramWSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security Program
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptx
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
 
tonesoftg
tonesoftgtonesoftg
tonesoftg
 
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
 
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
 
WSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - Keynote
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 

Building Better Data Pipelines using Apache Airflow

  • 1. Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1
  • 2. About Me 2 Work [ed | s] @ Maintainer of Spare time Co-Chair for
  • 4. 4 Apache Airflow : What is it? In a : Airflow is a platform to programmatically author, schedule and monitor workflows (a.k.a. DAGs or Directed Acyclic Graphs)
  • 6. 6 Apache Airflow : UI Walk-through
  • 7. Airflow - Authoring DAGs 7 Airflow: Visualizing a DAG
  • 8. 8 Airflow: Author DAGs in Python! No need to bundle many XML files! Airflow - Authoring DAGs
  • 9. 9 Airflow: The Tree View offers a view of DAG Runs over time! Airflow - Authoring DAGs
  • 10. Airflow - Performance Insights 10 Airflow: Gantt charts reveal the slowest tasks for a run!
  • 11. 11 Airflow: …And we can easily see performance trends over time Airflow - Performance Insights
  • 13. 13 Apache Airflow : Why use it? When would you use a Workflow Scheduler like Airflow? • ETL Pipelines • Machine Learning Pipelines • Predictive Data Pipelines • Fraud Detection, Scoring/Ranking, Classification, Recommender System, etc… • General Job Scheduling (e.g. Cron) • DB Back-ups, Scheduled code/config deployment
  • 14. 14 What should a Workflow Scheduler do well? • Schedule a graph of dependencies • where Workflow = A DAG of Tasks • Handle task failures • Report / Alert on failures • Monitor performance of tasks over time • Enforce SLAs • E.g. Alerting if time or correctness SLAs are not met • Easily scale for growing load Apache Airflow : Why use it?
  • 15. 15 What Does Apache Airflow Add? • Configuration-as-code • Usability - Stunning UI / UX • Centralized configuration • Resource Pooling • Extensibility Apache Airflow : Why use it?
  • 16. Use-Case : Message Scoring Batch Pipeline Architecture 16
  • 17. Use-Case : Message Scoring 17 enterprise A enterprise B enterprise C S3 S3 uploads every 15 minutes
  • 18. Use-Case : Message Scoring 18 enterprise A enterprise B enterprise C S3 Airflow kicks of a Spark message scoring job every hour
  • 19. Use-Case : Message Scoring 19 enterprise A enterprise B enterprise C S3 Spark job writes scored messages and stats to another S3 bucket S3
  • 20. Use-Case : Message Scoring 20 enterprise A enterprise B enterprise C S3 This triggers SNS/SQS messages events S3 SNS SQS
  • 21. Use-Case : Message Scoring 21 enterprise A enterprise B enterprise C S3 An Autoscale Group (ASG) of Importers spins up when it detects SQS messages S3 SNS SQS Importers ASG
  • 22. 22 enterprise A enterprise B enterprise C S3 The importers rapidly ingest scored messages and aggregate statistics into the DB S3 SNS SQS Importers ASG DB Use-Case : Message Scoring
  • 23. 23 enterprise A enterprise B enterprise C S3 Users receive alerts of untrusted emails & can review them in the web app S3 SNS SQS Importers ASG DB Use-Case : Message Scoring
  • 24. 24 enterprise A enterprise B enterprise C S3 S3 SNS SQS Importers ASG DB Airflow manages the entire process Use-Case : Message Scoring
  • 27. 27 Apache Airflow : Incubating Timeline • Airflow was created @ Airbnb in 2015 by Maxime Beauchemin • Max launched it @ Hadoop Summit in Summer 2015 • On 3/31/2016, Airflow —> Apache Incubator Today • 2400+ Forks • 7600+ GitHub Stars • 430+ Contributors • 150+ companies officially using it! • 14 Committers/Maintainers <— We’re growing here
  • 30. 30 Airflow is a platform to programmatically author, schedule and monitor workflows (a.k.a. DAGs) It ships with a • DAG Scheduler • Web application (UI) • Powerful CLI • Celery Workers! Apache Airflow : Behind the Scenes
  • 31. 31 Apache Airflow : Behind the Scenes Webserver Scheduler WorkerWorkerWorker Meta DB 1. A user schedules / manages DAGs using the Airflow UI! 2. Airflow’s webserver stores scheduling metadata in the metadata DB 3. The scheduler picks up new schedules and distributes work over Celery / RabbitMQ 4. Airflow workers pick up Airflow tasks over Celery Celery / RabbitMQ
  • 32. 32 Webserver Scheduler WorkerWorkerWorker Meta DB 1. A user schedules / manages DAGs using the Airflow UI! 2. Airflow’s webserver stores scheduling metadata in the metadata DB 3. The scheduler picks up new schedules and distributes work over Celery / RabbitMQ 4. Airflow workers pick up Airflow tasks over Celery Celery / RabbitMQ Apache Airflow : Behind the Scenes
  • 33. 1. A user schedules / manages DAGs using the Airflow UI! 2. Airflow’s webserver stores scheduling metadata in the metadata DB 3. The scheduler picks up new schedules and distributes work over Celery / RabbitMQ 4. Airflow workers pick up Airflow tasks over Celery 33 Webserver Scheduler WorkerWorkerWorker Meta DB Celery / RabbitMQ Apache Airflow : Behind the Scenes
  • 34. 34 Webserver Scheduler WorkerWorkerWorker Meta DB 1. A user schedules / manages DAGs using the Airflow UI! 2. Airflow’s webserver stores scheduling metadata in the metadata DB 3. The scheduler picks up new schedules and distributes work over Celery / RabbitMQ 4. Airflow workers pick up Airflow tasks from RabbitMQ Celery / RabbitMQ Apache Airflow : Behind the Scenes