SlideShare uma empresa Scribd logo
1 de 30
Baixar para ler offline
Airflow
Lessons Learned of 3
years using Airflow in
production
Juan Martín Pampliega
Juan Martín Pampliega
2
Information Engineering @ ITBA
Co Founder @ Mutt Data
Professor @ ITBA
● Working in Data Projects since 2010.
● Globant (Google), Despegar, Socialmetrix, Jampp, Claro, Clarín and other companies.
● Co Founder @ Mutt Data a company specialized in developing projects using Big
Data and Data Science.
● Using Airflow in production since 2015 to manage data workflows in several
companies.
Apache Airflow
3
Airflow is a platform to programmatically author, schedule and
monitor workflows.
Started in late 2014 @ Airbnb by Maxime Beauchemin.
Open sourced in mid 2015.
Apache incubation since March 2016, first version March 2017.
Used by HBO, Twitter, ING, Paypal, Reddit, Yahoo, Jampp and more!
Author workflows as directed acyclic graphs (DAGs) of tasks.
The UI makes it easy to visualize workflows’ status, monitor progress,
and troubleshoot issues when needed.
Workflows are defined as Python code which makes them more
maintainable, versionable, testable, collaborative and fosters
abstraction and code reuse.
Problems with CRON & similar options
4
● CRON does a poor job at handling task dependencies and viewing
them.
● Poor or no strategy for retrying tasks or backfills.
● Limited data about task times, execution durations and failures.
● Need to ssh into server to check logs and interact.
● No easy way to scale beyond one machine.
● People mostly write jobs in BASH or XML
● Some questions that are hard to answer:
○ Do you know when your CRON jobs fail?
○ Can you spot when your tasks become 3x slower?
○ Can you visualize what's currently running? What's queued?
○ Do you have reusable components you can use across workflows?
Terminology
5
● DAG: Directed Acyclic Graph of tasks that you want to run. (workflow)
● Operator: they define what should be executed. Examples: Bash
command, insert data into a table, etc.
● Task: instance of an operator that defines a node in a DAG.
● DAG run: instance of a DAG. When a DAG is triggered, Airflow
orchestrates the execution of operators while respecting
dependences and allocated resources.
● Task instance: specific run of a task for a particular DAG run in a
particular point in time.
DAGs and Executions
6
DAGs and Executions
7
DAGs are just configuration files that define the DAG
structure as code using Python
DAGs don't do any data processing as such, only the
actual execution of a DAG will
Tasks defined will run in different contexts, different
workers, different points in time and mostly don’t
communicate between each other
Should execute quickly (hundreds of milliseconds)
because they will be evaluated often by the Airflow
Scheduler
DAG Definition Files
8
Example DAG
9
Run a DAG with:
airflow backfill example_bash_operator -s 2015-01-01 -e 2015-01-02
Or enter Airflow UI, refresh the DAG and Airflow will trigger it when it needs to.
Local Installation
10
Distributed Architecture
11
● SequentialExecutor: default, can only run one task at a time.
● LocalExecutor: can run multiple tasks locally, needs different DB than sqlite.
● CeleryExecutor: uses Celery to execute tasks remotely, needs Celery workers.
● DaskExecutor: similar to Celery but with Dask, lower latency.
● MesosExecutor: runs tasks as containers on a Mesos Cluster
● KubernetesExecutor: tasks are executed as pods
CeleryExecutor is the most widely adopted option when scaling to multiple
machines.
Tip: use Redis as broker and Celery Flower to monitor it. Results should be stored in
Postgres.
Executors
12
Distributed Architecture
13
The scheduler processes iterates all DAGs continually and triggers DAG runs
● execution_date: the period of time for when the data will be processed.
● start_date: the execution_date of the first DAG run.
● end_date: last execution_date that will have a DAG run.
● execution_timeout: maximum time a task will take before failing.
● retries: amount of times a task will be retried before failing.
● retry_delay: minimum time between one task execution and the next after a
failure.
The Scheduler
14
Metrics
15
Metrics
16
Connections and Variables
17
Additional Features
18
● Hooks: interfaces to external platforms and databases like Hive, S3,
MySQL, Postgres, etc.
● Pools: help limit the execution parallelism on arbitrary sets of tasks.
Tasks can be assigned to pools and have a priority weight.
● Queues: when using CeleryExecutor tasks can be assigned to a queue
and a worker can listen and execute tasks on one or many queues.
● XComs: enables task to exchange messages of any object that can be
pickled.
● Sensors: operators that wait for a certain condition to be met and
succeed. (e.g. wait for a certain file to appear in a directory)
● Authentication: there are plugins to enable authentication and
authorization through LDAP/Kerberos/other methods.
● Ad Hoc Queries: enables charting and querying configured data
sources.
Additional Features
19
Jinja Templating
20
Jinja templating makes available multiple helpful variables and macros to aid in date
manipulation.
The {{ }} brackets tell Airflow that this is a Jinja template, and ds is a variable made
available by Airflow that is replaced by the execution date in the format YYYY-MM-DD.
Thus, in the dag run stamped with 2018-06-04, this would render to:
Jinja Templating
21
Another useful variable is ds_nodash, where './run.sh {{ ds_nodash }}' renders to:
execution_date variable is useful, as it is a python datetime object and not a string like ds
Plugins
22
Enable defining custom hooks, operators, sensors, macros executors and web views.
Use at many companies to generate DAGs automatically for ETLS, ML, A/B testing, etc.
Testing
23
DAGs are code so we have different options to test them
● Test DAG import: iterate through the DAG bag and check that each DAG can be
imported or run de .py file from the commandline.
● Test DAG’s parameters: make sure all DAGs have required parameters like
emails, catchup, etc.
● Unit test Python logic: since the code executed by PythonOperator is a Python
function you can use normal unit tests for them.
Installation Best Practices
24
● Install apache-airflow package.
● LocalExecutor is fine to start with.
● Use CeleryExecutor or Dask/Kubernetes to scale.
● Use https://github.com/puckel/docker-airflow if you want to use docker.
● Use PostgreSQL or MySQL for metadata.
● Tune scheduler properties to reduce CPU consumption.
● Remember to copy all config and DAG files to the worker’s/executor’s
location.
Best Practices
25
● Try to balance between DAG readability and abstracting code.
● Use depends_on_past and wait_for_downstream for safety.
● Change the name of the DAG when you change the start_date.
● Tasks are processes than run on workers, limit the size of data the
process locally.
● Remember to erase task logs after a certain time.
● Generate custom views for non technical people.
● Abstract duplicated logic!
Must reads on Data Engineering & Airflow
26
Questions?
We are hiring
Data Engineers!
Quizlet 4 part series:
https://medium.com/@dustinstansbury/beyond-cron-an-introduction-
to-workflow-management-systems-19987afcdb5e
https://towardsdatascience.com/why-quizlet-chose-apache-airflow-
for-executing-data-workflows-3f97d40e9571
https://medium.com/@dustinstansbury/understanding-apache-airflow
s-key-concepts-a96efed52b1a
https://medium.com/@dustinstansbury/how-quizlet-uses-apache-airf
low-in-practice-a903cbb5626d
References
28
http://michal.karzynski.pl/blog/2017/03/19/developing-workflows-
with-apache-airflow/
https://medium.com/handy-tech/airflow-tips-tricks-and-pitfalls-9
ba53fba14eb
https://github.com/jghoman/awesome-apache-airflow
https://gtoonstra.github.io/etl-with-airflow/principles.html
http://tech.marksblogg.com/install-and-configure-apache-airflow.
html
References
29
Operator Trigger Rules
30
● Operators have a trigger_rule argument which defines the rule by which the
generated task get triggered Default value for trigger_rule is all_success
● Other options
○ all_failed: all parents are in a failed or upstream_failed state
○ all_done: all parents are done with their execution
○ one_failed: fires as soon as at least one parent has failed, it does not wait for
all parents to be done
○ one_success: fires as soon as at least one parent succeeds, it does not wait
for all parents to be done

Mais conteúdo relacionado

Mais procurados

Apache Airflow Architecture
Apache Airflow ArchitectureApache Airflow Architecture
Apache Airflow ArchitectureGerard Toonstra
 
Apache Airflow Introduction
Apache Airflow IntroductionApache Airflow Introduction
Apache Airflow IntroductionLiangjun Jiang
 
Orchestrating workflows Apache Airflow on GCP & AWS
Orchestrating workflows Apache Airflow on GCP & AWSOrchestrating workflows Apache Airflow on GCP & AWS
Orchestrating workflows Apache Airflow on GCP & AWSDerrick Qin
 
Building an analytics workflow using Apache Airflow
Building an analytics workflow using Apache AirflowBuilding an analytics workflow using Apache Airflow
Building an analytics workflow using Apache AirflowYohei Onishi
 
Apache Airflow in Production
Apache Airflow in ProductionApache Airflow in Production
Apache Airflow in ProductionRobert Sanders
 
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow managementIntro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow managementBurasakorn Sabyeying
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentationIlias Okacha
 
Airflow tutorials hands_on
Airflow tutorials hands_onAirflow tutorials hands_on
Airflow tutorials hands_onpko89403
 
Airflow - a data flow engine
Airflow - a data flow engineAirflow - a data flow engine
Airflow - a data flow engineWalter Liu
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Yohei Onishi
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowSid Anand
 
Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0Kaxil Naik
 
Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016Sid Anand
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowPyData
 

Mais procurados (20)

Airflow introduction
Airflow introductionAirflow introduction
Airflow introduction
 
Apache Airflow Architecture
Apache Airflow ArchitectureApache Airflow Architecture
Apache Airflow Architecture
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
 
Apache Airflow Introduction
Apache Airflow IntroductionApache Airflow Introduction
Apache Airflow Introduction
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
 
Orchestrating workflows Apache Airflow on GCP & AWS
Orchestrating workflows Apache Airflow on GCP & AWSOrchestrating workflows Apache Airflow on GCP & AWS
Orchestrating workflows Apache Airflow on GCP & AWS
 
Airflow at WePay
Airflow at WePayAirflow at WePay
Airflow at WePay
 
Building an analytics workflow using Apache Airflow
Building an analytics workflow using Apache AirflowBuilding an analytics workflow using Apache Airflow
Building an analytics workflow using Apache Airflow
 
Apache Airflow in Production
Apache Airflow in ProductionApache Airflow in Production
Apache Airflow in Production
 
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow managementIntro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
 
Airflow tutorials hands_on
Airflow tutorials hands_onAirflow tutorials hands_on
Airflow tutorials hands_on
 
Airflow - a data flow engine
Airflow - a data flow engineAirflow - a data flow engine
Airflow - a data flow engine
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
 
Airflow Intro-1.pdf
Airflow Intro-1.pdfAirflow Intro-1.pdf
Airflow Intro-1.pdf
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache Airflow
 
Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0
 
Airflow and supervisor
Airflow and supervisorAirflow and supervisor
Airflow and supervisor
 
Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
 

Semelhante a Introduction to Apache Airflow

How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowLaura Lorenz
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoopclairvoyantllc
 
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow MeetupWhat's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow MeetupKaxil Naik
 
Scalable Clusters On Demand
Scalable Clusters On DemandScalable Clusters On Demand
Scalable Clusters On DemandBogdan Kyryliuk
 
Upcoming features in Airflow 2
Upcoming features in Airflow 2Upcoming features in Airflow 2
Upcoming features in Airflow 2Kaxil Naik
 
Terraforming your Infrastructure on GCP
Terraforming your Infrastructure on GCPTerraforming your Infrastructure on GCP
Terraforming your Infrastructure on GCPSamuel Chow
 
Data Engineer's Lunch #44: Prefect
Data Engineer's Lunch #44: PrefectData Engineer's Lunch #44: Prefect
Data Engineer's Lunch #44: PrefectAnant Corporation
 
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesApache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesDataWorks Summit
 
Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)Brian Brazil
 
Managing 100s of PetaBytes of data in Cloud
Managing 100s of PetaBytes of data in CloudManaging 100s of PetaBytes of data in Cloud
Managing 100s of PetaBytes of data in Cloudlohitvijayarenu
 
Distributed tracing 101
Distributed tracing 101Distributed tracing 101
Distributed tracing 101Itiel Shwartz
 
Integrating ChatGPT with Apache Airflow
Integrating ChatGPT with Apache AirflowIntegrating ChatGPT with Apache Airflow
Integrating ChatGPT with Apache AirflowTatiana Al-Chueyr
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks
 
Improving ad hoc and production workflows at Stitch Fix
Improving ad hoc and production workflows at Stitch FixImproving ad hoc and production workflows at Stitch Fix
Improving ad hoc and production workflows at Stitch FixStitch Fix Algorithms
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1Ruslan Meshenberg
 
Running Dataproc At Scale in production - Searce Talk at GDG Delhi
Running Dataproc At Scale in production - Searce Talk at GDG DelhiRunning Dataproc At Scale in production - Searce Talk at GDG Delhi
Running Dataproc At Scale in production - Searce Talk at GDG DelhiSearce Inc
 
Airflow at lyft
Airflow at lyftAirflow at lyft
Airflow at lyftTao Feng
 
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19Bolke de Bruin
 
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)Brian Brazil
 

Semelhante a Introduction to Apache Airflow (20)

How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoop
 
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow MeetupWhat's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
 
Scalable Clusters On Demand
Scalable Clusters On DemandScalable Clusters On Demand
Scalable Clusters On Demand
 
Upcoming features in Airflow 2
Upcoming features in Airflow 2Upcoming features in Airflow 2
Upcoming features in Airflow 2
 
Terraforming your Infrastructure on GCP
Terraforming your Infrastructure on GCPTerraforming your Infrastructure on GCP
Terraforming your Infrastructure on GCP
 
Data Engineer's Lunch #44: Prefect
Data Engineer's Lunch #44: PrefectData Engineer's Lunch #44: Prefect
Data Engineer's Lunch #44: Prefect
 
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesApache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
 
Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)
 
Distributed Tracing
Distributed TracingDistributed Tracing
Distributed Tracing
 
Managing 100s of PetaBytes of data in Cloud
Managing 100s of PetaBytes of data in CloudManaging 100s of PetaBytes of data in Cloud
Managing 100s of PetaBytes of data in Cloud
 
Distributed tracing 101
Distributed tracing 101Distributed tracing 101
Distributed tracing 101
 
Integrating ChatGPT with Apache Airflow
Integrating ChatGPT with Apache AirflowIntegrating ChatGPT with Apache Airflow
Integrating ChatGPT with Apache Airflow
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 
Improving ad hoc and production workflows at Stitch Fix
Improving ad hoc and production workflows at Stitch FixImproving ad hoc and production workflows at Stitch Fix
Improving ad hoc and production workflows at Stitch Fix
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
 
Running Dataproc At Scale in production - Searce Talk at GDG Delhi
Running Dataproc At Scale in production - Searce Talk at GDG DelhiRunning Dataproc At Scale in production - Searce Talk at GDG Delhi
Running Dataproc At Scale in production - Searce Talk at GDG Delhi
 
Airflow at lyft
Airflow at lyftAirflow at lyft
Airflow at lyft
 
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
 
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
 

Último

Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...KarteekMane1
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataTecnoIncentive
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
convolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfconvolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfSubhamKumar3239
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxHimangsuNath
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxHaritikaChhatwal1
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Milind Agarwal
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 

Último (20)

Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded data
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
convolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfconvolutional neural network and its applications.pdf
convolutional neural network and its applications.pdf
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptx
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptx
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 

Introduction to Apache Airflow

  • 1. Airflow Lessons Learned of 3 years using Airflow in production Juan Martín Pampliega
  • 2. Juan Martín Pampliega 2 Information Engineering @ ITBA Co Founder @ Mutt Data Professor @ ITBA ● Working in Data Projects since 2010. ● Globant (Google), Despegar, Socialmetrix, Jampp, Claro, Clarín and other companies. ● Co Founder @ Mutt Data a company specialized in developing projects using Big Data and Data Science. ● Using Airflow in production since 2015 to manage data workflows in several companies.
  • 3. Apache Airflow 3 Airflow is a platform to programmatically author, schedule and monitor workflows. Started in late 2014 @ Airbnb by Maxime Beauchemin. Open sourced in mid 2015. Apache incubation since March 2016, first version March 2017. Used by HBO, Twitter, ING, Paypal, Reddit, Yahoo, Jampp and more! Author workflows as directed acyclic graphs (DAGs) of tasks. The UI makes it easy to visualize workflows’ status, monitor progress, and troubleshoot issues when needed. Workflows are defined as Python code which makes them more maintainable, versionable, testable, collaborative and fosters abstraction and code reuse.
  • 4. Problems with CRON & similar options 4 ● CRON does a poor job at handling task dependencies and viewing them. ● Poor or no strategy for retrying tasks or backfills. ● Limited data about task times, execution durations and failures. ● Need to ssh into server to check logs and interact. ● No easy way to scale beyond one machine. ● People mostly write jobs in BASH or XML ● Some questions that are hard to answer: ○ Do you know when your CRON jobs fail? ○ Can you spot when your tasks become 3x slower? ○ Can you visualize what's currently running? What's queued? ○ Do you have reusable components you can use across workflows?
  • 5. Terminology 5 ● DAG: Directed Acyclic Graph of tasks that you want to run. (workflow) ● Operator: they define what should be executed. Examples: Bash command, insert data into a table, etc. ● Task: instance of an operator that defines a node in a DAG. ● DAG run: instance of a DAG. When a DAG is triggered, Airflow orchestrates the execution of operators while respecting dependences and allocated resources. ● Task instance: specific run of a task for a particular DAG run in a particular point in time.
  • 8. DAGs are just configuration files that define the DAG structure as code using Python DAGs don't do any data processing as such, only the actual execution of a DAG will Tasks defined will run in different contexts, different workers, different points in time and mostly don’t communicate between each other Should execute quickly (hundreds of milliseconds) because they will be evaluated often by the Airflow Scheduler DAG Definition Files 8
  • 10. Run a DAG with: airflow backfill example_bash_operator -s 2015-01-01 -e 2015-01-02 Or enter Airflow UI, refresh the DAG and Airflow will trigger it when it needs to. Local Installation 10
  • 12. ● SequentialExecutor: default, can only run one task at a time. ● LocalExecutor: can run multiple tasks locally, needs different DB than sqlite. ● CeleryExecutor: uses Celery to execute tasks remotely, needs Celery workers. ● DaskExecutor: similar to Celery but with Dask, lower latency. ● MesosExecutor: runs tasks as containers on a Mesos Cluster ● KubernetesExecutor: tasks are executed as pods CeleryExecutor is the most widely adopted option when scaling to multiple machines. Tip: use Redis as broker and Celery Flower to monitor it. Results should be stored in Postgres. Executors 12
  • 14. The scheduler processes iterates all DAGs continually and triggers DAG runs ● execution_date: the period of time for when the data will be processed. ● start_date: the execution_date of the first DAG run. ● end_date: last execution_date that will have a DAG run. ● execution_timeout: maximum time a task will take before failing. ● retries: amount of times a task will be retried before failing. ● retry_delay: minimum time between one task execution and the next after a failure. The Scheduler 14
  • 18. Additional Features 18 ● Hooks: interfaces to external platforms and databases like Hive, S3, MySQL, Postgres, etc. ● Pools: help limit the execution parallelism on arbitrary sets of tasks. Tasks can be assigned to pools and have a priority weight. ● Queues: when using CeleryExecutor tasks can be assigned to a queue and a worker can listen and execute tasks on one or many queues. ● XComs: enables task to exchange messages of any object that can be pickled.
  • 19. ● Sensors: operators that wait for a certain condition to be met and succeed. (e.g. wait for a certain file to appear in a directory) ● Authentication: there are plugins to enable authentication and authorization through LDAP/Kerberos/other methods. ● Ad Hoc Queries: enables charting and querying configured data sources. Additional Features 19
  • 20. Jinja Templating 20 Jinja templating makes available multiple helpful variables and macros to aid in date manipulation. The {{ }} brackets tell Airflow that this is a Jinja template, and ds is a variable made available by Airflow that is replaced by the execution date in the format YYYY-MM-DD. Thus, in the dag run stamped with 2018-06-04, this would render to:
  • 21. Jinja Templating 21 Another useful variable is ds_nodash, where './run.sh {{ ds_nodash }}' renders to: execution_date variable is useful, as it is a python datetime object and not a string like ds
  • 22. Plugins 22 Enable defining custom hooks, operators, sensors, macros executors and web views. Use at many companies to generate DAGs automatically for ETLS, ML, A/B testing, etc.
  • 23. Testing 23 DAGs are code so we have different options to test them ● Test DAG import: iterate through the DAG bag and check that each DAG can be imported or run de .py file from the commandline. ● Test DAG’s parameters: make sure all DAGs have required parameters like emails, catchup, etc. ● Unit test Python logic: since the code executed by PythonOperator is a Python function you can use normal unit tests for them.
  • 24. Installation Best Practices 24 ● Install apache-airflow package. ● LocalExecutor is fine to start with. ● Use CeleryExecutor or Dask/Kubernetes to scale. ● Use https://github.com/puckel/docker-airflow if you want to use docker. ● Use PostgreSQL or MySQL for metadata. ● Tune scheduler properties to reduce CPU consumption. ● Remember to copy all config and DAG files to the worker’s/executor’s location.
  • 25. Best Practices 25 ● Try to balance between DAG readability and abstracting code. ● Use depends_on_past and wait_for_downstream for safety. ● Change the name of the DAG when you change the start_date. ● Tasks are processes than run on workers, limit the size of data the process locally. ● Remember to erase task logs after a certain time. ● Generate custom views for non technical people. ● Abstract duplicated logic!
  • 26. Must reads on Data Engineering & Airflow 26
  • 28. Quizlet 4 part series: https://medium.com/@dustinstansbury/beyond-cron-an-introduction- to-workflow-management-systems-19987afcdb5e https://towardsdatascience.com/why-quizlet-chose-apache-airflow- for-executing-data-workflows-3f97d40e9571 https://medium.com/@dustinstansbury/understanding-apache-airflow s-key-concepts-a96efed52b1a https://medium.com/@dustinstansbury/how-quizlet-uses-apache-airf low-in-practice-a903cbb5626d References 28
  • 30. Operator Trigger Rules 30 ● Operators have a trigger_rule argument which defines the rule by which the generated task get triggered Default value for trigger_rule is all_success ● Other options ○ all_failed: all parents are in a failed or upstream_failed state ○ all_done: all parents are done with their execution ○ one_failed: fires as soon as at least one parent has failed, it does not wait for all parents to be done ○ one_success: fires as soon as at least one parent succeeds, it does not wait for all parents to be done