This presentation gives an overview of the Apache Airflow project. It explains Apache Airflow in terms of it's pipelines, tasks, integration and UI.
Links for further information and connecting
http://www.amazon.com/Michael-Frampton/e/B00NIQDOOM/
https://nz.linkedin.com/pub/mike-frampton/20/630/385
https://open-source-systems.blogspot.com/
1. What Is Apache Airflow ?
● A work flow management platform
● Uses Python based work flows
● Schedule by time or event
● Open source Apache 2.0 license
● Written in Python
● Monitor work flows in UI
● Has a wide range of integration options
● Originally developed at Airbnb
2. What Is Apache Airflow ?
● Uses SqlLite as a back end DB but can use
– MySQL, Postgres, JDBC etc
● Install extra packages using pip command
– Wide variety available, includes
– Many databases, cloud services
– Hadoop eco system
– Security, web services, queues
– Many more
3. Airflow Pipelines
● These are Python based work flows
● Are actually directed acyclic graphs ( DAG's )
● Pipelines use Jinja templating
● Pipelines contain user defined tasks
● Tasks can run on different workers at different times
● Jinja scripts can be embedded in tasks
● Comments can be added in tasks in varying formats
● Inter task dependencies can be defined
5. Airflow Tasks
● Tasks have a lifecycle
● Tasks use operators to execute, depends upon type
– For instance MySqlOperator
● Hooks are used to access external systems i.e. databases
● Worker specific queues can be used for tasks
● Xcom allows tasks to exchange messages
● Pipelines or DAG's allow
– Branching
– Sub DAG's
– Service level agreements ( SLA )
– Triggering rules
8. Airflow UI
● Airflow UI provides views
– DAG, Tree, Graph, Variables, Gantt Chart
– Task duration, Code view
● Select a task instance in any view to manage
● Monitor and troubleshoot pipelines in views
● Monitor DAG's by owner, schedule, run time etc
● Use views to find pipeline problem areas
● Use views to find bottle necks
10. Airflow Integration
● Airflow Integrates with
– Azure: Microsoft Azure
– AWS: Amazon Web Services
– Databricks
– GCP: Google Cloud Platform
– Cloud Speech Translate Operators
– Qubole
● Kubernetes
– Run tasks as pods
11. Airflow Metrics
● Airflow can send metrics to StatsD
– A network daemon that runs on Node.js
– Listens for statistics, like counters, gauges, timers
– Statistics sent over UDP or TCP
● Install metrics using pip command
● Specify which stats to record i.e.
– scheduler,executor,dagrun
12. Available Books
● See “Big Data Made Easy”
– Apress Jan 2015
●
See “Mastering Apache Spark”
– Packt Oct 2015
●
See “Complete Guide to Open Source Big Data Stack
– “Apress Jan 2018”
● Find the author on Amazon
– www.amazon.com/Michael-Frampton/e/B00NIQDOOM/
●
Connect on LinkedIn
– www.linkedin.com/in/mike-frampton-38563020
13. Connect
● Feel free to connect on LinkedIn
– www.linkedin.com/in/mike-frampton-38563020
● See my open source blog at
– open-source-systems.blogspot.com/
● I am always interested in
– New technology
– Opportunities
– Technology based issues
– Big data integration