3. What’s Apache Airflow
A platform to programmatically author, scheduler and
monitor workflows, has gained its popularity in the
world of data engineering and machine learning. Its
use cases can be
• Define and monitor Cron jobs
• Automate certain DevOps operation
• Move data periodically
• Machine learning pipeline
4. Who is using
• Airbnb
• Lyft
• Bloomberg
• Jets.com
• Sam’s Club
• a long list ...
5. Why I think it is useful for us?
• Need a way to manage data movement in and out
• Reliable and consistent way to schedule and monitor moving data
tasks
• Data Ingestion automation for hubble
• Machine learning pipelines automation
• For the benefit of other business units, or small groups by developing
and deploying as an part of infrastructure
7. Install Airflow (cont’d)
I create an fork of puckel/docker-airflow and helm/stable/airlfow with the following
features:
1. Role based authentication
2. Third party python packages to support Azure, GCP and Kubernetes Operators
3. Mount DAGs with Azure File for App Service and AKS deployment
4. Ingress Controller for TLS termination for AKS deployment
5. Sample Airflow Connection with Azure Blob Storage, Cosmos Db, Azure
Databricks
6. Sample DAGs with Azure Cosmos db , blob storage and Cosmos, etc
See this link for the details
8. Working Process
• Use your local Docker image for DAG
development
docker-compose -f docker-compose-
CeleryExecutor.yml up –d
• Pull request for production DAG
• This dags folder is git managed