Roi Teveth (Data Engineer) and Itai Yaffe (Tech Lead, Big Data group) @ Nielsen:
At Nielsen Identity Engine, we use Spark to process 10’s of TBs of data. Our ETLs, orchestrated by Airflow, spin-up AWS EMR clusters with thousands of nodes per day.
In this talk, we’ll guide you through migrating Spark workloads to Kubernetes with minimal changes to Airflow DAGs, using the open-sourced GCP Spark-on-K8s operator and the native integration we recently contributed to the Airflow project.
12. @ItaiYaffe, @RTeveth
~20 automatic DAG
deployments/day
~1000 DAG
Runs/day
~2 years in
production
Met all
requirements &
more
~40 users across 4
groups
6 contributions
to open-source
19. @ItaiYaffe, @RTeveth
EMR is an AWS
managed service to
run Hadoop & Spark
clusters
Allows you to reduce
costs by using Spot
instances
Charges management
cost for each
instance in a cluster