SlideShare uma empresa Scribd logo
1 de 30
Baixar para ler offline
@erinshellman
Wrangle Conf
July 20th, 2017
Building Robust
Pipelines with Airflow
Zymology: is the science of fermentation and it’s applied to
make materials and molecules
!
"
#
$
Beer
Insulin
Food
additives
Plastics
Zymergen provides
a platform for
rapid improvement
of microbial strains
through genetic
engineering.
Robotic automation
Our experimentation
is increasingly
orchestrated with
robotics and machine
learning.
Learning how to efficiently
navigate the genome is the mission
of data science at Zymergen
Blocker: process failure
Orchestrating complex experiments with
robots is hard, and there are process failures.
These failures often cause sporadic, extreme
measurement values.
Blocker: batch effects
We see temporal
effects based on
when experiments
were performed
Blocker:
different interpretations of results
We’re building a platform that can
support any microbe and any molecule.
Sometimes that results in a proliferation
of solutions with disagreement on which
is best.
Processing pipeline
1.Identify process failures
2.Quantify and remove process-
related bias
3.Identify strains that show
improvement using consistent
criteria
Clean model inputs
Outlier detection
Normalization
Hit detection
Rolling our own ETL pipeline
There are many
ways to measure
the concentration of
a molecule.
Any microbe, any
molecule… any
experiment, many
data formats.
Describing complex
processing
dependencies is
hard.
Rolling our own ETL pipeline
Airflow
https://airflow.incubator.apache.org/
“Airflow is a platform to programmatically author, schedule and
monitor workflows.”
Airflow gives us flexibility to apply a common
set of processing steps to variable data
inputs, schedule complex processing
workflows, and has become a delivery
mechanism for our products.
Structure
and Flexibility
e.g. Normalization
Airflow workflows are
described as directed
acyclic graphs (DAGs).
Each task node in the
DAG is an operator.
The anatomy of a
DAG
Custom operators
Ordering
Instantiate DAG
Modularity and flexibility
Airflow + PyStan
With Bayesian hierarchical models we estimate
(and monitor) the distribution of batch effects.
Experimental bias
DropBox
• Scientists at Zymergen work with data using
many different tools including JMP, SQL, and
Excel.
• We use a custom DropBox hook to make
quick data ingestion pipelines.
Alerting /
Communication
3rd-party hooks & operators
Operator
Pairs well with Superset!
“Apache Superset is a
modern, enterprise-ready
business intelligence web
application”
https://github.com/apache/incubator-superset
Constructing machine
learning workflows
Fairflow: Functional Airflow
• The core of Fairflow is an
abstract base class foperator
that takes care of
instantiating your Airflow
operators and setting their
dependencies.
• In Fairflow, DAGs are
constructed from foperators
that create the upstream
operators when the final
foperator is called.
Configuring complex ML
workflows… functionally
Defining ML workflows
In the DAG
definition, create
an instance of the
task.
Then, instantiate a
DAG like usual and
call the compare
task on the DAG.
Defining ML workflows
The design allows for simple creation of
complicated experimental workflows with arbitrary
sets of models, parameters, and evaluation metrics.
Is Airflow for you?
Do you have heterogeneous data sources?
Do you have complex dependencies between
processing tasks?
Do you have data with different velocities?
Do you have constraints on your time?
Probably!
Thanks team!
%%
& '()
*
+

Mais conteúdo relacionado

Mais procurados

How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
PyData
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
Laura Lorenz
 

Mais procurados (19)

Apache Airflow Architecture
Apache Airflow ArchitectureApache Airflow Architecture
Apache Airflow Architecture
 
Apache Airflow Introduction
Apache Airflow IntroductionApache Airflow Introduction
Apache Airflow Introduction
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
 
Airflow for Beginners
Airflow for BeginnersAirflow for Beginners
Airflow for Beginners
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
 
Introducing Apache Airflow and how we are using it
Introducing Apache Airflow and how we are using itIntroducing Apache Airflow and how we are using it
Introducing Apache Airflow and how we are using it
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache Airflow
 
Building cloud-enabled genomics workflows with Luigi and Docker
Building cloud-enabled genomics workflows with Luigi and DockerBuilding cloud-enabled genomics workflows with Luigi and Docker
Building cloud-enabled genomics workflows with Luigi and Docker
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
 
(CMP310) Data Processing Pipelines Using Containers & Spot Instances
(CMP310) Data Processing Pipelines Using Containers & Spot Instances(CMP310) Data Processing Pipelines Using Containers & Spot Instances
(CMP310) Data Processing Pipelines Using Containers & Spot Instances
 
Orchestrating workflows Apache Airflow on GCP & AWS
Orchestrating workflows Apache Airflow on GCP & AWSOrchestrating workflows Apache Airflow on GCP & AWS
Orchestrating workflows Apache Airflow on GCP & AWS
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
 
What is Spark
What is SparkWhat is Spark
What is Spark
 
Workflow Engines + Luigi
Workflow Engines + LuigiWorkflow Engines + Luigi
Workflow Engines + Luigi
 
Clearing Airflow Obstructions
Clearing Airflow ObstructionsClearing Airflow Obstructions
Clearing Airflow Obstructions
 
Powering machine learning workflows with Apache Airflow and Python
Powering machine learning workflows with Apache Airflow and PythonPowering machine learning workflows with Apache Airflow and Python
Powering machine learning workflows with Apache Airflow and Python
 
Fast and Reliable Apache Spark SQL Engine
Fast and Reliable Apache Spark SQL EngineFast and Reliable Apache Spark SQL Engine
Fast and Reliable Apache Spark SQL Engine
 
Flink Forward SF 2017: Bill Liu & Haohui Mai - AthenaX : Uber’s streaming pro...
Flink Forward SF 2017: Bill Liu & Haohui Mai - AthenaX : Uber’s streaming pro...Flink Forward SF 2017: Bill Liu & Haohui Mai - AthenaX : Uber’s streaming pro...
Flink Forward SF 2017: Bill Liu & Haohui Mai - AthenaX : Uber’s streaming pro...
 

Semelhante a Building Robust Pipelines with Airflow

Cerberus_Presentation1
Cerberus_Presentation1Cerberus_Presentation1
Cerberus_Presentation1
CIVEL Benoit
 
Coldbox developer training – session 4
Coldbox developer training – session 4Coldbox developer training – session 4
Coldbox developer training – session 4
Billie Berzinskas
 
Software Testing Tecniques
Software Testing TecniquesSoftware Testing Tecniques
Software Testing Tecniques
ersanbilik
 

Semelhante a Building Robust Pipelines with Airflow (20)

Software testing
Software testingSoftware testing
Software testing
 
Lessons Learned in Software Development: QA Infrastructure – Maintaining Rob...
Lessons Learned in Software Development: QA Infrastructure – Maintaining Rob...Lessons Learned in Software Development: QA Infrastructure – Maintaining Rob...
Lessons Learned in Software Development: QA Infrastructure – Maintaining Rob...
 
Concurrency in Eclipse: Best Practices and Gotchas
Concurrency in Eclipse: Best Practices and GotchasConcurrency in Eclipse: Best Practices and Gotchas
Concurrency in Eclipse: Best Practices and Gotchas
 
Java Performance & Profiling
Java Performance & ProfilingJava Performance & Profiling
Java Performance & Profiling
 
Google, quality and you
Google, quality and youGoogle, quality and you
Google, quality and you
 
INTRODUCTION TO MACHINE LEARNING FOR MATERIALS SCIENCE
INTRODUCTION TO MACHINE LEARNING FOR MATERIALS SCIENCEINTRODUCTION TO MACHINE LEARNING FOR MATERIALS SCIENCE
INTRODUCTION TO MACHINE LEARNING FOR MATERIALS SCIENCE
 
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
 
Cerberus : Framework for Manual and Automated Testing (Web Application)
Cerberus : Framework for Manual and Automated Testing (Web Application)Cerberus : Framework for Manual and Automated Testing (Web Application)
Cerberus : Framework for Manual and Automated Testing (Web Application)
 
Cerberus_Presentation1
Cerberus_Presentation1Cerberus_Presentation1
Cerberus_Presentation1
 
Coldbox developer training – session 4
Coldbox developer training – session 4Coldbox developer training – session 4
Coldbox developer training – session 4
 
[AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵
 [AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵 [AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵
[AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵
 
Software Testing Tecniques
Software Testing TecniquesSoftware Testing Tecniques
Software Testing Tecniques
 
Create an architecture for web test automation
Create an architecture for web test automationCreate an architecture for web test automation
Create an architecture for web test automation
 
Advances in Scientific Workflow Environments
Advances in Scientific Workflow EnvironmentsAdvances in Scientific Workflow Environments
Advances in Scientific Workflow Environments
 
Agility for Data
Agility for DataAgility for Data
Agility for Data
 
Using the Machine to predict Testability
Using the Machine to predict TestabilityUsing the Machine to predict Testability
Using the Machine to predict Testability
 
Resilience Engineering: A field of study, a community, and some perspective s...
Resilience Engineering: A field of study, a community, and some perspective s...Resilience Engineering: A field of study, a community, and some perspective s...
Resilience Engineering: A field of study, a community, and some perspective s...
 
Java code coverage with JCov. Implementation details and use cases.
Java code coverage with JCov. Implementation details and use cases.Java code coverage with JCov. Implementation details and use cases.
Java code coverage with JCov. Implementation details and use cases.
 
Data cleaning with the Kurator toolkit: Bridging the gap between conventional...
Data cleaning with the Kurator toolkit: Bridging the gap between conventional...Data cleaning with the Kurator toolkit: Bridging the gap between conventional...
Data cleaning with the Kurator toolkit: Bridging the gap between conventional...
 
Generating test cases using UML Communication Diagram
Generating test cases using UML Communication Diagram Generating test cases using UML Communication Diagram
Generating test cases using UML Communication Diagram
 

Mais de Erin Shellman

Mais de Erin Shellman (9)

Case studies in data-driven merchandising
Case studies in data-driven merchandisingCase studies in data-driven merchandising
Case studies in data-driven merchandising
 
Catching the most with high-throughput screening
Catching the most with high-throughput screeningCatching the most with high-throughput screening
Catching the most with high-throughput screening
 
Developing effective data scientists
Developing effective data scientistsDeveloping effective data scientists
Developing effective data scientists
 
Bot or Not
Bot or NotBot or Not
Bot or Not
 
Downloading the internet with Python + Scrapy
Downloading the internet with Python + ScrapyDownloading the internet with Python + Scrapy
Downloading the internet with Python + Scrapy
 
Fun! with the Twitter API
Fun! with the Twitter APIFun! with the Twitter API
Fun! with the Twitter API
 
real time real talk
real time real talkreal time real talk
real time real talk
 
Collaborative Filtering for fun ...and profit!
Collaborative Filtering for fun ...and profit!Collaborative Filtering for fun ...and profit!
Collaborative Filtering for fun ...and profit!
 
Assumptions: Check yo'self before you wreck yourself
Assumptions: Check yo'self before you wreck yourselfAssumptions: Check yo'self before you wreck yourself
Assumptions: Check yo'self before you wreck yourself
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 

Building Robust Pipelines with Airflow