SlideShare uma empresa Scribd logo
1 de 52
Baixar para ler offline
Bighead
Airbnb’s End-to-End Machine
Learning Infrastructure
Krishna Puttaswamy & Nick Handel
On behalf of ML Infra @ Airbnb
Context
In 2016
● Only major models in production
● Models took on average 8 weeks to build (source: survey of ML producers)
● Everything built in Aerosolve, Spark and Scala
● No support for Tensorflow, PyTorch, SK-Learn or other popular ML packages
● Significant discrepancies between offline and online data
ML Infra was formed with the charter to:
● Enable more users to build ML products
● Reduce time and effort
● Enable easier model evaluation
Q4 2016: Formation of our ML Infra team
Before ML
Infrastructure
ML has had a massive impact on Airbnb’s
product
● Search Ranking
● Smart Pricing
● Trust
● Paid Growth
● …And a few other major models
After ML
Infrastructure
But there were many other areas that had
high-potential for ML, but were realized less of
that potential.
● Paid Growth - Hosts
● Classifying listing
● Experience Ranking + Personalization
● Host Availability
● Business Travel Classifier
● Room Type Categorizations
● Make Listing a Space Easier
● Customer Service Ticket Routing
● … And many more
Vision
Airbnb routinely ships ML-powered features throughout the
product.
Mission
Equip Airbnb with shared technology to build
production-ready ML applications with no incidental
complexity.
(Technology = tools, platforms, knowledge, shared feature data, etc.)
Value of ML
Infrastructure
Machine Learning Infrastructure can:
● Remove incidental complexities, by providing
generic, reusable solutions
● Simplify the workflow for intrinsic
complexities, by providing tooling, libraries,
and environments that make ML
development more efficient
And at the same time:
● Establish a standardized platform that
enables cross-company sharing of feature
data and model components
● “Make it easy to do the right thing” (ex:
consistent training/streaming/scoring logic)
Bighead: Motivations
Learnings:
● No consistency between ML Workflows
● New teams struggle to begin using ML
● Airbnb has a wide variety in ML applications
● Existing ML workflows are slow, fragmented, and brittle
● Incidental complexity vs. intrinsic complexity
● Build and forget - ML as a linear process
Q1 2017: Figuring out what to build
Architecture
● Consistent environment across the stack
○ Use Docker
● Common workflow across different ML frameworks
○ Supports Scikit-learn, TF, PyTorch, etc.
● Modular components
○ Easy to customize parts
○ Easy to share data/pipelines
Key Design Decisions
Architecture
Architecture
Architecture
Architecture
Architecture
Components
air/mlinfravision
● Data Management: Zipline
● Training: Redspot / BigQueue
● Core ML Library: Bighead libraries
● Productionization: Deep Thought (online) / ML Automator (offline)
● Model Management: Model Repo
● Monitoring: Model Repo UI
Zipline (ML Data Management Framework)
Zipline - Why
● Defining features (especially windowed) with hive was complicated and error
prone
● Backfilling training sets (on inefficient hive queries) was a major bottleneck
● No feature sharing
● Inconsistent offline and online datasets
● Warehouse is built as of end-of-day, lacked point-in-time features
● ML data pipelines lacked data quality checks or monitoring
● Ownership of pipelines was in disarray
A data management platform for ML
● Common (and simple) definition: Define the feature once and use it in batch
and streaming
● Training data backfills: Resource efficient and point-in-time correct with
scheduled updates
● Lambda updates: Features available both offline and online
● Data quality: Feature visualizations and automatic data quality monitoring
Zipline - Overview
Zipline - Feature definition language
Primary Key
Timestamp
Owner
Operation = Sum
Time windows
● Owner allows us
to trace
accountability
● Primary keys and
timestamp are
used to guarantee
point in time
correctness in
Training Set
● Operations and
time windows are
optional
● Spark efficiently
handles
aggregations
(windowed and
not)
Zipline - Data Quality and Collaboration
● Features can be
visualized and
browsed through
online editor
● Gives stats on
feature, and also
provides info on
ownership
Zipline - Training Data
PK1 = User ID PK2 = Listing ID Timestamp bookings_by_user bookings_by_listing
123 456 2018-01-01 23... 0 4
234 567 2018-01-04 01... 2 8
456 789 2018-01-02 08... 1 0
User provides: Primary keys, timestamps, list of features
Zipline computes feature values
point-in-time correct for those PKs and
those timestamps. And joins them
together.
FeatureSet 1 FeatureSet 2
Zipline - Training Data
Airflow integration for daily
update of training data
Label logic
● Labels are often
joined to features with
an offset for training
(60 days offset)
● But that offset does
not apply to scoring
data
Zipline - Training Data with Labels
ds=2017-08-16
ds=2017-10-15
???
Features Table Labels Table
Training
...
???
ds=2017-10-15
Scoring
Features served
from online KV
store
Zipline schedules
daily batch
correction
Zipline - Consistent online and offline features
User writes one conf
Zipline starts the
streaming job
● More efficient cluster usage: Hive and Spark jobs are optimized; Many weeks
to create training data backfills => a few hours
● Ease of use: Can define 100s of new features in a few hours (from many days)
● Online scoring with lambda: Features are automatically availability in online
scoring environment
● Collaboration: Many features are shared!
● Management: Clear data ownership and maintenance
Zipline - Impact
Redspot (Hosted Jupyter Notebook Service)
Architecture
● Started with Jupyterhub (open-source project), which manages multiple Jupyter
Notebook Servers (prototyping environment)
● But users were installing packages locally, and then creating virtualenv for
other parts of our infra
○ Environment was very fragile
● Users wanted to be able to use jupyterhub on larger instances or instances
with GPU
● Wanting to share notebooks with other teammates was common too
Redspot - Why
Containerized environments
● Every user’s environment is containerized via docker
○ Allows customizing the notebook environment without
affecting other users
■ e.g. install system/python packages
○ Easier to restore state therefore helps with reproducibility
● Support using custom docker images
○ Base images based on user’s needs
■ e.g. GPU access, pre-installed ML packages
○ Build your own image for a faster start time
Remote Instance Spawner
● For bigger jobs and total isolation,
Redspot allows launching a dedicated
instance
● Hardware resources not shared with
other users
● Automatically terminates idle instances
periodically
● A multi-tenant notebook environment
● Makes it easy to iterate and prototype ML models, share work
○ Integrated with the rest of our infra - so one can deploy a notebook to prod
● Improved upon open source Jupyterhub
○ Containerized; can bring custom Docker env
○ Remote notebook spawner for dedicated instances (P3 and X1 machines on
AWS)
○ Persist notebooks in EFS and share with teams
○ Reverting to prior checkpoint
Redspot Summary
Deep Thought (Online Inference Service)
Architecture
● Performant, scalable execution of model inference in production is hard
○ Engineers shouldn’t build one off solutions for every model.
○ Data scientists should be able to launch new models in production with minimal
eng involvement.
● Debugging differences between online inference and training are difficult
○ We should support the exact serialized version of the model the data scientist
built
○ We should be able to run the same python transformations data scientists write
for training.
○ We should be able to load data computed in the warehouse or streaming easily
into online scoring.
Deep Thought - Why
● Deep Thought is a shared service for online inference
○ Support for pickled sklearn models, TensorFlow models, and custom code in
python or Java
○ Add your model configuration to a file and deploy. Completely config driven so data
scientists don’t have to involve engineers to launch new models.
○ Engineers can then connect to a REST API from other services to get scores.
○ Support for loading data from K/V stores
○ Standardized logging, alerting and dashboarding for monitoring and offline
analysis of model performance
○ Process isolation to enable multi-tenancy without contention
○ Scalable and Reliable: 80+ models. Highest QPS service at Airbnb. Median response
time: 4ms. p95: 13ms.
Deep Thought - How
Model Repo
Model Repo
Overview
Model Repo is Bighead’s model management service
● Contains prototype and production models
● Can serve models “raw” or trained
● The source of truth on which trained models are
in production
● Stores model health data
Model Repo
Internals
We decompose Models into two components:
● Model Version - raw model code + docker image
● Model Artifact - parameters learned via training
Model
Version
Model Artifact
Code
Docker
Image
A trained model consists of:
Model Version
+
Model Artifact
Production
Our built-in UI provides:
● Deployment - review changes, deploy, and rollback trained models
● Model Health - metrics, visualizations, alerting, central dashboard
● Experimentation - Ability to setup model experiments - e.g. split traffic
between two or more models
Model Repo: UI
ML Automator
● Tools and libraries for common tasks
○ Periodic training, evaluation and scoring on a model is common: Building Airflow
DAGs, uploading scores to K/V stores, dashboards on scores, alert on score changes
○ Scoring on large tables is tricky to scale
ML Automator - Why
● Once a model file is checked in, we generate the DAGs automatically to train/score it
● 40+ models using this feature
● Score on Spark for large datasets (we generate virtualenv equivalent to the docker image,
as spark doesn’t run executors in docker image)
ML Automator
Core Libraries
ML Helpers - Why
● Transformations are re-written too often
○ There are many versions of transformations for NLP, data cleaning, imputing, etc.
○ Models used to “start from scratch” and rebuild the same things
○ Model observability -- understand what features are important
● Library of transformations; holds more than 50 different transformations including
automated preprocessing for common input formats
● Created example notebooks to show usage of our infra
○ Example usage of ML pipelines, contains diagnostics that help people debug and
improve models
○ Has been cloned and modified more than 20 times to build new models
● Improved Scikit-Learn Pipelines
○ Propagate feature metadata so we can plot feature importance at the end and
connect it to feature names
○ Pipelines for data processing are reusable in other pipelines
○ Added wrappers for model libraries (XGB, etc.) can be serialized (robust to minor
version changes)
ML Helpers and Pipelines
Open Source in H1 2018
If you want to collaborate we can provide
early access
nick.handel@airbnb.com
krishna.puttaswamy@airbnb.com
Appendix
ML models have diverse dependency sets (tensorflow,
xgboost, etc.). We allow users to provide a docker image
within which model code always runs.
ML models don’t run in isolation however, so we’ve built a
lightweight API to interact with the “dockerized model”
Docker Container
Model
(user code)
Other ML
Infra
Services
Model
API
Dockerized
Models

Mais conteúdo relacionado

Mais procurados

How to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
How to Utilize MLflow and Kubernetes to Build an Enterprise ML PlatformHow to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
How to Utilize MLflow and Kubernetes to Build an Enterprise ML PlatformDatabricks
 
Multi Task Learning for Recommendation Systems
Multi Task Learning for Recommendation SystemsMulti Task Learning for Recommendation Systems
Multi Task Learning for Recommendation SystemsVaibhav Singh
 
Artwork Personalization at Netflix
Artwork Personalization at NetflixArtwork Personalization at Netflix
Artwork Personalization at NetflixJustin Basilico
 
Is it sensible to use Data Vault at all? Conclusions from a project.
Is it sensible to use Data Vault at all? Conclusions from a project.Is it sensible to use Data Vault at all? Conclusions from a project.
Is it sensible to use Data Vault at all? Conclusions from a project.Capgemini
 
Sentiment analysis - Our approach and use cases
Sentiment analysis - Our approach and use casesSentiment analysis - Our approach and use cases
Sentiment analysis - Our approach and use casesKarol Chlasta
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Network and IT Operations
Network and IT OperationsNetwork and IT Operations
Network and IT OperationsNeo4j
 
How to apply machine learning into your CI/CD pipeline
How to apply machine learning into your CI/CD pipelineHow to apply machine learning into your CI/CD pipeline
How to apply machine learning into your CI/CD pipelineAlon Weiss
 
Azure Machine Learning tutorial
Azure Machine Learning tutorialAzure Machine Learning tutorial
Azure Machine Learning tutorialGiacomo Lanciano
 
Productionalizing Models through CI/CD Design with MLflow
Productionalizing Models through CI/CD Design with MLflowProductionalizing Models through CI/CD Design with MLflow
Productionalizing Models through CI/CD Design with MLflowDatabricks
 
MLOps Using MLflow
MLOps Using MLflowMLOps Using MLflow
MLOps Using MLflowDatabricks
 
Ai platform at scale
Ai platform at scaleAi platform at scale
Ai platform at scaleHenry Saputra
 
Building and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowBuilding and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowKaxil Naik
 
Artwork Personalization at Netflix Fernando Amat RecSys2018
Artwork Personalization at Netflix Fernando Amat RecSys2018 Artwork Personalization at Netflix Fernando Amat RecSys2018
Artwork Personalization at Netflix Fernando Amat RecSys2018 Fernando Amat
 
ML Infrastracture @ Dropbox
ML Infrastracture @ Dropbox ML Infrastracture @ Dropbox
ML Infrastracture @ Dropbox Tsahi Glik
 
Generative AI Fundamentals - Databricks
Generative AI Fundamentals - DatabricksGenerative AI Fundamentals - Databricks
Generative AI Fundamentals - DatabricksVijayananda Mohire
 

Mais procurados (20)

How to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
How to Utilize MLflow and Kubernetes to Build an Enterprise ML PlatformHow to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
How to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
 
Multi Task Learning for Recommendation Systems
Multi Task Learning for Recommendation SystemsMulti Task Learning for Recommendation Systems
Multi Task Learning for Recommendation Systems
 
Tableau Presentation
Tableau PresentationTableau Presentation
Tableau Presentation
 
Tableau ppt
Tableau pptTableau ppt
Tableau ppt
 
Artwork Personalization at Netflix
Artwork Personalization at NetflixArtwork Personalization at Netflix
Artwork Personalization at Netflix
 
Is it sensible to use Data Vault at all? Conclusions from a project.
Is it sensible to use Data Vault at all? Conclusions from a project.Is it sensible to use Data Vault at all? Conclusions from a project.
Is it sensible to use Data Vault at all? Conclusions from a project.
 
Sentiment analysis - Our approach and use cases
Sentiment analysis - Our approach and use casesSentiment analysis - Our approach and use cases
Sentiment analysis - Our approach and use cases
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Network and IT Operations
Network and IT OperationsNetwork and IT Operations
Network and IT Operations
 
How to apply machine learning into your CI/CD pipeline
How to apply machine learning into your CI/CD pipelineHow to apply machine learning into your CI/CD pipeline
How to apply machine learning into your CI/CD pipeline
 
MLOps.pptx
MLOps.pptxMLOps.pptx
MLOps.pptx
 
Azure Machine Learning tutorial
Azure Machine Learning tutorialAzure Machine Learning tutorial
Azure Machine Learning tutorial
 
Productionalizing Models through CI/CD Design with MLflow
Productionalizing Models through CI/CD Design with MLflowProductionalizing Models through CI/CD Design with MLflow
Productionalizing Models through CI/CD Design with MLflow
 
MLOps Using MLflow
MLOps Using MLflowMLOps Using MLflow
MLOps Using MLflow
 
Ai platform at scale
Ai platform at scaleAi platform at scale
Ai platform at scale
 
Building and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowBuilding and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache Airflow
 
MLOps for production-level machine learning
MLOps for production-level machine learningMLOps for production-level machine learning
MLOps for production-level machine learning
 
Artwork Personalization at Netflix Fernando Amat RecSys2018
Artwork Personalization at Netflix Fernando Amat RecSys2018 Artwork Personalization at Netflix Fernando Amat RecSys2018
Artwork Personalization at Netflix Fernando Amat RecSys2018
 
ML Infrastracture @ Dropbox
ML Infrastracture @ Dropbox ML Infrastracture @ Dropbox
ML Infrastracture @ Dropbox
 
Generative AI Fundamentals - Databricks
Generative AI Fundamentals - DatabricksGenerative AI Fundamentals - Databricks
Generative AI Fundamentals - Databricks
 

Semelhante a ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure

AirBNB's ML platform - BigHead
AirBNB's ML platform - BigHeadAirBNB's ML platform - BigHead
AirBNB's ML platform - BigHeadKarthik Murugesan
 
Sf big analytics: bighead
Sf big analytics: bigheadSf big analytics: bighead
Sf big analytics: bigheadChester Chen
 
Infrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentInfrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentDatabricks
 
World Artificial Intelligence Conference Shanghai 2018
World Artificial Intelligence Conference Shanghai 2018World Artificial Intelligence Conference Shanghai 2018
World Artificial Intelligence Conference Shanghai 2018Adam Gibson
 
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdfSlides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdfvitm11
 
Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...
Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...
Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...Akash Tandon
 
Ml infra at an early stage
Ml infra at an early stageMl infra at an early stage
Ml infra at an early stageNick Handel
 
Simply Business' Data Platform
Simply Business' Data PlatformSimply Business' Data Platform
Simply Business' Data PlatformDani Solà Lagares
 
S8277 - Introducing Krylov: AI Platform that Empowers eBay Data Science and E...
S8277 - Introducing Krylov: AI Platform that Empowers eBay Data Science and E...S8277 - Introducing Krylov: AI Platform that Empowers eBay Data Science and E...
S8277 - Introducing Krylov: AI Platform that Empowers eBay Data Science and E...Henry Saputra
 
Designing and coding for cloud-native applications using Python, Harjinder Mi...
Designing and coding for cloud-native applications using Python, Harjinder Mi...Designing and coding for cloud-native applications using Python, Harjinder Mi...
Designing and coding for cloud-native applications using Python, Harjinder Mi...Pôle Systematic Paris-Region
 
Real world machine learning with Java for Fumankaitori.com
Real world machine learning with Java for Fumankaitori.comReal world machine learning with Java for Fumankaitori.com
Real world machine learning with Java for Fumankaitori.comMathieu Dumoulin
 
From prototype to production - The journey of re-designing SmartUp.io
From prototype to production - The journey of re-designing SmartUp.ioFrom prototype to production - The journey of re-designing SmartUp.io
From prototype to production - The journey of re-designing SmartUp.ioMáté Lang
 
AI hype or reality
AI  hype or realityAI  hype or reality
AI hype or realityAwantik Das
 
Building a high-performance, scalable ML & NLP platform with Python, Sheer El...
Building a high-performance, scalable ML & NLP platform with Python, Sheer El...Building a high-performance, scalable ML & NLP platform with Python, Sheer El...
Building a high-performance, scalable ML & NLP platform with Python, Sheer El...Pôle Systematic Paris-Region
 
Serverless Functions and Machine Learning: Putting the AI in APIs
Serverless Functions and Machine Learning: Putting the AI in APIsServerless Functions and Machine Learning: Putting the AI in APIs
Serverless Functions and Machine Learning: Putting the AI in APIsNordic APIs
 
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...DataScienceConferenc1
 
Day 13 - Creating Data Processing Services | Train the Trainers Program
Day 13 - Creating Data Processing Services | Train the Trainers ProgramDay 13 - Creating Data Processing Services | Train the Trainers Program
Day 13 - Creating Data Processing Services | Train the Trainers ProgramFIWARE
 
Benefits of a Homemade ML Platform
Benefits of a Homemade ML PlatformBenefits of a Homemade ML Platform
Benefits of a Homemade ML PlatformGetInData
 
Deploying ML models in the enterprise
Deploying ML models in the enterpriseDeploying ML models in the enterprise
Deploying ML models in the enterprisedoppenhe
 

Semelhante a ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure (20)

AirBNB's ML platform - BigHead
AirBNB's ML platform - BigHeadAirBNB's ML platform - BigHead
AirBNB's ML platform - BigHead
 
Sf big analytics: bighead
Sf big analytics: bigheadSf big analytics: bighead
Sf big analytics: bighead
 
Infrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentInfrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload Deployment
 
World Artificial Intelligence Conference Shanghai 2018
World Artificial Intelligence Conference Shanghai 2018World Artificial Intelligence Conference Shanghai 2018
World Artificial Intelligence Conference Shanghai 2018
 
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdfSlides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
 
Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...
Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...
Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...
 
Ml infra at an early stage
Ml infra at an early stageMl infra at an early stage
Ml infra at an early stage
 
Simply Business' Data Platform
Simply Business' Data PlatformSimply Business' Data Platform
Simply Business' Data Platform
 
S8277 - Introducing Krylov: AI Platform that Empowers eBay Data Science and E...
S8277 - Introducing Krylov: AI Platform that Empowers eBay Data Science and E...S8277 - Introducing Krylov: AI Platform that Empowers eBay Data Science and E...
S8277 - Introducing Krylov: AI Platform that Empowers eBay Data Science and E...
 
Designing and coding for cloud-native applications using Python, Harjinder Mi...
Designing and coding for cloud-native applications using Python, Harjinder Mi...Designing and coding for cloud-native applications using Python, Harjinder Mi...
Designing and coding for cloud-native applications using Python, Harjinder Mi...
 
DevOps Days Rockies MLOps
DevOps Days Rockies MLOpsDevOps Days Rockies MLOps
DevOps Days Rockies MLOps
 
Real world machine learning with Java for Fumankaitori.com
Real world machine learning with Java for Fumankaitori.comReal world machine learning with Java for Fumankaitori.com
Real world machine learning with Java for Fumankaitori.com
 
From prototype to production - The journey of re-designing SmartUp.io
From prototype to production - The journey of re-designing SmartUp.ioFrom prototype to production - The journey of re-designing SmartUp.io
From prototype to production - The journey of re-designing SmartUp.io
 
AI hype or reality
AI  hype or realityAI  hype or reality
AI hype or reality
 
Building a high-performance, scalable ML & NLP platform with Python, Sheer El...
Building a high-performance, scalable ML & NLP platform with Python, Sheer El...Building a high-performance, scalable ML & NLP platform with Python, Sheer El...
Building a high-performance, scalable ML & NLP platform with Python, Sheer El...
 
Serverless Functions and Machine Learning: Putting the AI in APIs
Serverless Functions and Machine Learning: Putting the AI in APIsServerless Functions and Machine Learning: Putting the AI in APIs
Serverless Functions and Machine Learning: Putting the AI in APIs
 
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...
 
Day 13 - Creating Data Processing Services | Train the Trainers Program
Day 13 - Creating Data Processing Services | Train the Trainers ProgramDay 13 - Creating Data Processing Services | Train the Trainers Program
Day 13 - Creating Data Processing Services | Train the Trainers Program
 
Benefits of a Homemade ML Platform
Benefits of a Homemade ML PlatformBenefits of a Homemade ML Platform
Benefits of a Homemade ML Platform
 
Deploying ML models in the enterprise
Deploying ML models in the enterpriseDeploying ML models in the enterprise
Deploying ML models in the enterprise
 

Último

Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendArshad QA
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 

Último (20)

Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and Backend
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 

ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure

  • 1. Bighead Airbnb’s End-to-End Machine Learning Infrastructure Krishna Puttaswamy & Nick Handel On behalf of ML Infra @ Airbnb
  • 3. In 2016 ● Only major models in production ● Models took on average 8 weeks to build (source: survey of ML producers) ● Everything built in Aerosolve, Spark and Scala ● No support for Tensorflow, PyTorch, SK-Learn or other popular ML packages ● Significant discrepancies between offline and online data ML Infra was formed with the charter to: ● Enable more users to build ML products ● Reduce time and effort ● Enable easier model evaluation Q4 2016: Formation of our ML Infra team
  • 4. Before ML Infrastructure ML has had a massive impact on Airbnb’s product ● Search Ranking ● Smart Pricing ● Trust ● Paid Growth ● …And a few other major models
  • 5. After ML Infrastructure But there were many other areas that had high-potential for ML, but were realized less of that potential. ● Paid Growth - Hosts ● Classifying listing ● Experience Ranking + Personalization ● Host Availability ● Business Travel Classifier ● Room Type Categorizations ● Make Listing a Space Easier ● Customer Service Ticket Routing ● … And many more
  • 6. Vision Airbnb routinely ships ML-powered features throughout the product. Mission Equip Airbnb with shared technology to build production-ready ML applications with no incidental complexity. (Technology = tools, platforms, knowledge, shared feature data, etc.)
  • 7. Value of ML Infrastructure Machine Learning Infrastructure can: ● Remove incidental complexities, by providing generic, reusable solutions ● Simplify the workflow for intrinsic complexities, by providing tooling, libraries, and environments that make ML development more efficient And at the same time: ● Establish a standardized platform that enables cross-company sharing of feature data and model components ● “Make it easy to do the right thing” (ex: consistent training/streaming/scoring logic)
  • 9. Learnings: ● No consistency between ML Workflows ● New teams struggle to begin using ML ● Airbnb has a wide variety in ML applications ● Existing ML workflows are slow, fragmented, and brittle ● Incidental complexity vs. intrinsic complexity ● Build and forget - ML as a linear process Q1 2017: Figuring out what to build
  • 10.
  • 12. ● Consistent environment across the stack ○ Use Docker ● Common workflow across different ML frameworks ○ Supports Scikit-learn, TF, PyTorch, etc. ● Modular components ○ Easy to customize parts ○ Easy to share data/pipelines Key Design Decisions
  • 18. Components air/mlinfravision ● Data Management: Zipline ● Training: Redspot / BigQueue ● Core ML Library: Bighead libraries ● Productionization: Deep Thought (online) / ML Automator (offline) ● Model Management: Model Repo ● Monitoring: Model Repo UI
  • 19. Zipline (ML Data Management Framework)
  • 20. Zipline - Why ● Defining features (especially windowed) with hive was complicated and error prone ● Backfilling training sets (on inefficient hive queries) was a major bottleneck ● No feature sharing ● Inconsistent offline and online datasets ● Warehouse is built as of end-of-day, lacked point-in-time features ● ML data pipelines lacked data quality checks or monitoring ● Ownership of pipelines was in disarray
  • 21. A data management platform for ML ● Common (and simple) definition: Define the feature once and use it in batch and streaming ● Training data backfills: Resource efficient and point-in-time correct with scheduled updates ● Lambda updates: Features available both offline and online ● Data quality: Feature visualizations and automatic data quality monitoring Zipline - Overview
  • 22. Zipline - Feature definition language Primary Key Timestamp Owner Operation = Sum Time windows ● Owner allows us to trace accountability ● Primary keys and timestamp are used to guarantee point in time correctness in Training Set ● Operations and time windows are optional ● Spark efficiently handles aggregations (windowed and not)
  • 23. Zipline - Data Quality and Collaboration ● Features can be visualized and browsed through online editor ● Gives stats on feature, and also provides info on ownership
  • 24. Zipline - Training Data PK1 = User ID PK2 = Listing ID Timestamp bookings_by_user bookings_by_listing 123 456 2018-01-01 23... 0 4 234 567 2018-01-04 01... 2 8 456 789 2018-01-02 08... 1 0 User provides: Primary keys, timestamps, list of features Zipline computes feature values point-in-time correct for those PKs and those timestamps. And joins them together. FeatureSet 1 FeatureSet 2
  • 25. Zipline - Training Data Airflow integration for daily update of training data
  • 26. Label logic ● Labels are often joined to features with an offset for training (60 days offset) ● But that offset does not apply to scoring data Zipline - Training Data with Labels ds=2017-08-16 ds=2017-10-15 ??? Features Table Labels Table Training ... ??? ds=2017-10-15 Scoring
  • 27. Features served from online KV store Zipline schedules daily batch correction Zipline - Consistent online and offline features User writes one conf Zipline starts the streaming job
  • 28. ● More efficient cluster usage: Hive and Spark jobs are optimized; Many weeks to create training data backfills => a few hours ● Ease of use: Can define 100s of new features in a few hours (from many days) ● Online scoring with lambda: Features are automatically availability in online scoring environment ● Collaboration: Many features are shared! ● Management: Clear data ownership and maintenance Zipline - Impact
  • 29. Redspot (Hosted Jupyter Notebook Service)
  • 31. ● Started with Jupyterhub (open-source project), which manages multiple Jupyter Notebook Servers (prototyping environment) ● But users were installing packages locally, and then creating virtualenv for other parts of our infra ○ Environment was very fragile ● Users wanted to be able to use jupyterhub on larger instances or instances with GPU ● Wanting to share notebooks with other teammates was common too Redspot - Why
  • 32. Containerized environments ● Every user’s environment is containerized via docker ○ Allows customizing the notebook environment without affecting other users ■ e.g. install system/python packages ○ Easier to restore state therefore helps with reproducibility ● Support using custom docker images ○ Base images based on user’s needs ■ e.g. GPU access, pre-installed ML packages ○ Build your own image for a faster start time
  • 33. Remote Instance Spawner ● For bigger jobs and total isolation, Redspot allows launching a dedicated instance ● Hardware resources not shared with other users ● Automatically terminates idle instances periodically
  • 34. ● A multi-tenant notebook environment ● Makes it easy to iterate and prototype ML models, share work ○ Integrated with the rest of our infra - so one can deploy a notebook to prod ● Improved upon open source Jupyterhub ○ Containerized; can bring custom Docker env ○ Remote notebook spawner for dedicated instances (P3 and X1 machines on AWS) ○ Persist notebooks in EFS and share with teams ○ Reverting to prior checkpoint Redspot Summary
  • 35. Deep Thought (Online Inference Service)
  • 37. ● Performant, scalable execution of model inference in production is hard ○ Engineers shouldn’t build one off solutions for every model. ○ Data scientists should be able to launch new models in production with minimal eng involvement. ● Debugging differences between online inference and training are difficult ○ We should support the exact serialized version of the model the data scientist built ○ We should be able to run the same python transformations data scientists write for training. ○ We should be able to load data computed in the warehouse or streaming easily into online scoring. Deep Thought - Why
  • 38. ● Deep Thought is a shared service for online inference ○ Support for pickled sklearn models, TensorFlow models, and custom code in python or Java ○ Add your model configuration to a file and deploy. Completely config driven so data scientists don’t have to involve engineers to launch new models. ○ Engineers can then connect to a REST API from other services to get scores. ○ Support for loading data from K/V stores ○ Standardized logging, alerting and dashboarding for monitoring and offline analysis of model performance ○ Process isolation to enable multi-tenancy without contention ○ Scalable and Reliable: 80+ models. Highest QPS service at Airbnb. Median response time: 4ms. p95: 13ms. Deep Thought - How
  • 39.
  • 41. Model Repo Overview Model Repo is Bighead’s model management service ● Contains prototype and production models ● Can serve models “raw” or trained ● The source of truth on which trained models are in production ● Stores model health data
  • 42. Model Repo Internals We decompose Models into two components: ● Model Version - raw model code + docker image ● Model Artifact - parameters learned via training Model Version Model Artifact Code Docker Image A trained model consists of: Model Version + Model Artifact Production
  • 43. Our built-in UI provides: ● Deployment - review changes, deploy, and rollback trained models ● Model Health - metrics, visualizations, alerting, central dashboard ● Experimentation - Ability to setup model experiments - e.g. split traffic between two or more models Model Repo: UI
  • 45. ● Tools and libraries for common tasks ○ Periodic training, evaluation and scoring on a model is common: Building Airflow DAGs, uploading scores to K/V stores, dashboards on scores, alert on score changes ○ Scoring on large tables is tricky to scale ML Automator - Why
  • 46. ● Once a model file is checked in, we generate the DAGs automatically to train/score it ● 40+ models using this feature ● Score on Spark for large datasets (we generate virtualenv equivalent to the docker image, as spark doesn’t run executors in docker image) ML Automator
  • 48. ML Helpers - Why ● Transformations are re-written too often ○ There are many versions of transformations for NLP, data cleaning, imputing, etc. ○ Models used to “start from scratch” and rebuild the same things ○ Model observability -- understand what features are important
  • 49. ● Library of transformations; holds more than 50 different transformations including automated preprocessing for common input formats ● Created example notebooks to show usage of our infra ○ Example usage of ML pipelines, contains diagnostics that help people debug and improve models ○ Has been cloned and modified more than 20 times to build new models ● Improved Scikit-Learn Pipelines ○ Propagate feature metadata so we can plot feature importance at the end and connect it to feature names ○ Pipelines for data processing are reusable in other pipelines ○ Added wrappers for model libraries (XGB, etc.) can be serialized (robust to minor version changes) ML Helpers and Pipelines
  • 50. Open Source in H1 2018 If you want to collaborate we can provide early access nick.handel@airbnb.com krishna.puttaswamy@airbnb.com
  • 52. ML models have diverse dependency sets (tensorflow, xgboost, etc.). We allow users to provide a docker image within which model code always runs. ML models don’t run in isolation however, so we’ve built a lightweight API to interact with the “dockerized model” Docker Container Model (user code) Other ML Infra Services Model API Dockerized Models