O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure


Confira estes a seguir

1 de 52 Anúncio

ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure

Baixar para ler offline

ML platform meetups are quarterly meetups, where we discuss and share advanced technology on machine learning infrastructure. Companies involved include Airbnb, Databricks, Facebook, Google, LinkedIn, Netflix, Pinterest, Twitter, and Uber.

ML platform meetups are quarterly meetups, where we discuss and share advanced technology on machine learning infrastructure. Companies involved include Airbnb, Databricks, Facebook, Google, LinkedIn, Netflix, Pinterest, Twitter, and Uber.


Mais Conteúdo rRelacionado

Diapositivos para si (20)

Semelhante a ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure (20)


Mais recentes (20)

ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure

  1. 1. Bighead Airbnb’s End-to-End Machine Learning Infrastructure Krishna Puttaswamy & Nick Handel On behalf of ML Infra @ Airbnb
  2. 2. Context
  3. 3. In 2016 ● Only major models in production ● Models took on average 8 weeks to build (source: survey of ML producers) ● Everything built in Aerosolve, Spark and Scala ● No support for Tensorflow, PyTorch, SK-Learn or other popular ML packages ● Significant discrepancies between offline and online data ML Infra was formed with the charter to: ● Enable more users to build ML products ● Reduce time and effort ● Enable easier model evaluation Q4 2016: Formation of our ML Infra team
  4. 4. Before ML Infrastructure ML has had a massive impact on Airbnb’s product ● Search Ranking ● Smart Pricing ● Trust ● Paid Growth ● …And a few other major models
  5. 5. After ML Infrastructure But there were many other areas that had high-potential for ML, but were realized less of that potential. ● Paid Growth - Hosts ● Classifying listing ● Experience Ranking + Personalization ● Host Availability ● Business Travel Classifier ● Room Type Categorizations ● Make Listing a Space Easier ● Customer Service Ticket Routing ● … And many more
  6. 6. Vision Airbnb routinely ships ML-powered features throughout the product. Mission Equip Airbnb with shared technology to build production-ready ML applications with no incidental complexity. (Technology = tools, platforms, knowledge, shared feature data, etc.)
  7. 7. Value of ML Infrastructure Machine Learning Infrastructure can: ● Remove incidental complexities, by providing generic, reusable solutions ● Simplify the workflow for intrinsic complexities, by providing tooling, libraries, and environments that make ML development more efficient And at the same time: ● Establish a standardized platform that enables cross-company sharing of feature data and model components ● “Make it easy to do the right thing” (ex: consistent training/streaming/scoring logic)
  8. 8. Bighead: Motivations
  9. 9. Learnings: ● No consistency between ML Workflows ● New teams struggle to begin using ML ● Airbnb has a wide variety in ML applications ● Existing ML workflows are slow, fragmented, and brittle ● Incidental complexity vs. intrinsic complexity ● Build and forget - ML as a linear process Q1 2017: Figuring out what to build
  10. 10. Architecture
  11. 11. ● Consistent environment across the stack ○ Use Docker ● Common workflow across different ML frameworks ○ Supports Scikit-learn, TF, PyTorch, etc. ● Modular components ○ Easy to customize parts ○ Easy to share data/pipelines Key Design Decisions
  12. 12. Architecture
  13. 13. Architecture
  14. 14. Architecture
  15. 15. Architecture
  16. 16. Architecture
  17. 17. Components air/mlinfravision ● Data Management: Zipline ● Training: Redspot / BigQueue ● Core ML Library: Bighead libraries ● Productionization: Deep Thought (online) / ML Automator (offline) ● Model Management: Model Repo ● Monitoring: Model Repo UI
  18. 18. Zipline (ML Data Management Framework)
  19. 19. Zipline - Why ● Defining features (especially windowed) with hive was complicated and error prone ● Backfilling training sets (on inefficient hive queries) was a major bottleneck ● No feature sharing ● Inconsistent offline and online datasets ● Warehouse is built as of end-of-day, lacked point-in-time features ● ML data pipelines lacked data quality checks or monitoring ● Ownership of pipelines was in disarray
  20. 20. A data management platform for ML ● Common (and simple) definition: Define the feature once and use it in batch and streaming ● Training data backfills: Resource efficient and point-in-time correct with scheduled updates ● Lambda updates: Features available both offline and online ● Data quality: Feature visualizations and automatic data quality monitoring Zipline - Overview
  21. 21. Zipline - Feature definition language Primary Key Timestamp Owner Operation = Sum Time windows ● Owner allows us to trace accountability ● Primary keys and timestamp are used to guarantee point in time correctness in Training Set ● Operations and time windows are optional ● Spark efficiently handles aggregations (windowed and not)
  22. 22. Zipline - Data Quality and Collaboration ● Features can be visualized and browsed through online editor ● Gives stats on feature, and also provides info on ownership
  23. 23. Zipline - Training Data PK1 = User ID PK2 = Listing ID Timestamp bookings_by_user bookings_by_listing 123 456 2018-01-01 23... 0 4 234 567 2018-01-04 01... 2 8 456 789 2018-01-02 08... 1 0 User provides: Primary keys, timestamps, list of features Zipline computes feature values point-in-time correct for those PKs and those timestamps. And joins them together. FeatureSet 1 FeatureSet 2
  24. 24. Zipline - Training Data Airflow integration for daily update of training data
  25. 25. Label logic ● Labels are often joined to features with an offset for training (60 days offset) ● But that offset does not apply to scoring data Zipline - Training Data with Labels ds=2017-08-16 ds=2017-10-15 ??? Features Table Labels Table Training ... ??? ds=2017-10-15 Scoring
  26. 26. Features served from online KV store Zipline schedules daily batch correction Zipline - Consistent online and offline features User writes one conf Zipline starts the streaming job
  27. 27. ● More efficient cluster usage: Hive and Spark jobs are optimized; Many weeks to create training data backfills => a few hours ● Ease of use: Can define 100s of new features in a few hours (from many days) ● Online scoring with lambda: Features are automatically availability in online scoring environment ● Collaboration: Many features are shared! ● Management: Clear data ownership and maintenance Zipline - Impact
  28. 28. Redspot (Hosted Jupyter Notebook Service)
  29. 29. Architecture
  30. 30. ● Started with Jupyterhub (open-source project), which manages multiple Jupyter Notebook Servers (prototyping environment) ● But users were installing packages locally, and then creating virtualenv for other parts of our infra ○ Environment was very fragile ● Users wanted to be able to use jupyterhub on larger instances or instances with GPU ● Wanting to share notebooks with other teammates was common too Redspot - Why
  31. 31. Containerized environments ● Every user’s environment is containerized via docker ○ Allows customizing the notebook environment without affecting other users ■ e.g. install system/python packages ○ Easier to restore state therefore helps with reproducibility ● Support using custom docker images ○ Base images based on user’s needs ■ e.g. GPU access, pre-installed ML packages ○ Build your own image for a faster start time
  32. 32. Remote Instance Spawner ● For bigger jobs and total isolation, Redspot allows launching a dedicated instance ● Hardware resources not shared with other users ● Automatically terminates idle instances periodically
  33. 33. ● A multi-tenant notebook environment ● Makes it easy to iterate and prototype ML models, share work ○ Integrated with the rest of our infra - so one can deploy a notebook to prod ● Improved upon open source Jupyterhub ○ Containerized; can bring custom Docker env ○ Remote notebook spawner for dedicated instances (P3 and X1 machines on AWS) ○ Persist notebooks in EFS and share with teams ○ Reverting to prior checkpoint Redspot Summary
  34. 34. Deep Thought (Online Inference Service)
  35. 35. Architecture
  36. 36. ● Performant, scalable execution of model inference in production is hard ○ Engineers shouldn’t build one off solutions for every model. ○ Data scientists should be able to launch new models in production with minimal eng involvement. ● Debugging differences between online inference and training are difficult ○ We should support the exact serialized version of the model the data scientist built ○ We should be able to run the same python transformations data scientists write for training. ○ We should be able to load data computed in the warehouse or streaming easily into online scoring. Deep Thought - Why
  37. 37. ● Deep Thought is a shared service for online inference ○ Support for pickled sklearn models, TensorFlow models, and custom code in python or Java ○ Add your model configuration to a file and deploy. Completely config driven so data scientists don’t have to involve engineers to launch new models. ○ Engineers can then connect to a REST API from other services to get scores. ○ Support for loading data from K/V stores ○ Standardized logging, alerting and dashboarding for monitoring and offline analysis of model performance ○ Process isolation to enable multi-tenancy without contention ○ Scalable and Reliable: 80+ models. Highest QPS service at Airbnb. Median response time: 4ms. p95: 13ms. Deep Thought - How
  38. 38. Model Repo
  39. 39. Model Repo Overview Model Repo is Bighead’s model management service ● Contains prototype and production models ● Can serve models “raw” or trained ● The source of truth on which trained models are in production ● Stores model health data
  40. 40. Model Repo Internals We decompose Models into two components: ● Model Version - raw model code + docker image ● Model Artifact - parameters learned via training Model Version Model Artifact Code Docker Image A trained model consists of: Model Version + Model Artifact Production
  41. 41. Our built-in UI provides: ● Deployment - review changes, deploy, and rollback trained models ● Model Health - metrics, visualizations, alerting, central dashboard ● Experimentation - Ability to setup model experiments - e.g. split traffic between two or more models Model Repo: UI
  42. 42. ML Automator
  43. 43. ● Tools and libraries for common tasks ○ Periodic training, evaluation and scoring on a model is common: Building Airflow DAGs, uploading scores to K/V stores, dashboards on scores, alert on score changes ○ Scoring on large tables is tricky to scale ML Automator - Why
  44. 44. ● Once a model file is checked in, we generate the DAGs automatically to train/score it ● 40+ models using this feature ● Score on Spark for large datasets (we generate virtualenv equivalent to the docker image, as spark doesn’t run executors in docker image) ML Automator
  45. 45. Core Libraries
  46. 46. ML Helpers - Why ● Transformations are re-written too often ○ There are many versions of transformations for NLP, data cleaning, imputing, etc. ○ Models used to “start from scratch” and rebuild the same things ○ Model observability -- understand what features are important
  47. 47. ● Library of transformations; holds more than 50 different transformations including automated preprocessing for common input formats ● Created example notebooks to show usage of our infra ○ Example usage of ML pipelines, contains diagnostics that help people debug and improve models ○ Has been cloned and modified more than 20 times to build new models ● Improved Scikit-Learn Pipelines ○ Propagate feature metadata so we can plot feature importance at the end and connect it to feature names ○ Pipelines for data processing are reusable in other pipelines ○ Added wrappers for model libraries (XGB, etc.) can be serialized (robust to minor version changes) ML Helpers and Pipelines
  48. 48. Open Source in H1 2018 If you want to collaborate we can provide early access nick.handel@airbnb.com krishna.puttaswamy@airbnb.com
  49. 49. Appendix
  50. 50. ML models have diverse dependency sets (tensorflow, xgboost, etc.). We allow users to provide a docker image within which model code always runs. ML models don’t run in isolation however, so we’ve built a lightweight API to interact with the “dockerized model” Docker Container Model (user code) Other ML Infra Services Model API Dockerized Models