SlideShare a Scribd company logo
1 of 44
Download to read offline
Data Agility - A Journey to
Advanced Analytics and
Machine Learning at Scale
Spark Summit, April 2019
Hari Subramanian, Engineering Manager
About me
Engineering Manager (also Engineer, Product Manager, Entrepreneur)
Previously: Amazon Web Services, VMware, Startups
Until Recently: Led Big Data Analytics & Data Science Workbench at Uber
Currently: Customer Obsession Engineering at Uber
00 The Uber Scale
01 Uber’s Data Platform
02 Data Science Workbench
03 DS & ML - Uber Toolset
04 Customer Obsession - a case study
05 Lessons learned
06 Wrap-up
Agenda
NYC
Uber’s mission is to
ignite opportunity by
setting the world in
motion.
Impacts
Millions of
Riders
Global
footprint
Livelihood
for Millions
of drivers
Data informs every decision in the company
Daily Uber trips
powered by ML
Millions
Messages
processed by Kafka
2T
Queries across
Hive, Vertica and
Presto
1M Data ingested
into HDFS
150TB
How Big is our Big Data?
Overview of Uber’s Data Platform
DATA SOURCES
RAW DATA
MODELED TABLES
MINING BUSINESS
INSIGHTS
CONSUMING BUSINESS INSIGHTS
EXPERIMENTATION
DATA SCIENCE
MACHINE
LEARNING
CUSTOM DATA SETS
Dashboarding
Alerting
Monitoring
Data Exploration
Query Engines
Knowledge Bases
ETL Frameworks
Data Integrity
Storage
Infrastructure
What is DSW?
Rapid growth growing pains
Accessing data
& services was
complicated
Getting started
was hard
Collaboration was
difficult
Many stakeholders, many needs
Cost and compliance
requirements
Varied
users
Different
infrastructure needs
Single window
access
Democratize data science by enabling access
to reliable infrastructure and advanced tooling
in a community-driven learning environment
Data Science Workbench
Fully hosted 1-click Jupyter Notebook & RStudio IDEGetting Started
Data Access
Shared Standards
Collaboration
Scalability
Available Features
All internal data sources / Multi-DC / Secure / Log/Audit capabilities
Pre-baked Environments
Sharing options on notebooks; 1-click Shiny dashboard publication
Various session sizes, types (CPU, GPU)/access to compute
engines
Documentation Support
Our world today
Key features
● Data exploration
● Data preparation
● Ad-hoc analyses
● Model exploration
Interactive
workspaces
● Visualizing rich insights
derived from complex
analytics
● Displaying business metrics
Advanced
dashboards
● Automating complex
processes
● Small model training
● Scheduling data pulls
Business process
automations
What problem does
DSW solve?
Advanced data
science &
complex analytics
Data Scientists Ops Analysts Contractors
Business process
automation
S&P AnalystsOps Managers Contractors
Exploratory ML,
model-training, &
production
Data Scientists ML Researchers
Support
NLP model for support
tickets
Safety
Trip classification
Uber Eats
Restaurant
recommendations
Risk
Driver account check
Referral risk scoring
Operations
Lifetime value (LTV)
modelEngineers
HDFS (Hadoop Data Lake)
YARN
Mesos
Peloton
Piper | Metron | WatchTower |
Marmaray | Kirby | Databook
Security Summary | Query | Dash |
Map | Chart Builder
Hive as a Service
Spark as a Service
DSW
Query Gateway Services
Observability
AllActiveandHiveSync
EfficiencyandCapacity
Presto
XP | Mentana Michelangelo
Experimentation BI Tools DS Platform ML Platform Data processing Platform
Unique fit in a mature Data Platform
DS & ML - Uber Toolkit
Ingestion & Dispersal (Hoover, Marmaray - uses Spark, Hive)
Data preparation (Databook, QB/QR - uses Spark, Presto, Hive)
Data Analytics (BI tools, DSW - numPy, scikit-learn, pandas)
ML and DL (Spark MLLib, xgboost, TF, keras, pytorch, Horovod)
Model serving (PyML, Michelangelo, Peloton)
Workflows, Exploration (AirFlow/Piper, Data Science Workbench)
Case study
COTA - Customer Obsession Ticketing Assistant
A Deep Learning Model developed and deployed using
Uber’s Data Platform
What is the challenge?
As Uber grows, so does our volume of support tickets
Millions of tickets
from riders / drivers /
eaters per week
Thousands of
different types of
issues users may
encounter
This slide was adapted from a talk by Huaixiu Zheng, Uber
User
CSRContact
Ticket
Response
Select Flow Node
Write Message
Select
Contact Type
Lookup info &
Policies
Select Action
Write response using
a Reply Template
Bliss - Uber’s Customer Support Platform
This slide was adapted from a talk by Huaixiu Zheng, Uber
The Problem
Resolving a ticket is not easy (or cheap)
1000+ types
in a hierarchy
depth: 3~6
10+ actions (adjust fare, add appeasement, …)
1000+ reply templates
This slide was adapted from a talk by Huaixiu Zheng, Uber
Portuguese
Spanish
English
ML Layer
User Info
Trip Info
Ticket Text
COTA: The Solution
A collaborative effort from Uber Risk, CO Eng, and Data Platform teams
TYPE
REPLY
Ticket Metadata
COTA v2.1
(wordCNN)
ACTION
Recommend
+
Default
+
Auto-resolution
ROUTING
Risk Features
Fraud DS
embedment
CO Eng routing
engagement
2. prototype
3. productionize
1. define
4. measure
Launch and Iterate
Typical Machine Learning Workflow
2. prototype
GET DATA
DATA PREPARATION
TRAIN MODELS
EVALUATE MODELS
Validation
Computational cost
Interpretability
SQL, Spark
Data cleansing and
pre-processing,
R / Python
CPU or GPU
Exploration and prototyping
3. productionize
1. define
4. measure
● Iterate on model quickly with tweaks to parameters and configuration
● Flexible development - custom code + leverage existing modules for
data prep, ETLs, train, predict, and visualize
● Jupyter notebook running on a GPU or CPU session
● Pre-packaged Spark, tensorflow, keras, pandas, numpy, scipy etc.
● Interactive Spark exec through Uber’s Spark as a Service - Drogon
● API integrations to production ML platform - Michelangelo
● API integrations to data workflow management - Piper
● Develop and test locally, deploy in the cluster when ready
Vision: Build in DSW, run in prod platforms
Easy ML experimentation, quick production
Deep Learning
Spark Pipeline
Architecture
This slide was adapted from a talk by Huaixiu Zheng, Uber
Training: Deep Learning Spark Pipeline
Spark
+
Tensorflow
This slide was adapted from a talk by Huaixiu Zheng, Uber
Serving: Spark for batch & real-time predictions
Java Virtual Machine Hosted by a
Docker Container
Model Lifecycle
Managed in DSW
Model Lifecycle
This slide was adapted from a talk by Huaixiu Zheng, Uber
Step 1: Data ETL
● Ingredients
○ Query to do ETL
○ Scheduled notebook as a Piper job to retrain data ETL daily
This slide was adapted from a talk by Huaixiu Zheng, Uber
Step 2: Spark Transformations
● Ingredients
○ Setup spark job in Drogon via Michelangelo
○ Scheduled job to trigger the job at a particular retraining
frequency
This slide was adapted from a talk by Huaixiu Zheng, Uber
Step 3: Data Transfer
● Ingredients
○ Upstream dependency on Spark job
○ Scheduled job to trigger data copy to a GPU only cluster using
a cross datacenter replication service
This slide was adapted from a talk by Huaixiu Zheng, Uber
Step 4: Deep Learning Training
● Ingredients
○ Upstream dependency on data transfer
○ Prepare a docker image containing the training code
○ A scheduled job to trigger the DL training in a GPU cluster
This slide was adapted from a talk by Huaixiu Zheng, Uber
Step 5: Model Merging
● Ingredients
○ This happens within the DL training job.
○ Right after DL training is done, Spark and DL models are
merged and uploaded to a model store.
This slide was adapted from a talk by Huaixiu Zheng, Uber
Step 6: Model Deployment
● Ingredients
○ Upstream dependency on DL training and Model Merging
○ A scheduled job from notebook triggerring Model Deployment
This slide was adapted from a talk by Huaixiu Zheng, Uber
Leverage Monitoring and Ops tools built for
production scenarios
This slide was adapted from a talk by Huaixiu Zheng, Uber
Lessons learned
Build for the experts, design for the less
technical people
Create communities with both data
scientists and non data scientists
Don’t stop at building what’s known,
empower people to look for the unknown
Thank you!

More Related Content

What's hot

Extending Machine Learning Algorithms with PySpark
Extending Machine Learning Algorithms with PySparkExtending Machine Learning Algorithms with PySpark
Extending Machine Learning Algorithms with PySparkDatabricks
 
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache SparkBuild, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache SparkDatabricks
 
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...Databricks
 
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache SparkBuild, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache SparkDatabricks
 
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...Databricks
 
Powering Custom Apps at Facebook using Spark Script Transformation
Powering Custom Apps at Facebook using Spark Script TransformationPowering Custom Apps at Facebook using Spark Script Transformation
Powering Custom Apps at Facebook using Spark Script TransformationDatabricks
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsAnyscale
 
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...Databricks
 
Splice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflowSplice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflowDatabricks
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...Databricks
 
Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...
Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...
Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...Databricks
 
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflowImproving the Life of Data Scientists: Automating ML Lifecycle through MLflow
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflowDatabricks
 
Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...
Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...
Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...Databricks
 
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph BradleyDeploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph BradleyDatabricks
 
Koalas: How Well Does Koalas Work?
Koalas: How Well Does Koalas Work?Koalas: How Well Does Koalas Work?
Koalas: How Well Does Koalas Work?Databricks
 
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache SparkBuild, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache SparkDatabricks
 
Context-aware Fast Food Recommendation with Ray on Apache Spark at Burger King
Context-aware Fast Food Recommendation with Ray on Apache Spark at Burger KingContext-aware Fast Food Recommendation with Ray on Apache Spark at Burger King
Context-aware Fast Food Recommendation with Ray on Apache Spark at Burger KingDatabricks
 
Scaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of ParametersScaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of ParametersJen Aman
 
Advanced Natural Language Processing with Apache Spark NLP
Advanced Natural Language Processing with Apache Spark NLPAdvanced Natural Language Processing with Apache Spark NLP
Advanced Natural Language Processing with Apache Spark NLPDatabricks
 

What's hot (20)

Extending Machine Learning Algorithms with PySpark
Extending Machine Learning Algorithms with PySparkExtending Machine Learning Algorithms with PySpark
Extending Machine Learning Algorithms with PySpark
 
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache SparkBuild, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
 
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
 
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache SparkBuild, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
 
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
 
Powering Custom Apps at Facebook using Spark Script Transformation
Powering Custom Apps at Facebook using Spark Script TransformationPowering Custom Apps at Facebook using Spark Script Transformation
Powering Custom Apps at Facebook using Spark Script Transformation
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
 
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...
 
Catalyst optimizer
Catalyst optimizerCatalyst optimizer
Catalyst optimizer
 
Splice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflowSplice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflow
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
 
Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...
Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...
Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...
 
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflowImproving the Life of Data Scientists: Automating ML Lifecycle through MLflow
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow
 
Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...
Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...
Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...
 
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph BradleyDeploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
 
Koalas: How Well Does Koalas Work?
Koalas: How Well Does Koalas Work?Koalas: How Well Does Koalas Work?
Koalas: How Well Does Koalas Work?
 
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache SparkBuild, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
 
Context-aware Fast Food Recommendation with Ray on Apache Spark at Burger King
Context-aware Fast Food Recommendation with Ray on Apache Spark at Burger KingContext-aware Fast Food Recommendation with Ray on Apache Spark at Burger King
Context-aware Fast Food Recommendation with Ray on Apache Spark at Burger King
 
Scaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of ParametersScaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of Parameters
 
Advanced Natural Language Processing with Apache Spark NLP
Advanced Natural Language Processing with Apache Spark NLPAdvanced Natural Language Processing with Apache Spark NLP
Advanced Natural Language Processing with Apache Spark NLP
 

Similar to Data Agility - A Journey to Advanced Analytics and Machine Learning at Scale

Building Intelligent Applications, Experimental ML with Uber’s Data Science W...
Building Intelligent Applications, Experimental ML with Uber’s Data Science W...Building Intelligent Applications, Experimental ML with Uber’s Data Science W...
Building Intelligent Applications, Experimental ML with Uber’s Data Science W...Databricks
 
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...Karthik Murugesan
 
Serverless machine learning operations
Serverless machine learning operationsServerless machine learning operations
Serverless machine learning operationsStepan Pushkarev
 
Building intelligent applications, experimental ML with Uber’s Data Science W...
Building intelligent applications, experimental ML with Uber’s Data Science W...Building intelligent applications, experimental ML with Uber’s Data Science W...
Building intelligent applications, experimental ML with Uber’s Data Science W...DataWorks Summit
 
Infrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentInfrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentDatabricks
 
Architecting an Open Source AI Platform 2018 edition
Architecting an Open Source AI Platform   2018 editionArchitecting an Open Source AI Platform   2018 edition
Architecting an Open Source AI Platform 2018 editionDavid Talby
 
Paige Roberts: Shortcut MLOps with In-Database Machine Learning
Paige Roberts: Shortcut MLOps with In-Database Machine LearningPaige Roberts: Shortcut MLOps with In-Database Machine Learning
Paige Roberts: Shortcut MLOps with In-Database Machine LearningEdunomica
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataDatabricks
 
_Python Ireland Meetup - Serverless ML - Dowling.pdf
_Python Ireland Meetup - Serverless ML - Dowling.pdf_Python Ireland Meetup - Serverless ML - Dowling.pdf
_Python Ireland Meetup - Serverless ML - Dowling.pdfJim Dowling
 
Big Data Meetup #7
Big Data Meetup #7Big Data Meetup #7
Big Data Meetup #7Paul Lo
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...PAPIs.io
 
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...Big Data Value Association
 
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...Provectus
 
(ENT306) Application Portfolio Migration | AWS re:Invent 2014
(ENT306) Application Portfolio Migration | AWS re:Invent 2014(ENT306) Application Portfolio Migration | AWS re:Invent 2014
(ENT306) Application Portfolio Migration | AWS re:Invent 2014Amazon Web Services
 
Near real-time anomaly detection at Lyft
Near real-time anomaly detection at LyftNear real-time anomaly detection at Lyft
Near real-time anomaly detection at Lyftmarkgrover
 
Deep Learning for Autonomous Driving
Deep Learning for Autonomous DrivingDeep Learning for Autonomous Driving
Deep Learning for Autonomous DrivingJan Wiegelmann
 
Leverage the power of machine learning on windows
Leverage the power of machine learning on windowsLeverage the power of machine learning on windows
Leverage the power of machine learning on windowsJosé António Silva
 

Similar to Data Agility - A Journey to Advanced Analytics and Machine Learning at Scale (20)

Building Intelligent Applications, Experimental ML with Uber’s Data Science W...
Building Intelligent Applications, Experimental ML with Uber’s Data Science W...Building Intelligent Applications, Experimental ML with Uber’s Data Science W...
Building Intelligent Applications, Experimental ML with Uber’s Data Science W...
 
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...
 
Serverless machine learning operations
Serverless machine learning operationsServerless machine learning operations
Serverless machine learning operations
 
Building intelligent applications, experimental ML with Uber’s Data Science W...
Building intelligent applications, experimental ML with Uber’s Data Science W...Building intelligent applications, experimental ML with Uber’s Data Science W...
Building intelligent applications, experimental ML with Uber’s Data Science W...
 
Infrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentInfrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload Deployment
 
Architecting an Open Source AI Platform 2018 edition
Architecting an Open Source AI Platform   2018 editionArchitecting an Open Source AI Platform   2018 edition
Architecting an Open Source AI Platform 2018 edition
 
DevOps for DataScience
DevOps for DataScienceDevOps for DataScience
DevOps for DataScience
 
Paige Roberts: Shortcut MLOps with In-Database Machine Learning
Paige Roberts: Shortcut MLOps with In-Database Machine LearningPaige Roberts: Shortcut MLOps with In-Database Machine Learning
Paige Roberts: Shortcut MLOps with In-Database Machine Learning
 
Spark ML Pipeline serving
Spark ML Pipeline servingSpark ML Pipeline serving
Spark ML Pipeline serving
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
 
_Python Ireland Meetup - Serverless ML - Dowling.pdf
_Python Ireland Meetup - Serverless ML - Dowling.pdf_Python Ireland Meetup - Serverless ML - Dowling.pdf
_Python Ireland Meetup - Serverless ML - Dowling.pdf
 
Big Data Meetup #7
Big Data Meetup #7Big Data Meetup #7
Big Data Meetup #7
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
 
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...
 
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...
 
Monitoring AI with AI
Monitoring AI with AIMonitoring AI with AI
Monitoring AI with AI
 
(ENT306) Application Portfolio Migration | AWS re:Invent 2014
(ENT306) Application Portfolio Migration | AWS re:Invent 2014(ENT306) Application Portfolio Migration | AWS re:Invent 2014
(ENT306) Application Portfolio Migration | AWS re:Invent 2014
 
Near real-time anomaly detection at Lyft
Near real-time anomaly detection at LyftNear real-time anomaly detection at Lyft
Near real-time anomaly detection at Lyft
 
Deep Learning for Autonomous Driving
Deep Learning for Autonomous DrivingDeep Learning for Autonomous Driving
Deep Learning for Autonomous Driving
 
Leverage the power of machine learning on windows
Leverage the power of machine learning on windowsLeverage the power of machine learning on windows
Leverage the power of machine learning on windows
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in collegessuser7a7cd61
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.pptamreenkhanum0307
 

Recently uploaded (20)

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in college
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.ppt
 

Data Agility - A Journey to Advanced Analytics and Machine Learning at Scale

  • 1. Data Agility - A Journey to Advanced Analytics and Machine Learning at Scale Spark Summit, April 2019 Hari Subramanian, Engineering Manager
  • 2. About me Engineering Manager (also Engineer, Product Manager, Entrepreneur) Previously: Amazon Web Services, VMware, Startups Until Recently: Led Big Data Analytics & Data Science Workbench at Uber Currently: Customer Obsession Engineering at Uber
  • 3. 00 The Uber Scale 01 Uber’s Data Platform 02 Data Science Workbench 03 DS & ML - Uber Toolset 04 Customer Obsession - a case study 05 Lessons learned 06 Wrap-up Agenda
  • 4. NYC Uber’s mission is to ignite opportunity by setting the world in motion. Impacts Millions of Riders Global footprint Livelihood for Millions of drivers
  • 5. Data informs every decision in the company
  • 6. Daily Uber trips powered by ML Millions Messages processed by Kafka 2T Queries across Hive, Vertica and Presto 1M Data ingested into HDFS 150TB How Big is our Big Data?
  • 7. Overview of Uber’s Data Platform DATA SOURCES RAW DATA MODELED TABLES MINING BUSINESS INSIGHTS CONSUMING BUSINESS INSIGHTS EXPERIMENTATION DATA SCIENCE MACHINE LEARNING CUSTOM DATA SETS Dashboarding Alerting Monitoring Data Exploration Query Engines Knowledge Bases ETL Frameworks Data Integrity Storage Infrastructure
  • 9. Rapid growth growing pains Accessing data & services was complicated Getting started was hard Collaboration was difficult
  • 10. Many stakeholders, many needs Cost and compliance requirements Varied users Different infrastructure needs Single window access
  • 11. Democratize data science by enabling access to reliable infrastructure and advanced tooling in a community-driven learning environment Data Science Workbench
  • 12. Fully hosted 1-click Jupyter Notebook & RStudio IDEGetting Started Data Access Shared Standards Collaboration Scalability Available Features All internal data sources / Multi-DC / Secure / Log/Audit capabilities Pre-baked Environments Sharing options on notebooks; 1-click Shiny dashboard publication Various session sizes, types (CPU, GPU)/access to compute engines Documentation Support Our world today
  • 13. Key features ● Data exploration ● Data preparation ● Ad-hoc analyses ● Model exploration Interactive workspaces ● Visualizing rich insights derived from complex analytics ● Displaying business metrics Advanced dashboards ● Automating complex processes ● Small model training ● Scheduling data pulls Business process automations
  • 14.
  • 16.
  • 17. Advanced data science & complex analytics Data Scientists Ops Analysts Contractors
  • 19. Exploratory ML, model-training, & production Data Scientists ML Researchers Support NLP model for support tickets Safety Trip classification Uber Eats Restaurant recommendations Risk Driver account check Referral risk scoring Operations Lifetime value (LTV) modelEngineers
  • 20. HDFS (Hadoop Data Lake) YARN Mesos Peloton Piper | Metron | WatchTower | Marmaray | Kirby | Databook Security Summary | Query | Dash | Map | Chart Builder Hive as a Service Spark as a Service DSW Query Gateway Services Observability AllActiveandHiveSync EfficiencyandCapacity Presto XP | Mentana Michelangelo Experimentation BI Tools DS Platform ML Platform Data processing Platform Unique fit in a mature Data Platform
  • 21. DS & ML - Uber Toolkit Ingestion & Dispersal (Hoover, Marmaray - uses Spark, Hive) Data preparation (Databook, QB/QR - uses Spark, Presto, Hive) Data Analytics (BI tools, DSW - numPy, scikit-learn, pandas) ML and DL (Spark MLLib, xgboost, TF, keras, pytorch, Horovod) Model serving (PyML, Michelangelo, Peloton) Workflows, Exploration (AirFlow/Piper, Data Science Workbench)
  • 22. Case study COTA - Customer Obsession Ticketing Assistant A Deep Learning Model developed and deployed using Uber’s Data Platform
  • 23. What is the challenge? As Uber grows, so does our volume of support tickets Millions of tickets from riders / drivers / eaters per week Thousands of different types of issues users may encounter This slide was adapted from a talk by Huaixiu Zheng, Uber
  • 24. User CSRContact Ticket Response Select Flow Node Write Message Select Contact Type Lookup info & Policies Select Action Write response using a Reply Template Bliss - Uber’s Customer Support Platform This slide was adapted from a talk by Huaixiu Zheng, Uber
  • 25. The Problem Resolving a ticket is not easy (or cheap) 1000+ types in a hierarchy depth: 3~6 10+ actions (adjust fare, add appeasement, …) 1000+ reply templates This slide was adapted from a talk by Huaixiu Zheng, Uber
  • 26. Portuguese Spanish English ML Layer User Info Trip Info Ticket Text COTA: The Solution A collaborative effort from Uber Risk, CO Eng, and Data Platform teams TYPE REPLY Ticket Metadata COTA v2.1 (wordCNN) ACTION Recommend + Default + Auto-resolution ROUTING Risk Features Fraud DS embedment CO Eng routing engagement
  • 27. 2. prototype 3. productionize 1. define 4. measure Launch and Iterate Typical Machine Learning Workflow
  • 28. 2. prototype GET DATA DATA PREPARATION TRAIN MODELS EVALUATE MODELS Validation Computational cost Interpretability SQL, Spark Data cleansing and pre-processing, R / Python CPU or GPU Exploration and prototyping 3. productionize 1. define 4. measure
  • 29. ● Iterate on model quickly with tweaks to parameters and configuration ● Flexible development - custom code + leverage existing modules for data prep, ETLs, train, predict, and visualize ● Jupyter notebook running on a GPU or CPU session ● Pre-packaged Spark, tensorflow, keras, pandas, numpy, scipy etc. ● Interactive Spark exec through Uber’s Spark as a Service - Drogon ● API integrations to production ML platform - Michelangelo ● API integrations to data workflow management - Piper ● Develop and test locally, deploy in the cluster when ready Vision: Build in DSW, run in prod platforms Easy ML experimentation, quick production
  • 31. Architecture This slide was adapted from a talk by Huaixiu Zheng, Uber
  • 32. Training: Deep Learning Spark Pipeline Spark + Tensorflow This slide was adapted from a talk by Huaixiu Zheng, Uber
  • 33. Serving: Spark for batch & real-time predictions Java Virtual Machine Hosted by a Docker Container
  • 35. Model Lifecycle This slide was adapted from a talk by Huaixiu Zheng, Uber
  • 36. Step 1: Data ETL ● Ingredients ○ Query to do ETL ○ Scheduled notebook as a Piper job to retrain data ETL daily This slide was adapted from a talk by Huaixiu Zheng, Uber
  • 37. Step 2: Spark Transformations ● Ingredients ○ Setup spark job in Drogon via Michelangelo ○ Scheduled job to trigger the job at a particular retraining frequency This slide was adapted from a talk by Huaixiu Zheng, Uber
  • 38. Step 3: Data Transfer ● Ingredients ○ Upstream dependency on Spark job ○ Scheduled job to trigger data copy to a GPU only cluster using a cross datacenter replication service This slide was adapted from a talk by Huaixiu Zheng, Uber
  • 39. Step 4: Deep Learning Training ● Ingredients ○ Upstream dependency on data transfer ○ Prepare a docker image containing the training code ○ A scheduled job to trigger the DL training in a GPU cluster This slide was adapted from a talk by Huaixiu Zheng, Uber
  • 40. Step 5: Model Merging ● Ingredients ○ This happens within the DL training job. ○ Right after DL training is done, Spark and DL models are merged and uploaded to a model store. This slide was adapted from a talk by Huaixiu Zheng, Uber
  • 41. Step 6: Model Deployment ● Ingredients ○ Upstream dependency on DL training and Model Merging ○ A scheduled job from notebook triggerring Model Deployment This slide was adapted from a talk by Huaixiu Zheng, Uber
  • 42. Leverage Monitoring and Ops tools built for production scenarios This slide was adapted from a talk by Huaixiu Zheng, Uber
  • 43. Lessons learned Build for the experts, design for the less technical people Create communities with both data scientists and non data scientists Don’t stop at building what’s known, empower people to look for the unknown