This document summarizes a talk on building an ML platform with Ray and MLflow. Ray is an open-source framework for distributed computing and machine learning. It provides libraries like Ray Tune for hyperparameter tuning and Ray Serve for model serving. MLflow is a tool for managing the machine learning lifecycle including tracking experiments, managing models, and deploying models. The talk demonstrates how to build an end-to-end ML platform by integrating Ray and MLflow for distributed training, hyperparameter tuning, model tracking, and low-latency serving.
6. Execution
- Feature engineering
- Training
- Including tuning
- Serving
- Offline scoring, inference
- Online serving
Typical ML Process -- Simplified
Management
- Tracking
- Data, Code, Configurations
- Reproducing Results
- Deployment
- Deploy in a variety of
environments
7. Challenges with the ML Process
Data/Features
• Data Preparation
• Data Analysis
• Feature
Engineering
• Data Pipeline
• Data
Management/Feat
ure Store
• Manages big data
clusters
Model
• ML Expertise
• Implement SOTA
ML Research
• Experimentation
• Manage GPU
infrastructure
• Scalable training &
hyperparameter
tuning
Production
• A/B Testing
• Model Evaluation
• Analysis of
Predictions
• Deploy in variety of
environments
• CI/CD
• Highly Available
prediction service
Data/Research
Scientist
Engineers
8. Challenges with the ML Process
Data
• Data Preparation
• Data Analysis
• Feature
Engineering
• Data Pipeline
• Data
Management/Feat
ure Store
• Manages big data
clusters
Model
• ML Expertise
• Implement SOTA
ML Research
• Experimentation
• Manage GPU
infrastructure
• Scalable training &
hyperparameter
tuning
Production
• A/B Testing
• Model Evaluation
• Analysis of
Predictions
• Deploy in variety of
environments
• CI/CD
• Highly Available
prediction service
Data/Research
Scientist
Software/Data/
ML Engineer
ML Platform
Abstraction
9. ML Platforms -- Scale
- LinkedIn:
- 500+ “AI engineers” building models; 50+ MLP engineers
- > 50% offline compute demand (12K servers each with 256G RAM)
- More than 2x a year
- Uber Michelangelo, AirBnB Bighead, Facebook FBLearner,
etc.
- Globally, a few Billion $ now, growing 40%+ YoY
- Many companies building ML Platforms from the ground up
15. What is Ray?
• A simple/general library for distributed computing
• Single machine or 100s of nodes
• Agnostic to the type of work
• An ecosystem of libraries (for scaling ML and more)
• Native: Ray RLlib, Ray Tune, Ray Serve
• Third party: Modin, Dask, Horovod, XGBoost, Pytorch Lightning
• Tools for launching clusters on any cloud provider
16. Three key ideas
Execute remote functions as tasks, and
instantiate remote classes as actors
• Support both stateful and stateless computations
Asynchronous execution using futures
• Enable parallelism
Distributed (immutable) object store
• Efficient communication (send arguments by reference)
31. Ray Tune focuses on
simplifying execution
Easily launch distributed multi-gpu
tuning jobs
Automatic fault tolerance to save
3x on GPU costs
https://www.vecteezy.com/
$ ray up {cluster config}
ray.init(address="auto")
tune.run(func, num_samples=100)
34. from ray import tune
def train_model(config={}):
model = ConvNet(config)
for i in range(steps):
current_loss = model.train()
tune.report(loss=current_loss)
35. def train_model(config):
model = ConvNet(config)
for i in range(epochs):
current_loss = model.train()
tune.report(loss=current_loss)
tune.run(train_model,
config={“lr”: 0.1})
42. Ray Serve is
high-performance and flexible
• Framework-agnostic
• Easily scales
• Supports batching
• Query your endpoints from
HTTP and from Python
• Easily integrate with other
tools
43. Ray Serve is built on top of Ray
For user, no need to think about:
• Interprocess communication
• Failure management
• Scheduling
Just tell Ray Serve to scale up your model.
44. Serve functions and stateful classes.
Ray Serve will use multiple replicas to parallelize
across cores and across nodes in your cluster.
Ray Serve API
45. Flexibility
Query your model from HTTP:
> curl "http://127.0.0.1:8000/my/route"
Or query from Python using ServeHandle:
47. Challenges of ML in production
• It’s difficult to keep track of experiments.
• It’s difficult to reproduce code.
• There’s no standard way to package and deploy
models.
• There’s no central store to manage models (their
versions and stage transitions).
Source: mlflow.org
48. What is MLflow?
• Open-source ML lifecycle management tool
• Single solution for all of the above challenges
• Library-agnostic and language-agnostic
• (Works with your existing code)
58. Integrating with Ray Serve is easy.
• Ray Serve endpoints can be called from Python.
• Clean conceptual separation:
• Ray Serve handles data plane (processing)
• MLflow handles control plane (metadata, configuration)
60. Acknowledgements
Thanks to Jules Damji, Sid Murching, and Paul Ogilvie for
their help and guidance with MLflow.
Thanks to Dmitri Gekhtman, Kai Fricke, Simon Mo,
Edward Oakes, Richard Liaw, Kathryn Zhou and the rest
of the Ray team!