SlideShare a Scribd company logo
1 of 21
Download to read offline
Machine Learning on
Kubernetes
13 Dec 2017
Anirudh Ramanathan
Software Engineer on Kubernetes
Twitter: @anirudh4444
Disclaimer
I’m not a Machine Learning expert.
I work on infrastructure and distributed systems for a
living.
Kubernetes a year ago...
● Was used primarily for stateless workloads
● Needed an understanding of several core concepts to operate
● Applications had to be written to fit into core controller abstractions
Kubernetes today...
● Has abstractions to support Stateful applications and now data
processing and machine learning.
● Has a wide range of extension points including ones that allow API
extensions and custom controllers.
● Has support for building higher level abstractions and APIs to hide
infrastructure & operational complexity.
What’s changed?
● Workload controller abstractions moving to GA/stable.
● Custom Resource Definitions & Aggregated API Servers
● Kubernetes Operators
● Community support for external frameworks
● Work on scheduling and resource management (ongoing)
Machine Learning
Solving problems without explicitly knowing
how to create solutions
Machine Learning Infrastructure
TFX: A TensorFlow-Based Production-Scale Machine Learning Platform (KDD 2017)
Machine Learning Infrastructure
TFX: A TensorFlow-Based Production-Scale Machine Learning Platform (KDD 2017)
Kubeflow
https://github.com/google/kubeflow/
Our goal is not to recreate other services, but to provide a straightforward
way for spinning up best of breed OSS solutions.
● A JupyterHub to create & manage interactive Jupyter notebooks
● A Tensorflow Training Controller that can be configured to use CPUs
or GPUs, and adjusted to the size of a cluster with a single setting
● A TF Serving container
JupyterHub
● A single hub & proxy for managing interactive sessions
● Can run entirely within Kubernetes - notebooks are backed by
Kubernetes pods
● Can request required resources - CPUs, GPUs, etc
● Has pluggable authentication (oauth, kdc, etc)
Made possible by: https://github.com/jupyterhub/kubespawner
Tensorflow Training Controller
● A Kubernetes “operator” to help run distributed/non-distributed TF
training.
● Exposes an API through a CustomResourceDefinition
● Controller manages complexity of distributed training using
Tensorflow.
Made possible by: https://github.com/tensorflow/k8s
Tensorflow Serving
● A Kubernetes Deployment that can serve saved models
● Deployment - replicas can be scaled.
Future work:
● Custom metrics & Autoscaling
But there were so many stages!
● Clearly there are many other challenges faced by people building
Machine Learning infrastructure.
● How do I preprocess data?
● How do I describe my pipeline?
● How do I orchestrate my pipeline?
● We have some ideas.
Apache Spark
● Spark on Kubernetes is an ongoing effort since Dec 2016.
● It is being upstreamed into Spark and expected to land in Spark 2.3
(due sometime in January).
● The changes make Spark itself aware of a new Kubernetes Scheduler
that can directly run Spark applications for the user.
Apache Spark
Spark Core Kubernetes Scheduler Backend
Kubernetes
Cluster
add executors
rm executors
configuration
Apache Spark
Kubernetes Scheduler for Spark
● Spark 2.3 will support
○ Running Java/Scala jobs
○ Static allocation of executors
○ Some dependency management
● Our fork (github.com/apache-spark-on-k8s/spark) has several
additional features which we’re slowly upstreaming.
○ It’s being run by several organizations right now.
Apache Airflow
● A DAG scheduler.
● Has a rich ecosystem of “operators” to allow interacting with different
applications.
● Community working on a Kubernetes native executor for Airflow.
● Currently in the process of being upstreamed.
Apache Airflow
BashOperator(
task_id = ‘account-test’,
bash_command = ‘run-something.sh’,
dag = dag,
executor_config = {
‘request_memory’: ‘128Mi’,
‘limit_memory’: ‘128Mi’
‘image’: ‘airflow/scipy:1.1.5’
}
)
The operators can specify various Kubernetes executor constraints within each DAG step.
For example:
Putting it all together
HDFS
or GCS/S3
Spark
Airflow Pipeline
JupyterHub
Tensorflow
Other ML
Frameworks
Get Involved
Kubeflow
● Slack Channel (See https://github.com/google/kubeflow for joining instructions)
● Twitter (http://twitter.com/kubeflow)
● Mailing List (https://groups.google.com/forum/#!forum/kubeflow-discuss)
SIG Big Data
● Slack Channel (https://kubernetes.slack.com/messages/sig-big-data)
● Mailing list (https://groups.google.com/forum/#!forum/kubernetes-sig-big-data)
● Weekly meeting (https://github.com/kubernetes/community/tree/master/sig-big-data)
Questions?

More Related Content

What's hot

What's hot (20)

Introducing Kubeflow (w. Special Guests Tensorflow and Apache Spark)
Introducing Kubeflow (w. Special Guests Tensorflow and Apache Spark)Introducing Kubeflow (w. Special Guests Tensorflow and Apache Spark)
Introducing Kubeflow (w. Special Guests Tensorflow and Apache Spark)
 
Advanced Model Inferencing leveraging Kubeflow Serving, KNative and Istio
Advanced Model Inferencing leveraging Kubeflow Serving, KNative and IstioAdvanced Model Inferencing leveraging Kubeflow Serving, KNative and Istio
Advanced Model Inferencing leveraging Kubeflow Serving, KNative and Istio
 
Machine learning at scale by Amy Unruh from Google
Machine learning at scale by  Amy Unruh from GoogleMachine learning at scale by  Amy Unruh from Google
Machine learning at scale by Amy Unruh from Google
 
What's New in H2O Driverless AI? - Arno Candel - H2O AI World London 2018
What's New in H2O Driverless AI? - Arno Candel - H2O AI World London 2018What's New in H2O Driverless AI? - Arno Candel - H2O AI World London 2018
What's New in H2O Driverless AI? - Arno Candel - H2O AI World London 2018
 
Kubeflow at Spotify (For the Kubeflow Summit)
Kubeflow at Spotify (For the Kubeflow Summit)Kubeflow at Spotify (For the Kubeflow Summit)
Kubeflow at Spotify (For the Kubeflow Summit)
 
running Tensorflow in Production
running Tensorflow in Productionrunning Tensorflow in Production
running Tensorflow in Production
 
Productionizing Machine Learning Pipelines with Databricks and Azure ML
Productionizing Machine Learning Pipelines with Databricks and Azure MLProductionizing Machine Learning Pipelines with Databricks and Azure ML
Productionizing Machine Learning Pipelines with Databricks and Azure ML
 
Kyryl Truskovskyi: Kubeflow for end2end machine learning lifecycle
Kyryl Truskovskyi: Kubeflow for end2end machine learning lifecycleKyryl Truskovskyi: Kubeflow for end2end machine learning lifecycle
Kyryl Truskovskyi: Kubeflow for end2end machine learning lifecycle
 
"Remote development of Quarkus applications"
"Remote development of Quarkus applications""Remote development of Quarkus applications"
"Remote development of Quarkus applications"
 
AI Pipeline Optimization using Kubeflow
AI Pipeline Optimization using KubeflowAI Pipeline Optimization using Kubeflow
AI Pipeline Optimization using Kubeflow
 
"Kubernetes as Driver of Generic IT Automation"
"Kubernetes as Driver of Generic IT Automation""Kubernetes as Driver of Generic IT Automation"
"Kubernetes as Driver of Generic IT Automation"
 
How to set up Kubernetes for all your machine learning workflows
How to set up Kubernetes for all your machine learning workflowsHow to set up Kubernetes for all your machine learning workflows
How to set up Kubernetes for all your machine learning workflows
 
Getting Started with Visual Studio Tools for AI
Getting Started with Visual Studio Tools for AIGetting Started with Visual Studio Tools for AI
Getting Started with Visual Studio Tools for AI
 
Yannis Zarkadas. Enterprise data science workflows on kubeflow
Yannis Zarkadas. Enterprise data science workflows on kubeflowYannis Zarkadas. Enterprise data science workflows on kubeflow
Yannis Zarkadas. Enterprise data science workflows on kubeflow
 
Using AML Python SDK
Using AML Python SDKUsing AML Python SDK
Using AML Python SDK
 
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSAccelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
 
Operator development made easy with helm
Operator development made easy with helmOperator development made easy with helm
Operator development made easy with helm
 
Build and Monitor Machine Learning Services in Kubernetes
Build and Monitor Machine Learning Services in KubernetesBuild and Monitor Machine Learning Services in Kubernetes
Build and Monitor Machine Learning Services in Kubernetes
 
Boolan machine learning summit
Boolan machine learning summitBoolan machine learning summit
Boolan machine learning summit
 
Serverless with Knative - Mete Atamel (Google)
Serverless with Knative - Mete Atamel (Google)Serverless with Knative - Mete Atamel (Google)
Serverless with Knative - Mete Atamel (Google)
 

Similar to Machine learning on kubernetes

Flink Forward San Francisco 2019: Managing Flink on Kubernetes - FlinkK8sOper...
Flink Forward San Francisco 2019: Managing Flink on Kubernetes - FlinkK8sOper...Flink Forward San Francisco 2019: Managing Flink on Kubernetes - FlinkK8sOper...
Flink Forward San Francisco 2019: Managing Flink on Kubernetes - FlinkK8sOper...
Flink Forward
 
Infrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentInfrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload Deployment
Databricks
 
Kubernetes for Beginners
Kubernetes for BeginnersKubernetes for Beginners
Kubernetes for Beginners
DigitalOcean
 

Similar to Machine learning on kubernetes (20)

PySpark on Kubernetes @ Python Barcelona March Meetup
PySpark on Kubernetes @ Python Barcelona March MeetupPySpark on Kubernetes @ Python Barcelona March Meetup
PySpark on Kubernetes @ Python Barcelona March Meetup
 
Democratizing machine learning on kubernetes
Democratizing machine learning on kubernetesDemocratizing machine learning on kubernetes
Democratizing machine learning on kubernetes
 
Flink Forward San Francisco 2019: Managing Flink on Kubernetes - FlinkK8sOper...
Flink Forward San Francisco 2019: Managing Flink on Kubernetes - FlinkK8sOper...Flink Forward San Francisco 2019: Managing Flink on Kubernetes - FlinkK8sOper...
Flink Forward San Francisco 2019: Managing Flink on Kubernetes - FlinkK8sOper...
 
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning InfrastructureML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
 
DevOps Days Boston 2017: Real-world Kubernetes for DevOps
DevOps Days Boston 2017: Real-world Kubernetes for DevOpsDevOps Days Boston 2017: Real-world Kubernetes for DevOps
DevOps Days Boston 2017: Real-world Kubernetes for DevOps
 
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesKubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
 
AirBNB's ML platform - BigHead
AirBNB's ML platform - BigHeadAirBNB's ML platform - BigHead
AirBNB's ML platform - BigHead
 
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
 Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa... Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
 
Infrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentInfrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload Deployment
 
KubeCon NA - 2021 Tools That I Wish Existed 3 Years Ago To Build a SaaS Offering
KubeCon NA - 2021 Tools That I Wish Existed 3 Years Ago To Build a SaaS OfferingKubeCon NA - 2021 Tools That I Wish Existed 3 Years Ago To Build a SaaS Offering
KubeCon NA - 2021 Tools That I Wish Existed 3 Years Ago To Build a SaaS Offering
 
Productionizing Machine Learning - Bigdata meetup 5-06-2019
Productionizing Machine Learning - Bigdata meetup 5-06-2019Productionizing Machine Learning - Bigdata meetup 5-06-2019
Productionizing Machine Learning - Bigdata meetup 5-06-2019
 
Kubernetes: Managed or Not Managed?
Kubernetes: Managed or Not Managed?Kubernetes: Managed or Not Managed?
Kubernetes: Managed or Not Managed?
 
Nugwc k8s session-16-march-2021
Nugwc k8s session-16-march-2021Nugwc k8s session-16-march-2021
Nugwc k8s session-16-march-2021
 
Containerized architectures for deep learning
Containerized architectures for deep learningContainerized architectures for deep learning
Containerized architectures for deep learning
 
Distributed tensorflow on kubernetes
Distributed tensorflow on kubernetesDistributed tensorflow on kubernetes
Distributed tensorflow on kubernetes
 
Distributed tensorflow on kubernetes
Distributed tensorflow on kubernetesDistributed tensorflow on kubernetes
Distributed tensorflow on kubernetes
 
Kubernetes for Beginners
Kubernetes for BeginnersKubernetes for Beginners
Kubernetes for Beginners
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 
introduction to micro services
introduction to micro servicesintroduction to micro services
introduction to micro services
 
Introduction to kubernetes
Introduction to kubernetesIntroduction to kubernetes
Introduction to kubernetes
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 

Machine learning on kubernetes

  • 1. Machine Learning on Kubernetes 13 Dec 2017 Anirudh Ramanathan Software Engineer on Kubernetes Twitter: @anirudh4444
  • 2. Disclaimer I’m not a Machine Learning expert. I work on infrastructure and distributed systems for a living.
  • 3. Kubernetes a year ago... ● Was used primarily for stateless workloads ● Needed an understanding of several core concepts to operate ● Applications had to be written to fit into core controller abstractions
  • 4. Kubernetes today... ● Has abstractions to support Stateful applications and now data processing and machine learning. ● Has a wide range of extension points including ones that allow API extensions and custom controllers. ● Has support for building higher level abstractions and APIs to hide infrastructure & operational complexity.
  • 5. What’s changed? ● Workload controller abstractions moving to GA/stable. ● Custom Resource Definitions & Aggregated API Servers ● Kubernetes Operators ● Community support for external frameworks ● Work on scheduling and resource management (ongoing)
  • 6. Machine Learning Solving problems without explicitly knowing how to create solutions
  • 7. Machine Learning Infrastructure TFX: A TensorFlow-Based Production-Scale Machine Learning Platform (KDD 2017)
  • 8. Machine Learning Infrastructure TFX: A TensorFlow-Based Production-Scale Machine Learning Platform (KDD 2017)
  • 9. Kubeflow https://github.com/google/kubeflow/ Our goal is not to recreate other services, but to provide a straightforward way for spinning up best of breed OSS solutions. ● A JupyterHub to create & manage interactive Jupyter notebooks ● A Tensorflow Training Controller that can be configured to use CPUs or GPUs, and adjusted to the size of a cluster with a single setting ● A TF Serving container
  • 10. JupyterHub ● A single hub & proxy for managing interactive sessions ● Can run entirely within Kubernetes - notebooks are backed by Kubernetes pods ● Can request required resources - CPUs, GPUs, etc ● Has pluggable authentication (oauth, kdc, etc) Made possible by: https://github.com/jupyterhub/kubespawner
  • 11. Tensorflow Training Controller ● A Kubernetes “operator” to help run distributed/non-distributed TF training. ● Exposes an API through a CustomResourceDefinition ● Controller manages complexity of distributed training using Tensorflow. Made possible by: https://github.com/tensorflow/k8s
  • 12. Tensorflow Serving ● A Kubernetes Deployment that can serve saved models ● Deployment - replicas can be scaled. Future work: ● Custom metrics & Autoscaling
  • 13. But there were so many stages! ● Clearly there are many other challenges faced by people building Machine Learning infrastructure. ● How do I preprocess data? ● How do I describe my pipeline? ● How do I orchestrate my pipeline? ● We have some ideas.
  • 14. Apache Spark ● Spark on Kubernetes is an ongoing effort since Dec 2016. ● It is being upstreamed into Spark and expected to land in Spark 2.3 (due sometime in January). ● The changes make Spark itself aware of a new Kubernetes Scheduler that can directly run Spark applications for the user.
  • 15. Apache Spark Spark Core Kubernetes Scheduler Backend Kubernetes Cluster add executors rm executors configuration
  • 16. Apache Spark Kubernetes Scheduler for Spark ● Spark 2.3 will support ○ Running Java/Scala jobs ○ Static allocation of executors ○ Some dependency management ● Our fork (github.com/apache-spark-on-k8s/spark) has several additional features which we’re slowly upstreaming. ○ It’s being run by several organizations right now.
  • 17. Apache Airflow ● A DAG scheduler. ● Has a rich ecosystem of “operators” to allow interacting with different applications. ● Community working on a Kubernetes native executor for Airflow. ● Currently in the process of being upstreamed.
  • 18. Apache Airflow BashOperator( task_id = ‘account-test’, bash_command = ‘run-something.sh’, dag = dag, executor_config = { ‘request_memory’: ‘128Mi’, ‘limit_memory’: ‘128Mi’ ‘image’: ‘airflow/scipy:1.1.5’ } ) The operators can specify various Kubernetes executor constraints within each DAG step. For example:
  • 19. Putting it all together HDFS or GCS/S3 Spark Airflow Pipeline JupyterHub Tensorflow Other ML Frameworks
  • 20. Get Involved Kubeflow ● Slack Channel (See https://github.com/google/kubeflow for joining instructions) ● Twitter (http://twitter.com/kubeflow) ● Mailing List (https://groups.google.com/forum/#!forum/kubeflow-discuss) SIG Big Data ● Slack Channel (https://kubernetes.slack.com/messages/sig-big-data) ● Mailing list (https://groups.google.com/forum/#!forum/kubernetes-sig-big-data) ● Weekly meeting (https://github.com/kubernetes/community/tree/master/sig-big-data)