SlideShare a Scribd company logo
1 of 57
Scaling Tensorflow models for
training using multi-GPUs &
Google Cloud ML
BEE PART OF THE CHANGE
Avenida de Burgos, 16 D, 28036 Madrid
hablemos@beeva.com
www.beeva.com
2
Topics
Cloud Machine Learning
Engine -> a.k.a. Cloud ML NVIDIA GPUs
Distributed computing
Tensorflow
3
Index
1. What is BEEVA? Who are we?
2. High Performance Computing: objectives
3. Experimental setup
4. Scenario 1: Distributed Tensorflow
5. Scenario 2: Cloud ML
6. Overall Conclusions
7. Future lines
4
What is BEEVA?
WWW.BEEVA.COM 5
“ WE MAKE COMPLEX THINGS SIMPLE”
100 % +40%
Annual growth last
3 years
+550
Employees
in Spain
BIG DATA
CLOUD
COMPUTING
MACHINE
INTELLIGENCE
● HIGH VALUE FOR INNOVATION
● PRODUCT DEVELOPMENT (APIVERSITY, lince.io, Clever)
WWW.BEEVA.COM 6
Technological Partners
6
In cloud we bet on those partners
that we believe best work and
cover the needs of the client,
making us experts and finding the
best cloud solution for each project.
AWS, Azure & Google
Cloud Platform
Data is the oil of the XXI century. In
BEEVA we seek to ally with the best
providers of solutions for the data.
Cloudera, Hortonworks,
MongoDB & Neo4j
The needs of BEEVA are constantly
renewed and we always seek to
add new and powerful references
of the sector to our portfolio of
technological partners.
RedHat, Puppet &
Docker
CLOUD DATA TECH
BEE DIFFERENT
WORK DIFFERENT
PROVIDE PASSION AND VALUE TO THE WORK
LEARN AND ENJOY WHAT YOU DO
CREATE A GOOD ENVIRONMENT EVERY DAY
‘OUT OF THE BOX’ THINKING
BEE DIFFERENT AND SPECIAL
www.beeva.com/empleo
rrhh@beeva.com
8
Who am I?
Ricardo Guerrero
A (very geeky) Telecommunications
Engineer.
1. Research: Computer Vision
2. Development: Embedded
systems (routers)
3. Innovation: Data scientist in
BEEVA.
Free time: Not too much
(Self-driving cars)
Plants Vs Zombies
9
Who is this?
Telecommunications Engineer
Data Scientist (Innovation
team)
Geek
Free time: compute PI decimals
(just kidding… I hope)
Enrique Otero
10
High Performance
Computing: objectives
11
HPC line
1. Scaling ML models over GPU clusters.
2. Ease ML deployments and its consume by analysts.
3. Analyze GPU clouds providers.
4. Study vertical scaling Vs horizontal scaling.
5. Paradigms of parallelization: data parallelism (sync or
async) Vs model parallelism.
12
Experimental setup
13
MNIST problem
The Hello World in Machine
Learning:
easy to reach an accuracy over 97%
MNIST Dataset
14
MNIST problem
MNIST Dataset
Classify digits in bank checks (1998)
15
MNIST problem
ICLR 2017
This happy guy is
me.
16
Benchmark
Model employed
5-layered Neural
Network proposed by
Yann Lecun
17
Scenario 1: Distributed
Tensorflow
18
How can we parallelize learning?
CLUSTERS
Communication
issues:
Latency
19
Single-machine learning
Forward prop -> compute output
Ytrue =3
Yest = -201.2
Random initialization of
weights
20
Single-machine learning
Backprop -> weights update
Ytrue =3
Yest = -201.2
Err = 204.2
21
How can we parallelize learning?
Machine Learning:
Andrew Ng
22
How can we parallelize learning?
Example:
● Optimizer: Mini-batch Gradient Descent.
● Training set: 10 samples.
● Iterations: 1000 (10x100) -> the network will see 100 times the
whole training set.
23
How can we parallelize learning?
Equation warning
24
How can we parallelize learning?
25
How can we parallelize learning?
26
How can we parallelize learning?
Neuron
weights
The famous gradients
27
How can we parallelize learning?
28
How can we parallelize learning?
5 examples 5 examples
29
How can we parallelize learning?
5 examples 5 examples
Parameter server
● Distribute
data
● Aggregate
gradients
30
How can we parallelize learning?
N examples
batch_size = N
Parameter server
{
M machines
Single machineMathematically equivalent
N examples
batch_size = N
batch_size = M * N
Synchronous
training
31
How can we parallelize learning?
Synchronous
training
32
How can we parallelize learning?
Synchronous
training
Asynchronous
training
33
How can we parallelize learning?
Fast-response
driver
Slow-response
driver
Synchronous
training
Asynchronous
training
Car
driver ->
machine
● Tensorflow examples are hard to adapt to other scenarios.
○ High coupling between model, input, and parallel paradigm.
○ Not a Deep Learning library, but a mathematical engine. Very high verbosity
○ High level abstraction is recommended:
■ Keras, TF-slim, TF Learn (old skflow, now tf.contrib.learn), TFLearn,
Sonnet (Deep Mind).
Preliminary conclusions
● We were not able to use a GPU cluster on GKE (Google Container Engine)
○ Not enough documentation on this issue
● Parallel paradigm (on single-machine):
○ Asynchronous data parallel is much faster than synchronous, a little less
accurate
● We tried first TF-Slim. But we were not able to make it work with multiworker :(
Distributed Tensorflow. Results
paradigm workers accuracy steps time
sync. 3 0.975 5000 62.8
async. 3 0.967 5000 21.6
● Keras was our final choice
○ We patched an external project and made it work on AWS p2.8x :)
○ with 4 GPUs we got (only) 30% speedup. With 8 GPUs even worse :(
Single machine multi-GPUs. Results (I)
GPUs epochs accuracy time (s/epoch)
1 12 0.9884 6.8
2 12 0.9898 5.2
4 12 0.9891 4.9
8 12 0.9899 6.4
37
How can we parallelize learning?
CLUSTERS
Communication
issues:
Latency
● Tensorflow ecosystem is a bit inmature
○ v1.0 not backwards compatible to v0.12
■ Google provides tf_upgrade.py. But manual changes are
sometimes necessary
○ Many open issues awaiting tensorflower...
Preliminary conclusions
● Scaling to serve models seems a solved issue
○ Seldon, Tensorflow Serving...
● Scaling to train models efficiently is not a solved issue
○ Our first experiments and external benchmarks confirm this point
○ Horizontal scaling is not efficient
○ Data parallelism (synch or asynch) and GPU optimization are not solved issues.
Preliminary conclusions
40
Scenario 2: Cloud ML
41
Are you more familiar with Amazon?
AWS
EC2
S3
??
It’s like Heroku, a PaaS,
but for Machine Learning
Google Cloud Platform (GCP)
Google Cloud Compute Engine
Google Cloud Storage
Google Cloud Machine Learning Engine (Cloud ML)
42
What is Cloud ML?
43
What is Google Cloud ML?
Google
Cloud
Storage
44
Cloud ML & Kaggle
The free trial
account includes
$300 in credits!
45
Pricing
“Pricing for training your models in the cloud is defined in terms
of ML training units, which are an abstract measurement of the
processing power involved. 1 ML training unit represents a
standard machine configuration used by the training service.”
It’s a bit complex. Let’s
read it:
46
Cluster configuration
47
Cluster configuration
“many workers”, “a few servers”, “a large number”
48
Cluster configuration
“The following table uses rough "t-shirt"
sizing to describe the machine types.”
49
Cluster configuration
50
Results
Duration Price Accuracy
BASIC 1h 2 min 0.01 ML units =
0.0049$
0.9886
STANDARD_1 16 min 4 sec 1.67 ML Units =
0.818$
0.99
BASIC_GPU 23 min 56 sec 0.82 ML Units =
0.4018$
0.989
Infrastructure provisioning time not negligible (~8 minutes)
51
Conclusion
52
Overall Conclusions
● Distributed computing for ML is not a commodity: you need highly
qualified engineers.
● Don’t scale horizontally in ML. Most of the time does not worth it unless
you have special conditions:
○ A huge dataset (really huge).
○ A medium size dataset + Infiniband connections + ML/DL framework with
RDMA support (reduce latency)
53
Overall Conclusions
● Google GPUs (beta) vs AWS GPUs: more cons than pros :(
● Tensorflow is growing fast but...
a. Not easy, but there is Keras.
b. We recommend (careful) adoption because of big community
54
Future lines
55
Future lines: Cloud ML changes very fast
CIFAR10
Recommender Systems
(Movielens)
56
ANY QUESTIONS?
?
?
?
?
Ricardo Guerrero Gómez-Olmedo
Email:
ricardo.guerrero@beeva.com
Twitter: @ricgu8086
Medium: medium.com/@ricardo.guerrero
IT Researcher | BEEVA LABS
hablemos@beeva.com | www.beeva.com
We are
hiring!!

More Related Content

What's hot

Apache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - finalApache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - finalSub Szabolcs Feczak
 
Consolidate Your Technical Debt With Spark Data Sources -Tools and Techniques...
Consolidate Your Technical Debt With Spark Data Sources -Tools and Techniques...Consolidate Your Technical Debt With Spark Data Sources -Tools and Techniques...
Consolidate Your Technical Debt With Spark Data Sources -Tools and Techniques...Databricks
 
AI Pipeline Optimization using Kubeflow
AI Pipeline Optimization using KubeflowAI Pipeline Optimization using Kubeflow
AI Pipeline Optimization using KubeflowSteve Guhr
 
Scaling machine learning as a service at Uber — Li Erran Li at #papis2016
Scaling machine learning as a service at Uber — Li Erran Li at #papis2016Scaling machine learning as a service at Uber — Li Erran Li at #papis2016
Scaling machine learning as a service at Uber — Li Erran Li at #papis2016PAPIs.io
 
Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...
Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...
Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...Akash Tandon
 
Hopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, SunnyvaleHopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, SunnyvaleJim Dowling
 
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...MLconf
 
TFX: A tensor flow-based production-scale machine learning platform
TFX: A tensor flow-based production-scale machine learning platformTFX: A tensor flow-based production-scale machine learning platform
TFX: A tensor flow-based production-scale machine learning platformShunya Ueta
 
Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...
Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...
Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...Databricks
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit
 
CI/CD for Machine Learning with Daniel Kobran
CI/CD for Machine Learning with Daniel KobranCI/CD for Machine Learning with Daniel Kobran
CI/CD for Machine Learning with Daniel KobranDatabricks
 
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15MLconf
 
AICamp - Dr Ramine Tinati - Making Computer Vision Real
AICamp - Dr Ramine Tinati - Making Computer Vision RealAICamp - Dr Ramine Tinati - Making Computer Vision Real
AICamp - Dr Ramine Tinati - Making Computer Vision RealRamine Tinati
 
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016MLconf
 
Scalable Deep Learning Platform On Spark In Baidu
Scalable Deep Learning Platform On Spark In BaiduScalable Deep Learning Platform On Spark In Baidu
Scalable Deep Learning Platform On Spark In BaiduJen Aman
 
Get Behind the Wheel with H2O Driverless AI Hands-On Training
Get Behind the Wheel with H2O Driverless AI Hands-On Training Get Behind the Wheel with H2O Driverless AI Hands-On Training
Get Behind the Wheel with H2O Driverless AI Hands-On Training Sri Ambati
 
Distributed deep learning
Distributed deep learningDistributed deep learning
Distributed deep learningMehdi Shibahara
 
Training at AI Frontiers 2018 - LaiOffer Data Session: How Spark Speedup AI
Training at AI Frontiers 2018 - LaiOffer Data Session: How Spark Speedup AI Training at AI Frontiers 2018 - LaiOffer Data Session: How Spark Speedup AI
Training at AI Frontiers 2018 - LaiOffer Data Session: How Spark Speedup AI AI Frontiers
 
TinyML as-a-Service
TinyML as-a-ServiceTinyML as-a-Service
TinyML as-a-ServiceHiroshi Doyu
 

What's hot (20)

Apache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - finalApache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - final
 
Consolidate Your Technical Debt With Spark Data Sources -Tools and Techniques...
Consolidate Your Technical Debt With Spark Data Sources -Tools and Techniques...Consolidate Your Technical Debt With Spark Data Sources -Tools and Techniques...
Consolidate Your Technical Debt With Spark Data Sources -Tools and Techniques...
 
AI Pipeline Optimization using Kubeflow
AI Pipeline Optimization using KubeflowAI Pipeline Optimization using Kubeflow
AI Pipeline Optimization using Kubeflow
 
Scaling machine learning as a service at Uber — Li Erran Li at #papis2016
Scaling machine learning as a service at Uber — Li Erran Li at #papis2016Scaling machine learning as a service at Uber — Li Erran Li at #papis2016
Scaling machine learning as a service at Uber — Li Erran Li at #papis2016
 
Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...
Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...
Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...
 
H20 - Thirst for Machine Learning
H20 - Thirst for Machine LearningH20 - Thirst for Machine Learning
H20 - Thirst for Machine Learning
 
Hopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, SunnyvaleHopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, Sunnyvale
 
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
 
TFX: A tensor flow-based production-scale machine learning platform
TFX: A tensor flow-based production-scale machine learning platformTFX: A tensor flow-based production-scale machine learning platform
TFX: A tensor flow-based production-scale machine learning platform
 
Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...
Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...
Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
CI/CD for Machine Learning with Daniel Kobran
CI/CD for Machine Learning with Daniel KobranCI/CD for Machine Learning with Daniel Kobran
CI/CD for Machine Learning with Daniel Kobran
 
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
 
AICamp - Dr Ramine Tinati - Making Computer Vision Real
AICamp - Dr Ramine Tinati - Making Computer Vision RealAICamp - Dr Ramine Tinati - Making Computer Vision Real
AICamp - Dr Ramine Tinati - Making Computer Vision Real
 
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
 
Scalable Deep Learning Platform On Spark In Baidu
Scalable Deep Learning Platform On Spark In BaiduScalable Deep Learning Platform On Spark In Baidu
Scalable Deep Learning Platform On Spark In Baidu
 
Get Behind the Wheel with H2O Driverless AI Hands-On Training
Get Behind the Wheel with H2O Driverless AI Hands-On Training Get Behind the Wheel with H2O Driverless AI Hands-On Training
Get Behind the Wheel with H2O Driverless AI Hands-On Training
 
Distributed deep learning
Distributed deep learningDistributed deep learning
Distributed deep learning
 
Training at AI Frontiers 2018 - LaiOffer Data Session: How Spark Speedup AI
Training at AI Frontiers 2018 - LaiOffer Data Session: How Spark Speedup AI Training at AI Frontiers 2018 - LaiOffer Data Session: How Spark Speedup AI
Training at AI Frontiers 2018 - LaiOffer Data Session: How Spark Speedup AI
 
TinyML as-a-Service
TinyML as-a-ServiceTinyML as-a-Service
TinyML as-a-Service
 

Similar to Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML

Deep Learning with Apache Spark: an Introduction
Deep Learning with Apache Spark: an IntroductionDeep Learning with Apache Spark: an Introduction
Deep Learning with Apache Spark: an IntroductionEmanuele Bezzi
 
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...Big Data Spain
 
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...Bharath Sudharsan
 
Machine learning at scale with Google Cloud Platform
Machine learning at scale with Google Cloud PlatformMachine learning at scale with Google Cloud Platform
Machine learning at scale with Google Cloud PlatformMatthias Feys
 
Netflix machine learning
Netflix machine learningNetflix machine learning
Netflix machine learningAmer Ather
 
Cloud Roundtable at Microsoft Switzerland
Cloud Roundtable at Microsoft Switzerland Cloud Roundtable at Microsoft Switzerland
Cloud Roundtable at Microsoft Switzerland mictc
 
Faster computation with matlab
Faster computation with matlabFaster computation with matlab
Faster computation with matlabMuhammad Alli
 
How I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "S
How I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "SHow I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "S
How I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "SBrandon Liu
 
Parallel & Distributed Deep Learning - Dataworks Summit
Parallel & Distributed Deep Learning - Dataworks SummitParallel & Distributed Deep Learning - Dataworks Summit
Parallel & Distributed Deep Learning - Dataworks SummitRafael Arana
 
Large Scale Deep Learning with TensorFlow
Large Scale Deep Learning with TensorFlow Large Scale Deep Learning with TensorFlow
Large Scale Deep Learning with TensorFlow Jen Aman
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)byteLAKE
 
DigitRecognition.pptx
DigitRecognition.pptxDigitRecognition.pptx
DigitRecognition.pptxruvex
 
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...Ryo Takahashi
 
Accelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsAccelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsIJMER
 
Deep learning for FinTech
Deep learning for FinTechDeep learning for FinTech
Deep learning for FinTechgeetachauhan
 
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNetFrom Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNetEric Haibin Lin
 
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSAccelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSDatabricks
 
Beyond data and model parallelism for deep neural networks
Beyond data and model parallelism for deep neural networksBeyond data and model parallelism for deep neural networks
Beyond data and model parallelism for deep neural networksJunKudo2
 

Similar to Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML (20)

Deep Learning with Apache Spark: an Introduction
Deep Learning with Apache Spark: an IntroductionDeep Learning with Apache Spark: an Introduction
Deep Learning with Apache Spark: an Introduction
 
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
 
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
 
Machine learning at scale with Google Cloud Platform
Machine learning at scale with Google Cloud PlatformMachine learning at scale with Google Cloud Platform
Machine learning at scale with Google Cloud Platform
 
Netflix machine learning
Netflix machine learningNetflix machine learning
Netflix machine learning
 
Cloud Roundtable at Microsoft Switzerland
Cloud Roundtable at Microsoft Switzerland Cloud Roundtable at Microsoft Switzerland
Cloud Roundtable at Microsoft Switzerland
 
Faster computation with matlab
Faster computation with matlabFaster computation with matlab
Faster computation with matlab
 
How I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "S
How I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "SHow I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "S
How I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "S
 
Parallel & Distributed Deep Learning - Dataworks Summit
Parallel & Distributed Deep Learning - Dataworks SummitParallel & Distributed Deep Learning - Dataworks Summit
Parallel & Distributed Deep Learning - Dataworks Summit
 
Open power ddl and lms
Open power ddl and lmsOpen power ddl and lms
Open power ddl and lms
 
Large Scale Deep Learning with TensorFlow
Large Scale Deep Learning with TensorFlow Large Scale Deep Learning with TensorFlow
Large Scale Deep Learning with TensorFlow
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
 
DigitRecognition.pptx
DigitRecognition.pptxDigitRecognition.pptx
DigitRecognition.pptx
 
ML in Android
ML in AndroidML in Android
ML in Android
 
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
 
Accelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsAccelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous Platforms
 
Deep learning for FinTech
Deep learning for FinTechDeep learning for FinTech
Deep learning for FinTech
 
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNetFrom Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
 
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSAccelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
 
Beyond data and model parallelism for deep neural networks
Beyond data and model parallelism for deep neural networksBeyond data and model parallelism for deep neural networks
Beyond data and model parallelism for deep neural networks
 

More from Seldon

CD4ML and the challenges of testing and quality in ML systems
CD4ML and the challenges of testing and quality in ML systemsCD4ML and the challenges of testing and quality in ML systems
CD4ML and the challenges of testing and quality in ML systemsSeldon
 
TensorFlow London: Cutting edge generative models
TensorFlow London: Cutting edge generative modelsTensorFlow London: Cutting edge generative models
TensorFlow London: Cutting edge generative modelsSeldon
 
Tensorflow London: Tensorflow and Graph Recommender Networks by Yaz Santissi
Tensorflow London: Tensorflow and Graph Recommender Networks by Yaz SantissiTensorflow London: Tensorflow and Graph Recommender Networks by Yaz Santissi
Tensorflow London: Tensorflow and Graph Recommender Networks by Yaz SantissiSeldon
 
TensorFlow London: Progressive Growing of GANs for increased stability, quali...
TensorFlow London: Progressive Growing of GANs for increased stability, quali...TensorFlow London: Progressive Growing of GANs for increased stability, quali...
TensorFlow London: Progressive Growing of GANs for increased stability, quali...Seldon
 
TensorFlow London 18: Dr Daniel Martinho-Corbishley, From science to startups...
TensorFlow London 18: Dr Daniel Martinho-Corbishley, From science to startups...TensorFlow London 18: Dr Daniel Martinho-Corbishley, From science to startups...
TensorFlow London 18: Dr Daniel Martinho-Corbishley, From science to startups...Seldon
 
TensorFlow London 18: Dr Alastair Moore, Towards the use of Graphical Models ...
TensorFlow London 18: Dr Alastair Moore, Towards the use of Graphical Models ...TensorFlow London 18: Dr Alastair Moore, Towards the use of Graphical Models ...
TensorFlow London 18: Dr Alastair Moore, Towards the use of Graphical Models ...Seldon
 
Seldon: Deploying Models at Scale
Seldon: Deploying Models at ScaleSeldon: Deploying Models at Scale
Seldon: Deploying Models at ScaleSeldon
 
TensorFlow London 17: How NASA Frontier Development Lab scientists use AI to ...
TensorFlow London 17: How NASA Frontier Development Lab scientists use AI to ...TensorFlow London 17: How NASA Frontier Development Lab scientists use AI to ...
TensorFlow London 17: How NASA Frontier Development Lab scientists use AI to ...Seldon
 
TensorFlow London 17: Practical Reinforcement Learning with OpenAI
TensorFlow London 17: Practical Reinforcement Learning with OpenAITensorFlow London 17: Practical Reinforcement Learning with OpenAI
TensorFlow London 17: Practical Reinforcement Learning with OpenAISeldon
 
TensorFlow 16: Multimodal Sentiment Analysis with TensorFlow
TensorFlow 16: Multimodal Sentiment Analysis with TensorFlow TensorFlow 16: Multimodal Sentiment Analysis with TensorFlow
TensorFlow 16: Multimodal Sentiment Analysis with TensorFlow Seldon
 
TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform Seldon
 
Ai in financial services
Ai in financial servicesAi in financial services
Ai in financial servicesSeldon
 
TensorFlow London 15: Find bugs in the herd with debuggable TensorFlow code
TensorFlow London 15: Find bugs in the herd with debuggable TensorFlow code TensorFlow London 15: Find bugs in the herd with debuggable TensorFlow code
TensorFlow London 15: Find bugs in the herd with debuggable TensorFlow code Seldon
 
TensorFlow London 14: Ben Hall 'Machine Learning Workloads with Kubernetes an...
TensorFlow London 14: Ben Hall 'Machine Learning Workloads with Kubernetes an...TensorFlow London 14: Ben Hall 'Machine Learning Workloads with Kubernetes an...
TensorFlow London 14: Ben Hall 'Machine Learning Workloads with Kubernetes an...Seldon
 
Tensorflow London 13: Barbara Fusinska 'Hassle Free, Scalable, Machine Learni...
Tensorflow London 13: Barbara Fusinska 'Hassle Free, Scalable, Machine Learni...Tensorflow London 13: Barbara Fusinska 'Hassle Free, Scalable, Machine Learni...
Tensorflow London 13: Barbara Fusinska 'Hassle Free, Scalable, Machine Learni...Seldon
 
Tensorflow London 13: Zbigniew Wojna 'Deep Learning for Big Scale 2D Imagery'
Tensorflow London 13: Zbigniew Wojna 'Deep Learning for Big Scale 2D Imagery'Tensorflow London 13: Zbigniew Wojna 'Deep Learning for Big Scale 2D Imagery'
Tensorflow London 13: Zbigniew Wojna 'Deep Learning for Big Scale 2D Imagery'Seldon
 
TensorFlow London 11: Pierre Harvey Richemond 'Trends and Developments in Rei...
TensorFlow London 11: Pierre Harvey Richemond 'Trends and Developments in Rei...TensorFlow London 11: Pierre Harvey Richemond 'Trends and Developments in Rei...
TensorFlow London 11: Pierre Harvey Richemond 'Trends and Developments in Rei...Seldon
 
TensorFlow London 11: Gema Parreno 'Use Cases of TensorFlow'
TensorFlow London 11: Gema Parreno 'Use Cases of TensorFlow'TensorFlow London 11: Gema Parreno 'Use Cases of TensorFlow'
TensorFlow London 11: Gema Parreno 'Use Cases of TensorFlow'Seldon
 
Tensorflow London 12: Marcel Horstmann and Laurent Decamp 'Using TensorFlow t...
Tensorflow London 12: Marcel Horstmann and Laurent Decamp 'Using TensorFlow t...Tensorflow London 12: Marcel Horstmann and Laurent Decamp 'Using TensorFlow t...
Tensorflow London 12: Marcel Horstmann and Laurent Decamp 'Using TensorFlow t...Seldon
 
TensorFlow London 12: Oliver Gindele 'Recommender systems in Tensorflow'
TensorFlow London 12: Oliver Gindele 'Recommender systems in Tensorflow'TensorFlow London 12: Oliver Gindele 'Recommender systems in Tensorflow'
TensorFlow London 12: Oliver Gindele 'Recommender systems in Tensorflow'Seldon
 

More from Seldon (20)

CD4ML and the challenges of testing and quality in ML systems
CD4ML and the challenges of testing and quality in ML systemsCD4ML and the challenges of testing and quality in ML systems
CD4ML and the challenges of testing and quality in ML systems
 
TensorFlow London: Cutting edge generative models
TensorFlow London: Cutting edge generative modelsTensorFlow London: Cutting edge generative models
TensorFlow London: Cutting edge generative models
 
Tensorflow London: Tensorflow and Graph Recommender Networks by Yaz Santissi
Tensorflow London: Tensorflow and Graph Recommender Networks by Yaz SantissiTensorflow London: Tensorflow and Graph Recommender Networks by Yaz Santissi
Tensorflow London: Tensorflow and Graph Recommender Networks by Yaz Santissi
 
TensorFlow London: Progressive Growing of GANs for increased stability, quali...
TensorFlow London: Progressive Growing of GANs for increased stability, quali...TensorFlow London: Progressive Growing of GANs for increased stability, quali...
TensorFlow London: Progressive Growing of GANs for increased stability, quali...
 
TensorFlow London 18: Dr Daniel Martinho-Corbishley, From science to startups...
TensorFlow London 18: Dr Daniel Martinho-Corbishley, From science to startups...TensorFlow London 18: Dr Daniel Martinho-Corbishley, From science to startups...
TensorFlow London 18: Dr Daniel Martinho-Corbishley, From science to startups...
 
TensorFlow London 18: Dr Alastair Moore, Towards the use of Graphical Models ...
TensorFlow London 18: Dr Alastair Moore, Towards the use of Graphical Models ...TensorFlow London 18: Dr Alastair Moore, Towards the use of Graphical Models ...
TensorFlow London 18: Dr Alastair Moore, Towards the use of Graphical Models ...
 
Seldon: Deploying Models at Scale
Seldon: Deploying Models at ScaleSeldon: Deploying Models at Scale
Seldon: Deploying Models at Scale
 
TensorFlow London 17: How NASA Frontier Development Lab scientists use AI to ...
TensorFlow London 17: How NASA Frontier Development Lab scientists use AI to ...TensorFlow London 17: How NASA Frontier Development Lab scientists use AI to ...
TensorFlow London 17: How NASA Frontier Development Lab scientists use AI to ...
 
TensorFlow London 17: Practical Reinforcement Learning with OpenAI
TensorFlow London 17: Practical Reinforcement Learning with OpenAITensorFlow London 17: Practical Reinforcement Learning with OpenAI
TensorFlow London 17: Practical Reinforcement Learning with OpenAI
 
TensorFlow 16: Multimodal Sentiment Analysis with TensorFlow
TensorFlow 16: Multimodal Sentiment Analysis with TensorFlow TensorFlow 16: Multimodal Sentiment Analysis with TensorFlow
TensorFlow 16: Multimodal Sentiment Analysis with TensorFlow
 
TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform
 
Ai in financial services
Ai in financial servicesAi in financial services
Ai in financial services
 
TensorFlow London 15: Find bugs in the herd with debuggable TensorFlow code
TensorFlow London 15: Find bugs in the herd with debuggable TensorFlow code TensorFlow London 15: Find bugs in the herd with debuggable TensorFlow code
TensorFlow London 15: Find bugs in the herd with debuggable TensorFlow code
 
TensorFlow London 14: Ben Hall 'Machine Learning Workloads with Kubernetes an...
TensorFlow London 14: Ben Hall 'Machine Learning Workloads with Kubernetes an...TensorFlow London 14: Ben Hall 'Machine Learning Workloads with Kubernetes an...
TensorFlow London 14: Ben Hall 'Machine Learning Workloads with Kubernetes an...
 
Tensorflow London 13: Barbara Fusinska 'Hassle Free, Scalable, Machine Learni...
Tensorflow London 13: Barbara Fusinska 'Hassle Free, Scalable, Machine Learni...Tensorflow London 13: Barbara Fusinska 'Hassle Free, Scalable, Machine Learni...
Tensorflow London 13: Barbara Fusinska 'Hassle Free, Scalable, Machine Learni...
 
Tensorflow London 13: Zbigniew Wojna 'Deep Learning for Big Scale 2D Imagery'
Tensorflow London 13: Zbigniew Wojna 'Deep Learning for Big Scale 2D Imagery'Tensorflow London 13: Zbigniew Wojna 'Deep Learning for Big Scale 2D Imagery'
Tensorflow London 13: Zbigniew Wojna 'Deep Learning for Big Scale 2D Imagery'
 
TensorFlow London 11: Pierre Harvey Richemond 'Trends and Developments in Rei...
TensorFlow London 11: Pierre Harvey Richemond 'Trends and Developments in Rei...TensorFlow London 11: Pierre Harvey Richemond 'Trends and Developments in Rei...
TensorFlow London 11: Pierre Harvey Richemond 'Trends and Developments in Rei...
 
TensorFlow London 11: Gema Parreno 'Use Cases of TensorFlow'
TensorFlow London 11: Gema Parreno 'Use Cases of TensorFlow'TensorFlow London 11: Gema Parreno 'Use Cases of TensorFlow'
TensorFlow London 11: Gema Parreno 'Use Cases of TensorFlow'
 
Tensorflow London 12: Marcel Horstmann and Laurent Decamp 'Using TensorFlow t...
Tensorflow London 12: Marcel Horstmann and Laurent Decamp 'Using TensorFlow t...Tensorflow London 12: Marcel Horstmann and Laurent Decamp 'Using TensorFlow t...
Tensorflow London 12: Marcel Horstmann and Laurent Decamp 'Using TensorFlow t...
 
TensorFlow London 12: Oliver Gindele 'Recommender systems in Tensorflow'
TensorFlow London 12: Oliver Gindele 'Recommender systems in Tensorflow'TensorFlow London 12: Oliver Gindele 'Recommender systems in Tensorflow'
TensorFlow London 12: Oliver Gindele 'Recommender systems in Tensorflow'
 

Recently uploaded

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 

Recently uploaded (20)

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 

Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML

  • 1. Scaling Tensorflow models for training using multi-GPUs & Google Cloud ML BEE PART OF THE CHANGE Avenida de Burgos, 16 D, 28036 Madrid hablemos@beeva.com www.beeva.com
  • 2. 2 Topics Cloud Machine Learning Engine -> a.k.a. Cloud ML NVIDIA GPUs Distributed computing Tensorflow
  • 3. 3 Index 1. What is BEEVA? Who are we? 2. High Performance Computing: objectives 3. Experimental setup 4. Scenario 1: Distributed Tensorflow 5. Scenario 2: Cloud ML 6. Overall Conclusions 7. Future lines
  • 5. WWW.BEEVA.COM 5 “ WE MAKE COMPLEX THINGS SIMPLE” 100 % +40% Annual growth last 3 years +550 Employees in Spain BIG DATA CLOUD COMPUTING MACHINE INTELLIGENCE ● HIGH VALUE FOR INNOVATION ● PRODUCT DEVELOPMENT (APIVERSITY, lince.io, Clever)
  • 6. WWW.BEEVA.COM 6 Technological Partners 6 In cloud we bet on those partners that we believe best work and cover the needs of the client, making us experts and finding the best cloud solution for each project. AWS, Azure & Google Cloud Platform Data is the oil of the XXI century. In BEEVA we seek to ally with the best providers of solutions for the data. Cloudera, Hortonworks, MongoDB & Neo4j The needs of BEEVA are constantly renewed and we always seek to add new and powerful references of the sector to our portfolio of technological partners. RedHat, Puppet & Docker CLOUD DATA TECH
  • 7. BEE DIFFERENT WORK DIFFERENT PROVIDE PASSION AND VALUE TO THE WORK LEARN AND ENJOY WHAT YOU DO CREATE A GOOD ENVIRONMENT EVERY DAY ‘OUT OF THE BOX’ THINKING BEE DIFFERENT AND SPECIAL www.beeva.com/empleo rrhh@beeva.com
  • 8. 8 Who am I? Ricardo Guerrero A (very geeky) Telecommunications Engineer. 1. Research: Computer Vision 2. Development: Embedded systems (routers) 3. Innovation: Data scientist in BEEVA. Free time: Not too much (Self-driving cars) Plants Vs Zombies
  • 9. 9 Who is this? Telecommunications Engineer Data Scientist (Innovation team) Geek Free time: compute PI decimals (just kidding… I hope) Enrique Otero
  • 11. 11 HPC line 1. Scaling ML models over GPU clusters. 2. Ease ML deployments and its consume by analysts. 3. Analyze GPU clouds providers. 4. Study vertical scaling Vs horizontal scaling. 5. Paradigms of parallelization: data parallelism (sync or async) Vs model parallelism.
  • 13. 13 MNIST problem The Hello World in Machine Learning: easy to reach an accuracy over 97% MNIST Dataset
  • 14. 14 MNIST problem MNIST Dataset Classify digits in bank checks (1998)
  • 18. 18 How can we parallelize learning? CLUSTERS Communication issues: Latency
  • 19. 19 Single-machine learning Forward prop -> compute output Ytrue =3 Yest = -201.2 Random initialization of weights
  • 20. 20 Single-machine learning Backprop -> weights update Ytrue =3 Yest = -201.2 Err = 204.2
  • 21. 21 How can we parallelize learning? Machine Learning: Andrew Ng
  • 22. 22 How can we parallelize learning? Example: ● Optimizer: Mini-batch Gradient Descent. ● Training set: 10 samples. ● Iterations: 1000 (10x100) -> the network will see 100 times the whole training set.
  • 23. 23 How can we parallelize learning? Equation warning
  • 24. 24 How can we parallelize learning?
  • 25. 25 How can we parallelize learning?
  • 26. 26 How can we parallelize learning? Neuron weights The famous gradients
  • 27. 27 How can we parallelize learning?
  • 28. 28 How can we parallelize learning? 5 examples 5 examples
  • 29. 29 How can we parallelize learning? 5 examples 5 examples Parameter server ● Distribute data ● Aggregate gradients
  • 30. 30 How can we parallelize learning? N examples batch_size = N Parameter server { M machines Single machineMathematically equivalent N examples batch_size = N batch_size = M * N Synchronous training
  • 31. 31 How can we parallelize learning? Synchronous training
  • 32. 32 How can we parallelize learning? Synchronous training Asynchronous training
  • 33. 33 How can we parallelize learning? Fast-response driver Slow-response driver Synchronous training Asynchronous training Car driver -> machine
  • 34. ● Tensorflow examples are hard to adapt to other scenarios. ○ High coupling between model, input, and parallel paradigm. ○ Not a Deep Learning library, but a mathematical engine. Very high verbosity ○ High level abstraction is recommended: ■ Keras, TF-slim, TF Learn (old skflow, now tf.contrib.learn), TFLearn, Sonnet (Deep Mind). Preliminary conclusions
  • 35. ● We were not able to use a GPU cluster on GKE (Google Container Engine) ○ Not enough documentation on this issue ● Parallel paradigm (on single-machine): ○ Asynchronous data parallel is much faster than synchronous, a little less accurate ● We tried first TF-Slim. But we were not able to make it work with multiworker :( Distributed Tensorflow. Results paradigm workers accuracy steps time sync. 3 0.975 5000 62.8 async. 3 0.967 5000 21.6
  • 36. ● Keras was our final choice ○ We patched an external project and made it work on AWS p2.8x :) ○ with 4 GPUs we got (only) 30% speedup. With 8 GPUs even worse :( Single machine multi-GPUs. Results (I) GPUs epochs accuracy time (s/epoch) 1 12 0.9884 6.8 2 12 0.9898 5.2 4 12 0.9891 4.9 8 12 0.9899 6.4
  • 37. 37 How can we parallelize learning? CLUSTERS Communication issues: Latency
  • 38. ● Tensorflow ecosystem is a bit inmature ○ v1.0 not backwards compatible to v0.12 ■ Google provides tf_upgrade.py. But manual changes are sometimes necessary ○ Many open issues awaiting tensorflower... Preliminary conclusions
  • 39. ● Scaling to serve models seems a solved issue ○ Seldon, Tensorflow Serving... ● Scaling to train models efficiently is not a solved issue ○ Our first experiments and external benchmarks confirm this point ○ Horizontal scaling is not efficient ○ Data parallelism (synch or asynch) and GPU optimization are not solved issues. Preliminary conclusions
  • 41. 41 Are you more familiar with Amazon? AWS EC2 S3 ?? It’s like Heroku, a PaaS, but for Machine Learning Google Cloud Platform (GCP) Google Cloud Compute Engine Google Cloud Storage Google Cloud Machine Learning Engine (Cloud ML)
  • 43. 43 What is Google Cloud ML? Google Cloud Storage
  • 44. 44 Cloud ML & Kaggle The free trial account includes $300 in credits!
  • 45. 45 Pricing “Pricing for training your models in the cloud is defined in terms of ML training units, which are an abstract measurement of the processing power involved. 1 ML training unit represents a standard machine configuration used by the training service.” It’s a bit complex. Let’s read it:
  • 47. 47 Cluster configuration “many workers”, “a few servers”, “a large number”
  • 48. 48 Cluster configuration “The following table uses rough "t-shirt" sizing to describe the machine types.”
  • 50. 50 Results Duration Price Accuracy BASIC 1h 2 min 0.01 ML units = 0.0049$ 0.9886 STANDARD_1 16 min 4 sec 1.67 ML Units = 0.818$ 0.99 BASIC_GPU 23 min 56 sec 0.82 ML Units = 0.4018$ 0.989 Infrastructure provisioning time not negligible (~8 minutes)
  • 52. 52 Overall Conclusions ● Distributed computing for ML is not a commodity: you need highly qualified engineers. ● Don’t scale horizontally in ML. Most of the time does not worth it unless you have special conditions: ○ A huge dataset (really huge). ○ A medium size dataset + Infiniband connections + ML/DL framework with RDMA support (reduce latency)
  • 53. 53 Overall Conclusions ● Google GPUs (beta) vs AWS GPUs: more cons than pros :( ● Tensorflow is growing fast but... a. Not easy, but there is Keras. b. We recommend (careful) adoption because of big community
  • 55. 55 Future lines: Cloud ML changes very fast CIFAR10 Recommender Systems (Movielens)
  • 57. Ricardo Guerrero Gómez-Olmedo Email: ricardo.guerrero@beeva.com Twitter: @ricgu8086 Medium: medium.com/@ricardo.guerrero IT Researcher | BEEVA LABS hablemos@beeva.com | www.beeva.com We are hiring!!