Axa Assurance Maroc - Insurer Innovation Award 2024
Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML
1. Scaling Tensorflow models for
training using multi-GPUs &
Google Cloud ML
BEE PART OF THE CHANGE
Avenida de Burgos, 16 D, 28036 Madrid
hablemos@beeva.com
www.beeva.com
3. 3
Index
1. What is BEEVA? Who are we?
2. High Performance Computing: objectives
3. Experimental setup
4. Scenario 1: Distributed Tensorflow
5. Scenario 2: Cloud ML
6. Overall Conclusions
7. Future lines
5. WWW.BEEVA.COM 5
“ WE MAKE COMPLEX THINGS SIMPLE”
100 % +40%
Annual growth last
3 years
+550
Employees
in Spain
BIG DATA
CLOUD
COMPUTING
MACHINE
INTELLIGENCE
● HIGH VALUE FOR INNOVATION
● PRODUCT DEVELOPMENT (APIVERSITY, lince.io, Clever)
6. WWW.BEEVA.COM 6
Technological Partners
6
In cloud we bet on those partners
that we believe best work and
cover the needs of the client,
making us experts and finding the
best cloud solution for each project.
AWS, Azure & Google
Cloud Platform
Data is the oil of the XXI century. In
BEEVA we seek to ally with the best
providers of solutions for the data.
Cloudera, Hortonworks,
MongoDB & Neo4j
The needs of BEEVA are constantly
renewed and we always seek to
add new and powerful references
of the sector to our portfolio of
technological partners.
RedHat, Puppet &
Docker
CLOUD DATA TECH
7. BEE DIFFERENT
WORK DIFFERENT
PROVIDE PASSION AND VALUE TO THE WORK
LEARN AND ENJOY WHAT YOU DO
CREATE A GOOD ENVIRONMENT EVERY DAY
‘OUT OF THE BOX’ THINKING
BEE DIFFERENT AND SPECIAL
www.beeva.com/empleo
rrhh@beeva.com
8. 8
Who am I?
Ricardo Guerrero
A (very geeky) Telecommunications
Engineer.
1. Research: Computer Vision
2. Development: Embedded
systems (routers)
3. Innovation: Data scientist in
BEEVA.
Free time: Not too much
(Self-driving cars)
Plants Vs Zombies
9. 9
Who is this?
Telecommunications Engineer
Data Scientist (Innovation
team)
Geek
Free time: compute PI decimals
(just kidding… I hope)
Enrique Otero
11. 11
HPC line
1. Scaling ML models over GPU clusters.
2. Ease ML deployments and its consume by analysts.
3. Analyze GPU clouds providers.
4. Study vertical scaling Vs horizontal scaling.
5. Paradigms of parallelization: data parallelism (sync or
async) Vs model parallelism.
21. 21
How can we parallelize learning?
Machine Learning:
Andrew Ng
22. 22
How can we parallelize learning?
Example:
● Optimizer: Mini-batch Gradient Descent.
● Training set: 10 samples.
● Iterations: 1000 (10x100) -> the network will see 100 times the
whole training set.
23. 23
How can we parallelize learning?
Equation warning
28. 28
How can we parallelize learning?
5 examples 5 examples
29. 29
How can we parallelize learning?
5 examples 5 examples
Parameter server
● Distribute
data
● Aggregate
gradients
30. 30
How can we parallelize learning?
N examples
batch_size = N
Parameter server
{
M machines
Single machineMathematically equivalent
N examples
batch_size = N
batch_size = M * N
Synchronous
training
31. 31
How can we parallelize learning?
Synchronous
training
32. 32
How can we parallelize learning?
Synchronous
training
Asynchronous
training
33. 33
How can we parallelize learning?
Fast-response
driver
Slow-response
driver
Synchronous
training
Asynchronous
training
Car
driver ->
machine
34. ● Tensorflow examples are hard to adapt to other scenarios.
○ High coupling between model, input, and parallel paradigm.
○ Not a Deep Learning library, but a mathematical engine. Very high verbosity
○ High level abstraction is recommended:
■ Keras, TF-slim, TF Learn (old skflow, now tf.contrib.learn), TFLearn,
Sonnet (Deep Mind).
Preliminary conclusions
35. ● We were not able to use a GPU cluster on GKE (Google Container Engine)
○ Not enough documentation on this issue
● Parallel paradigm (on single-machine):
○ Asynchronous data parallel is much faster than synchronous, a little less
accurate
● We tried first TF-Slim. But we were not able to make it work with multiworker :(
Distributed Tensorflow. Results
paradigm workers accuracy steps time
sync. 3 0.975 5000 62.8
async. 3 0.967 5000 21.6
36. ● Keras was our final choice
○ We patched an external project and made it work on AWS p2.8x :)
○ with 4 GPUs we got (only) 30% speedup. With 8 GPUs even worse :(
Single machine multi-GPUs. Results (I)
GPUs epochs accuracy time (s/epoch)
1 12 0.9884 6.8
2 12 0.9898 5.2
4 12 0.9891 4.9
8 12 0.9899 6.4
37. 37
How can we parallelize learning?
CLUSTERS
Communication
issues:
Latency
38. ● Tensorflow ecosystem is a bit inmature
○ v1.0 not backwards compatible to v0.12
■ Google provides tf_upgrade.py. But manual changes are
sometimes necessary
○ Many open issues awaiting tensorflower...
Preliminary conclusions
39. ● Scaling to serve models seems a solved issue
○ Seldon, Tensorflow Serving...
● Scaling to train models efficiently is not a solved issue
○ Our first experiments and external benchmarks confirm this point
○ Horizontal scaling is not efficient
○ Data parallelism (synch or asynch) and GPU optimization are not solved issues.
Preliminary conclusions
41. 41
Are you more familiar with Amazon?
AWS
EC2
S3
??
It’s like Heroku, a PaaS,
but for Machine Learning
Google Cloud Platform (GCP)
Google Cloud Compute Engine
Google Cloud Storage
Google Cloud Machine Learning Engine (Cloud ML)
44. 44
Cloud ML & Kaggle
The free trial
account includes
$300 in credits!
45. 45
Pricing
“Pricing for training your models in the cloud is defined in terms
of ML training units, which are an abstract measurement of the
processing power involved. 1 ML training unit represents a
standard machine configuration used by the training service.”
It’s a bit complex. Let’s
read it:
50. 50
Results
Duration Price Accuracy
BASIC 1h 2 min 0.01 ML units =
0.0049$
0.9886
STANDARD_1 16 min 4 sec 1.67 ML Units =
0.818$
0.99
BASIC_GPU 23 min 56 sec 0.82 ML Units =
0.4018$
0.989
Infrastructure provisioning time not negligible (~8 minutes)
52. 52
Overall Conclusions
● Distributed computing for ML is not a commodity: you need highly
qualified engineers.
● Don’t scale horizontally in ML. Most of the time does not worth it unless
you have special conditions:
○ A huge dataset (really huge).
○ A medium size dataset + Infiniband connections + ML/DL framework with
RDMA support (reduce latency)
53. 53
Overall Conclusions
● Google GPUs (beta) vs AWS GPUs: more cons than pros :(
● Tensorflow is growing fast but...
a. Not easy, but there is Keras.
b. We recommend (careful) adoption because of big community