Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML

Scaling Tensorflow models for
training using multi-GPUs &
Google Cloud ML
BEE PART OF THE CHANGE
Avenida de Burgos, 16 D, 28036 Madrid
hablemos@beeva.com
www.beeva.com

2
Topics
Cloud Machine Learning
Engine -> a.k.a. Cloud ML NVIDIA GPUs
Distributed computing
Tensorflow

3
Index
1. What is BEEVA? Who are we?
2. High Performance Computing: objectives
3. Experimental setup
4. Scenario 1: Distributed Tensorflow
5. Scenario 2: Cloud ML
6. Overall Conclusions
7. Future lines

WWW.BEEVA.COM 5
“ WE MAKE COMPLEX THINGS SIMPLE”
100 % +40%
Annual growth last
3 years
+550
Employees
in Spain
BIG DATA
CLOUD
COMPUTING
MACHINE
INTELLIGENCE
● HIGH VALUE FOR INNOVATION
● PRODUCT DEVELOPMENT (APIVERSITY, lince.io, Clever)

WWW.BEEVA.COM 6
Technological Partners
6
In cloud we bet on those partners
that we believe best work and
cover the needs of the client,
making us experts and finding the
best cloud solution for each project.
AWS, Azure & Google
Cloud Platform
Data is the oil of the XXI century. In
BEEVA we seek to ally with the best
providers of solutions for the data.
Cloudera, Hortonworks,
MongoDB & Neo4j
The needs of BEEVA are constantly
renewed and we always seek to
add new and powerful references
of the sector to our portfolio of
technological partners.
RedHat, Puppet &
Docker
CLOUD DATA TECH

BEE DIFFERENT
WORK DIFFERENT
PROVIDE PASSION AND VALUE TO THE WORK
LEARN AND ENJOY WHAT YOU DO
CREATE A GOOD ENVIRONMENT EVERY DAY
‘OUT OF THE BOX’ THINKING
BEE DIFFERENT AND SPECIAL
www.beeva.com/empleo
rrhh@beeva.com

8
Who am I?
Ricardo Guerrero
A (very geeky) Telecommunications
Engineer.
1. Research: Computer Vision
2. Development: Embedded
systems (routers)
3. Innovation: Data scientist in
BEEVA.
Free time: Not too much
(Self-driving cars)
Plants Vs Zombies

9
Who is this?
Telecommunications Engineer
Data Scientist (Innovation
team)
Geek
Free time: compute PI decimals
(just kidding… I hope)
Enrique Otero

10
High Performance
Computing: objectives

11
HPC line
1. Scaling ML models over GPU clusters.
2. Ease ML deployments and its consume by analysts.
3. Analyze GPU clouds providers.
4. Study vertical scaling Vs horizontal scaling.
5. Paradigms of parallelization: data parallelism (sync or
async) Vs model parallelism.

13
MNIST problem
The Hello World in Machine
Learning:
easy to reach an accuracy over 97%
MNIST Dataset

14
MNIST problem
MNIST Dataset
Classify digits in bank checks (1998)

15
MNIST problem
ICLR 2017
This happy guy is
me.

16
Benchmark
Model employed
5-layered Neural
Network proposed by
Yann Lecun

17
Scenario 1: Distributed
Tensorflow

18
How can we parallelize learning?
CLUSTERS
Communication
issues:
Latency

19
Single-machine learning
Forward prop -> compute output
Ytrue =3
Yest = -201.2
Random initialization of
weights

20
Single-machine learning
Backprop -> weights update
Ytrue =3
Yest = -201.2
Err = 204.2

21
Machine Learning:
Andrew Ng

22
Example:
● Optimizer: Mini-batch Gradient Descent.
● Training set: 10 samples.
● Iterations: 1000 (10x100) -> the network will see 100 times the
whole training set.

23
Equation warning

24

25

26
Neuron
weights
The famous gradients

27

28
5 examples 5 examples

29
5 examples 5 examples
Parameter server
● Distribute
data
● Aggregate
gradients

30
N examples
batch_size = N
Parameter server
{
M machines
Single machineMathematically equivalent
N examples
batch_size = N
batch_size = M * N
Synchronous
training

31
Synchronous
training

32
Synchronous
training
Asynchronous
training

33
Fast-response
driver
Slow-response
driver
Synchronous
training
Asynchronous
training
Car
driver ->
machine

● Tensorflow examples are hard to adapt to other scenarios.
○ High coupling between model, input, and parallel paradigm.
○ Not a Deep Learning library, but a mathematical engine. Very high verbosity
○ High level abstraction is recommended:
■ Keras, TF-slim, TF Learn (old skflow, now tf.contrib.learn), TFLearn,
Sonnet (Deep Mind).
Preliminary conclusions

● We were not able to use a GPU cluster on GKE (Google Container Engine)
○ Not enough documentation on this issue
● Parallel paradigm (on single-machine):
○ Asynchronous data parallel is much faster than synchronous, a little less
accurate
● We tried first TF-Slim. But we were not able to make it work with multiworker :(
Distributed Tensorflow. Results
paradigm workers accuracy steps time
sync. 3 0.975 5000 62.8
async. 3 0.967 5000 21.6

● Keras was our final choice
○ We patched an external project and made it work on AWS p2.8x :)
○ with 4 GPUs we got (only) 30% speedup. With 8 GPUs even worse :(
Single machine multi-GPUs. Results (I)
GPUs epochs accuracy time (s/epoch)
1 12 0.9884 6.8
2 12 0.9898 5.2
4 12 0.9891 4.9
8 12 0.9899 6.4

37
CLUSTERS
Communication
issues:
Latency

● Tensorflow ecosystem is a bit inmature
○ v1.0 not backwards compatible to v0.12
■ Google provides tf_upgrade.py. But manual changes are
sometimes necessary
○ Many open issues awaiting tensorflower...

● Scaling to serve models seems a solved issue
○ Seldon, Tensorflow Serving...
● Scaling to train models efficiently is not a solved issue
○ Our first experiments and external benchmarks confirm this point
○ Horizontal scaling is not efficient
○ Data parallelism (synch or asynch) and GPU optimization are not solved issues.

41
Are you more familiar with Amazon?
AWS
EC2
S3
??
It’s like Heroku, a PaaS,
but for Machine Learning
Google Cloud Platform (GCP)
Google Cloud Compute Engine
Google Cloud Storage
Google Cloud Machine Learning Engine (Cloud ML)

43
What is Google Cloud ML?
Google
Cloud
Storage

44
Cloud ML & Kaggle
The free trial
account includes
$300 in credits!

45
Pricing
“Pricing for training your models in the cloud is defined in terms
of ML training units, which are an abstract measurement of the
processing power involved. 1 ML training unit represents a
standard machine configuration used by the training service.”
It’s a bit complex. Let’s
read it:

47
Cluster configuration
“many workers”, “a few servers”, “a large number”

48
Cluster configuration
“The following table uses rough "t-shirt"
sizing to describe the machine types.”

50
Results
Duration Price Accuracy
BASIC 1h 2 min 0.01 ML units =
0.0049$
0.9886
STANDARD_1 16 min 4 sec 1.67 ML Units =
0.818$
0.99
BASIC_GPU 23 min 56 sec 0.82 ML Units =
0.4018$
0.989
Infrastructure provisioning time not negligible (~8 minutes)

52
Overall Conclusions
● Distributed computing for ML is not a commodity: you need highly
qualified engineers.
● Don’t scale horizontally in ML. Most of the time does not worth it unless
you have special conditions:
○ A huge dataset (really huge).
○ A medium size dataset + Infiniband connections + ML/DL framework with
RDMA support (reduce latency)

53
Overall Conclusions
● Google GPUs (beta) vs AWS GPUs: more cons than pros :(
● Tensorflow is growing fast but...
a. Not easy, but there is Keras.
b. We recommend (careful) adoption because of big community

55
Future lines: Cloud ML changes very fast
CIFAR10
Recommender Systems
(Movielens)

Ricardo Guerrero Gómez-Olmedo
Email:
ricardo.guerrero@beeva.com
Twitter: @ricgu8086
Medium: medium.com/@ricardo.guerrero
IT Researcher | BEEVA LABS
hablemos@beeva.com | www.beeva.com
We are
hiring!!

Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML

Similar to Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML (20)

More from Seldon

More from Seldon (20)

Recently uploaded

Recently uploaded (20)

Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML