H2O.ai's Distributed Deep Learning Presented at PayPal by Arno Candel 04/24/14

Deep Learning
with H2O
!
H2O.ai 
Scalable In-Memory Machine Learning
!
PayPal, San Jose, 4/24/14
Arno Candel

Who am I?
PhD in Computational Physics, 2005 
from ETH Zurich Switzerland
!
6 years at SLAC - Accelerator Physics Modeling
2 years at Skytree, Inc - Machine Learning
4 months at 0xdata/H2O - Machine Learning
!
10+ years in HPC, C++, MPI, Supercomputing
@ArnoCandel

H2O Deep Learning, A. Candel
Outline
Intro
Theory
Implementation
Results
MNIST handwritten digits classification
Live Demo
Prostate cancer classification and age regression
text classification
3

Distributed in-memory math platform  
➔ GLM, GBM, RF, K-Means, PCA, Deep Learning 
Easy to use SDK / API 
➔ Java, R, Scala, Python, JSON, Browser-based GUI
!
Businesses can use ALL of their data (w or w/o Hadoop) 
➔ Modeling without Sampling 
 
Big Data + Better Algorithms  
➔ Better Predictions
H2O Open Source in-memory 
Prediction Engine for Big Data
4

About H20 (aka 0xdata)
Pure Java, Apache v2 Open Source
Join the www.h2o.ai/community!
5
+1 Cyprien Noel for prior work

H2O w or w/o Hadoop
H2O
H2O H2O
HDFS HDFS HDFS
YARN Hadoop MR
R Java Scala JSON Python
Standalone Over YARN On MRv1
6

H2O Architecture
in-memory K-V store
compression
Machine
Learning
Algorithms
R Engine
Nano fast
Scoring Engine
Prediction Engine
memory manager
e.g. Deep Learning
7
MapReduce

Wikipedia: 
Deep learning is a set of algorithms in machine learning
that attempt to model high-level abstractions in data by
using architectures composed of multiple non-linear
transformations.
!
!
!
!
!
Facebook DeepFace (LeCun): “Almost as good as humans at recognising faces”
!
Google Brain (Andrew Ng, Jeff Dean & Geoffrey Hinton)
!
FBI FACE: $1 billion face recognition project
What is Deep Learning?
Example:
Input data 
(facial image)
Prediction
(person’s ID)
8

Deep Learning is trending
20132012
Google trends
2011
9

Deep Learning History
slides by Yan LeCun (now Facebook)
10
Deep Learning wins competitions
AND 
makes humans, businesses and
machines (cyborgs!?) smarter

What is NOT Deep
Linear models are not deep
(by definition)
!
Neural nets with 1 hidden layer are not deep
(no feature hierarchy)
!
SVMs and Kernel methods are not deep
(2 layers: kernel + linear)
!
Classification trees are not deep
(operate on original input space)
11

1970s multi-layer feed-forward Neural Network
(supervised learning with stochastic gradient descent using back-propagation)
!
+ distributed processing for big data
(H2O in-memory MapReduce paradigm on distributed data)
!
+ multi-threaded speedup
(H2O Fork/Join worker threads update the model asynchronously)
!
+ smart algorithms for accuracy
(weight initialization, adaptive learning, momentum, dropout, regularization)
!
= Top-notch prediction engine!
Deep Learning in H2O
12

“fully connected” directed graph of neurons
age
income
employment
married
not married
Input layer
Hidden
layer 1
Hidden
layer 2
Output layer
3x4 4x3 3x2#connections
information flow
input/output neuron
hidden neuron
4 3 2#neurons 3
Example Neural Network
13

age
income
employment
yj = tanh(sumi(xi*uij)+bj)
uij
xi
yj
per-class probabilities 
sum(pl) = 1
zk = tanh(sumj(yj*vjk)+ck)
vjk
zk
pl
pl = softmax(sumk(zk*wkl)+dl)
wkl
softmax(xk) = exp(xk) / sumk(exp(xk))
“neurons activate each other via weighted sums”
Prediction: Forward Propagation
married
not married
activation function: tanh
alternative: 
x -> max(0,x) “rectifier”
pl is a non-linear function of xi:
can approximate ANY function
with enough layers!
bj, ck, dl: bias values 
(indep. of inputs)
14

age
income
employment
xi
standardize input xi: mean = 0, stddev = 1
!
horizontalize categorical variables, e.g.
{full-time, part-time, none, self-employed}  
-> 
{0,1,0} = part-time, {0,0,0} = self-employed
Poor man’s initialization: random weights
!
Better: Uniform distribution in 
+/- sqrt(6/(#units + #units_previous_layer))
Data preparation & Initialization
Neural Networks are sensitive to numerical noise, 
operate best in the linear regime (not saturated)
married
not married
15

Mean Square Error = (0.22 + 0.22)/2 “penalize differences per-class”
!
Cross-entropy = -log(0.8) “strongly penalize non-1-ness”
Stochastic Gradient Descent
SGD: improve weights and biases for EACH training row
married
not married
For each training row, we make a prediction and compare
with the actual label (supervised training):
1
0
0.8
0.2
predicted actual
Objective: minimize prediction error (MSE or cross-entropy)
w <— w - rate * ∂E/∂w
1
16

Backward Propagation
 
!
∂E/∂wi = ∂E/∂y * ∂y/∂net * ∂net/∂wi
= ∂(error(y))/∂y * ∂(activation(net))/∂net * xi
Backprop: Compute ∂E/∂wi via chain rule going backwards
wi
net = sumi(wi*xi) + b
xi
E = error(y)
y = activation(net)
How to compute ∂E/∂wi for wi <— wi - rate * ∂E/∂wi ?
Naive: For every i, evaluate E twice at (w1,…,wi±∆,…,wN)… Slow!
17

H2O Deep Learning Architecture
K-V
K-V
HTTPD
HTTPD
nodes/JVMs: sync
threads: async
communication
w
w w
w w w w
w1
w3 w2
w4
w2+w4
w1+w3
w* = (w1+w2+w3+w4)/4
map: 
each node trains a
copy of the weights
and biases with
(some* or all of) its
local data with
asynchronous F/J
threads
initial weights and biases w
updated weights and biases w*
H2O atomic
in-memory 
K-V store
reduce: 
model averaging:
average weights and
biases from all nodes,
speedup is at least
#nodes/log(#rows)
arxiv:1209.4129v3
Keep iterating over the data (“epochs”), score from time to time
Query & display
the model via
JSON, WWW
2
2 431
1
1
1
4
3 2
1 2
1
i
*user can specify the number of total rows per MapReduce iteration
18

“Secret” Sauce to Higher Accuracy
Momentum training: 
keep changing weights and biases (even if there’s no error)  
“find other local minima, and go faster along valleys”
Adaptive learning rate - ADADELTA (Google): 
automatically set learning rate for each neuron based on its
training history, combines annealing and momentum features
Learning rate annealing: 
rate r = r0 / (1+ß*N), N = training samples
“dig deeper into local minimum”
Grid Search and Checkpointing: 
Run a grid search over multiple hyper-parameters,
then continue training the best model
L1/L2/Dropout/max_w2 regularization: 
L1: penalizes non-zero weights, L2: penalizes large weights 
Dropout: randomly ignore certain inputs “train exp. many models at once”
max_w2: Scale down all incoming weights if their squared sum > max_w2
“regularization avoids overtraining and improves generalization error”
19

Adaptive Learning Rate
!
Compute moving average of ∆wi
2 at time t for window length rho:
!
E[∆wi
2]t = rho * E[∆wi
2]t-1 + (1-rho) * ∆wi
2
!
Compute RMS of ∆wi at time t with smoothing epsilon:
!
RMS[∆wi]t = sqrt( E[∆wi
2]t + epsilon )
Adaptive annealing / progress:
Gradient-dependent learning rate,
moving window prevents “freezing”
(unlike ADAGRAD: no window)
Adaptive acceleration / momentum:
accumulate previous weight updates,
but over a window of time
RMS[∆wi]t-1
RMS[∂E/∂wi]t
rate(wi, t) =
Do the same for ∂E/∂wi, then
obtain per-weight learning rate:
cf. ADADELTA paper

Dropout Training
Training:
For each hidden neuron, for each training sample, for each iteration,
ignore (zero out) a different random fraction p of input activations.
!
age
income
employment
married
not married
X
X
X
Testing:
Use all activations, but reduce them by a factor p
(to “simulate” the missing activations during training).
cf. Geoff Hinton's paper

MNIST: digits classification
Train: 60,000 rows 784 integer columns 10 classes
Test: 10,000 rows 784 integer columns 10 classes
MNIST: Digitized handwritten digits database (Yann LeCun)
Data: 28x28=784 pixels with values in 0…255 (gray-scale)
One of the most popular multi-class classification problems
Without distortions or convolutions
(which help), the best-ever published
error rate on test set: 0.83% (Microsoft)
22

Frequent errors: confuse 2/7 and 4/9
H2O Deep Learning on MNIST:
0.87% test set error (so far)
23
test set error: 1.5% after 10 mins
1.0% after 1.5 hours 
0.87% after 4 hours
World-class
results!
No pre-training
No distortions
No convolutions
No unsupervised
training
Running on 4
nodes with 16
cores each

Parallel Scalability
(for 64 epochs on MNIST, with “0.87%” parameters)
24
Speedup
0.00
10.00
20.00
30.00
40.00
1 2 4 8 16 32 63
H2O Nodes
(4 cores per node, 1 epoch per node per MapReduce)
2.7 mins
Training Time
0
25
50
75
100
1 2 4 8 16 32 63
H2O Nodes
in minutes

Prostate Cancer Dataset
25

Live Demo: Cancer Prediction
Interactive ROC
curve with real-
time updates
26

Live Demo: Cancer Prediction
0% training error
with only 322
model parameters
in seconds!
27

H2O Deep Learning with Scala
28
Predict CAPSULE: Variable 1

H2O Deep Learning with Scala
29

Live Demo: Grid Search Regression
Doing a grid search to find good hyper-parameters
to predict AGE from other 7 features
Then continue training the best model
5 hidden 50 tanh layers, rho=0.99, 
epsilon = 1e-10, normal distribution scale=1
MSE = 0.5 for test
set ages in 44…79
Regression:
1 linear output
neuron
30

Live Demo: ebay Text Classification
Users enter a description when selling an item
Task: Predict the type of item from the words used
Data prep: Binary word vector 0,0,1,0,0,0,0,0,1,0,0,0,1,…,0
H2O parses SVMLight sparse format: label 3:1 9:1 13:1 …
!
“Small” sample dataset on jewelry and watches:
Train: 578,361 rows 8,647 cols 467 classes
Test: 64,263 rows 8,647 cols 143 classes
!
H2O compressed columnar in-memory store:
Only needs 60MB to store 5 billion entries (never inflated)
31

Live Demo: ebay Text Classification
No tuning (results for illustration only):
11.6% test set error (<4% for top-5) after only 10 epochs!
Train: 578,361 rows 8,647 cols 467 classes
Test: 64,263 rows 8,647 cols 143 classes
32

Tips for H2O Deep Learning
!
General:
More layers for more complex functions (exp. more non-linearity)
More neurons per layer to detect finer structure in data (“memorizing”)
Add some regularization for less overfitting (smaller validation error)
Do a grid search to get a feel for convergence, then continue training.
Try Tanh first, then Rectifier, try max_w2 = 50 and/or L1=1e-5.
Try Dropout (input: 20%, hidden: 50%) with test/validation set after
finding good parameters for convergence on training set.
Distributed: More training samples per iteration: faster, but less accuracy?
With ADADELTA: Try epsilon = 1e-4,1e-6,1e-8,1e-10, rho = 0.9,0.95,0.99
Without ADADELTA: Try rate = 1e-4…1e-2, rate_annealing = 1e-5…1e-8,
momentum_start = 0.5, momentum_stable = 0.99, 
momentum_ramp = 1/rate_annealing
Try balance_classes = true for imbalanced classes.
Use force_load_balance and replicate_training_data for small datasets.
33

Summary
H2O is a distributed in-memory math platform that
allows fast prototyping in Java, R, Scala and Python.
!
H2o enables the development of enterprise-quality
blazingly fast machine learning applications.
!
H2O Deep Learning is distributed, easy to use, and
early results compete with the world’s best.
!
Try it yourself and join our next meetup! 
git clone https://github.com/0xdata/h2o
http://docs.0xdata.com www.h2o.ai/community
follow us on Twitter: @hexadata
34

H2O.ai's Distributed Deep Learning Presented at PayPal by Arno Candel 04/24/14

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (18)

Mais de Sri Ambati

Mais de Sri Ambati (20)

Último

Último (20)

H2O.ai's Distributed Deep Learning Presented at PayPal by Arno Candel 04/24/14