https://telecombcn-dl.github.io/2018-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona 2018
1. [course
site]
Verónica Vilaplana
veronica.vilaplana@upc.edu
Associate Professor
Universitat Politecnica de Catalunya
Technical University of Catalonia
Optimization for neural
network training
Day 3 Lecture 2
#DLUPC
2. Previously
in
DLAI…
• Mul.layer
perceptron
• Training:
(stochas.c
/
mini-‐batch)
gradient
descent
• Backpropaga.on
• Loss
func.on
but…
What
type
of
op.miza.on
problem?
Do
local
minima
and
saddle
points
cause
problems?
Does
gradient
descent
perform
well?
How
to
set
the
learning
rate?
How
to
ini.alize
weights?
How
does
batch
size
affect
training?
2
3. Index
• Op6miza6on
for
a
machine
learning
task;
difference
between
learning
and
pure
op6miza6on
• Expected
and
empirical
risk
• Surrogate
loss
func.ons
and
early
stopping
• Batch
and
mini-‐batch
algorithms
• Challenges
for
deep
models
• Local
minima
• Saddle
points
and
other
flat
regions
• Cliffs
and
exploding
gradients
• Prac6cal
algorithms
• Stochas.c
Gradient
Descent
• Momentum
• Nesterov
Momentum
• Learning
rate
• Adap.ve
learning
rates:
adaGrad,
RMSProp,
Adam
• Parameter
ini6aliza6on
• Batch
Normaliza6on
3
5. Op6miza6on
for
NN
training
• Goal:
Find
the
parameters
that
minimize
the
expected
risk
(generaliza.on
error)
• x
input,
predicted
output,
y
target
output,
E
expecta.on
• pdata
true
(unknown)
data
distribu.on,
L
loss
func6on
(how
wrong
predic6ons
are)
• But
we
only
have
a
training
set
of
samples:
we
minimize
the
empirical
risk,
average
loss
on
a
finite
dataset
D
J(θ) = Ε(x,y)∼pdata
L( fθ
(x), y)
fθ
(x)
J(θ) = Ε(x,y)∼ ˆpdata
L( fθ
(x), y) =
1
D
L( fθ
(x(i)
), y(i)
)
(x(i)
,y(i)
)∈D
∑
where
is
the
empirical
distribu.on,
|D|
is
the
number
of
examples
in
D
5
ˆpdata
6. Surrogate
loss
• O]en
minimizing
the
real
loss
is
intractable
(can’t
be
used
with
gradient
descent)
• e.g.
0-‐1
loss
(0
if
correctly
classified,
1
if
it
is
not)
(intractable
even
for
linear
classifiers
(Marcobe
1992)
• Minimize
a
surrogate
loss
instead
• e.g.
for
the
0-‐1
loss
hinge
square
logis.c
6
0-‐1
loss
(blue)
and
surrogate
losses
(green:
square,
purple:
hinge,
yellow:
logis.c)
L( f (x) , y) = I( f (x)≠y)
L( f (x), y) = max(0,1− yf (x))
L( f (x), y) = (1− yf (x))2
L( f (x), y) = log(1+ e− yf (x)
)
7. Surrogate
loss
func6ons
7
Probabilistic
classifier
Outputs
probability
of
class
1
f(x) ≈ P(y=1 | x) Probability for class 0 is 1-f(x)
Binary cross-entropy loss:
L(f(x),y) = -(y log(f(x)) + (1-y) log(1-f(x))
Decision
func.on: F(x) = If(x)>0.5
Outputs
a
vector
of
probabili.es:
f(x) ≈ ( P(y=0|x), ..., P(y=m-1|x) )
Negative conditional log likelihood loss
L(f(x),y) = -log f(x)y
Decision
func.on:
F(x) = argmax(f(x))
Non-
Hinge
loss:
probabilistic
classifier
Outputs a «score» f(x) for class 1.
score for the other class is -f(x)
L(f(x),t) = max(0, 1-t f(x)) where t=2y-1
Decision
func.on:
F(x) = If(x)>0
Outputs
a
vector
f(x) of
real-‐valued
scores
for
the
m
classes.
Mul.class
margin
loss
L(f(x),y) = max(0,1+max(f(x)k)-f(x)y )
k≠y
Decision
func.on:
F(x) = argmax(f(x))
Binary classifier Multiclass classifier
8. Early
stopping
• Training
algorithms
usually
do
not
halt
at
a
local
minimum
• Convergence
criterion
based
on
early
stopping:
• based
on
surrogate
loss
or
true
underlying
loss
(ex
0-‐1
loss)
measured
on
a
valida6on
set
• #
training
steps
=
hyperparameter
controlling
the
effec.ve
capacity
of
the
model
• simple,
effec.ve,
must
keep
a
copy
of
the
best
parameters
• acts
as
a
regularizer
(Bishop
1995,…)
8
Training
error
decreases
steadily
Valida.on
error
begins
to
increase
Return
parameters
at
point
with
lowest
valida6on
error
9. Batch
and
mini-‐batch
algorithms
• Gradient
descent
at
each
itera.on
computes
gradients
over
the
en.re
dataset
for
one
update
• ↑
Gradients
are
stable
• ↓
Using
the
complete
training
set
can
be
very
expensive
• the
gain
of
using
more
samples
is
less
than
linear:
• standard
error
of
the
mean
es.mated
from
m
samples
is
(σ
is
true
std)
• ↓
Training
set
may
be
redundant
• Use
a
subset
of
the
training
set
Loop:
1. sample
a
subset
of
data
2. forward
prop
through
the
network
3. backprop
to
calculate
gradients
4. update
parameters
using
gradients
9
∇θ
J(θ) =
1
m
∇θ
L( fθ
(x(i)
), y(i)
)i∑
SE =
σ
m
Minibatch
gradient
descent
10. Batch
and
mini-‐batch
algorithms
• How
many
samples
in
each
update
step?
• Determinis.c
or
batch
gradient
methods:
process
all
training
samples
in
a
large
batch
• Mini-‐batch
stochas.c
methods:
use
several
(not
all)
samples
• Stochas.c
methods:
use
a
single
example
at
a
.me
• online
methods:
samples
are
drawn
from
a
stream
of
con.nually
created
samples
10
batch
vs
minibatch
gradient
descent
11. Batch
and
mini-‐batch
algorithms
Mini-‐batch
size?
• Larger
batches:
more
accurate
es.mate
of
the
gradient
but
less
than
linear
return
• Very
small
batches:
Mul.core
architectures
under-‐u.lized
• Smaller
batches
provide
noisier
gradient
es.mates
• Small
batches
may
offer
a
regularizing
effect
(add
noise)
• but
may
require
small
learning
rate
• may
increase
number
of
steps
for
convergence
• If
small
training
set,
use
batch
gradient
descent
• If
large
training
set,
use
mini
batches
• Minbatches
should
be
selected
randomly
(shuffle
samples)
• unbiased
es.mate
of
gradients
• Typical
mini-‐batch
size:
32,
64,
128,
256
• (2p,
make
sure
mini-‐batch
fits
in
CPU-‐GPU
memory)
11
13. Convex
/
Non-‐convex
op6miza6on
A
func.on
defined
on
an
n-‐dim
interval
is
convex
if
for
any
13
f : X → !
f (λx + (1− λ)x') ≤ λ f (x) + (1− λ) f (x')
x,x' ∈X λ ∈[0,1]
f (λx + (1− λ)x')
λ f (x) + (1− λ) f (x')
14. Convex
/
Non-‐convex
op6miza6on
• Convex
op.miza.on
• any
local
minimum
is
a
global
minimum
• there
are
several
opt.
algorithms
(polynomial-‐.me)
• Non-‐convex
op.miza.on
• objec6ve
func6on
in
deep
networks
is
non-‐convex
• deep
models
may
have
several
local
minima
• but
this
is
not
necessarily
a
major
problem!
14
15. Local
minima
and
saddle
points
• Cri6cal
points:
• For
high
dimensional
loss
func.ons,
local
minima
are
rare
compared
to
saddle
points
• Hessian
matrix:
real,
symmetric
eigenvector/eigenvalue
decomposi.on
• Intui.on:
eigenvalues
of
the
Hessian
matrix
• local
minimum/maximum:
all
posi.ve
/
all
nega.ve
eigenvalues:
exponen.ally
unlikely
as
n
grows
• saddle
points:
both
posi.ve
and
nega.ve
eigenvalues
15
Dauphin
et
al.
Iden.fying
and
abacking
the
saddle
point
problem
in
high-‐dimensional
non-‐convex
op.miza.on.
NIPS
2014
Hij
=
∂2
f
∂xi
∂xj
f :!n
→ !
∇x
f (x) = 0
16. Local
minima
and
saddle
points
• It
is
believed
that
for
many
problems
including
learning
deep
nets,
almost
all
local
minimum
have
very
similar
func.on
value
to
the
global
op.mum
• Finding
a
local
minimum
is
good
enough
16
Value
of
local
minima
found
by
running
SGD
for
200
itera.ons
on
a
simplified
version
of
MNIST
from
different
ini.al
star.ng
points.
As
number
of
parameters
increases,
local
minima
tend
to
cluster
more
.ghtly.
• For
many
random
func.ons
local
minima
are
more
likely
to
have
low
cost
than
high
cost.
Choromanska
et
al.
The
loss
surfaces
of
mul.layer
networks,
AISTATS
2015
17. Saddle
points
How
to
escape
from
saddle
points?
• First
order
methods
• ini.ally
abracted
to
saddle
points,
but
unless
exact
hit,
it
will
be
repelled
when
close
• hitng
cri.cal
point
exactly
is
unlikely
(es.mated
gradient
is
noisy)
• saddle
points
are
very
unstable:
noise
(stochas.c
gradient
descent)
helps
convergence,
trajectory
escapes
quickly
• Second
order
moments:
• Netwon’s
method
can
jump
to
saddle
points
(where
gradient
is
0)
17
S.
Credit:
K.McGuinness
SGD
tends
to
oscillate
between
slowly
approaching
a
saddle
point
and
quickly
escaping
from
it
18. Other
difficul6es
• Cliffs
and
exploding
gradients
• Nets
with
many
layers
/
recurrent
nets
can
contain
very
steep
regions
(cliffs):
gradient
descent
can
move
parameters
too
far,
jumping
off
of
the
cliff.
(solu.ons:
gradient
clipping)
• Long
term
dependencies
• computa.onal
graph
becomes
very
deep
(deep
nets
/
recurrent
nets):
vanishing
and
exploding
gradients
18
cost
func.on
of
highly
non
linear
deep
nets
or
recurrent
net
(Pascanu2013)
20. Mini-‐batch
Gradient
Descent
• Most
used
algorithm
for
deep
learning
Algorithm
• Require:
ini.al
parameter
θ,
learning
rate
α,
• while
stopping
criterion
not
met
do
• sample
a
minibatch
of
m
examples
from
the
training
set
with
corresponding
targets
• compute
gradient
es.mate
• apply
update
• end
while
20
{x(i)
}i=1...m
{y(i)
}i=1...m
g ← +
1
m
∇θ
L( fθ
(x(i)
), y(i)
)i∑
θ ←θ −αg
21. Problems
with
GD
• GD
can
be
very
slow.
• Can
get
stuck
in
local
minima
or
saddle
points
• If
the
loss
changes
quickly
in
one
direc.on
and
slowly
in
another,
GD
makes
slow
progress
along
shallow
dimension,
jibers
along
steep
direc.on
21
Loss
func.on
has
a
high
condi6on
number
(5):
ra.o
of
largest
to
smallest
singular
value
of
Hessian
matrix
is
large
22. Momentum
• Momentum
is
designed
to
accelerate
learning,
especially
for
high
curvature,
small
but
consistent
gradients
or
noisy
gradients
• New
variable
velocity
v
(direc.on
and
speed
at
which
parameters
move)
• exponen.ally
decaying
average
of
nega.ve
gradient
Algorithm
• Require:
ini.al
parameter
θ,
learning
rate
α,
momentum
parameter
λ
,
ini6al
velocity
v
• Update
rule:
(g
is
gradient
es.mate)
• compute
velocity
update
• apply
update
• Typical
values
v0=0,
λ=0.5,
0.9,0.99
(in
[0,1})
• Read
physical
analogy
in
Deep
Learning
book
(Goodfellow
et
al):
velocity
=
momentum
of
unit
mass
par.cle
22
θ ←θ + v
v ← λv −αg
23. Nesterov
accelerated
gradient
(NAG)
• A
variant
of
momentum,
where
gradient
is
evaluated
a]er
current
velocity
is
applied:
• Approximate
where
the
parameters
will
be
on
the
next
.me
step
using
current
velocity
• Update
velocity
using
gradient
where
we
predict
parameters
will
be
Algorithm
• Require:
ini.al
parameter
θ,
learning
rate
α,
momentum
parameter
λ
,
ini.al
velocity
v
• Update:
• apply
interim
update
• compute
gradient
(at
interim
point)
• compute
velocity
update
• apply
update
• Interpreta.on:
add
a
correc.on
factor
to
momentum
23
g ← +
1
m
∇!θ
L!θ
( f (x(i)
), y(i)
)i∑
θ ←θ + v
v ← λv −αg
!θ ←θ + λv
interim
24. Nesterov
accelerated
gradient
(NAG)
24
current
loca.on
wt
vt
∇L(wt) vt+1
S.
Credit:
K.
McGuinness
predicted
loca.on
based
on
velocity
alone
wt + 𝛾v
∇L(wt + 𝛾vt)
vt
vt+1
25. GD:
learning
rate
• Learning
rate
is
a
crucial
parameter
for
GD
• Too
large:
overshoots
local
minimum,
loss
increases
• Too
small:
makes
very
slow
progress,
can
get
stuck
• Good
learning
rate:
makes
steady
progress
toward
local
minimum
25
too
small
too
large
26. GD:
learning
rate
decay
• In
prac.ce
it
is
necessary
to
gradually
decrease
learning
rate
to
speed
up
the
training
• step
decay
(e.g.
reduce
by
half
every
few
epochs)
• exponen6al
decay
• 1/t
decay
• manual
decay
• Sufficient
condi.ons
for
convergence:
• Usually:
adapt
learning
rate
by
monitoring
learning
curves
that
plot
the
objec.ve
func.on
as
a
func.on
of
.me
(more
of
an
art
than
a
science!)
26
αt
= ∞
t=1
∞
∑ αt
2
< ∞
t=1
∞
∑
α = α0
e−kt
α =
α0
1+ kt
k decay rate
t iteration number
α0
initial learning rate
27. Adap6ve
learning
rates
• Cost
if
o]en
sensi.ve
to
some
direc.ons
and
insensi.ve
to
others
• Momentum/Nesterov
mi.gate
this
issue
but
introduce
another
hyperparameter
• Solu6on:
Use
a
separate
learning
rate
for
each
parameter
and
automa6cally
adapt
it
through
the
course
of
learning
• Algorithms
(mini-‐batch
based)
• AdaGrad
• RMSProp
• Adam
27
28. AdaGrad
• Adapts
the
learning
rate
of
each
parameter
based
on
sizes
of
previous
updates:
• scales
updates
to
be
larger
for
parameters
that
are
updated
less
• scales
updates
to
be
smaller
for
parameters
that
are
updated
more
• The
net
effect
is
greater
progress
in
the
more
gently
sloped
direc.ons
of
parameter
space
• Require:
ini.al
parameter
θ,
learning
rate
α,
small
constant
δ
(e.g.
10-‐7)
for
numerical
stability
• Update:
• accumulate
squared
gradient
• compute
update
• apply
update
28
θ ←θ + Δθ
Δθ ← −
α
δ + r
⊙ g
r ← r + g ⊙ g sum
of
all
previous
squared
gradients
updates
inversely
propor.onal
to
the
square
root
of
the
sum
(elementwise
mul.plica.on)
Duchi
et
al.
Adap.ve
Subgradient
Methods
for
Online
Learning
and
Stochas.c
Op.miza.on.
JMRL
2011
29. Root
Mean
Square
Propaga6on
(RMSProp)
• AdaGrad
can
result
in
a
premature
and
excessive
decrease
in
effec6ve
learning
rate
• RMSProp
modifies
AdaGrad
to
perform
beber
in
non-‐convex
surfaces
• Changes
gradient
accumula.on
by
an
exponen6ally
decaying
average
of
sum
of
squares
of
gradients
• Requires:
ini.al
parameter
θ,
learning
rate
α,
decay
rate
ρ,
small
constant
δ
(e.g.
10-‐7)
• Update:
• accumulate
squared
gradient
• compute
update
• apply
update
29
θ ←θ + Δθ
Δθ ← −
α
δ + r
⊙ g
r ← ρr + (1− ρ)g ⊙ g
Geoff
Hinton,
Unpublished
30. ADAp6ve
Moments
(Adam)
• Combina.on
of
RMSProp
and
momentum,
but:
• Keep
decaying
average
of
both
first-‐order
moment
of
gradient
(momentum)
and
second-‐
order
moment
(RMSProp)
• Includes
bias
correc.ons
(first
and
second
moments)
to
account
for
their
ini.aliza.on
at
origin
Update:
• updated
biased
first
moment
es6mate
• update
biased
second
moment
• correct
biases
• compute
update
(opera.ons
applied
elementwise)
• apply
update
30
θ ←θ + Δθ
Δθ ← −α
ˆs
δ + ˆr
s ← ρ1
s + (1− ρ1
)g
r ← ρ2
r + (1− ρ2
)g ⊙ g
ˆs ←
s
1− ρ1
ˆr ←
r
1− ρ2
Kingma
et
al.
Adam:
a
Method
for
Stochas.c
Op.miza.on.
ICLR
2015
δ=10-‐8,
ρ1=0.9,
ρ2=0.999
34. Parameter
ini6aliza6on
• Weights
• Can’t
ini.alize
weights
to
0
(gradients
would
be
0)
• Can’t
ini.alize
all
weights
to
the
same
value
(all
hidden
units
in
a
layer
will
always
behave
the
same;
need
to
break
symmetry)
• Small
random
number,
e.g.,
uniform
or
gaussian
distribu.on
• if
weights
start
too
small,
the
signal
shrinks
as
it
passes
through
each
layer
un.l
it
is
too
.ny
to
be
useful
• Xavier
ini.aliza.on
(calibra.ng
variances,
for
tanh
ac.va.ons)
sqrt(1/n)
• each
neuron:
w
=
randn(n)
/
sqrt(n)
,
n
inputs
• He
ini.aliza.on
(for
ReLu
ac.va.ons)
sqrt(2/n)
• each
neuron
w
=
randn(n)
*
sqrt(2.0
/n)
,
n
inputs
• Biases
• ini.alize
all
to
0
(except
for
output
unit
for
skewed
distribu.ons,
0.01
to
avoid
satura.ng
RELU)
• Alterna6ve:
Ini.alize
using
machine
learning;
parameters
learned
by
unsupervised
model
trained
on
the
same
inputs
/
trained
on
unrelated
task
34
35. Normalizing
inputs
• Normalizing
inputs
to
speed
up
learning
• For
input
layers:
data
preprocessing
mean
=
1,
std=1
• For
hidden
layers:
batch
normaliza.on
35
original
data
mean=0
mean
=0,
std=1
Loss
for
unnormalized
data
Loss
for
normalized
data
36. Batch
normaliza6on
• As
learning
progresses,
the
distribu.on
of
the
layer
inputs
changes
due
to
parameter
updates
(
internal
covariate
shi])
• This
can
result
in
most
inputs
being
in
the
non-‐linear
regime
of
the
ac.va.on
func.on,
slowing
down
learning
• Bach
normaliza.on
is
a
technique
to
reduce
this
effect
• Explicitly
force
the
layer
ac.va.ons
to
have
zero
mean
and
unit
variance
w.r.t
running
batch
es.mates
• Adds
a
learnable
scale
and
bias
term
to
allow
the
network
to
s.ll
use
the
nonlinearity
36
Ioffe
and
Szegedy,
2015.
“Batch
normaliza.on:
accelera.ng
deep
network
training
by
reducing
internal
covariate
shi]”
FC
/
Conv
Batch
norm
ReLu
FC
/
Conv
Batch
norm
ReLu
37. Batch
normaliza6on
• Can
be
applied
to
any
input
or
hidden
layer
• For
a
mini-‐batch
of
m
ac.va.ons
of
the
layer
1. Compute
empirical
mean
and
variance
for
each
dimension
D
2. Normalize
3. Scale
and
shi]
(two
learnable
parameters
)
37
ˆxi
=
xi
− µB
σ B
2
+ ε
m
D
x
yi
= γ ˆxi
+ β
B = xi{ }i=1....m
µB
=
1
m
xi
i=1
m
∑ σ B
2
=
1
m
(xi
− µB
)2
i=1
m
∑
Note:
normaliza.on
can
reduce
the
expressive
power
of
the
network
(e.g.
normalize
inputs
of
a
sigmoid
would
constrain
them
to
the
linear
regime
To
recover
the
iden.ty
mapping.
The
network
can
lean
Then
β = µBγ = σ B
2
+ ε
ˆyi
= xi
38. Batch
normaliza6on
Each
mini-‐batch
is
scaled
by
the
mean/variance
computed
on
just
that
mini-‐batch.
This
adds
some
noise
to
the
hidden
layer’s
ac.va.ons
within
that
minibatch,
having
a
slight
regulariza.on
effect:
• Improves
gradient
flow
through
the
network
• Allows
higher
learning
rates
• Reduces
the
strong
dependency
on
ini.aliza.on
• Reduces
the
need
of
regulariza.on
At
test
.me
BN
layers
func.on
differently:
• Mean
and
std
are
not
computed
on
the
batch.
• Instead,
a
single
fixed
empirical
mean
and
std
of
ac.va.ons
computed
during
training
is
used
(can
be
es.mated
with
exponen.ally
decaying
weighted
averages)
38
39. Summary
39
• Op.miza.on
for
NN
is
different
from
pure
op.miza.on:
• GD
with
mini-‐batches
• early
stopping
• non-‐convex
surface,
saddle
points
• Learning
rate
has
a
significant
impact
on
model
performance
• Several
extensions
to
GD
can
improve
convergence
• Adap.ve
learning-‐rate
methods
are
likely
to
achieve
best
results
• RMSProp,
Adam
• Weight
ini.aliza.on:
He
w=
randn(n)/
sqrt(2/n)
• Batch
normaliza.on
to
reduce
the
internal
covariance
shi]
40. Bibliograpy
• Goodfellow,
I.,
Bengio,
Y.,
and
A.,
C.
(2016),
Deep
Learning,
MIT
Press.
• Choromanska,
A.,
Henaff,
M.,
Mathieu,
M.,
Arous,
G.
B.,
and
LeCun,
Y.
(2015),
The
loss
surfaces
of
mul.layer
networks.
In
AISTATS.
• Dauphin,
Y.
N.,
Pascanu,
R.,
Gulcehre,
C.,
Cho,
K.,
Ganguli,
S.,
and
Bengio,
Y.
(2014).
Iden.fying
and
abacking
the
saddle
point
problem
in
high-‐dimensional
non-‐convex
op.miza.on.
In
Advances
in
Neural
Informa.on
Processing.
Systems,
pages
2933–2941.
• Duchi,
J.,
Hazan,
E.,
and
Singer,
Y.
(2011).
Adap.ve
subgradient
methods
for
online
learning
and
stochas.c
op.miza.on.
Journal
of
Machine
Learning
Research,
12(Jul):2121–2159.
• Goodfellow,
I.
J.,
Vinyals,
O.,
and
Saxe,
A.
M.
(2015).
Qualita.vely
characterizing
neural
network
op.miza.on
problems.
In
Interna.onal
Conference
on
Learning
Representa.ons.
• Hinton,
G.
(2012).
Neural
networks
for
machine
learning.
Coursera,
video
lectures
• Jacobs,
R.
A.
(1988).
Increased
rates
of
convergence
through
learning
rate
adapta.on.
Neural
networks,
1(4):295–307.
• Kingma,
D.
and
Ba,
J.
(2014)-‐
Adam:
A
method
for
stochas.c
op.miza.on.
arXiv
preprint
arXiv:
1412.6980.
• Saxe,
A.
M.,
McClelland,
J.
L.,
and
Ganguli,
S.
(2013).
Exact
solu.ons
to
the
nonlinear
dynamics
of
learning
in
deep
linear
neural
networks.
In
Interna.onal
Conference
on
Learning
Representa.ons
40