SlideShare uma empresa Scribd logo
1 de 40
Baixar para ler offline
[course	
  site]	
  
Verónica Vilaplana
veronica.vilaplana@upc.edu
Associate Professor
Universitat Politecnica de Catalunya
Technical University of Catalonia
Optimization for neural
network training
Day 3 Lecture 2
#DLUPC
Previously	
  in	
  DLAI…	
  
•  Mul.layer	
  perceptron	
  
•  Training:	
  (stochas.c	
  /	
  mini-­‐batch)	
  gradient	
  descent	
  
•  Backpropaga.on	
  
•  Loss	
  func.on	
  
	
  
but…	
  
What	
  type	
  of	
  op.miza.on	
  problem?	
  
Do	
  local	
  minima	
  and	
  saddle	
  points	
  cause	
  problems?	
  
Does	
  gradient	
  descent	
  perform	
  well?	
  
How	
  to	
  set	
  the	
  learning	
  rate?	
  
How	
  to	
  ini.alize	
  weights?	
  
How	
  does	
  batch	
  size	
  affect	
  training?	
  
	
  
	
  
	
  
2	
  
Index	
  
•  Op6miza6on	
  for	
  a	
  machine	
  learning	
  task;	
  difference	
  between	
  learning	
  and	
  pure	
  op6miza6on	
  
•  Expected	
  and	
  empirical	
  risk	
  
•  Surrogate	
  loss	
  func.ons	
  and	
  early	
  stopping	
  
•  Batch	
  and	
  mini-­‐batch	
  algorithms	
  
•  Challenges	
  for	
  deep	
  models	
  
•  Local	
  minima	
  	
  
•  Saddle	
  points	
  and	
  other	
  flat	
  regions	
  
•  Cliffs	
  and	
  exploding	
  gradients	
  
•  Prac6cal	
  algorithms	
  
•  Stochas.c	
  Gradient	
  Descent	
  
•  Momentum	
  
•  Nesterov	
  Momentum	
  
•  Learning	
  rate	
  
•  Adap.ve	
  learning	
  rates:	
  adaGrad,	
  RMSProp,	
  Adam	
  
•  Parameter	
  ini6aliza6on	
  
•  Batch	
  Normaliza6on	
  
3	
  
Differences	
  between	
  learning	
  and	
  pure	
  
op6miza6on	
  
Op6miza6on	
  for	
  NN	
  training	
  
•  Goal:	
  Find	
  the	
  parameters	
  that	
  minimize	
  the	
  expected	
  risk	
  (generaliza.on	
  error)	
  
•  x	
  input,	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  predicted	
  output,	
  y	
  target	
  output,	
  E	
  expecta.on	
  
•  pdata	
  true	
  (unknown)	
  data	
  distribu.on,	
  L	
  	
  loss	
  func6on	
  (how	
  wrong	
  predic6ons	
  are)	
  
•  But	
  we	
  only	
  have	
  a	
  training	
  set	
  of	
  samples:	
  we	
  minimize	
  the	
  empirical	
  risk,	
  average	
  
loss	
  on	
  a	
  finite	
  dataset	
  D	
  
J(θ) = Ε(x,y)∼pdata
L( fθ
(x), y)
fθ
(x)
J(θ) = Ε(x,y)∼ ˆpdata
L( fθ
(x), y) =
1
D
L( fθ
(x(i)
), y(i)
)
(x(i)
,y(i)
)∈D
∑
where	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  is	
  the	
  empirical	
  distribu.on,	
  |D|	
  is	
  the	
  number	
  of	
  examples	
  in	
  D	
  
5	
  
ˆpdata
Surrogate	
  loss	
  
•  O]en	
  minimizing	
  the	
  real	
  loss	
  is	
  intractable	
  (can’t	
  be	
  used	
  with	
  gradient	
  descent)	
  
•  e.g.	
  0-­‐1	
  loss	
  (0	
  if	
  correctly	
  classified,	
  1	
  if	
  it	
  is	
  not)	
  
	
  	
  	
  	
  	
  	
  	
  (intractable	
  even	
  for	
  linear	
  classifiers	
  (Marcobe	
  1992)	
  	
  	
  
•  Minimize	
  a	
  surrogate	
  loss	
  instead	
  
•  e.g.	
  for	
  the	
  0-­‐1	
  loss	
  
hinge	
  
	
  
square	
  
	
  
logis.c	
  
6	
  
0-­‐1	
  loss	
  (blue)	
  and	
  surrogate	
  losses	
  	
  
(green:	
  square,	
  purple:	
  hinge,	
  yellow:	
  logis.c)	
  	
  
L( f (x) , y) = I( f (x)≠y)
L( f (x), y) = max(0,1− yf (x))
L( f (x), y) = (1− yf (x))2
L( f (x), y) = log(1+ e− yf (x)
)
Surrogate	
  loss	
  func6ons	
  
7	
  
Probabilistic
classifier
Outputs	
  probability	
  of	
  class	
  1	
  
f(x) ≈ P(y=1 | x) Probability for class 0 is 1-f(x)
Binary cross-entropy loss:
L(f(x),y) = -(y log(f(x)) + (1-y) log(1-f(x))
Decision	
  func.on: F(x) = If(x)>0.5
Outputs	
  a	
  vector	
  of	
  probabili.es:	
  
f(x) ≈ ( P(y=0|x), ..., P(y=m-1|x) )
Negative conditional log likelihood loss
L(f(x),y) = -log f(x)y
Decision	
  func.on:	
  F(x) = argmax(f(x))
Non-
Hinge	
  loss:	
  probabilistic
classifier
Outputs a «score» f(x) for class 1.
score for the other class is -f(x)
L(f(x),t) = max(0, 1-t f(x)) where t=2y-1
Decision	
  func.on:	
  	
  F(x) = If(x)>0
Outputs	
  a	
  vector	
  f(x) of	
  real-­‐valued	
  
scores	
  for	
  the	
  m	
  classes.
Mul.class	
  margin	
  loss	
  
L(f(x),y) = max(0,1+max(f(x)k)-f(x)y )
k≠y
Decision	
  func.on:	
  	
  F(x) = argmax(f(x))
Binary classifier Multiclass classifier
Early	
  stopping	
  
•  Training	
  algorithms	
  usually	
  do	
  not	
  halt	
  at	
  a	
  local	
  minimum	
  
•  Convergence	
  criterion	
  based	
  on	
  early	
  stopping:	
  
•  based	
  on	
  surrogate	
  loss	
  or	
  true	
  underlying	
  loss	
  (ex	
  0-­‐1	
  loss)	
  measured	
  on	
  a	
  valida6on	
  set	
  
•  #	
  training	
  steps	
  =	
  hyperparameter	
  controlling	
  the	
  effec.ve	
  capacity	
  of	
  the	
  model	
  
•  simple,	
  effec.ve,	
  must	
  keep	
  a	
  copy	
  of	
  the	
  best	
  parameters	
  
•  acts	
  as	
  a	
  regularizer	
  (Bishop	
  1995,…)	
  
8	
  
Training	
  error	
  decreases	
  steadily	
  
Valida.on	
  error	
  begins	
  to	
  increase	
  
	
  
Return	
  parameters	
  at	
  point	
  with	
  
lowest	
  valida6on	
  error	
  
Batch	
  and	
  mini-­‐batch	
  algorithms	
  
•  Gradient	
  descent	
  at	
  each	
  itera.on	
  computes	
  gradients	
  over	
  the	
  en.re	
  dataset	
  for	
  one	
  update	
  
	
  
•  ↑	
  Gradients	
  are	
  stable	
  
•  ↓	
  Using	
  the	
  complete	
  training	
  set	
  can	
  be	
  very	
  expensive	
  	
  
•  the	
  gain	
  of	
  using	
  more	
  samples	
  is	
  less	
  than	
  linear:	
  	
  
•  standard	
  error	
  of	
  the	
  mean	
  es.mated	
  from	
  m	
  samples	
  is	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  (σ	
  is	
  true	
  std)	
  	
  
•  ↓	
  Training	
  set	
  may	
  be	
  redundant	
  
•  Use	
  a	
  subset	
  of	
  the	
  training	
  set	
  
Loop:	
  
1.  sample	
  a	
  subset	
  of	
  data	
  
2.  forward	
  prop	
  through	
  the	
  network	
  
3.  backprop	
  to	
  calculate	
  gradients	
  
4.  update	
  parameters	
  using	
  gradients	
   9	
  
∇θ
J(θ) =
1
m
∇θ
L( fθ
(x(i)
), y(i)
)i∑
SE =
σ
m
Minibatch	
  gradient	
  descent	
  
Batch	
  and	
  mini-­‐batch	
  algorithms	
  
•  How	
  many	
  samples	
  in	
  each	
  update	
  step?	
  
	
  
•  Determinis.c	
  or	
  batch	
  gradient	
  methods:	
  process	
  all	
  training	
  samples	
  in	
  a	
  large	
  batch	
  
•  Mini-­‐batch	
  stochas.c	
  methods:	
  use	
  several	
  (not	
  all)	
  samples	
  
	
  
•  Stochas.c	
  methods:	
  use	
  a	
  single	
  example	
  at	
  a	
  .me	
  
•  online	
  methods:	
  samples	
  are	
  drawn	
  from	
  a	
  stream	
  of	
  con.nually	
  created	
  samples	
  
10	
  batch	
  vs	
  minibatch	
  gradient	
  descent	
  
Batch	
  and	
  mini-­‐batch	
  algorithms	
  
Mini-­‐batch	
  size?	
  
•  Larger	
  batches:	
  more	
  accurate	
  es.mate	
  of	
  the	
  gradient	
  but	
  less	
  than	
  linear	
  return	
  	
  
•  Very	
  small	
  batches:	
  Mul.core	
  architectures	
  under-­‐u.lized	
  
•  Smaller	
  batches	
  provide	
  noisier	
  gradient	
  es.mates	
  
•  Small	
  batches	
  may	
  offer	
  a	
  regularizing	
  effect	
  	
  (add	
  noise)	
  
•  but	
  may	
  require	
  small	
  learning	
  rate	
  
•  may	
  increase	
  number	
  of	
  steps	
  for	
  convergence	
  
	
  
•  If	
  small	
  training	
  set,	
  use	
  batch	
  gradient	
  descent	
  
•  If	
  large	
  training	
  set,	
  use	
  mini	
  batches	
  
•  Minbatches	
  should	
  be	
  selected	
  randomly	
  (shuffle	
  samples)	
  
•  unbiased	
  es.mate	
  of	
  gradients	
  
•  Typical	
  mini-­‐batch	
  size:	
  32,	
  64,	
  128,	
  256	
  	
  
•  (2p,	
  make	
  sure	
  mini-­‐batch	
  fits	
  in	
  CPU-­‐GPU	
  memory)	
  
11	
  
Challenges	
  in	
  deep	
  NN	
  op6miza6on	
  
Convex	
  /	
  Non-­‐convex	
  op6miza6on	
  
A	
  func.on	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  defined	
  on	
  an	
  n-­‐dim	
  interval	
  is	
  convex	
  if	
  for	
  any	
  	
  	
  
13	
  
f : X → !
f (λx + (1− λ)x') ≤ λ f (x) + (1− λ) f (x')
x,x' ∈X λ ∈[0,1]
f (λx + (1− λ)x')
λ f (x) + (1− λ) f (x')
Convex	
  /	
  Non-­‐convex	
  op6miza6on	
  
•  Convex	
  op.miza.on	
  
•  any	
  local	
  minimum	
  is	
  a	
  global	
  minimum	
  
•  there	
  are	
  several	
  opt.	
  algorithms	
  (polynomial-­‐.me)	
  
	
  
•  Non-­‐convex	
  op.miza.on	
  
•  objec6ve	
  func6on	
  in	
  deep	
  networks	
  is	
  non-­‐convex	
  
•  deep	
  models	
  may	
  have	
  several	
  local	
  minima	
  
•  but	
  this	
  is	
  not	
  necessarily	
  a	
  major	
  problem!	
  
14	
  
Local	
  minima	
  and	
  saddle	
  points	
  
•  Cri6cal	
  points:	
  	
  
•  For	
  high	
  dimensional	
  loss	
  func.ons,	
  local	
  minima	
  are	
  rare	
  compared	
  to	
  saddle	
  points	
  
•  Hessian	
  matrix:	
  	
  
	
  real,	
  symmetric	
  
	
  eigenvector/eigenvalue	
  decomposi.on	
  
	
  
•  Intui.on:	
  eigenvalues	
  of	
  the	
  Hessian	
  matrix	
  	
  
•  local	
  minimum/maximum:	
  all	
  posi.ve	
  /	
  all	
  nega.ve	
  eigenvalues:	
  exponen.ally	
  unlikely	
  as	
  n	
  grows	
  
•  saddle	
  points:	
  both	
  posi.ve	
  and	
  nega.ve	
  eigenvalues	
  
15	
  Dauphin	
  et	
  al.	
  Iden.fying	
  and	
  abacking	
  the	
  saddle	
  point	
  problem	
  in	
  high-­‐dimensional	
  non-­‐convex	
  op.miza.on.	
  NIPS	
  2014	
  	
  
Hij
=
∂2
f
∂xi
∂xj
f :!n
→ !
∇x
f (x) = 0
Local	
  minima	
  and	
  saddle	
  points	
  
•  It	
  is	
  believed	
  that	
  for	
  many	
  problems	
  
including	
  learning	
  deep	
  nets,	
  almost	
  all	
  local	
  
minimum	
  have	
  very	
  similar	
  func.on	
  value	
  to	
  
the	
  global	
  op.mum	
  
•  Finding	
  a	
  local	
  minimum	
  is	
  good	
  enough	
  
16	
  
Value	
  of	
  local	
  minima	
  found	
  by	
  running	
  SGD	
  for	
  200	
  
itera.ons	
  on	
  a	
  simplified	
  version	
  of	
  MNIST	
  from	
  different	
  
ini.al	
  star.ng	
  points.	
  As	
  number	
  of	
  parameters	
  increases,	
  
local	
  minima	
  tend	
  to	
  cluster	
  more	
  .ghtly.	
  
•  For	
  many	
  random	
  func.ons	
  local	
  minima	
  are	
  more	
  likely	
  to	
  have	
  low	
  cost	
  than	
  high	
  
cost.	
  
Choromanska	
  et	
  al.	
  The	
  loss	
  surfaces	
  of	
  mul.layer	
  networks,	
  AISTATS	
  2015	
  
Saddle	
  points	
  
How	
  to	
  escape	
  from	
  saddle	
  points?	
  
•  First	
  order	
  methods	
  
•  ini.ally	
  abracted	
  to	
  saddle	
  points,	
  but	
  unless	
  
exact	
  hit,	
  it	
  will	
  be	
  repelled	
  when	
  close	
  
•  hitng	
  cri.cal	
  point	
  exactly	
  is	
  unlikely	
  (es.mated	
  
gradient	
  is	
  noisy)	
  
•  saddle	
  points	
  are	
  very	
  unstable:	
  noise	
  (stochas.c	
  
gradient	
  descent)	
  helps	
  convergence,	
  trajectory	
  
escapes	
  quickly	
  
•  Second	
  order	
  moments:	
  
•  Netwon’s	
  method	
  can	
  jump	
  to	
  saddle	
  points	
  
(where	
  gradient	
  is	
  0)	
  
17	
  S.	
  Credit:	
  K.McGuinness	
  
SGD	
  tends	
  to	
  oscillate	
  between	
  slowly	
  approaching	
  
a	
  saddle	
  point	
  and	
  quickly	
  escaping	
  from	
  it	
  
Other	
  difficul6es	
  
•  Cliffs	
  and	
  exploding	
  gradients	
  
•  Nets	
  with	
  many	
  layers	
  /	
  recurrent	
  nets	
  can	
  contain	
  very	
  steep	
  regions	
  (cliffs):	
  
gradient	
  descent	
  can	
  move	
  parameters	
  too	
  far,	
  jumping	
  off	
  of	
  the	
  cliff.	
  (solu.ons:	
  
gradient	
  clipping)	
  
•  Long	
  term	
  dependencies	
  
•  computa.onal	
  graph	
  becomes	
  very	
  deep	
  (deep	
  nets	
  /	
  recurrent	
  nets):	
  vanishing	
  
and	
  exploding	
  gradients	
  
18	
  
cost	
  func.on	
  of	
  highly	
  
non	
  linear	
  deep	
  nets	
  
or	
  recurrent	
  net	
  
(Pascanu2013)	
  
Algorithms	
  
Mini-­‐batch	
  Gradient	
  Descent	
  
•  Most	
  used	
  algorithm	
  for	
  deep	
  learning	
  
Algorithm	
  
•  Require:	
  ini.al	
  parameter	
  θ,	
  learning	
  rate	
  α,	
  	
  
•  while	
  stopping	
  criterion	
  not	
  met	
  do	
  
•  sample	
  a	
  minibatch	
  of	
  m	
  examples	
  from	
  the	
  training	
  set	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  with	
  
corresponding	
  targets	
  	
  
•  compute	
  gradient	
  es.mate	
  
•  apply	
  update	
  	
  
•  end	
  while	
  
20	
  
{x(i)
}i=1...m
{y(i)
}i=1...m
g ← +
1
m
∇θ
L( fθ
(x(i)
), y(i)
)i∑
θ ←θ −αg
Problems	
  with	
  GD	
  
•  GD	
  can	
  be	
  very	
  slow.	
  	
  
•  Can	
  get	
  stuck	
  in	
  local	
  minima	
  or	
  saddle	
  points	
  
•  If	
  the	
  loss	
  changes	
  quickly	
  in	
  one	
  direc.on	
  and	
  slowly	
  in	
  another,	
  GD	
  makes	
  slow	
  
progress	
  along	
  shallow	
  dimension,	
  jibers	
  along	
  steep	
  direc.on	
  	
  
21	
  
Loss	
  func.on	
  has	
  a	
  high	
  condi6on	
  number	
  (5):	
  ra.o	
  of	
  
largest	
  to	
  smallest	
  singular	
  value	
  of	
  Hessian	
  matrix	
  is	
  large	
  
Momentum	
  	
  
•  Momentum	
  is	
  designed	
  to	
  accelerate	
  learning,	
  especially	
  for	
  high	
  curvature,	
  small	
  but	
  
consistent	
  gradients	
  or	
  noisy	
  gradients	
  
•  New	
  variable	
  velocity	
  v	
  (direc.on	
  and	
  speed	
  at	
  which	
  parameters	
  move)	
  
•  exponen.ally	
  decaying	
  average	
  of	
  nega.ve	
  gradient	
  
	
  
Algorithm	
  
•  Require:	
  ini.al	
  parameter	
  θ,	
  learning	
  rate	
  α,	
  	
  momentum	
  parameter	
  λ	
  	
  ,	
  ini6al	
  velocity	
  v
•  Update	
  rule:	
  (g	
  is	
  gradient	
  es.mate)	
  
•  compute	
  velocity	
  update	
  
•  apply	
  update	
  	
  
	
  
	
  
•  Typical	
  values	
  v0=0,	
  	
  λ=0.5,	
  0.9,0.99	
  	
  	
  (in	
  [0,1})	
  
•  Read	
  physical	
  analogy	
  in	
  Deep	
  Learning	
  book	
  (Goodfellow	
  et	
  al):	
  velocity	
  =	
  momentum	
  of	
  unit	
  mass	
  par.cle	
  
22	
  
θ ←θ + v
v ← λv −αg
Nesterov	
  accelerated	
  gradient	
  (NAG)	
  
•  A	
  variant	
  of	
  momentum,	
  where	
  gradient	
  is	
  evaluated	
  a]er	
  current	
  velocity	
  is	
  applied:	
  
•  Approximate	
  where	
  the	
  parameters	
  will	
  be	
  on	
  the	
  next	
  .me	
  step	
  using	
  current	
  velocity	
  
•  Update	
  velocity	
  using	
  gradient	
  where	
  we	
  predict	
  parameters	
  will	
  be	
  
Algorithm	
  
•  Require:	
  ini.al	
  parameter	
  θ,	
  learning	
  rate	
  α,	
  momentum	
  parameter	
  λ	
  	
  ,	
  ini.al	
  velocity	
  v
•  Update:	
  
•  apply	
  interim	
  update	
  
•  compute	
  gradient	
  (at	
  interim	
  point)	
  
•  compute	
  velocity	
  update	
  
•  apply	
  update	
  	
  
•  Interpreta.on:	
  add	
  a	
  correc.on	
  factor	
  to	
  momentum	
  
23	
  
g ← +
1
m
∇!θ
L!θ
( f (x(i)
), y(i)
)i∑
θ ←θ + v
v ← λv −αg
!θ ←θ + λv
interim	
  	
  
Nesterov	
  accelerated	
  gradient	
  (NAG)	
  
24	
  
current	
  loca.on	
  wt
vt
∇L(wt) vt+1
S.	
  Credit:	
  K.	
  McGuinness	
  
predicted	
  loca.on	
  based	
  on	
  velocity	
  alone	
  wt + 𝛾v
∇L(wt + 𝛾vt)
vt
vt+1
GD:	
  learning	
  rate	
  
•  Learning	
  rate	
  is	
  a	
  crucial	
  parameter	
  for	
  GD	
  
•  Too	
  large:	
  overshoots	
  local	
  minimum,	
  loss	
  increases	
  
•  Too	
  small:	
  makes	
  very	
  slow	
  progress,	
  can	
  get	
  stuck	
  
•  Good	
  learning	
  rate:	
  makes	
  steady	
  progress	
  toward	
  local	
  minimum	
  
25	
  
too	
  small	
   too	
  large	
  
GD:	
  learning	
  rate	
  decay	
  
•  In	
  prac.ce	
  it	
  is	
  necessary	
  to	
  gradually	
  decrease	
  learning	
  rate	
  to	
  speed	
  up	
  the	
  training	
  
•  step	
  decay	
  (e.g.	
  reduce	
  by	
  half	
  every	
  few	
  epochs)	
  
•  exponen6al	
  decay	
  
•  1/t	
  decay	
  	
  
•  manual	
  decay	
  
•  Sufficient	
  condi.ons	
  for	
  convergence:	
  
•  Usually:	
  adapt	
  learning	
  rate	
  by	
  monitoring	
  learning	
  curves	
  that	
  plot	
  the	
  objec.ve	
  
func.on	
  as	
  a	
  func.on	
  of	
  .me	
  (more	
  of	
  an	
  art	
  than	
  a	
  science!)	
  
26	
  
αt
= ∞
t=1
∞
∑ αt
2
< ∞
t=1
∞
∑
α = α0
e−kt
α =
α0
1+ kt
k decay rate
t iteration number
α0
initial learning rate
Adap6ve	
  learning	
  rates	
  
•  Cost	
  if	
  o]en	
  sensi.ve	
  to	
  some	
  direc.ons	
  and	
  insensi.ve	
  to	
  others	
  
•  Momentum/Nesterov	
  mi.gate	
  this	
  issue	
  but	
  introduce	
  another	
  hyperparameter	
  
•  Solu6on:	
  Use	
  a	
  separate	
  learning	
  rate	
  for	
  each	
  parameter	
  and	
  automa6cally	
  adapt	
  it	
  
through	
  the	
  course	
  of	
  learning	
  	
  
•  Algorithms	
  (mini-­‐batch	
  based)	
  
•  AdaGrad	
  
•  RMSProp	
  
•  Adam	
  
	
  
	
  
27	
  
AdaGrad	
  
•  Adapts	
  the	
  learning	
  rate	
  of	
  each	
  parameter	
  based	
  on	
  sizes	
  of	
  previous	
  updates:	
  	
  
•  scales	
  updates	
  to	
  be	
  larger	
  for	
  parameters	
  that	
  are	
  updated	
  less	
  
•  scales	
  updates	
  to	
  be	
  smaller	
  for	
  parameters	
  that	
  are	
  updated	
  more	
  
	
  
•  The	
  net	
  effect	
  is	
  greater	
  progress	
  in	
  the	
  more	
  gently	
  sloped	
  direc.ons	
  of	
  parameter	
  space	
  	
  
•  Require:	
  ini.al	
  parameter	
  θ,	
  learning	
  rate	
  α,	
  small	
  constant	
  δ	
  (e.g.	
  10-­‐7)	
  for	
  numerical	
  stability
•  Update:	
  
•  accumulate	
  squared	
  gradient	
  
•  compute	
  update	
  
•  apply	
  update	
  	
  
28	
  
θ ←θ + Δθ
Δθ ← −
α
δ + r
⊙ g
r ← r + g ⊙ g sum	
  of	
  	
  all	
  previous	
  squared	
  gradients	
  
	
  
updates	
  inversely	
  propor.onal	
  to	
  the	
  
square	
  root	
  of	
  the	
  sum	
  
(elementwise	
  mul.plica.on)	
  
Duchi	
  et	
  al.	
  Adap.ve	
  Subgradient	
  Methods	
  for	
  Online	
  Learning	
  and	
  Stochas.c	
  Op.miza.on.	
  JMRL	
  2011	
  
Root	
  Mean	
  Square	
  Propaga6on	
  (RMSProp)	
  
•  AdaGrad	
  can	
  result	
  in	
  a	
  premature	
  and	
  excessive	
  decrease	
  in	
  effec6ve	
  learning	
  rate	
  
•  RMSProp	
  modifies	
  AdaGrad	
  to	
  perform	
  beber	
  in	
  non-­‐convex	
  surfaces	
  
•  Changes	
  gradient	
  accumula.on	
  by	
  an	
  exponen6ally	
  decaying	
  average	
  of	
  sum	
  of	
  
squares	
  of	
  gradients	
  
	
  
•  Requires:	
  ini.al	
  parameter	
  θ,	
  learning	
  rate	
  α,	
  decay	
  rate	
  ρ,	
  small	
  constant	
  δ	
  (e.g.	
  10-­‐7)	
  
•  Update:	
  
•  accumulate	
  squared	
  gradient	
  
•  compute	
  update	
  
•  apply	
  update	
  	
  
29	
  
θ ←θ + Δθ
Δθ ← −
α
δ + r
⊙ g
r ← ρr + (1− ρ)g ⊙ g
Geoff	
  Hinton,	
  Unpublished	
  
ADAp6ve	
  Moments	
  (Adam)	
  
•  Combina.on	
  of	
  RMSProp	
  and	
  momentum,	
  but:	
  
•  Keep	
  decaying	
  average	
  of	
  both	
  first-­‐order	
  moment	
  of	
  gradient	
  (momentum)	
  and	
  second-­‐
order	
  moment	
  (RMSProp)	
  
•  Includes	
  bias	
  correc.ons	
  (first	
  and	
  second	
  moments)	
  to	
  account	
  for	
  their	
  ini.aliza.on	
  at	
  
origin	
  
Update:	
  
•  updated	
  biased	
  first	
  moment	
  es6mate	
  
•  update	
  biased	
  second	
  moment	
  
•  correct	
  biases	
  
•  compute	
  update	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  (opera.ons	
  applied	
  elementwise)	
  
•  apply	
  update	
  
30	
  
θ ←θ + Δθ
Δθ ← −α
ˆs
δ + ˆr
s ← ρ1
s + (1− ρ1
)g
r ← ρ2
r + (1− ρ2
)g ⊙ g
ˆs ←
s
1− ρ1
ˆr ←
r
1− ρ2
Kingma	
  et	
  al.	
  Adam:	
  a	
  Method	
  for	
  Stochas.c	
  Op.miza.on.	
  ICLR	
  2015	
  
	
  δ=10-­‐8,	
  ρ1=0.9,	
  ρ2=0.999	
  
Example:	
  test	
  func6on	
  
31	
  
Image	
  credit:	
  Alec	
  Radford.	
  
Beale’s	
  func.on	
  
Example:	
  saddle	
  point	
  
32	
  
Image	
  credit:	
  Alec	
  Radford.	
  
Ini6aliza6on	
  -­‐	
  Normaliza6on	
  
Parameter	
  ini6aliza6on	
  
•  Weights	
  
•  Can’t	
  ini.alize	
  weights	
  to	
  0	
  	
  (gradients	
  would	
  be	
  0)	
  
•  Can’t	
  ini.alize	
  all	
  weights	
  to	
  the	
  same	
  value	
  (all	
  hidden	
  units	
  in	
  a	
  layer	
  will	
  always	
  
behave	
  the	
  same;	
  need	
  to	
  break	
  symmetry)	
  
•  Small	
  random	
  number,	
  e.g.,	
  uniform	
  or	
  gaussian	
  distribu.on	
  	
  
•  if	
  weights	
  start	
  too	
  small,	
  the	
  signal	
  shrinks	
  as	
  it	
  passes	
  through	
  each	
  layer	
  un.l	
  it	
  is	
  too	
  .ny	
  
to	
  be	
  useful	
  
•  Xavier	
  ini.aliza.on	
  (calibra.ng	
  variances,	
  for	
  tanh	
  ac.va.ons)	
  sqrt(1/n)	
  
•  each	
  neuron:	
  w	
  =	
  randn(n)	
  /	
  sqrt(n)	
  ,	
  n	
  inputs	
  
•  He	
  ini.aliza.on	
  (for	
  ReLu	
  ac.va.ons)	
  sqrt(2/n)	
  
•  each	
  neuron	
  w	
  =	
  randn(n)	
  *	
  sqrt(2.0	
  /n)	
  ,	
  n	
  inputs	
  
•  Biases	
  
•  ini.alize	
  all	
  to	
  0	
  (except	
  for	
  output	
  unit	
  for	
  skewed	
  distribu.ons,	
  0.01	
  to	
  avoid	
  satura.ng	
  RELU)	
  
•  Alterna6ve:	
  Ini.alize	
  using	
  machine	
  learning;	
  parameters	
  learned	
  by	
  unsupervised	
  model	
  
trained	
  on	
  the	
  same	
  inputs	
  /	
  trained	
  on	
  unrelated	
  task	
  
34	
  
Normalizing	
  inputs	
  
•  Normalizing	
  inputs	
  to	
  speed	
  up	
  learning	
  
•  For	
  input	
  layers:	
  data	
  preprocessing	
  mean	
  =	
  1,	
  std=1	
  	
  
	
  
•  For	
  hidden	
  layers:	
  batch	
  normaliza.on	
  
35	
  
original	
  data	
  	
   mean=0	
   mean	
  =0,	
  std=1 	
  	
  
Loss	
  for	
  unnormalized	
  data
	
  	
  
Loss	
  for	
  normalized	
  data	
  
Batch	
  normaliza6on	
  
•  As	
  learning	
  progresses,	
  the	
  distribu.on	
  of	
  the	
  layer	
  inputs	
  changes	
  due	
  
to	
  parameter	
  updates	
  (	
  internal	
  covariate	
  shi])	
  
	
  
•  This	
  can	
  result	
  in	
  most	
  inputs	
  being	
  in	
  the	
  non-­‐linear	
  regime	
  of	
  	
  
	
  the	
  ac.va.on	
  func.on,	
  slowing	
  down	
  learning	
  
	
  
•  Bach	
  normaliza.on	
  is	
  a	
  technique	
  to	
  reduce	
  this	
  effect	
  
•  Explicitly	
  force	
  the	
  layer	
  ac.va.ons	
  to	
  have	
  zero	
  mean	
  and	
  unit	
  
variance	
  w.r.t	
  running	
  batch	
  es.mates	
  
	
  
•  Adds	
  a	
  learnable	
  scale	
  and	
  bias	
  term	
  to	
  allow	
  the	
  network	
  to	
  s.ll	
  
use	
  the	
  nonlinearity	
  
36	
  	
  Ioffe	
  and	
  Szegedy,	
  2015.	
  “Batch	
  normaliza.on:	
  accelera.ng	
  deep	
  network	
  training	
  by	
  reducing	
  internal	
  covariate	
  shi]”	
  
FC	
  /	
  Conv	
  
Batch	
  norm	
  
ReLu	
  
FC	
  /	
  Conv	
  
Batch	
  norm	
  
ReLu	
  
Batch	
  normaliza6on	
  
•  Can	
  be	
  applied	
  to	
  any	
  input	
  or	
  hidden	
  layer	
  
•  For	
  a	
  mini-­‐batch	
  of	
  m	
  ac.va.ons	
  of	
  the	
  layer	
  
1.  Compute	
  empirical	
  mean	
  and	
  variance	
  for	
  each	
  dimension	
  D	
  
2.  Normalize	
  
3.  Scale	
  and	
  shi] 	
   	
   	
   	
   	
  (two	
  learnable	
  parameters	
  )	
  
	
  
37	
  
ˆxi
=
xi
− µB
σ B
2
+ ε
m
D
x
yi
= γ ˆxi
+ β
B = xi{ }i=1....m
µB
=
1
m
xi
i=1
m
∑ σ B
2
=
1
m
(xi
− µB
)2
i=1
m
∑
Note:	
  normaliza.on	
  can	
  reduce	
  the	
  expressive	
  power	
  of	
  the	
  network	
  (e.g.	
  normalize	
  inputs	
  of	
  a	
  
sigmoid	
  would	
  constrain	
  them	
  to	
  the	
  linear	
  regime	
  
To	
  recover	
  the	
  iden.ty	
  mapping.	
  The	
  network	
  can	
  lean	
  
Then	
  	
  
β = µBγ = σ B
2
+ ε
ˆyi
= xi
Batch	
  normaliza6on	
  
Each	
  mini-­‐batch	
  is	
  scaled	
  by	
  the	
  mean/variance	
  computed	
  on	
  just	
  that	
  mini-­‐batch.	
  
This	
  adds	
  some	
  noise	
  to	
  the	
  hidden	
  layer’s	
  ac.va.ons	
  within	
  that	
  minibatch,	
  having	
  a	
  
slight	
  regulariza.on	
  effect:	
  
	
  
•  Improves	
  gradient	
  flow	
  through	
  the	
  network	
  
•  Allows	
  higher	
  learning	
  rates	
  
•  Reduces	
  the	
  strong	
  dependency	
  on	
  ini.aliza.on	
  
•  Reduces	
  the	
  need	
  of	
  regulariza.on	
  
At	
  test	
  .me	
  BN	
  layers	
  func.on	
  differently:	
  
•  Mean	
  and	
  std	
  are	
  not	
  computed	
  on	
  the	
  batch.	
  
•  Instead,	
  a	
  single	
  fixed	
  empirical	
  mean	
  and	
  std	
  of	
  ac.va.ons	
  computed	
  during	
  training	
  is	
  
used	
  (can	
  be	
  es.mated	
  with	
  exponen.ally	
  decaying	
  weighted	
  averages)	
  
38	
  
Summary	
  
39	
  
•  Op.miza.on	
  for	
  NN	
  is	
  different	
  from	
  pure	
  op.miza.on:	
  
•  GD	
  with	
  mini-­‐batches	
  
•  early	
  stopping	
  
•  non-­‐convex	
  surface,	
  saddle	
  points	
  
•  Learning	
  rate	
  has	
  a	
  significant	
  impact	
  on	
  model	
  performance	
  
•  Several	
  extensions	
  to	
  GD	
  can	
  improve	
  convergence	
  
•  Adap.ve	
  learning-­‐rate	
  methods	
  are	
  likely	
  to	
  achieve	
  best	
  results	
  
•  RMSProp,	
  Adam	
  	
  
•  Weight	
  ini.aliza.on:	
  He	
  	
  	
  	
  w=	
  randn(n)/	
  sqrt(2/n)	
  	
  
•  Batch	
  normaliza.on	
  to	
  reduce	
  the	
  internal	
  covariance	
  shi]	
  
Bibliograpy	
  
•  Goodfellow,	
  I.,	
  Bengio,	
  Y.,	
  and	
  A.,	
  C.	
  (2016),	
  Deep	
  Learning,	
  MIT	
  Press.	
  
•  Choromanska,	
  A.,	
  Henaff,	
  M.,	
  Mathieu,	
  M.,	
  Arous,	
  G.	
  B.,	
  and	
  LeCun,	
  Y.	
  (2015),	
  The	
  loss	
  surfaces	
  of	
  
mul.layer	
  networks.	
  In	
  AISTATS.	
  
•  Dauphin,	
  Y.	
  N.,	
  Pascanu,	
  R.,	
  Gulcehre,	
  C.,	
  Cho,	
  K.,	
  Ganguli,	
  S.,	
  and	
  Bengio,	
  Y.	
  (2014).	
  Iden.fying	
  and	
  
abacking	
  the	
  saddle	
  point	
  problem	
  in	
  high-­‐dimensional	
  non-­‐convex	
  op.miza.on.	
  In	
  Advances	
  in	
  
Neural	
  Informa.on	
  Processing.	
  Systems,	
  pages	
  2933–2941.	
  
•  Duchi,	
  J.,	
  Hazan,	
  E.,	
  and	
  Singer,	
  Y.	
  (2011).	
  Adap.ve	
  subgradient	
  methods	
  for	
  online	
  learning	
  and	
  
stochas.c	
  op.miza.on.	
  Journal	
  of	
  Machine	
  Learning	
  Research,	
  12(Jul):2121–2159.	
  
•  Goodfellow,	
  I.	
  J.,	
  Vinyals,	
  O.,	
  and	
  Saxe,	
  A.	
  M.	
  (2015).	
  Qualita.vely	
  characterizing	
  neural	
  network	
  
op.miza.on	
  problems.	
  In	
  Interna.onal	
  Conference	
  on	
  Learning	
  Representa.ons.	
  
•  Hinton,	
  G.	
  (2012).	
  Neural	
  networks	
  for	
  machine	
  learning.	
  Coursera,	
  video	
  lectures	
  
•  Jacobs,	
  R.	
  A.	
  (1988).	
  Increased	
  rates	
  of	
  convergence	
  through	
  learning	
  rate	
  adapta.on.	
  Neural	
  
networks,	
  1(4):295–307.	
  
•  Kingma,	
  D.	
  and	
  Ba,	
  J.	
  (2014)-­‐	
  Adam:	
  A	
  method	
  for	
  stochas.c	
  op.miza.on.	
  arXiv	
  preprint	
  arXiv:
1412.6980.	
  
•  Saxe,	
  A.	
  M.,	
  McClelland,	
  J.	
  L.,	
  and	
  Ganguli,	
  S.	
  (2013).	
  Exact	
  solu.ons	
  to	
  the	
  nonlinear	
  dynamics	
  of	
  
learning	
  in	
  deep	
  linear	
  neural	
  networks.	
  In	
  Interna.onal	
  Conference	
  on	
  Learning	
  Representa.ons	
  40	
  

Mais conteúdo relacionado

Mais procurados

Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learningJie-Han Chen
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-LearningKuppusamy P
 
PR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox DetectorPR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox DetectorJinwon Lee
 
Ensemble learning Techniques
Ensemble learning TechniquesEnsemble learning Techniques
Ensemble learning TechniquesBabu Priyavrat
 
Autoencoders Tutorial | Autoencoders In Deep Learning | Tensorflow Training |...
Autoencoders Tutorial | Autoencoders In Deep Learning | Tensorflow Training |...Autoencoders Tutorial | Autoencoders In Deep Learning | Tensorflow Training |...
Autoencoders Tutorial | Autoencoders In Deep Learning | Tensorflow Training |...Edureka!
 
Decision trees in Machine Learning
Decision trees in Machine Learning Decision trees in Machine Learning
Decision trees in Machine Learning Mohammad Junaid Khan
 
Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural NetworksDatabricks
 
Optimization for Deep Learning
Optimization for Deep LearningOptimization for Deep Learning
Optimization for Deep LearningSebastian Ruder
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter TuningJon Lederman
 
Understanding Bagging and Boosting
Understanding Bagging and BoostingUnderstanding Bagging and Boosting
Understanding Bagging and BoostingMohit Rajput
 
Support Vector Machines for Classification
Support Vector Machines for ClassificationSupport Vector Machines for Classification
Support Vector Machines for ClassificationPrakash Pimpale
 
Methods of Optimization in Machine Learning
Methods of Optimization in Machine LearningMethods of Optimization in Machine Learning
Methods of Optimization in Machine LearningKnoldus Inc.
 
backpropagation in neural networks
backpropagation in neural networksbackpropagation in neural networks
backpropagation in neural networksAkash Goel
 
Regularization in deep learning
Regularization in deep learningRegularization in deep learning
Regularization in deep learningKien Le
 
Perceptron 2015.ppt
Perceptron 2015.pptPerceptron 2015.ppt
Perceptron 2015.pptSadafAyesha9
 
Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree LearningMilind Gokhale
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313Slideshare
 

Mais procurados (20)

Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learning
 
Support vector machine
Support vector machineSupport vector machine
Support vector machine
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-Learning
 
PR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox DetectorPR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox Detector
 
Ensemble learning Techniques
Ensemble learning TechniquesEnsemble learning Techniques
Ensemble learning Techniques
 
K Nearest Neighbors
K Nearest NeighborsK Nearest Neighbors
K Nearest Neighbors
 
Autoencoders Tutorial | Autoencoders In Deep Learning | Tensorflow Training |...
Autoencoders Tutorial | Autoencoders In Deep Learning | Tensorflow Training |...Autoencoders Tutorial | Autoencoders In Deep Learning | Tensorflow Training |...
Autoencoders Tutorial | Autoencoders In Deep Learning | Tensorflow Training |...
 
Decision trees in Machine Learning
Decision trees in Machine Learning Decision trees in Machine Learning
Decision trees in Machine Learning
 
Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural Networks
 
Optimization for Deep Learning
Optimization for Deep LearningOptimization for Deep Learning
Optimization for Deep Learning
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter Tuning
 
Understanding Bagging and Boosting
Understanding Bagging and BoostingUnderstanding Bagging and Boosting
Understanding Bagging and Boosting
 
Support vector machine-SVM's
Support vector machine-SVM'sSupport vector machine-SVM's
Support vector machine-SVM's
 
Support Vector Machines for Classification
Support Vector Machines for ClassificationSupport Vector Machines for Classification
Support Vector Machines for Classification
 
Methods of Optimization in Machine Learning
Methods of Optimization in Machine LearningMethods of Optimization in Machine Learning
Methods of Optimization in Machine Learning
 
backpropagation in neural networks
backpropagation in neural networksbackpropagation in neural networks
backpropagation in neural networks
 
Regularization in deep learning
Regularization in deep learningRegularization in deep learning
Regularization in deep learning
 
Perceptron 2015.ppt
Perceptron 2015.pptPerceptron 2015.ppt
Perceptron 2015.ppt
 
Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree Learning
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313
 

Semelhante a Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona 2018

Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)Universitat Politècnica de Catalunya
 
DeepLearningLecture.pptx
DeepLearningLecture.pptxDeepLearningLecture.pptx
DeepLearningLecture.pptxssuserf07225
 
L1 intro2 supervised_learning
L1 intro2 supervised_learningL1 intro2 supervised_learning
L1 intro2 supervised_learningYogendra Singh
 
1629 stochastic subgradient approach for solving linear support vector
1629 stochastic subgradient approach for solving linear support vector1629 stochastic subgradient approach for solving linear support vector
1629 stochastic subgradient approach for solving linear support vectorDr Fereidoun Dejahang
 
Machine Learning workshop by GDSC Amity University Chhattisgarh
Machine Learning workshop by GDSC Amity University ChhattisgarhMachine Learning workshop by GDSC Amity University Chhattisgarh
Machine Learning workshop by GDSC Amity University ChhattisgarhPoorabpatel
 
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory AccessAccelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory AccessIgor Sfiligoi
 
super vector machines algorithms using deep
super vector machines algorithms using deepsuper vector machines algorithms using deep
super vector machines algorithms using deepKNaveenKumarECE
 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Universitat Politècnica de Catalunya
 
机器学习Adaboost
机器学习Adaboost机器学习Adaboost
机器学习AdaboostShocky1
 
Learning stochastic neural networks with Chainer
Learning stochastic neural networks with ChainerLearning stochastic neural networks with Chainer
Learning stochastic neural networks with ChainerSeiya Tokui
 
مدخل إلى تعلم الآلة
مدخل إلى تعلم الآلةمدخل إلى تعلم الآلة
مدخل إلى تعلم الآلةFares Al-Qunaieer
 
ARTIFICIAL-NEURAL-NETWORKMACHINELEARNING
ARTIFICIAL-NEURAL-NETWORKMACHINELEARNINGARTIFICIAL-NEURAL-NETWORKMACHINELEARNING
ARTIFICIAL-NEURAL-NETWORKMACHINELEARNINGmohanapriyastp
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnDataRobot
 
An Introduction to Deep Learning
An Introduction to Deep LearningAn Introduction to Deep Learning
An Introduction to Deep Learningmilad abbasi
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep LearningMehrnaz Faraz
 
Number Crunching in Python
Number Crunching in PythonNumber Crunching in Python
Number Crunching in PythonValerio Maggio
 

Semelhante a Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona 2018 (20)

Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
 
DeepLearningLecture.pptx
DeepLearningLecture.pptxDeepLearningLecture.pptx
DeepLearningLecture.pptx
 
L1 intro2 supervised_learning
L1 intro2 supervised_learningL1 intro2 supervised_learning
L1 intro2 supervised_learning
 
1629 stochastic subgradient approach for solving linear support vector
1629 stochastic subgradient approach for solving linear support vector1629 stochastic subgradient approach for solving linear support vector
1629 stochastic subgradient approach for solving linear support vector
 
Mit6 094 iap10_lec03
Mit6 094 iap10_lec03Mit6 094 iap10_lec03
Mit6 094 iap10_lec03
 
Machine Learning workshop by GDSC Amity University Chhattisgarh
Machine Learning workshop by GDSC Amity University ChhattisgarhMachine Learning workshop by GDSC Amity University Chhattisgarh
Machine Learning workshop by GDSC Amity University Chhattisgarh
 
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory AccessAccelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
 
Regression ppt
Regression pptRegression ppt
Regression ppt
 
super vector machines algorithms using deep
super vector machines algorithms using deepsuper vector machines algorithms using deep
super vector machines algorithms using deep
 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
 
机器学习Adaboost
机器学习Adaboost机器学习Adaboost
机器学习Adaboost
 
Learning stochastic neural networks with Chainer
Learning stochastic neural networks with ChainerLearning stochastic neural networks with Chainer
Learning stochastic neural networks with Chainer
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
مدخل إلى تعلم الآلة
مدخل إلى تعلم الآلةمدخل إلى تعلم الآلة
مدخل إلى تعلم الآلة
 
ARTIFICIAL-NEURAL-NETWORKMACHINELEARNING
ARTIFICIAL-NEURAL-NETWORKMACHINELEARNINGARTIFICIAL-NEURAL-NETWORKMACHINELEARNING
ARTIFICIAL-NEURAL-NETWORKMACHINELEARNING
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learn
 
An Introduction to Deep Learning
An Introduction to Deep LearningAn Introduction to Deep Learning
An Introduction to Deep Learning
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep Learning
 
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Backpropagation - Elisa Sayrol - UPC Barcelona 2018Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
 
Number Crunching in Python
Number Crunching in PythonNumber Crunching in Python
Number Crunching in Python
 

Mais de Universitat Politècnica de Catalunya

The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...Universitat Politècnica de Catalunya
 
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-NietoTowards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-NietoUniversitat Politècnica de Catalunya
 
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...Universitat Politècnica de Catalunya
 
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosGeneration of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosUniversitat Politècnica de Catalunya
 
Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Universitat Politècnica de Catalunya
 
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Universitat Politècnica de Catalunya
 
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...Universitat Politècnica de Catalunya
 
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Universitat Politècnica de Catalunya
 
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Universitat Politècnica de Catalunya
 
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Universitat Politècnica de Catalunya
 
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)Universitat Politècnica de Catalunya
 
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Universitat Politècnica de Catalunya
 

Mais de Universitat Politècnica de Catalunya (20)

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Deep Generative Learning for All
Deep Generative Learning for AllDeep Generative Learning for All
Deep Generative Learning for All
 
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
 
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-NietoTowards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
 
The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021
 
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
 
Open challenges in sign language translation and production
Open challenges in sign language translation and productionOpen challenges in sign language translation and production
Open challenges in sign language translation and production
 
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosGeneration of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
 
Discovery and Learning of Navigation Goals from Pixels in Minecraft
Discovery and Learning of Navigation Goals from Pixels in MinecraftDiscovery and Learning of Navigation Goals from Pixels in Minecraft
Discovery and Learning of Navigation Goals from Pixels in Minecraft
 
Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...
 
Intepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural NetworksIntepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural Networks
 
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
 
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
 
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
 
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
 
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
 
Curriculum Learning for Recurrent Video Object Segmentation
Curriculum Learning for Recurrent Video Object SegmentationCurriculum Learning for Recurrent Video Object Segmentation
Curriculum Learning for Recurrent Video Object Segmentation
 
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
 

Último

Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxVision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxellehsormae
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxAleenaJamil4
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 

Último (20)

Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxVision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptx
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptx
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 

Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona 2018

  • 1. [course  site]   Verónica Vilaplana veronica.vilaplana@upc.edu Associate Professor Universitat Politecnica de Catalunya Technical University of Catalonia Optimization for neural network training Day 3 Lecture 2 #DLUPC
  • 2. Previously  in  DLAI…   •  Mul.layer  perceptron   •  Training:  (stochas.c  /  mini-­‐batch)  gradient  descent   •  Backpropaga.on   •  Loss  func.on     but…   What  type  of  op.miza.on  problem?   Do  local  minima  and  saddle  points  cause  problems?   Does  gradient  descent  perform  well?   How  to  set  the  learning  rate?   How  to  ini.alize  weights?   How  does  batch  size  affect  training?         2  
  • 3. Index   •  Op6miza6on  for  a  machine  learning  task;  difference  between  learning  and  pure  op6miza6on   •  Expected  and  empirical  risk   •  Surrogate  loss  func.ons  and  early  stopping   •  Batch  and  mini-­‐batch  algorithms   •  Challenges  for  deep  models   •  Local  minima     •  Saddle  points  and  other  flat  regions   •  Cliffs  and  exploding  gradients   •  Prac6cal  algorithms   •  Stochas.c  Gradient  Descent   •  Momentum   •  Nesterov  Momentum   •  Learning  rate   •  Adap.ve  learning  rates:  adaGrad,  RMSProp,  Adam   •  Parameter  ini6aliza6on   •  Batch  Normaliza6on   3  
  • 4. Differences  between  learning  and  pure   op6miza6on  
  • 5. Op6miza6on  for  NN  training   •  Goal:  Find  the  parameters  that  minimize  the  expected  risk  (generaliza.on  error)   •  x  input,                                    predicted  output,  y  target  output,  E  expecta.on   •  pdata  true  (unknown)  data  distribu.on,  L    loss  func6on  (how  wrong  predic6ons  are)   •  But  we  only  have  a  training  set  of  samples:  we  minimize  the  empirical  risk,  average   loss  on  a  finite  dataset  D   J(θ) = Ε(x,y)∼pdata L( fθ (x), y) fθ (x) J(θ) = Ε(x,y)∼ ˆpdata L( fθ (x), y) = 1 D L( fθ (x(i) ), y(i) ) (x(i) ,y(i) )∈D ∑ where                            is  the  empirical  distribu.on,  |D|  is  the  number  of  examples  in  D   5   ˆpdata
  • 6. Surrogate  loss   •  O]en  minimizing  the  real  loss  is  intractable  (can’t  be  used  with  gradient  descent)   •  e.g.  0-­‐1  loss  (0  if  correctly  classified,  1  if  it  is  not)                (intractable  even  for  linear  classifiers  (Marcobe  1992)       •  Minimize  a  surrogate  loss  instead   •  e.g.  for  the  0-­‐1  loss   hinge     square     logis.c   6   0-­‐1  loss  (blue)  and  surrogate  losses     (green:  square,  purple:  hinge,  yellow:  logis.c)     L( f (x) , y) = I( f (x)≠y) L( f (x), y) = max(0,1− yf (x)) L( f (x), y) = (1− yf (x))2 L( f (x), y) = log(1+ e− yf (x) )
  • 7. Surrogate  loss  func6ons   7   Probabilistic classifier Outputs  probability  of  class  1   f(x) ≈ P(y=1 | x) Probability for class 0 is 1-f(x) Binary cross-entropy loss: L(f(x),y) = -(y log(f(x)) + (1-y) log(1-f(x)) Decision  func.on: F(x) = If(x)>0.5 Outputs  a  vector  of  probabili.es:   f(x) ≈ ( P(y=0|x), ..., P(y=m-1|x) ) Negative conditional log likelihood loss L(f(x),y) = -log f(x)y Decision  func.on:  F(x) = argmax(f(x)) Non- Hinge  loss:  probabilistic classifier Outputs a «score» f(x) for class 1. score for the other class is -f(x) L(f(x),t) = max(0, 1-t f(x)) where t=2y-1 Decision  func.on:    F(x) = If(x)>0 Outputs  a  vector  f(x) of  real-­‐valued   scores  for  the  m  classes. Mul.class  margin  loss   L(f(x),y) = max(0,1+max(f(x)k)-f(x)y ) k≠y Decision  func.on:    F(x) = argmax(f(x)) Binary classifier Multiclass classifier
  • 8. Early  stopping   •  Training  algorithms  usually  do  not  halt  at  a  local  minimum   •  Convergence  criterion  based  on  early  stopping:   •  based  on  surrogate  loss  or  true  underlying  loss  (ex  0-­‐1  loss)  measured  on  a  valida6on  set   •  #  training  steps  =  hyperparameter  controlling  the  effec.ve  capacity  of  the  model   •  simple,  effec.ve,  must  keep  a  copy  of  the  best  parameters   •  acts  as  a  regularizer  (Bishop  1995,…)   8   Training  error  decreases  steadily   Valida.on  error  begins  to  increase     Return  parameters  at  point  with   lowest  valida6on  error  
  • 9. Batch  and  mini-­‐batch  algorithms   •  Gradient  descent  at  each  itera.on  computes  gradients  over  the  en.re  dataset  for  one  update     •  ↑  Gradients  are  stable   •  ↓  Using  the  complete  training  set  can  be  very  expensive     •  the  gain  of  using  more  samples  is  less  than  linear:     •  standard  error  of  the  mean  es.mated  from  m  samples  is                                          (σ  is  true  std)     •  ↓  Training  set  may  be  redundant   •  Use  a  subset  of  the  training  set   Loop:   1.  sample  a  subset  of  data   2.  forward  prop  through  the  network   3.  backprop  to  calculate  gradients   4.  update  parameters  using  gradients   9   ∇θ J(θ) = 1 m ∇θ L( fθ (x(i) ), y(i) )i∑ SE = σ m Minibatch  gradient  descent  
  • 10. Batch  and  mini-­‐batch  algorithms   •  How  many  samples  in  each  update  step?     •  Determinis.c  or  batch  gradient  methods:  process  all  training  samples  in  a  large  batch   •  Mini-­‐batch  stochas.c  methods:  use  several  (not  all)  samples     •  Stochas.c  methods:  use  a  single  example  at  a  .me   •  online  methods:  samples  are  drawn  from  a  stream  of  con.nually  created  samples   10  batch  vs  minibatch  gradient  descent  
  • 11. Batch  and  mini-­‐batch  algorithms   Mini-­‐batch  size?   •  Larger  batches:  more  accurate  es.mate  of  the  gradient  but  less  than  linear  return     •  Very  small  batches:  Mul.core  architectures  under-­‐u.lized   •  Smaller  batches  provide  noisier  gradient  es.mates   •  Small  batches  may  offer  a  regularizing  effect    (add  noise)   •  but  may  require  small  learning  rate   •  may  increase  number  of  steps  for  convergence     •  If  small  training  set,  use  batch  gradient  descent   •  If  large  training  set,  use  mini  batches   •  Minbatches  should  be  selected  randomly  (shuffle  samples)   •  unbiased  es.mate  of  gradients   •  Typical  mini-­‐batch  size:  32,  64,  128,  256     •  (2p,  make  sure  mini-­‐batch  fits  in  CPU-­‐GPU  memory)   11  
  • 12. Challenges  in  deep  NN  op6miza6on  
  • 13. Convex  /  Non-­‐convex  op6miza6on   A  func.on                                                  defined  on  an  n-­‐dim  interval  is  convex  if  for  any       13   f : X → ! f (λx + (1− λ)x') ≤ λ f (x) + (1− λ) f (x') x,x' ∈X λ ∈[0,1] f (λx + (1− λ)x') λ f (x) + (1− λ) f (x')
  • 14. Convex  /  Non-­‐convex  op6miza6on   •  Convex  op.miza.on   •  any  local  minimum  is  a  global  minimum   •  there  are  several  opt.  algorithms  (polynomial-­‐.me)     •  Non-­‐convex  op.miza.on   •  objec6ve  func6on  in  deep  networks  is  non-­‐convex   •  deep  models  may  have  several  local  minima   •  but  this  is  not  necessarily  a  major  problem!   14  
  • 15. Local  minima  and  saddle  points   •  Cri6cal  points:     •  For  high  dimensional  loss  func.ons,  local  minima  are  rare  compared  to  saddle  points   •  Hessian  matrix:      real,  symmetric    eigenvector/eigenvalue  decomposi.on     •  Intui.on:  eigenvalues  of  the  Hessian  matrix     •  local  minimum/maximum:  all  posi.ve  /  all  nega.ve  eigenvalues:  exponen.ally  unlikely  as  n  grows   •  saddle  points:  both  posi.ve  and  nega.ve  eigenvalues   15  Dauphin  et  al.  Iden.fying  and  abacking  the  saddle  point  problem  in  high-­‐dimensional  non-­‐convex  op.miza.on.  NIPS  2014     Hij = ∂2 f ∂xi ∂xj f :!n → ! ∇x f (x) = 0
  • 16. Local  minima  and  saddle  points   •  It  is  believed  that  for  many  problems   including  learning  deep  nets,  almost  all  local   minimum  have  very  similar  func.on  value  to   the  global  op.mum   •  Finding  a  local  minimum  is  good  enough   16   Value  of  local  minima  found  by  running  SGD  for  200   itera.ons  on  a  simplified  version  of  MNIST  from  different   ini.al  star.ng  points.  As  number  of  parameters  increases,   local  minima  tend  to  cluster  more  .ghtly.   •  For  many  random  func.ons  local  minima  are  more  likely  to  have  low  cost  than  high   cost.   Choromanska  et  al.  The  loss  surfaces  of  mul.layer  networks,  AISTATS  2015  
  • 17. Saddle  points   How  to  escape  from  saddle  points?   •  First  order  methods   •  ini.ally  abracted  to  saddle  points,  but  unless   exact  hit,  it  will  be  repelled  when  close   •  hitng  cri.cal  point  exactly  is  unlikely  (es.mated   gradient  is  noisy)   •  saddle  points  are  very  unstable:  noise  (stochas.c   gradient  descent)  helps  convergence,  trajectory   escapes  quickly   •  Second  order  moments:   •  Netwon’s  method  can  jump  to  saddle  points   (where  gradient  is  0)   17  S.  Credit:  K.McGuinness   SGD  tends  to  oscillate  between  slowly  approaching   a  saddle  point  and  quickly  escaping  from  it  
  • 18. Other  difficul6es   •  Cliffs  and  exploding  gradients   •  Nets  with  many  layers  /  recurrent  nets  can  contain  very  steep  regions  (cliffs):   gradient  descent  can  move  parameters  too  far,  jumping  off  of  the  cliff.  (solu.ons:   gradient  clipping)   •  Long  term  dependencies   •  computa.onal  graph  becomes  very  deep  (deep  nets  /  recurrent  nets):  vanishing   and  exploding  gradients   18   cost  func.on  of  highly   non  linear  deep  nets   or  recurrent  net   (Pascanu2013)  
  • 20. Mini-­‐batch  Gradient  Descent   •  Most  used  algorithm  for  deep  learning   Algorithm   •  Require:  ini.al  parameter  θ,  learning  rate  α,     •  while  stopping  criterion  not  met  do   •  sample  a  minibatch  of  m  examples  from  the  training  set                                          with   corresponding  targets     •  compute  gradient  es.mate   •  apply  update     •  end  while   20   {x(i) }i=1...m {y(i) }i=1...m g ← + 1 m ∇θ L( fθ (x(i) ), y(i) )i∑ θ ←θ −αg
  • 21. Problems  with  GD   •  GD  can  be  very  slow.     •  Can  get  stuck  in  local  minima  or  saddle  points   •  If  the  loss  changes  quickly  in  one  direc.on  and  slowly  in  another,  GD  makes  slow   progress  along  shallow  dimension,  jibers  along  steep  direc.on     21   Loss  func.on  has  a  high  condi6on  number  (5):  ra.o  of   largest  to  smallest  singular  value  of  Hessian  matrix  is  large  
  • 22. Momentum     •  Momentum  is  designed  to  accelerate  learning,  especially  for  high  curvature,  small  but   consistent  gradients  or  noisy  gradients   •  New  variable  velocity  v  (direc.on  and  speed  at  which  parameters  move)   •  exponen.ally  decaying  average  of  nega.ve  gradient     Algorithm   •  Require:  ini.al  parameter  θ,  learning  rate  α,    momentum  parameter  λ    ,  ini6al  velocity  v •  Update  rule:  (g  is  gradient  es.mate)   •  compute  velocity  update   •  apply  update         •  Typical  values  v0=0,    λ=0.5,  0.9,0.99      (in  [0,1})   •  Read  physical  analogy  in  Deep  Learning  book  (Goodfellow  et  al):  velocity  =  momentum  of  unit  mass  par.cle   22   θ ←θ + v v ← λv −αg
  • 23. Nesterov  accelerated  gradient  (NAG)   •  A  variant  of  momentum,  where  gradient  is  evaluated  a]er  current  velocity  is  applied:   •  Approximate  where  the  parameters  will  be  on  the  next  .me  step  using  current  velocity   •  Update  velocity  using  gradient  where  we  predict  parameters  will  be   Algorithm   •  Require:  ini.al  parameter  θ,  learning  rate  α,  momentum  parameter  λ    ,  ini.al  velocity  v •  Update:   •  apply  interim  update   •  compute  gradient  (at  interim  point)   •  compute  velocity  update   •  apply  update     •  Interpreta.on:  add  a  correc.on  factor  to  momentum   23   g ← + 1 m ∇!θ L!θ ( f (x(i) ), y(i) )i∑ θ ←θ + v v ← λv −αg !θ ←θ + λv interim    
  • 24. Nesterov  accelerated  gradient  (NAG)   24   current  loca.on  wt vt ∇L(wt) vt+1 S.  Credit:  K.  McGuinness   predicted  loca.on  based  on  velocity  alone  wt + 𝛾v ∇L(wt + 𝛾vt) vt vt+1
  • 25. GD:  learning  rate   •  Learning  rate  is  a  crucial  parameter  for  GD   •  Too  large:  overshoots  local  minimum,  loss  increases   •  Too  small:  makes  very  slow  progress,  can  get  stuck   •  Good  learning  rate:  makes  steady  progress  toward  local  minimum   25   too  small   too  large  
  • 26. GD:  learning  rate  decay   •  In  prac.ce  it  is  necessary  to  gradually  decrease  learning  rate  to  speed  up  the  training   •  step  decay  (e.g.  reduce  by  half  every  few  epochs)   •  exponen6al  decay   •  1/t  decay     •  manual  decay   •  Sufficient  condi.ons  for  convergence:   •  Usually:  adapt  learning  rate  by  monitoring  learning  curves  that  plot  the  objec.ve   func.on  as  a  func.on  of  .me  (more  of  an  art  than  a  science!)   26   αt = ∞ t=1 ∞ ∑ αt 2 < ∞ t=1 ∞ ∑ α = α0 e−kt α = α0 1+ kt k decay rate t iteration number α0 initial learning rate
  • 27. Adap6ve  learning  rates   •  Cost  if  o]en  sensi.ve  to  some  direc.ons  and  insensi.ve  to  others   •  Momentum/Nesterov  mi.gate  this  issue  but  introduce  another  hyperparameter   •  Solu6on:  Use  a  separate  learning  rate  for  each  parameter  and  automa6cally  adapt  it   through  the  course  of  learning     •  Algorithms  (mini-­‐batch  based)   •  AdaGrad   •  RMSProp   •  Adam       27  
  • 28. AdaGrad   •  Adapts  the  learning  rate  of  each  parameter  based  on  sizes  of  previous  updates:     •  scales  updates  to  be  larger  for  parameters  that  are  updated  less   •  scales  updates  to  be  smaller  for  parameters  that  are  updated  more     •  The  net  effect  is  greater  progress  in  the  more  gently  sloped  direc.ons  of  parameter  space     •  Require:  ini.al  parameter  θ,  learning  rate  α,  small  constant  δ  (e.g.  10-­‐7)  for  numerical  stability •  Update:   •  accumulate  squared  gradient   •  compute  update   •  apply  update     28   θ ←θ + Δθ Δθ ← − α δ + r ⊙ g r ← r + g ⊙ g sum  of    all  previous  squared  gradients     updates  inversely  propor.onal  to  the   square  root  of  the  sum   (elementwise  mul.plica.on)   Duchi  et  al.  Adap.ve  Subgradient  Methods  for  Online  Learning  and  Stochas.c  Op.miza.on.  JMRL  2011  
  • 29. Root  Mean  Square  Propaga6on  (RMSProp)   •  AdaGrad  can  result  in  a  premature  and  excessive  decrease  in  effec6ve  learning  rate   •  RMSProp  modifies  AdaGrad  to  perform  beber  in  non-­‐convex  surfaces   •  Changes  gradient  accumula.on  by  an  exponen6ally  decaying  average  of  sum  of   squares  of  gradients     •  Requires:  ini.al  parameter  θ,  learning  rate  α,  decay  rate  ρ,  small  constant  δ  (e.g.  10-­‐7)   •  Update:   •  accumulate  squared  gradient   •  compute  update   •  apply  update     29   θ ←θ + Δθ Δθ ← − α δ + r ⊙ g r ← ρr + (1− ρ)g ⊙ g Geoff  Hinton,  Unpublished  
  • 30. ADAp6ve  Moments  (Adam)   •  Combina.on  of  RMSProp  and  momentum,  but:   •  Keep  decaying  average  of  both  first-­‐order  moment  of  gradient  (momentum)  and  second-­‐ order  moment  (RMSProp)   •  Includes  bias  correc.ons  (first  and  second  moments)  to  account  for  their  ini.aliza.on  at   origin   Update:   •  updated  biased  first  moment  es6mate   •  update  biased  second  moment   •  correct  biases   •  compute  update                                                                                              (opera.ons  applied  elementwise)   •  apply  update   30   θ ←θ + Δθ Δθ ← −α ˆs δ + ˆr s ← ρ1 s + (1− ρ1 )g r ← ρ2 r + (1− ρ2 )g ⊙ g ˆs ← s 1− ρ1 ˆr ← r 1− ρ2 Kingma  et  al.  Adam:  a  Method  for  Stochas.c  Op.miza.on.  ICLR  2015    δ=10-­‐8,  ρ1=0.9,  ρ2=0.999  
  • 31. Example:  test  func6on   31   Image  credit:  Alec  Radford.   Beale’s  func.on  
  • 32. Example:  saddle  point   32   Image  credit:  Alec  Radford.  
  • 34. Parameter  ini6aliza6on   •  Weights   •  Can’t  ini.alize  weights  to  0    (gradients  would  be  0)   •  Can’t  ini.alize  all  weights  to  the  same  value  (all  hidden  units  in  a  layer  will  always   behave  the  same;  need  to  break  symmetry)   •  Small  random  number,  e.g.,  uniform  or  gaussian  distribu.on     •  if  weights  start  too  small,  the  signal  shrinks  as  it  passes  through  each  layer  un.l  it  is  too  .ny   to  be  useful   •  Xavier  ini.aliza.on  (calibra.ng  variances,  for  tanh  ac.va.ons)  sqrt(1/n)   •  each  neuron:  w  =  randn(n)  /  sqrt(n)  ,  n  inputs   •  He  ini.aliza.on  (for  ReLu  ac.va.ons)  sqrt(2/n)   •  each  neuron  w  =  randn(n)  *  sqrt(2.0  /n)  ,  n  inputs   •  Biases   •  ini.alize  all  to  0  (except  for  output  unit  for  skewed  distribu.ons,  0.01  to  avoid  satura.ng  RELU)   •  Alterna6ve:  Ini.alize  using  machine  learning;  parameters  learned  by  unsupervised  model   trained  on  the  same  inputs  /  trained  on  unrelated  task   34  
  • 35. Normalizing  inputs   •  Normalizing  inputs  to  speed  up  learning   •  For  input  layers:  data  preprocessing  mean  =  1,  std=1       •  For  hidden  layers:  batch  normaliza.on   35   original  data     mean=0   mean  =0,  std=1     Loss  for  unnormalized  data     Loss  for  normalized  data  
  • 36. Batch  normaliza6on   •  As  learning  progresses,  the  distribu.on  of  the  layer  inputs  changes  due   to  parameter  updates  (  internal  covariate  shi])     •  This  can  result  in  most  inputs  being  in  the  non-­‐linear  regime  of      the  ac.va.on  func.on,  slowing  down  learning     •  Bach  normaliza.on  is  a  technique  to  reduce  this  effect   •  Explicitly  force  the  layer  ac.va.ons  to  have  zero  mean  and  unit   variance  w.r.t  running  batch  es.mates     •  Adds  a  learnable  scale  and  bias  term  to  allow  the  network  to  s.ll   use  the  nonlinearity   36    Ioffe  and  Szegedy,  2015.  “Batch  normaliza.on:  accelera.ng  deep  network  training  by  reducing  internal  covariate  shi]”   FC  /  Conv   Batch  norm   ReLu   FC  /  Conv   Batch  norm   ReLu  
  • 37. Batch  normaliza6on   •  Can  be  applied  to  any  input  or  hidden  layer   •  For  a  mini-­‐batch  of  m  ac.va.ons  of  the  layer   1.  Compute  empirical  mean  and  variance  for  each  dimension  D   2.  Normalize   3.  Scale  and  shi]          (two  learnable  parameters  )     37   ˆxi = xi − µB σ B 2 + ε m D x yi = γ ˆxi + β B = xi{ }i=1....m µB = 1 m xi i=1 m ∑ σ B 2 = 1 m (xi − µB )2 i=1 m ∑ Note:  normaliza.on  can  reduce  the  expressive  power  of  the  network  (e.g.  normalize  inputs  of  a   sigmoid  would  constrain  them  to  the  linear  regime   To  recover  the  iden.ty  mapping.  The  network  can  lean   Then     β = µBγ = σ B 2 + ε ˆyi = xi
  • 38. Batch  normaliza6on   Each  mini-­‐batch  is  scaled  by  the  mean/variance  computed  on  just  that  mini-­‐batch.   This  adds  some  noise  to  the  hidden  layer’s  ac.va.ons  within  that  minibatch,  having  a   slight  regulariza.on  effect:     •  Improves  gradient  flow  through  the  network   •  Allows  higher  learning  rates   •  Reduces  the  strong  dependency  on  ini.aliza.on   •  Reduces  the  need  of  regulariza.on   At  test  .me  BN  layers  func.on  differently:   •  Mean  and  std  are  not  computed  on  the  batch.   •  Instead,  a  single  fixed  empirical  mean  and  std  of  ac.va.ons  computed  during  training  is   used  (can  be  es.mated  with  exponen.ally  decaying  weighted  averages)   38  
  • 39. Summary   39   •  Op.miza.on  for  NN  is  different  from  pure  op.miza.on:   •  GD  with  mini-­‐batches   •  early  stopping   •  non-­‐convex  surface,  saddle  points   •  Learning  rate  has  a  significant  impact  on  model  performance   •  Several  extensions  to  GD  can  improve  convergence   •  Adap.ve  learning-­‐rate  methods  are  likely  to  achieve  best  results   •  RMSProp,  Adam     •  Weight  ini.aliza.on:  He        w=  randn(n)/  sqrt(2/n)     •  Batch  normaliza.on  to  reduce  the  internal  covariance  shi]  
  • 40. Bibliograpy   •  Goodfellow,  I.,  Bengio,  Y.,  and  A.,  C.  (2016),  Deep  Learning,  MIT  Press.   •  Choromanska,  A.,  Henaff,  M.,  Mathieu,  M.,  Arous,  G.  B.,  and  LeCun,  Y.  (2015),  The  loss  surfaces  of   mul.layer  networks.  In  AISTATS.   •  Dauphin,  Y.  N.,  Pascanu,  R.,  Gulcehre,  C.,  Cho,  K.,  Ganguli,  S.,  and  Bengio,  Y.  (2014).  Iden.fying  and   abacking  the  saddle  point  problem  in  high-­‐dimensional  non-­‐convex  op.miza.on.  In  Advances  in   Neural  Informa.on  Processing.  Systems,  pages  2933–2941.   •  Duchi,  J.,  Hazan,  E.,  and  Singer,  Y.  (2011).  Adap.ve  subgradient  methods  for  online  learning  and   stochas.c  op.miza.on.  Journal  of  Machine  Learning  Research,  12(Jul):2121–2159.   •  Goodfellow,  I.  J.,  Vinyals,  O.,  and  Saxe,  A.  M.  (2015).  Qualita.vely  characterizing  neural  network   op.miza.on  problems.  In  Interna.onal  Conference  on  Learning  Representa.ons.   •  Hinton,  G.  (2012).  Neural  networks  for  machine  learning.  Coursera,  video  lectures   •  Jacobs,  R.  A.  (1988).  Increased  rates  of  convergence  through  learning  rate  adapta.on.  Neural   networks,  1(4):295–307.   •  Kingma,  D.  and  Ba,  J.  (2014)-­‐  Adam:  A  method  for  stochas.c  op.miza.on.  arXiv  preprint  arXiv: 1412.6980.   •  Saxe,  A.  M.,  McClelland,  J.  L.,  and  Ganguli,  S.  (2013).  Exact  solu.ons  to  the  nonlinear  dynamics  of   learning  in  deep  linear  neural  networks.  In  Interna.onal  Conference  on  Learning  Representa.ons  40