The Magic of Auto Differentiation

The Magic of Auto Diﬀerentiation
Sanyam Kapoor, sanyam@nyu.edu
October 21, 2017
Courant Institute, NYU

The Learning Problem
Given an a set of instances S : (X, Y) drawn i.i.d from some
distribution, predict the underlying unknown distribution D.
1

One Approach to Learning
We deﬁne a loss function L on some hypothesis h ∈ H (hypothesis
set) and aim to minimize the loss across sample space
argmin
θ
L(Y, h(X, θ)) (1)
where θ is the set of parameters of the hypothesis.
2

Minimizing the Loss
Diﬀerentiation is our tool! Compute the solution to
θL =
dL
dθ
= 0 (2)
And, we have solved the learning problem. But have we?
3

Techniques in Diﬀerentiation

A Sample Function
Consider a multi-variate function
f (x1, x2) = x1 log
x1
sin(x2
2 )
(3)
I just cooked that up!
4

Manual Diﬀerentiation
∂f
∂x1
= log
x1
sin(x2
2 )
+
1
2
log
x1
sin(x2
2 )
−1
2
(4)
∂f
∂x2
= −x1x2cot(x2
2 ) log
x1
sin(x2
2 )
−1
2
(5)
Pros
• Irreducable form
• Hardcodes everything
Cons
• Time Consuming
• Error Prone
5

Numerical Differentiation
Method of finite differences derived from First-Order
Approximation of Taylor Series (higher order methods as well)
[BF89]
lim
h→0
∂f
∂x
=
f (x + h) − f (x)
h
(6)
lim
h→0
∂f
∂x
=
f (x + h) − f (x − h)
2h
(7)
Pros
• Fair approximations
Cons
• Ill-conditioned and unstable
• Truncation and Round-off
errors
6

Symbolic Diﬀerentiation
Compute actual symbols from a repository of basic rules like the
sum rule or the product rule represented as concrete data
structures.
Used in algebra systems like Mathematica, Theano. A deterministic
and mechanistic process just like how one would code!
Pros
• Insight into structure of
problem
• Build analytical solutions
(e.g. the classic Normal
Equation for Linear
Regression)
Cons
• Expression Swell
7

Automatic Diﬀerentiation
Problem Calculate the sensitivity of output w.r.t input (Jacobian)
Observations
1. Need the exact derivatives and not approximations
2. Don’t really need the the symbolic form
Solution Chain rule (but just the smart way!)
8

Computational Graphs
Represents ﬂow of values across a non-trivial computation [Bau74].
Core of modern computational libraries like PyTorch and
Tensorﬂow.
Consider each node as a special gate. Looks familiar?
Figure 1: Computational Graph for Equation 3
9

Forward Mode Diﬀerentiation
Computes the sensitivity of the output w.r.t. one input parameter.
Any hypothesis h : Rm → R would require m forward mode
diﬀerentiations to compute sensitivity w.r.t each input parameter.
Forward Primal Trace is the algebraic version of computational
graph. Read top-down.
Forward Tangent Trace calculates ∂
∂x . Read top-down.
10

Forward Mode Example
Forward Primal Trace
v−1 = x1
v0 = x2
v1 = v2
0
v2 = sin(v1)
v3 =
v−1
v2
v4 = log(v3)
v5 =
√
v4
v6 = v−1 ∗ v5
y = v6
Forward Tangent Trace ∂
∂x2
˙v−1 = 0
˙v0 = 1
˙v1 = 2v0 ˙v0
˙v2 = cos(v1) ˙v1
˙v3 =
˙v−1v2 − v−1 ˙v2
v2
2
˙v4 =
1
v3
˙v3
˙v5 = −
1
2
v
−1
2
4 ˙v4
˙y = ˙v6 = ˙v−1v5 + v−1 ˙v5
11

Reverse Mode Diﬀerentiation
Computes the sensitivity of the output w.r.t. all input parameters.
Any hypothesis h : Rm → R would require ONE reverse mode
diﬀerentiation.
Also called Reverse Mode Accumulator.
¯vi =
∂f
∂vi
(adjoint of a variable)
Reverse Adjoint Trace calculates ∂f
∂ . Read bottom-up.
12

Reverse Mode Example
Forward Primal Trace
v−1 = x1
v0 = x2
v1 = v2
0
v2 = sin(v1)
v3 =
v−1
v2
v4 = log(v3)
v5 =
√
v4
v6 = v−1 ∗ v5
y = v6
Reverse Adjoint Trace ∂f
∂
¯v0 = ¯v1
∂v1
∂v0
¯v1 = ¯v2
∂v2
∂v1
¯v−1 = ¯v−1 + ¯v3
∂v3
∂v−1
; ¯v2 = ¯v3
∂v3
∂v2
¯v3 = ¯v4
∂v4
∂v3
¯v4 = ¯v5
∂v5
∂v4
¯v−1 = ¯v6
∂v6
∂v−1
; ¯v5 = ¯v6
∂v6
∂v5
¯y = ¯v6 = 1
13

Reverse Mode in Practice
More commonly known as the Backpropagation algorithm.
For a generic hypothesis h : Rm → Rn, we need n reverse mode
diﬀerentiations versus m forward mode diﬀerentiations. Helpful
when n m.
For instance, Dense Interpolated Embedding Model DIEM
[TGR15] proposed an architecture with ∼ 160b parameters and
output syntactic embeddings of size 1000+.
14

Reverse Mode in PyTorch i
import torch
from torch . autograd import V a r i a b l e
def main ( ) :
N, D in , H, D out = 64 , 1000 , 100 , 10
x = V a r i a b l e ( torch . randn (N, D in ))
y = V a r i a b l e ( torch . randn (N, D out ))
model = torch . nn . S e q u e n t i a l (
torch . nn . Linear ( D in , H) ,
torch . nn . ReLU () ,
torch . nn . Linear (H, D out ) ,
15

Reverse Mode in PyTorch ii
)
l o s s f n = torch . nn . MSELoss ( s i z e a v e r a g e=False )
o p t i m i z e r = torch . optim .SGD(
model . parameters () ,
l r =1e−4)
for t in range (500):
y pred = model ( x )
l o s s = l o s s f n ( y pred , y )
o p t i m i z e r . zero grad ()
l o s s . backward () # Reverse Mode !
o p t i m i z e r . step ()
16

References i
F. L. Bauer, Computational graphs and rounding error, SIAM
Journal on Numerical Analysis 11 (1974), no. 1, 87–96.
Richard L. Burden and J. Douglas Faires, Numerical analysis:
4th ed, PWS Publishing Co., Boston, MA, USA, 1989.
Atilim Gunes Baydin, Barak A. Pearlmutter, and
Alexey Andreyevich Radul, Automatic diﬀerentiation in
machine learning: a survey, CoRR abs/1502.05767 (2015).
Andreas Griewank and Andrea Walther, Evaluating derivatives:
Principles and techniques of algorithmic diﬀerentiation, second
ed., Society for Industrial and Applied Mathematics,
Philadelphia, PA, USA, 2008.
17

References ii
Andrew Trask, David Gilmore, and Matthew Russell, Modeling
order in neural word embeddings at scale, Proceedings of the
32nd International Conference on Machine Learning
(ICML-15) (David Blei and Francis Bach, eds.), JMLR
Workshop and Conference Proceedings, 2015, pp. 2266–2275.
18

The Magic of Auto Differentiation

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a The Magic of Auto Differentiation

Semelhante a The Magic of Auto Differentiation (20)

Último

Último (20)

The Magic of Auto Differentiation