3. The Learning Problem
Given an a set of instances S : (X, Y) drawn i.i.d from some
distribution, predict the underlying unknown distribution D.
1
4. One Approach to Learning
We define a loss function L on some hypothesis h ∈ H (hypothesis
set) and aim to minimize the loss across sample space
argmin
θ
L(Y, h(X, θ)) (1)
where θ is the set of parameters of the hypothesis.
2
5. Minimizing the Loss
Differentiation is our tool! Compute the solution to
θL =
dL
dθ
= 0 (2)
And, we have solved the learning problem. But have we?
3
9. Numerical Differentiation
Method of finite differences derived from First-Order
Approximation of Taylor Series (higher order methods as well)
[BF89]
lim
h→0
∂f
∂x
=
f (x + h) − f (x)
h
(6)
lim
h→0
∂f
∂x
=
f (x + h) − f (x − h)
2h
(7)
Pros
• Fair approximations
Cons
• Ill-conditioned and unstable
• Truncation and Round-off
errors
6
10. Symbolic Differentiation
Compute actual symbols from a repository of basic rules like the
sum rule or the product rule represented as concrete data
structures.
Used in algebra systems like Mathematica, Theano. A deterministic
and mechanistic process just like how one would code!
Pros
• Insight into structure of
problem
• Build analytical solutions
(e.g. the classic Normal
Equation for Linear
Regression)
Cons
• Expression Swell
7
11. Automatic Differentiation
Problem Calculate the sensitivity of output w.r.t input (Jacobian)
Observations
1. Need the exact derivatives and not approximations
2. Don’t really need the the symbolic form
Solution Chain rule (but just the smart way!)
8
13. Computational Graphs
Represents flow of values across a non-trivial computation [Bau74].
Core of modern computational libraries like PyTorch and
Tensorflow.
Consider each node as a special gate. Looks familiar?
Figure 1: Computational Graph for Equation 3
9
14. Forward Mode Differentiation
Computes the sensitivity of the output w.r.t. one input parameter.
Any hypothesis h : Rm → R would require m forward mode
differentiations to compute sensitivity w.r.t each input parameter.
Forward Primal Trace is the algebraic version of computational
graph. Read top-down.
Forward Tangent Trace calculates ∂
∂x . Read top-down.
10
16. Reverse Mode Differentiation
Computes the sensitivity of the output w.r.t. all input parameters.
Any hypothesis h : Rm → R would require ONE reverse mode
differentiation.
Also called Reverse Mode Accumulator.
¯vi =
∂f
∂vi
(adjoint of a variable)
Reverse Adjoint Trace calculates ∂f
∂ . Read bottom-up.
12
18. Reverse Mode in Practice
More commonly known as the Backpropagation algorithm.
For a generic hypothesis h : Rm → Rn, we need n reverse mode
differentiations versus m forward mode differentiations. Helpful
when n m.
For instance, Dense Interpolated Embedding Model DIEM
[TGR15] proposed an architecture with ∼ 160b parameters and
output syntactic embeddings of size 1000+.
14
19. Reverse Mode in PyTorch i
import torch
from torch . autograd import V a r i a b l e
def main ( ) :
N, D in , H, D out = 64 , 1000 , 100 , 10
x = V a r i a b l e ( torch . randn (N, D in ))
y = V a r i a b l e ( torch . randn (N, D out ))
model = torch . nn . S e q u e n t i a l (
torch . nn . Linear ( D in , H) ,
torch . nn . ReLU () ,
torch . nn . Linear (H, D out ) ,
15
20. Reverse Mode in PyTorch ii
)
l o s s f n = torch . nn . MSELoss ( s i z e a v e r a g e=False )
o p t i m i z e r = torch . optim .SGD(
model . parameters () ,
l r =1e−4)
for t in range (500):
y pred = model ( x )
l o s s = l o s s f n ( y pred , y )
o p t i m i z e r . zero grad ()
l o s s . backward () # Reverse Mode !
o p t i m i z e r . step ()
16
21. References i
F. L. Bauer, Computational graphs and rounding error, SIAM
Journal on Numerical Analysis 11 (1974), no. 1, 87–96.
Richard L. Burden and J. Douglas Faires, Numerical analysis:
4th ed, PWS Publishing Co., Boston, MA, USA, 1989.
Atilim Gunes Baydin, Barak A. Pearlmutter, and
Alexey Andreyevich Radul, Automatic differentiation in
machine learning: a survey, CoRR abs/1502.05767 (2015).
Andreas Griewank and Andrea Walther, Evaluating derivatives:
Principles and techniques of algorithmic differentiation, second
ed., Society for Industrial and Applied Mathematics,
Philadelphia, PA, USA, 2008.
17
22. References ii
Andrew Trask, David Gilmore, and Matthew Russell, Modeling
order in neural word embeddings at scale, Proceedings of the
32nd International Conference on Machine Learning
(ICML-15) (David Blei and Francis Bach, eds.), JMLR
Workshop and Conference Proceedings, 2015, pp. 2266–2275.
18