Why Deep Learning Works: Self Regularization in Deep Neural Networks

calculation | consulting
why deep learning works:
self-regularization in deep neural networks
(TM)
c|c
(TM)
charles@calculationconsulting.com

calculation|consulting
UC Berkeley / NERSC 2018
why deep learning works:
self-regularization in deep neural networks
(TM)

calculation | consulting why deep learning works
Who Are We?
c|c
(TM)
Dr. Charles H. Martin, PhD
University of Chicago, Chemical Physics
NSF Fellow in Theoretical Chemistry
Over 15 years experience in applied Machine Learning and AI
ML algos for: Aardvark, acquired by Google (2010)
Demand Media (eHow); ﬁrst $1B IPO since Google
Wall Street: BlackRock
Fortune 500: Roche, France Telecom
BigTech: eBay, Aardvark (Google), GoDaddy
Private Equity: Anthropocene Institute
www.calculationconsulting.com
(TM)
3

c|c
(TM)
(TM)
4
Michael W. Mahoney
ICSI, RISELab, Dept. of Statistics UC Berkeley
Algorithmic and statistical aspects of modern large-scale data analysis.
large-scale machine learning | randomized linear algebra
geometric network analysis | scalable implicit regularization
PhD, Yale University, computational chemical physics
SAMSI National Advisory Committee
NRC Committee on the Analysis of Massive Data
Simons Institute Fall 2013 and 2018 program on the Foundations of Data
Biennial MMDS Workshops on Algorithms for Modern Massive Data Sets
NSF/TRIPODS-funded Foundations of Data Analysis Institute at UC Berkeley
https://www.stat.berkeley.edu/~mmahoney/
mmahoney@stat.berkeley.edu
Who Are We?

c|c
(TM)
Motivations: towards a Theory of Deep Learning
(TM)
5
NNs as spin glasses
LeCun et. al. 2015
Looks exactly
old protein
folding results
(late 90s)
Energy Landscape Theory
broad questions about Why Deep Learning Works ?
MDDS talk 2016 Blog post 2015
completely
different
picture
of DNNs

c|c
(TM)
Motivations: towards a Theory of Deep Learning
(TM)
6
Theoretical: deeper insight into Why Deep LearningWorks ?
non-convex optimization ?
regularization ?
why is deep better ?
VC vs Stat Mech vs ?
…
Practical: useful insight to improve engineering DNNs
when is a network fully optimized ?
large batch sizes ?
better ensembles ?
…

c|c
(TM)
Set up: the Energy Landscape
(TM)
7
minimize Loss: but how avoid overtraining ?

c|c
(TM)
Problem: How can this possibly work ?
(TM)
8
highly non-convex ? apparently not
expected observed ?
has been suspected for a long time that local minima are not the issue

c|c
(TM)
Problem: Local Minima ?
(TM)
9
Duda, Hart and Stork, 2000
solution: add more capacity and regularize

c|c
(TM)
Motivations: what is Regularization ?
(TM)
10
every adjustable knob and switch is called regularization
https://arxiv.org/pdf/1710.10686.pdf
Dropout Batch Size Noisify Data
…

c|c
(TM)
(TM)
11
Understanding deep learning requires rethinking generalization
Problem: What is Regularization in DNNs ?
ICLR 2017 Best paper
Large models overﬁt on randomly labeled data
Regularization can not prevent this

Moore-Pensrose pseudoinverse (1955)
regularize (Phillips, 1962)
familiar optimization problem
c|c
(TM)
Motivations: what is Regularization ?
(TM)
12
Soften the rank of X, focus on large eigenvalues ( )
Ridge Regression / Tikhonov-Phillips Regularization
https://calculatedcontent.com/2012/09/28/kernels-greens-functions-and-resolvent-operators/

c|c
(TM)
Motivations: how we study Regularization
(TM)
13
turn off regularization, turn it back on systematically, study W
and traditional regularization is applied to W
the Energy Landscape is determined by the layer weights WL
L
L

c|c
(TM)
(TM)
14
Information bottleneck
Entropy collapse
local minima
k=1 saddle points
ﬂoor / ground state
k = 2 saddle points
Information / Entropy
Energy Landscape: and Information ﬂow
what happens to the layer weight matrices WL ?

c|c
(TM)
(TM)
15
Self-Regularization: Experiments
Retrained LeNet5 on MINST using Keras
Two (2) other small models: 3-Layer MLP and a Mini AlexNet
And examine pre-trained models (AlexNet, Inception, …)
Conv2D MaxPool Conv2D MaxPool FC1 FC2 FC

c|c
(TM)
(TM)
16
Matrix Complexity: Entropy and Stable Rank

c|c
(TM)
(TM)
17
Random Matrix Theory: detailed insight into WL
Empirical Spectral Density (ESD: eigenvalues of X=W W )LL
T
import keras
import numpy as np
import matplotlib.pyplot as plt
…
W = model.layers[i].get_weights()[0]
…
X = np.dot(W, W.T)
evals, evecs = np.linalg.eig(W, W.T)
plt.hist(X, bin=100, density=True)

c|c
(TM)
(TM)
18
Random Matrix Theory: detailed insight into WL
Entropy decrease corresponds to breakdown of random structure
and the onset of a new kind of self-regularization
Empirical Spectral Density (ESD: eigenvalues of X=W W )LL
T
Random
Matrix
Random
+ Spikes

c|c
(TM)
(TM)
19
Random Matrix Theory: Marchenko-Pastur
converges to a deterministic function
Empirical Spectral Density (ESD)
with well deﬁned edges (depends on Q, aspect ratio)

c|c
(TM)
(TM)
20
Random Matrix Theory: Marcenko Pastur
plus Tracy-Widom ﬂuctuations
very crisp edges
Q

c|c
(TM)
(TM)
21
Experiments: just apply to pre-trained Models
https://medium.com/@siddharthdas_32104/
cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5

c|c
(TM)
(TM)
22
Experiments: just apply to pre-trained Models
LeNet5 (1998)
AlexNet (2012)
InceptionV3 (2014)
ResNet (2015)
…
DenseNet201 (2018)
https://medium.com/@siddharthdas_32104/
cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5
Conv2D MaxPool Conv2D MaxPool FC FC

c|c
(TM)
(TM)
23
Marchenko-Pastur Bulk + Spikes
Conv2D MaxPool Conv2D MaxPool FC FC
softrank = 10%
RMT: LeNet5

c|c
(TM)
(TM)
24
RMT: AlexNet
Marchenko-Pastur Bulk-decay | Heavy Tailed
FC1
zoomed in
FC2
zoomed in

c|c
(TM)
(TM)
25
Random Matrix Theory: InceptionV3
Marchenko-Pastur bulk decay, onset of Heavy Tails
W226

c|c
(TM)
(TM)
26
Eigenvalue Analysis: Rank Collapse ?
Modern DNNs: soft rank collapses; do not lose hard rank
> 0
(hard) rank collapse (Q>1)
signiﬁes over-regularization
= 0
all smallest eigenvalues > 0,
within numerical (recipes) threshold~
Q > 1

c|c
(TM)
(TM)
27
RMT: 5+1 Phases of Training

c|c
(TM)
(TM)
28
Bulk+Spikes: Small Models
Rank 1 perturbation Perturbative correction
Bulk
Spikes
Smaller, older models can be described pertubatively w/RMT

c|c
(TM)
(TM)
29
Spikes: carry more information
Information begins to concentrate in the spikes
S(v)
spikes have less entropy, are more localized than bulk

c|c
(TM)
(TM)
30
Bulk+Spikes: ~ Tikhonov regularization
Small models like LeNet5 exhibit traditional regularization
softer rank , eigenvalues > , spikes carry most information
simple scale threshold

c|c
(TM)
(TM)
31
Heavy Tailed: Self-Regularization
W strongly correlated / highly non-random
Can be modeled as if drawn from a heavy tailed distribution
Then RMT/MP ESD will also have heavy tails
Known results from RMT / polymer theory (Bouchaud, Potters, etc)
AlexNet
ReseNet50
InceptionV3
DenseNet201
…
Large, well trained, modern DNNs exhibit heavy tailed self-regularization

c|c
(TM)
(TM)
32
Heavy Tailed: Self-Regularization
Large, well trained, modern DNNs exhibit heavy tailed self-regularization
Salient ideas: what we ‘suspect’ today
No single scale threshold
No simple low rank approximation for WL
Contributions from correlations at all scales
Can not be treated pertubatively

c|c
(TM)
(TM)
33
Self-Regularization: Batch size experiments
We can cause small models to exhibit strong correlations / heavy tails
By exploiting the Generalization Gap Phenomena
Large batch sizes => decrease generalization accuracy
Tuning the batch size from very large to very small

c|c
(TM)
(TM)
34
Batch Size Tuning: Generalization Gap
Decreasing the batch size induces strong correlations in W
Random-Like Bleeding-outRandom-Like

c|c
(TM)
(TM)
35

c|c
(TM)
(TM)
36
Bulk+Spikes Bulk+Spikes Bulk-decay

c|c
(TM)
(TM)
37
Bulk-decay Bulk-decay Heavy-tailed

c|c
(TM)
(TM)
38
Applying RMT: What phase is your model in ?
How to apply RMT
Q > 1
Q = 1
> 0
= 0
plus Tracy-Widom ﬂuctuations
very crisp edges
Heavy-tailed ?
Bulk-decay ?
Bulk+Spikes
Large, well trained models approach heavy tailed self-regularization

c|c
(TM)
(TM)
39
InceptionV3 Layer 226 Q~1.3
Bulk-decay ? Heavy-tailed ?
best MP ﬁts
bulk not captured difﬁcult to apply MP

c|c
(TM)
(TM)
40
Applying RMT: Heavy Tails ~ Q=1
DenseNet201 typical layer w/ Q=1.92
Heavy-tailed, but seemingly
within MP eigenvalue bounds
MP fit is terrible near
eigenvalue minimum = 0
variance 1.83 is quite large
Resembles Q =1 fit
like a soft rank collapse
best MP fit
(Q fixed)

c|c
(TM)
(TM)
41
How to apply RMT
Q = 1 = 0
Long tail looks takes the form of very large variance
standard MP theory
assumes ﬁnite variance

c|c
(TM)
(TM)
42
best MP ﬁt
(Q ﬁxed)
Heavy-tailed, but
not clean power law
no hard rank collapse
but close
max ~ 30

c|c
(TM)
(TM)
43
Applying RMT: Should we float Q ?
best MP fit
(Q =1)
Heavy-tailed, Q=1 does not fit
max ~ 30 (not shown)
~1.3
very large variances
do not capture bulk

c|c
(TM)
(TM)
44
Summary 
self-regularization ~ entropy / information decrease
modern DNNs have heavy-tailed self-regularization
5+1 phases of learning
applied Random Matrix Theory (RMT)
small models ~ Tinkhonov regularization

c|c
(TM)
(TM)
45
Implications: RMT and Deep Learning
How can RMT be used to understand the Energy Landscape ?
tradeoff between
Energy and Entropy minimization
Where are the local minima ?
How is the Hessian behaved ?
Are simpler models misleading ?
Can we design better learning strategies ?

c|c
(TM)
Energy Funnels: Minimizing Frustration 
(TM)
46
http://www.nature.com/nsmb/journal/v4/n11/pdf/nsb1197-871.pdf
Energy Landscape Theory for polymer / protein folding

c|c
(TM)
the Spin Glass of Minimal Frustration  
(TM)
47
Conjectured 2015 on my blog (15 min fame on Hacker News)
https://calculatedcontent.com/2015/03/25/why-does-deep-learning-work/
Bulk+Spikes, ﬂipped
low lying Energy state in Spin Glass ~ spikes in RMT

c|c
(TM)
RMT w/Heavy Tails: Energy Landscape ?
(TM)
48
Compare to LeCun’s Spin Glass model (2015)
Spin Glass with/Heavy Tails ?
Local minima do not concentrate
near the ground state
(Cizeau P and Bouchaud J-P 1993)
is Landscape is more funneled, no ‘problems’ with local minima ?

(TM)
c|c
(TM)
c | c

Why Deep Learning Works: Self Regularization in Deep Neural Networks

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Why Deep Learning Works: Self Regularization in Deep Neural Networks

Semelhante a Why Deep Learning Works: Self Regularization in Deep Neural Networks (20)

Mais de Charles Martin

Mais de Charles Martin (7)

Último

Último (20)

Why Deep Learning Works: Self Regularization in Deep Neural Networks