SlideShare uma empresa Scribd logo
1 de 49
Baixar para ler offline
calculation | consulting
why deep learning works:
self-regularization in deep neural networks
(TM)
c|c
(TM)
charles@calculationconsulting.com
calculation|consulting
UC Berkeley / NERSC 2018
why deep learning works:
self-regularization in deep neural networks
(TM)
charles@calculationconsulting.com
calculation | consulting why deep learning works
Who Are We?
c|c
(TM)
Dr. Charles H. Martin, PhD
University of Chicago, Chemical Physics
NSF Fellow in Theoretical Chemistry
Over 15 years experience in applied Machine Learning and AI
ML algos for: Aardvark, acquired by Google (2010)
Demand Media (eHow); first $1B IPO since Google
Wall Street: BlackRock
Fortune 500: Roche, France Telecom
BigTech: eBay, Aardvark (Google), GoDaddy
Private Equity: Anthropocene Institute
www.calculationconsulting.com
charles@calculationconsulting.com
(TM)
3
calculation | consulting why deep learning works
c|c
(TM)
(TM)
4
Michael W. Mahoney
ICSI, RISELab, Dept. of Statistics UC Berkeley
Algorithmic and statistical aspects of modern large-scale data analysis.
large-scale machine learning | randomized linear algebra
geometric network analysis | scalable implicit regularization
PhD, Yale University, computational chemical physics
SAMSI National Advisory Committee
NRC Committee on the Analysis of Massive Data
Simons Institute Fall 2013 and 2018 program on the Foundations of Data
Biennial MMDS Workshops on Algorithms for Modern Massive Data Sets
NSF/TRIPODS-funded Foundations of Data Analysis Institute at UC Berkeley
https://www.stat.berkeley.edu/~mmahoney/
mmahoney@stat.berkeley.edu
Who Are We?
c|c
(TM)
Motivations: towards a Theory of Deep Learning
(TM)
5
calculation | consulting why deep learning works
NNs as spin glasses
LeCun et. al. 2015
Looks exactly
old protein
folding results
(late 90s)
Energy Landscape Theory
broad questions about Why Deep Learning Works ?
MDDS talk 2016 Blog post 2015
completely
different
picture
of DNNs
c|c
(TM)
Motivations: towards a Theory of Deep Learning
(TM)
6
calculation | consulting why deep learning works
Theoretical: deeper insight into Why Deep LearningWorks ?
non-convex optimization ?
regularization ?
why is deep better ?
VC vs Stat Mech vs ?
…
Practical: useful insight to improve engineering DNNs
when is a network fully optimized ?
large batch sizes ?
better ensembles ?
…
c|c
(TM)
Set up: the Energy Landscape
(TM)
7
calculation | consulting why deep learning works
minimize Loss: but how avoid overtraining ?
c|c
(TM)
Problem: How can this possibly work ?
(TM)
8
calculation | consulting why deep learning works
highly non-convex ? apparently not
expected observed ?
has been suspected for a long time that local minima are not the issue
c|c
(TM)
Problem: Local Minima ?
(TM)
9
calculation | consulting why deep learning works
Duda, Hart and Stork, 2000
solution: add more capacity and regularize
c|c
(TM)
Motivations: what is Regularization ?
(TM)
10
calculation | consulting why deep learning works
every adjustable knob and switch is called regularization
https://arxiv.org/pdf/1710.10686.pdf
Dropout Batch Size Noisify Data
…
c|c
(TM)
(TM)
11
calculation | consulting why deep learning works
Understanding deep learning requires rethinking generalization
Problem: What is Regularization in DNNs ?
ICLR 2017 Best paper
Large models overfit on randomly labeled data
Regularization can not prevent this
Moore-Pensrose pseudoinverse (1955)
regularize (Phillips, 1962)
familiar optimization problem
c|c
(TM)
Motivations: what is Regularization ?
(TM)
12
calculation | consulting why deep learning works
Soften the rank of X, focus on large eigenvalues ( )
Ridge Regression / Tikhonov-Phillips Regularization
https://calculatedcontent.com/2012/09/28/kernels-greens-functions-and-resolvent-operators/
c|c
(TM)
Motivations: how we study Regularization
(TM)
13
turn off regularization, turn it back on systematically, study W
and traditional regularization is applied to W
the Energy Landscape is determined by the layer weights WL
L
L
c|c
(TM)
(TM)
14
calculation | consulting why deep learning works
Information bottleneck
Entropy collapse
local minima
k=1 saddle points
floor / ground state
k = 2 saddle points
Information / Entropy
Energy Landscape: and Information flow
what happens to the layer weight matrices WL ?
c|c
(TM)
(TM)
15
calculation | consulting why deep learning works
Self-Regularization: Experiments
Retrained LeNet5 on MINST using Keras
Two (2) other small models: 3-Layer MLP and a Mini AlexNet
And examine pre-trained models (AlexNet, Inception, …)
Conv2D MaxPool Conv2D MaxPool FC1 FC2 FC
c|c
(TM)
(TM)
16
calculation | consulting why deep learning works
Matrix Complexity: Entropy and Stable Rank
c|c
(TM)
(TM)
17
calculation | consulting why deep learning works
Random Matrix Theory: detailed insight into WL
Empirical Spectral Density (ESD: eigenvalues of X=W W )LL
T
import keras
import numpy as np
import matplotlib.pyplot as plt
…
W = model.layers[i].get_weights()[0]
…
X = np.dot(W, W.T)
evals, evecs = np.linalg.eig(W, W.T)
plt.hist(X, bin=100, density=True)
c|c
(TM)
(TM)
18
calculation | consulting why deep learning works
Random Matrix Theory: detailed insight into WL
Entropy decrease corresponds to breakdown of random structure
and the onset of a new kind of self-regularization
Empirical Spectral Density (ESD: eigenvalues of X=W W )LL
T
Random
Matrix
Random
+ Spikes
c|c
(TM)
(TM)
19
calculation | consulting why deep learning works
Random Matrix Theory: Marchenko-Pastur
converges to a deterministic function
Empirical Spectral Density (ESD)
with well defined edges (depends on Q, aspect ratio)
c|c
(TM)
(TM)
20
calculation | consulting why deep learning works
Random Matrix Theory: Marcenko Pastur
plus Tracy-Widom fluctuations
very crisp edges
Q
c|c
(TM)
(TM)
21
calculation | consulting why deep learning works
Experiments: just apply to pre-trained Models
https://medium.com/@siddharthdas_32104/
cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5
c|c
(TM)
(TM)
22
calculation | consulting why deep learning works
Experiments: just apply to pre-trained Models
LeNet5 (1998)
AlexNet (2012)
InceptionV3 (2014)
ResNet (2015)
…
DenseNet201 (2018)
https://medium.com/@siddharthdas_32104/
cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5
Conv2D MaxPool Conv2D MaxPool FC FC
c|c
(TM)
(TM)
23
calculation | consulting why deep learning works
Marchenko-Pastur Bulk + Spikes
Conv2D MaxPool Conv2D MaxPool FC FC
softrank = 10%
RMT: LeNet5
c|c
(TM)
(TM)
24
calculation | consulting why deep learning works
RMT: AlexNet
Marchenko-Pastur Bulk-decay | Heavy Tailed
FC1
zoomed in
FC2
zoomed in
c|c
(TM)
(TM)
25
calculation | consulting why deep learning works
Random Matrix Theory: InceptionV3
Marchenko-Pastur bulk decay, onset of Heavy Tails
W226
c|c
(TM)
(TM)
26
calculation | consulting why deep learning works
Eigenvalue Analysis: Rank Collapse ?
Modern DNNs: soft rank collapses; do not lose hard rank
> 0
(hard) rank collapse (Q>1)
signifies over-regularization
= 0
all smallest eigenvalues > 0,
within numerical (recipes) threshold~
Q > 1
c|c
(TM)
(TM)
27
calculation | consulting why deep learning works
RMT: 5+1 Phases of Training
c|c
(TM)
(TM)
28
calculation | consulting why deep learning works
Bulk+Spikes: Small Models
Rank 1 perturbation Perturbative correction
Bulk
Spikes
Smaller, older models can be described pertubatively w/RMT
c|c
(TM)
(TM)
29
calculation | consulting why deep learning works
Spikes: carry more information
Information begins to concentrate in the spikes
S(v)
spikes have less entropy, are more localized than bulk
c|c
(TM)
(TM)
30
calculation | consulting why deep learning works
Bulk+Spikes: ~ Tikhonov regularization
Small models like LeNet5 exhibit traditional regularization
softer rank , eigenvalues > , spikes carry most information
simple scale threshold
c|c
(TM)
(TM)
31
calculation | consulting why deep learning works
Heavy Tailed: Self-Regularization
W strongly correlated / highly non-random
Can be modeled as if drawn from a heavy tailed distribution
Then RMT/MP ESD will also have heavy tails
Known results from RMT / polymer theory (Bouchaud, Potters, etc)
AlexNet
ReseNet50
InceptionV3
DenseNet201
…
Large, well trained, modern DNNs exhibit heavy tailed self-regularization
c|c
(TM)
(TM)
32
calculation | consulting why deep learning works
Heavy Tailed: Self-Regularization
Large, well trained, modern DNNs exhibit heavy tailed self-regularization
Salient ideas: what we ‘suspect’ today
No single scale threshold
No simple low rank approximation for WL
Contributions from correlations at all scales
Can not be treated pertubatively
c|c
(TM)
(TM)
33
calculation | consulting why deep learning works
Self-Regularization: Batch size experiments
We can cause small models to exhibit strong correlations / heavy tails
By exploiting the Generalization Gap Phenomena
Large batch sizes => decrease generalization accuracy
Tuning the batch size from very large to very small
c|c
(TM)
(TM)
34
calculation | consulting why deep learning works
Batch Size Tuning: Generalization Gap
Decreasing the batch size induces strong correlations in W
Random-Like Bleeding-outRandom-Like
c|c
(TM)
(TM)
35
calculation | consulting why deep learning works
Batch Size Tuning: Generalization Gap
Decreasing the batch size induces strong correlations in W
c|c
(TM)
(TM)
36
calculation | consulting why deep learning works
Batch Size Tuning: Generalization Gap
Decreasing the batch size induces strong correlations in W
Bulk+Spikes Bulk+Spikes Bulk-decay
c|c
(TM)
(TM)
37
calculation | consulting why deep learning works
Batch Size Tuning: Generalization Gap
Decreasing the batch size induces strong correlations in W
Bulk-decay Bulk-decay Heavy-tailed
c|c
(TM)
(TM)
38
calculation | consulting why deep learning works
Applying RMT: What phase is your model in ?
How to apply RMT
Q > 1
Q = 1
> 0
= 0
plus Tracy-Widom fluctuations
very crisp edges
Heavy-tailed ?
Bulk-decay ?
Bulk+Spikes
Large, well trained models approach heavy tailed self-regularization
c|c
(TM)
(TM)
39
calculation | consulting why deep learning works
Applying RMT: What phase is your model in ?
Large, well trained models approach heavy tailed self-regularization
InceptionV3 Layer 226 Q~1.3
Bulk-decay ? Heavy-tailed ?
best MP fits
bulk not captured difficult to apply MP
c|c
(TM)
(TM)
40
calculation | consulting why deep learning works
Applying RMT: Heavy Tails ~ Q=1
Large, well trained models approach heavy tailed self-regularization
DenseNet201 typical layer w/ Q=1.92
Heavy-tailed, but seemingly
within MP eigenvalue bounds
MP fit is terrible near
eigenvalue minimum = 0
variance 1.83 is quite large
Resembles Q =1 fit
like a soft rank collapse
best MP fit
(Q fixed)
c|c
(TM)
(TM)
41
calculation | consulting why deep learning works
Applying RMT: What phase is your model in ?
How to apply RMT
Large, well trained models approach heavy tailed self-regularization
Q = 1 = 0
Long tail looks takes the form of very large variance
standard MP theory
assumes finite variance
c|c
(TM)
(TM)
42
calculation | consulting why deep learning works
Applying RMT: What phase is your model in ?
Large, well trained models approach heavy tailed self-regularization
best MP fit
(Q fixed)
Heavy-tailed, but
not clean power law
no hard rank collapse
but close
max ~ 30
InceptionV3 Layer 302 Q~2.048
c|c
(TM)
(TM)
43
calculation | consulting why deep learning works
Applying RMT: Should we float Q ?
Large, well trained models approach heavy tailed self-regularization
InceptionV3 Layer 302 Q~2.048
best MP fit
(Q =1)
Heavy-tailed, Q=1 does not fit
max ~ 30 (not shown)
~1.3
very large variances
do not capture bulk
c|c
(TM)
(TM)
44
calculation | consulting why deep learning works
Summary

self-regularization ~ entropy / information decrease
modern DNNs have heavy-tailed self-regularization
5+1 phases of learning
applied Random Matrix Theory (RMT)
small models ~ Tinkhonov regularization
c|c
(TM)
(TM)
45
calculation | consulting why deep learning works
Implications: RMT and Deep Learning
How can RMT be used to understand the Energy Landscape ?
tradeoff between
Energy and Entropy minimization
Where are the local minima ?
How is the Hessian behaved ?
Are simpler models misleading ?
Can we design better learning strategies ?
c|c
(TM)
Energy Funnels: Minimizing Frustration

(TM)
46
calculation | consulting why deep learning works
http://www.nature.com/nsmb/journal/v4/n11/pdf/nsb1197-871.pdf
Energy Landscape Theory for polymer / protein folding
c|c
(TM)
the Spin Glass of Minimal Frustration 

(TM)
47
calculation | consulting why deep learning works
Conjectured 2015 on my blog (15 min fame on Hacker News)
https://calculatedcontent.com/2015/03/25/why-does-deep-learning-work/
Bulk+Spikes, flipped
low lying Energy state in Spin Glass ~ spikes in RMT
c|c
(TM)
RMT w/Heavy Tails: Energy Landscape ?
(TM)
48
calculation | consulting why deep learning works
Compare to LeCun’s Spin Glass model (2015)
Spin Glass with/Heavy Tails ?
Local minima do not concentrate
near the ground state
(Cizeau P and Bouchaud J-P 1993)
is Landscape is more funneled, no ‘problems’ with local minima ?
(TM)
c|c
(TM)
c | c
charles@calculationconsulting.com

Mais conteúdo relacionado

Mais procurados

This Week in Machine Learning and AI Feb 2019
This Week in Machine Learning and AI Feb 2019This Week in Machine Learning and AI Feb 2019
This Week in Machine Learning and AI Feb 2019Charles Martin
 
Georgetown B-school Talk 2021
Georgetown B-school Talk  2021Georgetown B-school Talk  2021
Georgetown B-school Talk 2021Charles Martin
 
Statistical Mechanics Methods for Discovering Knowledge from Production-Scale...
Statistical Mechanics Methods for Discovering Knowledge from Production-Scale...Statistical Mechanics Methods for Discovering Knowledge from Production-Scale...
Statistical Mechanics Methods for Discovering Knowledge from Production-Scale...Charles Martin
 
Weight watcher Bay Area ACM Feb 28, 2022
Weight watcher Bay Area ACM Feb 28, 2022 Weight watcher Bay Area ACM Feb 28, 2022
Weight watcher Bay Area ACM Feb 28, 2022 Charles Martin
 
WeightWatcher Introduction
WeightWatcher IntroductionWeightWatcher Introduction
WeightWatcher IntroductionCharles Martin
 
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...CARI-2020, Application of LSTM architectures for next frame forecasting in Se...
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...Mokhtar SELLAMI
 
GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...
GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...
GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...Fabio Petroni, PhD
 
HDRF: Stream-Based Partitioning for Power-Law Graphs
HDRF: Stream-Based Partitioning for Power-Law GraphsHDRF: Stream-Based Partitioning for Power-Law Graphs
HDRF: Stream-Based Partitioning for Power-Law GraphsFabio Petroni, PhD
 
LCBM: Statistics-Based Parallel Collaborative Filtering
LCBM: Statistics-Based Parallel Collaborative FilteringLCBM: Statistics-Based Parallel Collaborative Filtering
LCBM: Statistics-Based Parallel Collaborative FilteringFabio Petroni, PhD
 
Cari presentation maurice-tchoupe-joskelngoufo
Cari presentation maurice-tchoupe-joskelngoufoCari presentation maurice-tchoupe-joskelngoufo
Cari presentation maurice-tchoupe-joskelngoufoMokhtar SELLAMI
 
Mining at scale with latent factor models for matrix completion
Mining at scale with latent factor models for matrix completionMining at scale with latent factor models for matrix completion
Mining at scale with latent factor models for matrix completionFabio Petroni, PhD
 
Dimensionality reduction with UMAP
Dimensionality reduction with UMAPDimensionality reduction with UMAP
Dimensionality reduction with UMAPJakub Bartczuk
 
Metric-learn, a Scikit-learn compatible package
Metric-learn, a Scikit-learn compatible packageMetric-learn, a Scikit-learn compatible package
Metric-learn, a Scikit-learn compatible packageWilliam de Vazelhes
 
Chap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsChap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsYoung-Geun Choi
 
Minimization of Assignment Problems
Minimization of Assignment ProblemsMinimization of Assignment Problems
Minimization of Assignment Problemsijtsrd
 
A Parallel Data Distribution Management Algorithm
A Parallel Data Distribution Management AlgorithmA Parallel Data Distribution Management Algorithm
A Parallel Data Distribution Management AlgorithmGabriele D'Angelo
 
D. Vulcanov: Symbolic Computation Methods in Cosmology and General Relativity...
D. Vulcanov: Symbolic Computation Methods in Cosmology and General Relativity...D. Vulcanov: Symbolic Computation Methods in Cosmology and General Relativity...
D. Vulcanov: Symbolic Computation Methods in Cosmology and General Relativity...SEENET-MTP
 

Mais procurados (20)

Search relevance
Search relevanceSearch relevance
Search relevance
 
This Week in Machine Learning and AI Feb 2019
This Week in Machine Learning and AI Feb 2019This Week in Machine Learning and AI Feb 2019
This Week in Machine Learning and AI Feb 2019
 
Georgetown B-school Talk 2021
Georgetown B-school Talk  2021Georgetown B-school Talk  2021
Georgetown B-school Talk 2021
 
Statistical Mechanics Methods for Discovering Knowledge from Production-Scale...
Statistical Mechanics Methods for Discovering Knowledge from Production-Scale...Statistical Mechanics Methods for Discovering Knowledge from Production-Scale...
Statistical Mechanics Methods for Discovering Knowledge from Production-Scale...
 
Weight watcher Bay Area ACM Feb 28, 2022
Weight watcher Bay Area ACM Feb 28, 2022 Weight watcher Bay Area ACM Feb 28, 2022
Weight watcher Bay Area ACM Feb 28, 2022
 
ENS Macrh 2022.pdf
ENS Macrh 2022.pdfENS Macrh 2022.pdf
ENS Macrh 2022.pdf
 
WeightWatcher Introduction
WeightWatcher IntroductionWeightWatcher Introduction
WeightWatcher Introduction
 
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...CARI-2020, Application of LSTM architectures for next frame forecasting in Se...
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...
 
GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...
GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...
GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...
 
HDRF: Stream-Based Partitioning for Power-Law Graphs
HDRF: Stream-Based Partitioning for Power-Law GraphsHDRF: Stream-Based Partitioning for Power-Law Graphs
HDRF: Stream-Based Partitioning for Power-Law Graphs
 
LCBM: Statistics-Based Parallel Collaborative Filtering
LCBM: Statistics-Based Parallel Collaborative FilteringLCBM: Statistics-Based Parallel Collaborative Filtering
LCBM: Statistics-Based Parallel Collaborative Filtering
 
Cari presentation maurice-tchoupe-joskelngoufo
Cari presentation maurice-tchoupe-joskelngoufoCari presentation maurice-tchoupe-joskelngoufo
Cari presentation maurice-tchoupe-joskelngoufo
 
Mining at scale with latent factor models for matrix completion
Mining at scale with latent factor models for matrix completionMining at scale with latent factor models for matrix completion
Mining at scale with latent factor models for matrix completion
 
Dimensionality reduction with UMAP
Dimensionality reduction with UMAPDimensionality reduction with UMAP
Dimensionality reduction with UMAP
 
Metric-learn, a Scikit-learn compatible package
Metric-learn, a Scikit-learn compatible packageMetric-learn, a Scikit-learn compatible package
Metric-learn, a Scikit-learn compatible package
 
Chap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsChap 8. Optimization for training deep models
Chap 8. Optimization for training deep models
 
Minimization of Assignment Problems
Minimization of Assignment ProblemsMinimization of Assignment Problems
Minimization of Assignment Problems
 
A Parallel Data Distribution Management Algorithm
A Parallel Data Distribution Management AlgorithmA Parallel Data Distribution Management Algorithm
A Parallel Data Distribution Management Algorithm
 
D. Vulcanov: Symbolic Computation Methods in Cosmology and General Relativity...
D. Vulcanov: Symbolic Computation Methods in Cosmology and General Relativity...D. Vulcanov: Symbolic Computation Methods in Cosmology and General Relativity...
D. Vulcanov: Symbolic Computation Methods in Cosmology and General Relativity...
 
CLIM Program: Remote Sensing Workshop, Statistical Emulation with Dimension R...
CLIM Program: Remote Sensing Workshop, Statistical Emulation with Dimension R...CLIM Program: Remote Sensing Workshop, Statistical Emulation with Dimension R...
CLIM Program: Remote Sensing Workshop, Statistical Emulation with Dimension R...
 

Semelhante a Why Deep Learning Works: Self Regularization in Deep Neural Networks

WeightWatcher LLM Update
WeightWatcher LLM UpdateWeightWatcher LLM Update
WeightWatcher LLM UpdateCharles Martin
 
WeightWatcher Update: January 2021
WeightWatcher Update:  January 2021WeightWatcher Update:  January 2021
WeightWatcher Update: January 2021Charles Martin
 
Self-Balancing Multimemetic Algorithms in Dynamic Scale-Free Networks
Self-Balancing Multimemetic Algorithms in Dynamic Scale-Free NetworksSelf-Balancing Multimemetic Algorithms in Dynamic Scale-Free Networks
Self-Balancing Multimemetic Algorithms in Dynamic Scale-Free NetworksRafael Nogueras
 
4 - Simulation and analysis of different DCT techniques on MATLAB (presented ...
4 - Simulation and analysis of different DCT techniques on MATLAB (presented ...4 - Simulation and analysis of different DCT techniques on MATLAB (presented ...
4 - Simulation and analysis of different DCT techniques on MATLAB (presented ...Youness Lahdili
 
Tensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with HummingbirdTensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with HummingbirdDatabricks
 
Transfer learning with LTANN-MEM & NSA for solving multi-objective symbolic r...
Transfer learning with LTANN-MEM & NSA for solving multi-objective symbolic r...Transfer learning with LTANN-MEM & NSA for solving multi-objective symbolic r...
Transfer learning with LTANN-MEM & NSA for solving multi-objective symbolic r...Amr Kamel Deklel
 
A simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsA simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsDevansh16
 
Colfax-Winograd-Summary _final (1)
Colfax-Winograd-Summary _final (1)Colfax-Winograd-Summary _final (1)
Colfax-Winograd-Summary _final (1)Sangamesh Ragate
 
Overlapping community detection in Large-Scale Networks using BigCLAM model b...
Overlapping community detection in Large-Scale Networks using BigCLAM model b...Overlapping community detection in Large-Scale Networks using BigCLAM model b...
Overlapping community detection in Large-Scale Networks using BigCLAM model b...Thang Nguyen
 
Project seminar ppt_steelcasting
Project seminar ppt_steelcastingProject seminar ppt_steelcasting
Project seminar ppt_steelcastingRudra Narayan Paul
 
Support Vector Machines for Classification
Support Vector Machines for ClassificationSupport Vector Machines for Classification
Support Vector Machines for ClassificationPrakash Pimpale
 
Data-Driven Recommender Systems
Data-Driven Recommender SystemsData-Driven Recommender Systems
Data-Driven Recommender Systemsrecsysfr
 
深度學習在AOI的應用
深度學習在AOI的應用深度學習在AOI的應用
深度學習在AOI的應用CHENHuiMei
 
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackConformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackBhaskar Mitra
 

Semelhante a Why Deep Learning Works: Self Regularization in Deep Neural Networks (20)

WeightWatcher LLM Update
WeightWatcher LLM UpdateWeightWatcher LLM Update
WeightWatcher LLM Update
 
ICCF24.pdf
ICCF24.pdfICCF24.pdf
ICCF24.pdf
 
WeightWatcher Update: January 2021
WeightWatcher Update:  January 2021WeightWatcher Update:  January 2021
WeightWatcher Update: January 2021
 
Self-Balancing Multimemetic Algorithms in Dynamic Scale-Free Networks
Self-Balancing Multimemetic Algorithms in Dynamic Scale-Free NetworksSelf-Balancing Multimemetic Algorithms in Dynamic Scale-Free Networks
Self-Balancing Multimemetic Algorithms in Dynamic Scale-Free Networks
 
MUMS Opening Workshop - Machine-Learning Error Models for Quantifying the Epi...
MUMS Opening Workshop - Machine-Learning Error Models for Quantifying the Epi...MUMS Opening Workshop - Machine-Learning Error Models for Quantifying the Epi...
MUMS Opening Workshop - Machine-Learning Error Models for Quantifying the Epi...
 
4 - Simulation and analysis of different DCT techniques on MATLAB (presented ...
4 - Simulation and analysis of different DCT techniques on MATLAB (presented ...4 - Simulation and analysis of different DCT techniques on MATLAB (presented ...
4 - Simulation and analysis of different DCT techniques on MATLAB (presented ...
 
Tensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with HummingbirdTensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with Hummingbird
 
Transfer learning with LTANN-MEM & NSA for solving multi-objective symbolic r...
Transfer learning with LTANN-MEM & NSA for solving multi-objective symbolic r...Transfer learning with LTANN-MEM & NSA for solving multi-objective symbolic r...
Transfer learning with LTANN-MEM & NSA for solving multi-objective symbolic r...
 
A simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsA simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representations
 
Lecture16 xing
Lecture16 xingLecture16 xing
Lecture16 xing
 
Colfax-Winograd-Summary _final (1)
Colfax-Winograd-Summary _final (1)Colfax-Winograd-Summary _final (1)
Colfax-Winograd-Summary _final (1)
 
Overlapping community detection in Large-Scale Networks using BigCLAM model b...
Overlapping community detection in Large-Scale Networks using BigCLAM model b...Overlapping community detection in Large-Scale Networks using BigCLAM model b...
Overlapping community detection in Large-Scale Networks using BigCLAM model b...
 
Complex Models for Big Data
Complex Models for Big DataComplex Models for Big Data
Complex Models for Big Data
 
Project seminar ppt_steelcasting
Project seminar ppt_steelcastingProject seminar ppt_steelcasting
Project seminar ppt_steelcasting
 
Support Vector Machines for Classification
Support Vector Machines for ClassificationSupport Vector Machines for Classification
Support Vector Machines for Classification
 
autoTVM
autoTVMautoTVM
autoTVM
 
Data-Driven Recommender Systems
Data-Driven Recommender SystemsData-Driven Recommender Systems
Data-Driven Recommender Systems
 
深度學習在AOI的應用
深度學習在AOI的應用深度學習在AOI的應用
深度學習在AOI的應用
 
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackConformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
 
Recurrent Instance Segmentation (UPC Reading Group)
Recurrent Instance Segmentation (UPC Reading Group)Recurrent Instance Segmentation (UPC Reading Group)
Recurrent Instance Segmentation (UPC Reading Group)
 

Mais de Charles Martin

Heavy Tails Workshop NeurIPS2023.pdf
Heavy Tails Workshop NeurIPS2023.pdfHeavy Tails Workshop NeurIPS2023.pdf
Heavy Tails Workshop NeurIPS2023.pdfCharles Martin
 
LLM avalanche June 2023.pdf
LLM avalanche June 2023.pdfLLM avalanche June 2023.pdf
LLM avalanche June 2023.pdfCharles Martin
 
Building AI Products: Delivery Vs Discovery
Building AI Products: Delivery Vs Discovery Building AI Products: Delivery Vs Discovery
Building AI Products: Delivery Vs Discovery Charles Martin
 
Palo alto university rotary club talk Sep 29, 2107
Palo alto university rotary club talk Sep 29, 2107Palo alto university rotary club talk Sep 29, 2107
Palo alto university rotary club talk Sep 29, 2107Charles Martin
 
Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3Charles Martin
 
Cc hass b school talk 2105
Cc hass b school talk  2105Cc hass b school talk  2105
Cc hass b school talk 2105Charles Martin
 

Mais de Charles Martin (7)

Heavy Tails Workshop NeurIPS2023.pdf
Heavy Tails Workshop NeurIPS2023.pdfHeavy Tails Workshop NeurIPS2023.pdf
Heavy Tails Workshop NeurIPS2023.pdf
 
LLM avalanche June 2023.pdf
LLM avalanche June 2023.pdfLLM avalanche June 2023.pdf
LLM avalanche June 2023.pdf
 
Building AI Products: Delivery Vs Discovery
Building AI Products: Delivery Vs Discovery Building AI Products: Delivery Vs Discovery
Building AI Products: Delivery Vs Discovery
 
Palo alto university rotary club talk Sep 29, 2107
Palo alto university rotary club talk Sep 29, 2107Palo alto university rotary club talk Sep 29, 2107
Palo alto university rotary club talk Sep 29, 2107
 
Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3
 
Cc hass b school talk 2105
Cc hass b school talk  2105Cc hass b school talk  2105
Cc hass b school talk 2105
 
CC Talk at Berekely
CC Talk at BerekelyCC Talk at Berekely
CC Talk at Berekely
 

Último

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 

Último (20)

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 

Why Deep Learning Works: Self Regularization in Deep Neural Networks

  • 1. calculation | consulting why deep learning works: self-regularization in deep neural networks (TM) c|c (TM) charles@calculationconsulting.com
  • 2. calculation|consulting UC Berkeley / NERSC 2018 why deep learning works: self-regularization in deep neural networks (TM) charles@calculationconsulting.com
  • 3. calculation | consulting why deep learning works Who Are We? c|c (TM) Dr. Charles H. Martin, PhD University of Chicago, Chemical Physics NSF Fellow in Theoretical Chemistry Over 15 years experience in applied Machine Learning and AI ML algos for: Aardvark, acquired by Google (2010) Demand Media (eHow); first $1B IPO since Google Wall Street: BlackRock Fortune 500: Roche, France Telecom BigTech: eBay, Aardvark (Google), GoDaddy Private Equity: Anthropocene Institute www.calculationconsulting.com charles@calculationconsulting.com (TM) 3
  • 4. calculation | consulting why deep learning works c|c (TM) (TM) 4 Michael W. Mahoney ICSI, RISELab, Dept. of Statistics UC Berkeley Algorithmic and statistical aspects of modern large-scale data analysis. large-scale machine learning | randomized linear algebra geometric network analysis | scalable implicit regularization PhD, Yale University, computational chemical physics SAMSI National Advisory Committee NRC Committee on the Analysis of Massive Data Simons Institute Fall 2013 and 2018 program on the Foundations of Data Biennial MMDS Workshops on Algorithms for Modern Massive Data Sets NSF/TRIPODS-funded Foundations of Data Analysis Institute at UC Berkeley https://www.stat.berkeley.edu/~mmahoney/ mmahoney@stat.berkeley.edu Who Are We?
  • 5. c|c (TM) Motivations: towards a Theory of Deep Learning (TM) 5 calculation | consulting why deep learning works NNs as spin glasses LeCun et. al. 2015 Looks exactly old protein folding results (late 90s) Energy Landscape Theory broad questions about Why Deep Learning Works ? MDDS talk 2016 Blog post 2015 completely different picture of DNNs
  • 6. c|c (TM) Motivations: towards a Theory of Deep Learning (TM) 6 calculation | consulting why deep learning works Theoretical: deeper insight into Why Deep LearningWorks ? non-convex optimization ? regularization ? why is deep better ? VC vs Stat Mech vs ? … Practical: useful insight to improve engineering DNNs when is a network fully optimized ? large batch sizes ? better ensembles ? …
  • 7. c|c (TM) Set up: the Energy Landscape (TM) 7 calculation | consulting why deep learning works minimize Loss: but how avoid overtraining ?
  • 8. c|c (TM) Problem: How can this possibly work ? (TM) 8 calculation | consulting why deep learning works highly non-convex ? apparently not expected observed ? has been suspected for a long time that local minima are not the issue
  • 9. c|c (TM) Problem: Local Minima ? (TM) 9 calculation | consulting why deep learning works Duda, Hart and Stork, 2000 solution: add more capacity and regularize
  • 10. c|c (TM) Motivations: what is Regularization ? (TM) 10 calculation | consulting why deep learning works every adjustable knob and switch is called regularization https://arxiv.org/pdf/1710.10686.pdf Dropout Batch Size Noisify Data …
  • 11. c|c (TM) (TM) 11 calculation | consulting why deep learning works Understanding deep learning requires rethinking generalization Problem: What is Regularization in DNNs ? ICLR 2017 Best paper Large models overfit on randomly labeled data Regularization can not prevent this
  • 12. Moore-Pensrose pseudoinverse (1955) regularize (Phillips, 1962) familiar optimization problem c|c (TM) Motivations: what is Regularization ? (TM) 12 calculation | consulting why deep learning works Soften the rank of X, focus on large eigenvalues ( ) Ridge Regression / Tikhonov-Phillips Regularization https://calculatedcontent.com/2012/09/28/kernels-greens-functions-and-resolvent-operators/
  • 13. c|c (TM) Motivations: how we study Regularization (TM) 13 turn off regularization, turn it back on systematically, study W and traditional regularization is applied to W the Energy Landscape is determined by the layer weights WL L L
  • 14. c|c (TM) (TM) 14 calculation | consulting why deep learning works Information bottleneck Entropy collapse local minima k=1 saddle points floor / ground state k = 2 saddle points Information / Entropy Energy Landscape: and Information flow what happens to the layer weight matrices WL ?
  • 15. c|c (TM) (TM) 15 calculation | consulting why deep learning works Self-Regularization: Experiments Retrained LeNet5 on MINST using Keras Two (2) other small models: 3-Layer MLP and a Mini AlexNet And examine pre-trained models (AlexNet, Inception, …) Conv2D MaxPool Conv2D MaxPool FC1 FC2 FC
  • 16. c|c (TM) (TM) 16 calculation | consulting why deep learning works Matrix Complexity: Entropy and Stable Rank
  • 17. c|c (TM) (TM) 17 calculation | consulting why deep learning works Random Matrix Theory: detailed insight into WL Empirical Spectral Density (ESD: eigenvalues of X=W W )LL T import keras import numpy as np import matplotlib.pyplot as plt … W = model.layers[i].get_weights()[0] … X = np.dot(W, W.T) evals, evecs = np.linalg.eig(W, W.T) plt.hist(X, bin=100, density=True)
  • 18. c|c (TM) (TM) 18 calculation | consulting why deep learning works Random Matrix Theory: detailed insight into WL Entropy decrease corresponds to breakdown of random structure and the onset of a new kind of self-regularization Empirical Spectral Density (ESD: eigenvalues of X=W W )LL T Random Matrix Random + Spikes
  • 19. c|c (TM) (TM) 19 calculation | consulting why deep learning works Random Matrix Theory: Marchenko-Pastur converges to a deterministic function Empirical Spectral Density (ESD) with well defined edges (depends on Q, aspect ratio)
  • 20. c|c (TM) (TM) 20 calculation | consulting why deep learning works Random Matrix Theory: Marcenko Pastur plus Tracy-Widom fluctuations very crisp edges Q
  • 21. c|c (TM) (TM) 21 calculation | consulting why deep learning works Experiments: just apply to pre-trained Models https://medium.com/@siddharthdas_32104/ cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5
  • 22. c|c (TM) (TM) 22 calculation | consulting why deep learning works Experiments: just apply to pre-trained Models LeNet5 (1998) AlexNet (2012) InceptionV3 (2014) ResNet (2015) … DenseNet201 (2018) https://medium.com/@siddharthdas_32104/ cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5 Conv2D MaxPool Conv2D MaxPool FC FC
  • 23. c|c (TM) (TM) 23 calculation | consulting why deep learning works Marchenko-Pastur Bulk + Spikes Conv2D MaxPool Conv2D MaxPool FC FC softrank = 10% RMT: LeNet5
  • 24. c|c (TM) (TM) 24 calculation | consulting why deep learning works RMT: AlexNet Marchenko-Pastur Bulk-decay | Heavy Tailed FC1 zoomed in FC2 zoomed in
  • 25. c|c (TM) (TM) 25 calculation | consulting why deep learning works Random Matrix Theory: InceptionV3 Marchenko-Pastur bulk decay, onset of Heavy Tails W226
  • 26. c|c (TM) (TM) 26 calculation | consulting why deep learning works Eigenvalue Analysis: Rank Collapse ? Modern DNNs: soft rank collapses; do not lose hard rank > 0 (hard) rank collapse (Q>1) signifies over-regularization = 0 all smallest eigenvalues > 0, within numerical (recipes) threshold~ Q > 1
  • 27. c|c (TM) (TM) 27 calculation | consulting why deep learning works RMT: 5+1 Phases of Training
  • 28. c|c (TM) (TM) 28 calculation | consulting why deep learning works Bulk+Spikes: Small Models Rank 1 perturbation Perturbative correction Bulk Spikes Smaller, older models can be described pertubatively w/RMT
  • 29. c|c (TM) (TM) 29 calculation | consulting why deep learning works Spikes: carry more information Information begins to concentrate in the spikes S(v) spikes have less entropy, are more localized than bulk
  • 30. c|c (TM) (TM) 30 calculation | consulting why deep learning works Bulk+Spikes: ~ Tikhonov regularization Small models like LeNet5 exhibit traditional regularization softer rank , eigenvalues > , spikes carry most information simple scale threshold
  • 31. c|c (TM) (TM) 31 calculation | consulting why deep learning works Heavy Tailed: Self-Regularization W strongly correlated / highly non-random Can be modeled as if drawn from a heavy tailed distribution Then RMT/MP ESD will also have heavy tails Known results from RMT / polymer theory (Bouchaud, Potters, etc) AlexNet ReseNet50 InceptionV3 DenseNet201 … Large, well trained, modern DNNs exhibit heavy tailed self-regularization
  • 32. c|c (TM) (TM) 32 calculation | consulting why deep learning works Heavy Tailed: Self-Regularization Large, well trained, modern DNNs exhibit heavy tailed self-regularization Salient ideas: what we ‘suspect’ today No single scale threshold No simple low rank approximation for WL Contributions from correlations at all scales Can not be treated pertubatively
  • 33. c|c (TM) (TM) 33 calculation | consulting why deep learning works Self-Regularization: Batch size experiments We can cause small models to exhibit strong correlations / heavy tails By exploiting the Generalization Gap Phenomena Large batch sizes => decrease generalization accuracy Tuning the batch size from very large to very small
  • 34. c|c (TM) (TM) 34 calculation | consulting why deep learning works Batch Size Tuning: Generalization Gap Decreasing the batch size induces strong correlations in W Random-Like Bleeding-outRandom-Like
  • 35. c|c (TM) (TM) 35 calculation | consulting why deep learning works Batch Size Tuning: Generalization Gap Decreasing the batch size induces strong correlations in W
  • 36. c|c (TM) (TM) 36 calculation | consulting why deep learning works Batch Size Tuning: Generalization Gap Decreasing the batch size induces strong correlations in W Bulk+Spikes Bulk+Spikes Bulk-decay
  • 37. c|c (TM) (TM) 37 calculation | consulting why deep learning works Batch Size Tuning: Generalization Gap Decreasing the batch size induces strong correlations in W Bulk-decay Bulk-decay Heavy-tailed
  • 38. c|c (TM) (TM) 38 calculation | consulting why deep learning works Applying RMT: What phase is your model in ? How to apply RMT Q > 1 Q = 1 > 0 = 0 plus Tracy-Widom fluctuations very crisp edges Heavy-tailed ? Bulk-decay ? Bulk+Spikes Large, well trained models approach heavy tailed self-regularization
  • 39. c|c (TM) (TM) 39 calculation | consulting why deep learning works Applying RMT: What phase is your model in ? Large, well trained models approach heavy tailed self-regularization InceptionV3 Layer 226 Q~1.3 Bulk-decay ? Heavy-tailed ? best MP fits bulk not captured difficult to apply MP
  • 40. c|c (TM) (TM) 40 calculation | consulting why deep learning works Applying RMT: Heavy Tails ~ Q=1 Large, well trained models approach heavy tailed self-regularization DenseNet201 typical layer w/ Q=1.92 Heavy-tailed, but seemingly within MP eigenvalue bounds MP fit is terrible near eigenvalue minimum = 0 variance 1.83 is quite large Resembles Q =1 fit like a soft rank collapse best MP fit (Q fixed)
  • 41. c|c (TM) (TM) 41 calculation | consulting why deep learning works Applying RMT: What phase is your model in ? How to apply RMT Large, well trained models approach heavy tailed self-regularization Q = 1 = 0 Long tail looks takes the form of very large variance standard MP theory assumes finite variance
  • 42. c|c (TM) (TM) 42 calculation | consulting why deep learning works Applying RMT: What phase is your model in ? Large, well trained models approach heavy tailed self-regularization best MP fit (Q fixed) Heavy-tailed, but not clean power law no hard rank collapse but close max ~ 30 InceptionV3 Layer 302 Q~2.048
  • 43. c|c (TM) (TM) 43 calculation | consulting why deep learning works Applying RMT: Should we float Q ? Large, well trained models approach heavy tailed self-regularization InceptionV3 Layer 302 Q~2.048 best MP fit (Q =1) Heavy-tailed, Q=1 does not fit max ~ 30 (not shown) ~1.3 very large variances do not capture bulk
  • 44. c|c (TM) (TM) 44 calculation | consulting why deep learning works Summary
 self-regularization ~ entropy / information decrease modern DNNs have heavy-tailed self-regularization 5+1 phases of learning applied Random Matrix Theory (RMT) small models ~ Tinkhonov regularization
  • 45. c|c (TM) (TM) 45 calculation | consulting why deep learning works Implications: RMT and Deep Learning How can RMT be used to understand the Energy Landscape ? tradeoff between Energy and Entropy minimization Where are the local minima ? How is the Hessian behaved ? Are simpler models misleading ? Can we design better learning strategies ?
  • 46. c|c (TM) Energy Funnels: Minimizing Frustration
 (TM) 46 calculation | consulting why deep learning works http://www.nature.com/nsmb/journal/v4/n11/pdf/nsb1197-871.pdf Energy Landscape Theory for polymer / protein folding
  • 47. c|c (TM) the Spin Glass of Minimal Frustration 
 (TM) 47 calculation | consulting why deep learning works Conjectured 2015 on my blog (15 min fame on Hacker News) https://calculatedcontent.com/2015/03/25/why-does-deep-learning-work/ Bulk+Spikes, flipped low lying Energy state in Spin Glass ~ spikes in RMT
  • 48. c|c (TM) RMT w/Heavy Tails: Energy Landscape ? (TM) 48 calculation | consulting why deep learning works Compare to LeCun’s Spin Glass model (2015) Spin Glass with/Heavy Tails ? Local minima do not concentrate near the ground state (Cizeau P and Bouchaud J-P 1993) is Landscape is more funneled, no ‘problems’ with local minima ?