SlideShare uma empresa Scribd logo
1 de 42
Baixar para ler offline
calculation | consulting
why deep learning works:
self-regularization in deep neural networks
(TM)
c|c
(TM)
charles@calculationconsulting.com
calculation|consulting
UC Berkeley / NERSC 2018
why deep learning works:
self-regularization in deep neural networks
(TM)
charles@calculationconsulting.com
calculation | consulting why deep learning works
Who Are We?
c|c
(TM)
Dr. Charles H. Martin, PhD
University of Chicago, Chemical Physics
NSF Fellow in Theoretical Chemistry
Over 15 years experience in applied Machine Learning and AI
ML algos for: Aardvark, acquired by Google (2010)
Demand Media (eHow); first $1B IPO since Google
Wall Street: BlackRock
Fortune 500: Roche, France Telecom
BigTech: eBay, Aardvark (Google), GoDaddy
Private Equity: Anthropocene Institute
www.calculationconsulting.com
charles@calculationconsulting.com
(TM)
3
calculation | consulting why deep learning works
c|c
(TM)
(TM)
4
Michael W. Mahoney
ICSI, RISELab, Dept. of Statistics UC Berkeley
Algorithmic and statistical aspects of modern large-scale data analysis.
large-scale machine learning | randomized linear algebra
geometric network analysis | scalable implicit regularization
PhD, Yale University, computational chemical physics
SAMSI National Advisory Committee
NRC Committee on the Analysis of Massive Data
Simons Institute Fall 2013 and 2018 program on the Foundations of Data
Biennial MMDS Workshops on Algorithms for Modern Massive Data Sets
NSF/TRIPODS-funded Foundations of Data Analysis Institute at UC Berkeley
https://www.stat.berkeley.edu/~mmahoney/
mmahoney@stat.berkeley.edu
Who Are We?
c|c
(TM)
Motivations: towards a Theory of Deep Learning
(TM)
5
calculation | consulting why deep learning works
Theoretical: deeper insight into Why Deep LearningWorks ?
non-convex optimization ?
regularization ?
why is deep better ?
VC vs Stat Mech vs ?
…
Practical: useful insight to improve engineering DNNs
when is a network fully optimized ?
large batch sizes ?
better ensembles ?
…
c|c
(TM)
Set up: the Energy Landscape
(TM)
6
calculation | consulting why deep learning works
minimize Loss: but how avoid overtraining ?
c|c
(TM)
Problem: How can this possibly work ?
(TM)
7
calculation | consulting why deep learning works
highly non-convex ? apparently not
expected observed ?
has been suspected for a long time that local minima are not the issue
c|c
(TM)
Problem: Local Minima ?
(TM)
8
calculation | consulting why deep learning works
Duda, Hart and Stork, 2000
solution: add more capacity and regularize
c|c
(TM)
Motivations: what is Regularization ?
(TM)
9
calculation | consulting why deep learning works
every adjustable knob and switch is called regularization
https://arxiv.org/pdf/1710.10686.pdf
Dropout Batch Size Noisify Data
…
c|c
(TM)
(TM)
10
calculation | consulting why deep learning works
Understanding deep learning requires rethinking generalization
Problem: What is Regularization in DNNs ?
ICLR 2017 Best paper
Large models overfit on randomly labeled data
Regularization can not prevent this
Moore-Pensrose pseudoinverse (1955)
regularize (Phillips, 1962)
familiar optimization problem
c|c
(TM)
Motivations: what is Regularization ?
(TM)
11
calculation | consulting why deep learning works
Soften the rank of X, focus on large eigenvalues ( )
Ridge Regression / Tikhonov-Phillips Regularization
https://calculatedcontent.com/2012/09/28/kernels-greens-functions-and-resolvent-operators/
c|c
(TM)
Motivations: how we study Regularization
(TM)
12
calculation | consulting why deep learning work
turn off regularization, turn it back on systematically, study W
and traditional regularization is applied to W
the Energy Landscape is determined by the layer weights WL
L
L
c|c
(TM)
(TM)
13
calculation | consulting why deep learning works
Information bottleneck
Entropy collapse
local minima
k=1 saddle points
floor / ground state
k = 2 saddle points
Information / Entropy
Energy Landscape: and Information flow
what happens to the layer weight matrices WL ?
c|c
(TM)
(TM)
14
calculation | consulting why deep learning works
Self-Regularization: Experiments
Retrained LeNet5 on MINST using Keras
Two (2) other small models: 3-Layer MLP and a Mini AlexNet
And examine pre-trained models (AlexNet, Inception, …)
Conv2D MaxPool Conv2D MaxPool FC1 FC2 FC
c|c
(TM)
(TM)
15
calculation | consulting why deep learning works
Matrix Complexity: Entropy and Stable Rank
c|c
(TM)
(TM)
16
calculation | consulting why deep learning works
Random Matrix Theory: detailed insight into WL
Empirical Spectral Density (ESD: eigenvalues of X=W W )LL
T
import keras
import numpy as np
import matplotlib.pyplot as plt
…
W = model.layers[i].get_weights()[0]
…
X = np.dot(W, W.T)
evals, evecs = np.linalg.eig(W, W.T)
plt.hist(X, bin=100, density=True)
c|c
(TM)
(TM)
17
calculation | consulting why deep learning works
Random Matrix Theory: detailed insight into WL
Entropy decrease corresponds to breakdown of random structure
and the onset of a new kind of self-regularization
Empirical Spectral Density (ESD: eigenvalues of X=W W )LL
T
Random
Matrix
Random
+ Spikes
c|c
(TM)
(TM)
18
calculation | consulting why deep learning works
Random Matrix Theory: Marchenko-Pastur
converges to a deterministic function
Empirical Spectral Density (ESD)
with well defined edges (depends on Q, aspect ratio)
c|c
(TM)
(TM)
19
calculation | consulting why deep learning works
Random Matrix Theory: Marcenko Pastur
plus Tracy-Widom fluctuations
very crisp edges
Q
c|c
(TM)
(TM)
20
calculation | consulting why deep learning works
Experiments: just apply to pre-trained Models
https://medium.com/@siddharthdas_32104/
cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5
c|c
(TM)
(TM)
21
calculation | consulting why deep learning works
Experiments: just apply to pre-trained Models
LeNet5 (1998)
AlexNet (2012)
InceptionV3 (2014)
ResNet (2015)
…
DenseNet201 (2018)
https://medium.com/@siddharthdas_32104/
cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5
Conv2D MaxPool Conv2D MaxPool FC FC
c|c
(TM)
(TM)
22
calculation | consulting why deep learning works
Marchenko-Pastur Bulk + Spikes
Conv2D MaxPool Conv2D MaxPool FC FC
softrank = 10%
RMT: LeNet5
c|c
(TM)
(TM)
23
calculation | consulting why deep learning works
RMT: AlexNet
Marchenko-Pastur Bulk-decay | Heavy Tailed
FC1
zoomed in
FC2
zoomed in
c|c
(TM)
(TM)
24
calculation | consulting why deep learning works
Random Matrix Theory: InceptionV3
Marchenko-Pastur bulk decay, onset of Heavy Tails
W226
c|c
(TM)
(TM)
25
calculation | consulting why deep learning works
Eigenvalue Analysis: Rank Collapse ?
Modern DNNs: soft rank collapses; do not lose hard rank
> 0
(hard) rank collapse (Q>1)
signifies over-regularization
= 0
all smallest eigenvalues > 0,
within numerical (recipes) threshold~
c|c
(TM)
(TM)
26
calculation | consulting why deep learning works
RMT: 5+1 Phases of Training
c|c
(TM)
(TM)
27
calculation | consulting why deep learning works
Bulk+Spikes: Small Models
Rank 1 perturbation Perturbative correction
Bulk
Spikes
Smaller, older models can be described pertubatively w/RMT
c|c
(TM)
(TM)
28
calculation | consulting why deep learning works
Spikes: carry more information
Information begins to concentrate in the spikes
S(v)
spikes have less entropy, are more localized than bulk
c|c
(TM)
(TM)
29
calculation | consulting why deep learning works
Bulk+Spikes: ~ Tikhonov regularization
Small models like LeNet5 exhibit traditional regularization
softer rank , eigenvalues > , spikes carry most information
simple scale threshold
c|c
(TM)
(TM)
30
calculation | consulting why deep learning works
Heavy Tailed: Self-Regularization
W strongly correlated / highly non-random
Can be modeled as if drawn from a heavy tailed distribution
Then RMT/MP ESD will also have heavy tails
Known results from RMT / polymer theory (Bouchaud, Potters, etc)
AlexNet
ReseNet50
InceptionV3
DenseNet201
…
Large, well trained, modern DNNs exhibit heavy tailed self-regularization
c|c
(TM)
(TM)
31
calculation | consulting why deep learning works
Heavy Tailed: Self-Regularization
Large, well trained, modern DNNs exhibit heavy tailed self-regularization
Salient ideas: what we ‘suspect’ today
No single scale threshold
No simple low rank approximation for WL
Contributions from correlations at all scales
Can not be treated pertubatively
c|c
(TM)
(TM)
32
calculation | consulting why deep learning works
Self-Regularization: Batch size experiments
We can cause small models to exhibit strong correlations / heavy tails
By exploiting the Generalization Gap Phenomena
Large batch sizes => decrease generalization accuracy
Tuning the batch size from very large to very small
c|c
(TM)
(TM)
33
calculation | consulting why deep learning works
Batch Size Tuning: Generalization Gap
Decreasing the batch size induces strong correlations in W
Random-Like Bleeding-outRandom-Like
c|c
(TM)
(TM)
34
calculation | consulting why deep learning works
Batch Size Tuning: Generalization Gap
Decreasing the batch size induces strong correlations in W
c|c
(TM)
(TM)
35
calculation | consulting why deep learning works
Batch Size Tuning: Generalization Gap
Decreasing the batch size induces strong correlations in W
Bulk+Spikes Bulk+Spikes Bulk-decay
c|c
(TM)
(TM)
36
calculation | consulting why deep learning works
Batch Size Tuning: Generalization Gap
Decreasing the batch size induces strong correlations in W
Bulk-decay Bulk-decay Heavy-tailed
c|c
(TM)
(TM)
37
calculation | consulting why deep learning works
Summary

self-regularization ~ entropy / information decrease
modern DNNs have heavy-tailed self-regularization
5+1 phases of learning
applied Random Matrix Theory (RMT)
small models ~ Tinkhonov regularization
c|c
(TM)
(TM)
38
calculation | consulting why deep learning works
Implications: RMT and Deep Learning
How can RMT be used to understand the Energy Landscape ?
tradeoff between
Energy and Entropy minimization
Where are the local minima ?
How is the Hessian behaved ?
Are simpler models misleading ?
Can we design better learning strategies ?
c|c
(TM)
Energy Funnels: Minimizing Frustration

(TM)
39
calculation | consulting why deep learning works
http://www.nature.com/nsmb/journal/v4/n11/pdf/nsb1197-871.pdf
Energy Landscape Theory for polymer / protein folding
c|c
(TM)
the Spin Glass of Minimal Frustration 

(TM)
40
calculation | consulting why deep learning works
Conjectured 2015 on my blog (15 min fame on Hacker News)
https://calculatedcontent.com/2015/03/25/why-does-deep-learning-work/
Bulk+Spikes, flipped
low lying Energy state in Spin Glass ~ spikes in RMT
c|c
(TM)
RMT w/Heavy Tails: Energy Landscape ?
(TM)
41
calculation | consulting why deep learning works
Compare to LeCun’s Spin Glass model (2015)
Spin Glass with/Heavy Tails ?
Local minima do not concentrate
near the ground state
(Cizeau P and Bouchaud J-P 1993)
is Landscape is more funneled, no ‘problems’ with local minima ?
(TM)
c|c
(TM)
c | c
charles@calculationconsulting.com

Mais conteúdo relacionado

Mais procurados

Statistical Mechanics Methods for Discovering Knowledge from Production-Scale...
Statistical Mechanics Methods for Discovering Knowledge from Production-Scale...Statistical Mechanics Methods for Discovering Knowledge from Production-Scale...
Statistical Mechanics Methods for Discovering Knowledge from Production-Scale...Charles Martin
 
Weight watcher Bay Area ACM Feb 28, 2022
Weight watcher Bay Area ACM Feb 28, 2022 Weight watcher Bay Area ACM Feb 28, 2022
Weight watcher Bay Area ACM Feb 28, 2022 Charles Martin
 
Georgetown B-school Talk 2021
Georgetown B-school Talk  2021Georgetown B-school Talk  2021
Georgetown B-school Talk 2021Charles Martin
 
WeightWatcher Introduction
WeightWatcher IntroductionWeightWatcher Introduction
WeightWatcher IntroductionCharles Martin
 
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...CARI-2020, Application of LSTM architectures for next frame forecasting in Se...
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...Mokhtar SELLAMI
 
GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...
GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...
GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...Fabio Petroni, PhD
 
HDRF: Stream-Based Partitioning for Power-Law Graphs
HDRF: Stream-Based Partitioning for Power-Law GraphsHDRF: Stream-Based Partitioning for Power-Law Graphs
HDRF: Stream-Based Partitioning for Power-Law GraphsFabio Petroni, PhD
 
LCBM: Statistics-Based Parallel Collaborative Filtering
LCBM: Statistics-Based Parallel Collaborative FilteringLCBM: Statistics-Based Parallel Collaborative Filtering
LCBM: Statistics-Based Parallel Collaborative FilteringFabio Petroni, PhD
 
Dimensionality reduction with UMAP
Dimensionality reduction with UMAPDimensionality reduction with UMAP
Dimensionality reduction with UMAPJakub Bartczuk
 
Mining at scale with latent factor models for matrix completion
Mining at scale with latent factor models for matrix completionMining at scale with latent factor models for matrix completion
Mining at scale with latent factor models for matrix completionFabio Petroni, PhD
 
Cari presentation maurice-tchoupe-joskelngoufo
Cari presentation maurice-tchoupe-joskelngoufoCari presentation maurice-tchoupe-joskelngoufo
Cari presentation maurice-tchoupe-joskelngoufoMokhtar SELLAMI
 
Methods of Manifold Learning for Dimension Reduction of Large Data Sets
Methods of Manifold Learning for Dimension Reduction of Large Data SetsMethods of Manifold Learning for Dimension Reduction of Large Data Sets
Methods of Manifold Learning for Dimension Reduction of Large Data SetsRyan B Harvey, CSDP, CSM
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)byteLAKE
 
Graph Convolutional Neural Networks
Graph Convolutional Neural Networks Graph Convolutional Neural Networks
Graph Convolutional Neural Networks 신동 강
 
DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...
DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...
DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...Artem Lutov
 
Metric-learn, a Scikit-learn compatible package
Metric-learn, a Scikit-learn compatible packageMetric-learn, a Scikit-learn compatible package
Metric-learn, a Scikit-learn compatible packageWilliam de Vazelhes
 
A Parallel Data Distribution Management Algorithm
A Parallel Data Distribution Management AlgorithmA Parallel Data Distribution Management Algorithm
A Parallel Data Distribution Management AlgorithmGabriele D'Angelo
 

Mais procurados (20)

Search relevance
Search relevanceSearch relevance
Search relevance
 
Statistical Mechanics Methods for Discovering Knowledge from Production-Scale...
Statistical Mechanics Methods for Discovering Knowledge from Production-Scale...Statistical Mechanics Methods for Discovering Knowledge from Production-Scale...
Statistical Mechanics Methods for Discovering Knowledge from Production-Scale...
 
Weight watcher Bay Area ACM Feb 28, 2022
Weight watcher Bay Area ACM Feb 28, 2022 Weight watcher Bay Area ACM Feb 28, 2022
Weight watcher Bay Area ACM Feb 28, 2022
 
Georgetown B-school Talk 2021
Georgetown B-school Talk  2021Georgetown B-school Talk  2021
Georgetown B-school Talk 2021
 
ENS Macrh 2022.pdf
ENS Macrh 2022.pdfENS Macrh 2022.pdf
ENS Macrh 2022.pdf
 
WeightWatcher Introduction
WeightWatcher IntroductionWeightWatcher Introduction
WeightWatcher Introduction
 
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...CARI-2020, Application of LSTM architectures for next frame forecasting in Se...
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...
 
GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...
GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...
GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...
 
HDRF: Stream-Based Partitioning for Power-Law Graphs
HDRF: Stream-Based Partitioning for Power-Law GraphsHDRF: Stream-Based Partitioning for Power-Law Graphs
HDRF: Stream-Based Partitioning for Power-Law Graphs
 
LCBM: Statistics-Based Parallel Collaborative Filtering
LCBM: Statistics-Based Parallel Collaborative FilteringLCBM: Statistics-Based Parallel Collaborative Filtering
LCBM: Statistics-Based Parallel Collaborative Filtering
 
Dimensionality reduction with UMAP
Dimensionality reduction with UMAPDimensionality reduction with UMAP
Dimensionality reduction with UMAP
 
Mining at scale with latent factor models for matrix completion
Mining at scale with latent factor models for matrix completionMining at scale with latent factor models for matrix completion
Mining at scale with latent factor models for matrix completion
 
Cari presentation maurice-tchoupe-joskelngoufo
Cari presentation maurice-tchoupe-joskelngoufoCari presentation maurice-tchoupe-joskelngoufo
Cari presentation maurice-tchoupe-joskelngoufo
 
Methods of Manifold Learning for Dimension Reduction of Large Data Sets
Methods of Manifold Learning for Dimension Reduction of Large Data SetsMethods of Manifold Learning for Dimension Reduction of Large Data Sets
Methods of Manifold Learning for Dimension Reduction of Large Data Sets
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
 
Neural Style Transfer in practice
Neural Style Transfer in practiceNeural Style Transfer in practice
Neural Style Transfer in practice
 
Graph Convolutional Neural Networks
Graph Convolutional Neural Networks Graph Convolutional Neural Networks
Graph Convolutional Neural Networks
 
DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...
DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...
DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...
 
Metric-learn, a Scikit-learn compatible package
Metric-learn, a Scikit-learn compatible packageMetric-learn, a Scikit-learn compatible package
Metric-learn, a Scikit-learn compatible package
 
A Parallel Data Distribution Management Algorithm
A Parallel Data Distribution Management AlgorithmA Parallel Data Distribution Management Algorithm
A Parallel Data Distribution Management Algorithm
 

Semelhante a Why Deep Learning Works: Self Regularization in Deep Neural Networks

Why Deep Learning Works: Self Regularization in Deep Neural Networks
Why Deep Learning Works: Self Regularization in Deep Neural NetworksWhy Deep Learning Works: Self Regularization in Deep Neural Networks
Why Deep Learning Works: Self Regularization in Deep Neural NetworksCharles Martin
 
WeightWatcher LLM Update
WeightWatcher LLM UpdateWeightWatcher LLM Update
WeightWatcher LLM UpdateCharles Martin
 
WeightWatcher Update: January 2021
WeightWatcher Update:  January 2021WeightWatcher Update:  January 2021
WeightWatcher Update: January 2021Charles Martin
 
Support Vector Machines for Classification
Support Vector Machines for ClassificationSupport Vector Machines for Classification
Support Vector Machines for ClassificationPrakash Pimpale
 
A simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsA simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsDevansh16
 
Time Series Forecasting Modeling CMG12
Time Series Forecasting Modeling CMG12Time Series Forecasting Modeling CMG12
Time Series Forecasting Modeling CMG12Alex Gilgur
 
Heavy Tails Workshop NeurIPS2023.pdf
Heavy Tails Workshop NeurIPS2023.pdfHeavy Tails Workshop NeurIPS2023.pdf
Heavy Tails Workshop NeurIPS2023.pdfCharles Martin
 
Hands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in PythonHands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in PythonChun-Ming Chang
 
Keynote at IWLS 2017
Keynote at IWLS 2017Keynote at IWLS 2017
Keynote at IWLS 2017Manish Pandey
 
Self-Balancing Multimemetic Algorithms in Dynamic Scale-Free Networks
Self-Balancing Multimemetic Algorithms in Dynamic Scale-Free NetworksSelf-Balancing Multimemetic Algorithms in Dynamic Scale-Free Networks
Self-Balancing Multimemetic Algorithms in Dynamic Scale-Free NetworksRafael Nogueras
 
An Analysis of Graph Cut Size for Transductive Learning
An Analysis of Graph Cut Size for Transductive LearningAn Analysis of Graph Cut Size for Transductive Learning
An Analysis of Graph Cut Size for Transductive Learningbutest
 
Data-Driven Recommender Systems
Data-Driven Recommender SystemsData-Driven Recommender Systems
Data-Driven Recommender Systemsrecsysfr
 

Semelhante a Why Deep Learning Works: Self Regularization in Deep Neural Networks (20)

Why Deep Learning Works: Self Regularization in Deep Neural Networks
Why Deep Learning Works: Self Regularization in Deep Neural NetworksWhy Deep Learning Works: Self Regularization in Deep Neural Networks
Why Deep Learning Works: Self Regularization in Deep Neural Networks
 
WeightWatcher LLM Update
WeightWatcher LLM UpdateWeightWatcher LLM Update
WeightWatcher LLM Update
 
WeightWatcher Update: January 2021
WeightWatcher Update:  January 2021WeightWatcher Update:  January 2021
WeightWatcher Update: January 2021
 
ICCF24.pdf
ICCF24.pdfICCF24.pdf
ICCF24.pdf
 
Support Vector Machines for Classification
Support Vector Machines for ClassificationSupport Vector Machines for Classification
Support Vector Machines for Classification
 
WMT14_sakaguchi
WMT14_sakaguchiWMT14_sakaguchi
WMT14_sakaguchi
 
MUMS Opening Workshop - Machine-Learning Error Models for Quantifying the Epi...
MUMS Opening Workshop - Machine-Learning Error Models for Quantifying the Epi...MUMS Opening Workshop - Machine-Learning Error Models for Quantifying the Epi...
MUMS Opening Workshop - Machine-Learning Error Models for Quantifying the Epi...
 
deep CNN vs conventional ML
deep CNN vs conventional MLdeep CNN vs conventional ML
deep CNN vs conventional ML
 
A simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsA simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representations
 
Time Series Forecasting Modeling CMG12
Time Series Forecasting Modeling CMG12Time Series Forecasting Modeling CMG12
Time Series Forecasting Modeling CMG12
 
Heavy Tails Workshop NeurIPS2023.pdf
Heavy Tails Workshop NeurIPS2023.pdfHeavy Tails Workshop NeurIPS2023.pdf
Heavy Tails Workshop NeurIPS2023.pdf
 
Hands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in PythonHands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in Python
 
LR2. Summary Day 2
LR2. Summary Day 2LR2. Summary Day 2
LR2. Summary Day 2
 
Big Data Challenges and Solutions
Big Data Challenges and SolutionsBig Data Challenges and Solutions
Big Data Challenges and Solutions
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
 
Keynote at IWLS 2017
Keynote at IWLS 2017Keynote at IWLS 2017
Keynote at IWLS 2017
 
Self-Balancing Multimemetic Algorithms in Dynamic Scale-Free Networks
Self-Balancing Multimemetic Algorithms in Dynamic Scale-Free NetworksSelf-Balancing Multimemetic Algorithms in Dynamic Scale-Free Networks
Self-Balancing Multimemetic Algorithms in Dynamic Scale-Free Networks
 
An Analysis of Graph Cut Size for Transductive Learning
An Analysis of Graph Cut Size for Transductive LearningAn Analysis of Graph Cut Size for Transductive Learning
An Analysis of Graph Cut Size for Transductive Learning
 
Lecture16 xing
Lecture16 xingLecture16 xing
Lecture16 xing
 
Data-Driven Recommender Systems
Data-Driven Recommender SystemsData-Driven Recommender Systems
Data-Driven Recommender Systems
 

Mais de Charles Martin

LLM avalanche June 2023.pdf
LLM avalanche June 2023.pdfLLM avalanche June 2023.pdf
LLM avalanche June 2023.pdfCharles Martin
 
Building AI Products: Delivery Vs Discovery
Building AI Products: Delivery Vs Discovery Building AI Products: Delivery Vs Discovery
Building AI Products: Delivery Vs Discovery Charles Martin
 
Palo alto university rotary club talk Sep 29, 2107
Palo alto university rotary club talk Sep 29, 2107Palo alto university rotary club talk Sep 29, 2107
Palo alto university rotary club talk Sep 29, 2107Charles Martin
 
Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3Charles Martin
 
Cc hass b school talk 2105
Cc hass b school talk  2105Cc hass b school talk  2105
Cc hass b school talk 2105Charles Martin
 

Mais de Charles Martin (6)

LLM avalanche June 2023.pdf
LLM avalanche June 2023.pdfLLM avalanche June 2023.pdf
LLM avalanche June 2023.pdf
 
Building AI Products: Delivery Vs Discovery
Building AI Products: Delivery Vs Discovery Building AI Products: Delivery Vs Discovery
Building AI Products: Delivery Vs Discovery
 
Palo alto university rotary club talk Sep 29, 2107
Palo alto university rotary club talk Sep 29, 2107Palo alto university rotary club talk Sep 29, 2107
Palo alto university rotary club talk Sep 29, 2107
 
Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3
 
Cc hass b school talk 2105
Cc hass b school talk  2105Cc hass b school talk  2105
Cc hass b school talk 2105
 
CC Talk at Berekely
CC Talk at BerekelyCC Talk at Berekely
CC Talk at Berekely
 

Último

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 

Último (20)

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 

Why Deep Learning Works: Self Regularization in Deep Neural Networks

  • 1. calculation | consulting why deep learning works: self-regularization in deep neural networks (TM) c|c (TM) charles@calculationconsulting.com
  • 2. calculation|consulting UC Berkeley / NERSC 2018 why deep learning works: self-regularization in deep neural networks (TM) charles@calculationconsulting.com
  • 3. calculation | consulting why deep learning works Who Are We? c|c (TM) Dr. Charles H. Martin, PhD University of Chicago, Chemical Physics NSF Fellow in Theoretical Chemistry Over 15 years experience in applied Machine Learning and AI ML algos for: Aardvark, acquired by Google (2010) Demand Media (eHow); first $1B IPO since Google Wall Street: BlackRock Fortune 500: Roche, France Telecom BigTech: eBay, Aardvark (Google), GoDaddy Private Equity: Anthropocene Institute www.calculationconsulting.com charles@calculationconsulting.com (TM) 3
  • 4. calculation | consulting why deep learning works c|c (TM) (TM) 4 Michael W. Mahoney ICSI, RISELab, Dept. of Statistics UC Berkeley Algorithmic and statistical aspects of modern large-scale data analysis. large-scale machine learning | randomized linear algebra geometric network analysis | scalable implicit regularization PhD, Yale University, computational chemical physics SAMSI National Advisory Committee NRC Committee on the Analysis of Massive Data Simons Institute Fall 2013 and 2018 program on the Foundations of Data Biennial MMDS Workshops on Algorithms for Modern Massive Data Sets NSF/TRIPODS-funded Foundations of Data Analysis Institute at UC Berkeley https://www.stat.berkeley.edu/~mmahoney/ mmahoney@stat.berkeley.edu Who Are We?
  • 5. c|c (TM) Motivations: towards a Theory of Deep Learning (TM) 5 calculation | consulting why deep learning works Theoretical: deeper insight into Why Deep LearningWorks ? non-convex optimization ? regularization ? why is deep better ? VC vs Stat Mech vs ? … Practical: useful insight to improve engineering DNNs when is a network fully optimized ? large batch sizes ? better ensembles ? …
  • 6. c|c (TM) Set up: the Energy Landscape (TM) 6 calculation | consulting why deep learning works minimize Loss: but how avoid overtraining ?
  • 7. c|c (TM) Problem: How can this possibly work ? (TM) 7 calculation | consulting why deep learning works highly non-convex ? apparently not expected observed ? has been suspected for a long time that local minima are not the issue
  • 8. c|c (TM) Problem: Local Minima ? (TM) 8 calculation | consulting why deep learning works Duda, Hart and Stork, 2000 solution: add more capacity and regularize
  • 9. c|c (TM) Motivations: what is Regularization ? (TM) 9 calculation | consulting why deep learning works every adjustable knob and switch is called regularization https://arxiv.org/pdf/1710.10686.pdf Dropout Batch Size Noisify Data …
  • 10. c|c (TM) (TM) 10 calculation | consulting why deep learning works Understanding deep learning requires rethinking generalization Problem: What is Regularization in DNNs ? ICLR 2017 Best paper Large models overfit on randomly labeled data Regularization can not prevent this
  • 11. Moore-Pensrose pseudoinverse (1955) regularize (Phillips, 1962) familiar optimization problem c|c (TM) Motivations: what is Regularization ? (TM) 11 calculation | consulting why deep learning works Soften the rank of X, focus on large eigenvalues ( ) Ridge Regression / Tikhonov-Phillips Regularization https://calculatedcontent.com/2012/09/28/kernels-greens-functions-and-resolvent-operators/
  • 12. c|c (TM) Motivations: how we study Regularization (TM) 12 calculation | consulting why deep learning work turn off regularization, turn it back on systematically, study W and traditional regularization is applied to W the Energy Landscape is determined by the layer weights WL L L
  • 13. c|c (TM) (TM) 13 calculation | consulting why deep learning works Information bottleneck Entropy collapse local minima k=1 saddle points floor / ground state k = 2 saddle points Information / Entropy Energy Landscape: and Information flow what happens to the layer weight matrices WL ?
  • 14. c|c (TM) (TM) 14 calculation | consulting why deep learning works Self-Regularization: Experiments Retrained LeNet5 on MINST using Keras Two (2) other small models: 3-Layer MLP and a Mini AlexNet And examine pre-trained models (AlexNet, Inception, …) Conv2D MaxPool Conv2D MaxPool FC1 FC2 FC
  • 15. c|c (TM) (TM) 15 calculation | consulting why deep learning works Matrix Complexity: Entropy and Stable Rank
  • 16. c|c (TM) (TM) 16 calculation | consulting why deep learning works Random Matrix Theory: detailed insight into WL Empirical Spectral Density (ESD: eigenvalues of X=W W )LL T import keras import numpy as np import matplotlib.pyplot as plt … W = model.layers[i].get_weights()[0] … X = np.dot(W, W.T) evals, evecs = np.linalg.eig(W, W.T) plt.hist(X, bin=100, density=True)
  • 17. c|c (TM) (TM) 17 calculation | consulting why deep learning works Random Matrix Theory: detailed insight into WL Entropy decrease corresponds to breakdown of random structure and the onset of a new kind of self-regularization Empirical Spectral Density (ESD: eigenvalues of X=W W )LL T Random Matrix Random + Spikes
  • 18. c|c (TM) (TM) 18 calculation | consulting why deep learning works Random Matrix Theory: Marchenko-Pastur converges to a deterministic function Empirical Spectral Density (ESD) with well defined edges (depends on Q, aspect ratio)
  • 19. c|c (TM) (TM) 19 calculation | consulting why deep learning works Random Matrix Theory: Marcenko Pastur plus Tracy-Widom fluctuations very crisp edges Q
  • 20. c|c (TM) (TM) 20 calculation | consulting why deep learning works Experiments: just apply to pre-trained Models https://medium.com/@siddharthdas_32104/ cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5
  • 21. c|c (TM) (TM) 21 calculation | consulting why deep learning works Experiments: just apply to pre-trained Models LeNet5 (1998) AlexNet (2012) InceptionV3 (2014) ResNet (2015) … DenseNet201 (2018) https://medium.com/@siddharthdas_32104/ cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5 Conv2D MaxPool Conv2D MaxPool FC FC
  • 22. c|c (TM) (TM) 22 calculation | consulting why deep learning works Marchenko-Pastur Bulk + Spikes Conv2D MaxPool Conv2D MaxPool FC FC softrank = 10% RMT: LeNet5
  • 23. c|c (TM) (TM) 23 calculation | consulting why deep learning works RMT: AlexNet Marchenko-Pastur Bulk-decay | Heavy Tailed FC1 zoomed in FC2 zoomed in
  • 24. c|c (TM) (TM) 24 calculation | consulting why deep learning works Random Matrix Theory: InceptionV3 Marchenko-Pastur bulk decay, onset of Heavy Tails W226
  • 25. c|c (TM) (TM) 25 calculation | consulting why deep learning works Eigenvalue Analysis: Rank Collapse ? Modern DNNs: soft rank collapses; do not lose hard rank > 0 (hard) rank collapse (Q>1) signifies over-regularization = 0 all smallest eigenvalues > 0, within numerical (recipes) threshold~
  • 26. c|c (TM) (TM) 26 calculation | consulting why deep learning works RMT: 5+1 Phases of Training
  • 27. c|c (TM) (TM) 27 calculation | consulting why deep learning works Bulk+Spikes: Small Models Rank 1 perturbation Perturbative correction Bulk Spikes Smaller, older models can be described pertubatively w/RMT
  • 28. c|c (TM) (TM) 28 calculation | consulting why deep learning works Spikes: carry more information Information begins to concentrate in the spikes S(v) spikes have less entropy, are more localized than bulk
  • 29. c|c (TM) (TM) 29 calculation | consulting why deep learning works Bulk+Spikes: ~ Tikhonov regularization Small models like LeNet5 exhibit traditional regularization softer rank , eigenvalues > , spikes carry most information simple scale threshold
  • 30. c|c (TM) (TM) 30 calculation | consulting why deep learning works Heavy Tailed: Self-Regularization W strongly correlated / highly non-random Can be modeled as if drawn from a heavy tailed distribution Then RMT/MP ESD will also have heavy tails Known results from RMT / polymer theory (Bouchaud, Potters, etc) AlexNet ReseNet50 InceptionV3 DenseNet201 … Large, well trained, modern DNNs exhibit heavy tailed self-regularization
  • 31. c|c (TM) (TM) 31 calculation | consulting why deep learning works Heavy Tailed: Self-Regularization Large, well trained, modern DNNs exhibit heavy tailed self-regularization Salient ideas: what we ‘suspect’ today No single scale threshold No simple low rank approximation for WL Contributions from correlations at all scales Can not be treated pertubatively
  • 32. c|c (TM) (TM) 32 calculation | consulting why deep learning works Self-Regularization: Batch size experiments We can cause small models to exhibit strong correlations / heavy tails By exploiting the Generalization Gap Phenomena Large batch sizes => decrease generalization accuracy Tuning the batch size from very large to very small
  • 33. c|c (TM) (TM) 33 calculation | consulting why deep learning works Batch Size Tuning: Generalization Gap Decreasing the batch size induces strong correlations in W Random-Like Bleeding-outRandom-Like
  • 34. c|c (TM) (TM) 34 calculation | consulting why deep learning works Batch Size Tuning: Generalization Gap Decreasing the batch size induces strong correlations in W
  • 35. c|c (TM) (TM) 35 calculation | consulting why deep learning works Batch Size Tuning: Generalization Gap Decreasing the batch size induces strong correlations in W Bulk+Spikes Bulk+Spikes Bulk-decay
  • 36. c|c (TM) (TM) 36 calculation | consulting why deep learning works Batch Size Tuning: Generalization Gap Decreasing the batch size induces strong correlations in W Bulk-decay Bulk-decay Heavy-tailed
  • 37. c|c (TM) (TM) 37 calculation | consulting why deep learning works Summary
 self-regularization ~ entropy / information decrease modern DNNs have heavy-tailed self-regularization 5+1 phases of learning applied Random Matrix Theory (RMT) small models ~ Tinkhonov regularization
  • 38. c|c (TM) (TM) 38 calculation | consulting why deep learning works Implications: RMT and Deep Learning How can RMT be used to understand the Energy Landscape ? tradeoff between Energy and Entropy minimization Where are the local minima ? How is the Hessian behaved ? Are simpler models misleading ? Can we design better learning strategies ?
  • 39. c|c (TM) Energy Funnels: Minimizing Frustration
 (TM) 39 calculation | consulting why deep learning works http://www.nature.com/nsmb/journal/v4/n11/pdf/nsb1197-871.pdf Energy Landscape Theory for polymer / protein folding
  • 40. c|c (TM) the Spin Glass of Minimal Frustration 
 (TM) 40 calculation | consulting why deep learning works Conjectured 2015 on my blog (15 min fame on Hacker News) https://calculatedcontent.com/2015/03/25/why-does-deep-learning-work/ Bulk+Spikes, flipped low lying Energy state in Spin Glass ~ spikes in RMT
  • 41. c|c (TM) RMT w/Heavy Tails: Energy Landscape ? (TM) 41 calculation | consulting why deep learning works Compare to LeCun’s Spin Glass model (2015) Spin Glass with/Heavy Tails ? Local minima do not concentrate near the ground state (Cizeau P and Bouchaud J-P 1993) is Landscape is more funneled, no ‘problems’ with local minima ?