SlideShare uma empresa Scribd logo
1 de 41
Baixar para ler offline
calculation | consulting
why deep learning works:
self-regularization in deep neural networks
(TM)
c|c
(TM)
charles@calculationconsulting.com
calculation|consulting
UC Berkeley / NERSC 2018
why deep learning works:
self-regularization in deep neural networks
(TM)
charles@calculationconsulting.com
calculation | consulting why deep learning works
Who Are We?
c|c
(TM)
Dr. Charles H. Martin, PhD
University of Chicago, Chemical Physics
NSF Fellow in Theoretical Chemistry
Over 15 years experience in applied Machine Learning and AI
ML algos for: Aardvark, acquired by Google (2010)
Demand Media (eHow); first $1B IPO since Google
Wall Street: BlackRock
Fortune 500: Roche, France Telecom
BigTech: eBay, Aardvark (Google), GoDaddy
Private Equity: Anthropocene Institute
www.calculationconsulting.com
charles@calculationconsulting.com
(TM)
3
c|c
(TM)
Motivations: towards a Theory of Deep Learning
(TM)
4
calculation | consulting why deep learning works
Theoretical: deeper insight into Why Deep LearningWorks ?
non-convex optimization ?
regularization ?
why is deep better ?
VC vs Stat Mech vs ?
…
Practical: useful insight to improve engineering DNNs
when is a network fully optimized ?
large batch sizes ?
better ensembles ?
…
c|c
(TM)
Set up: the Energy Landscape
(TM)
5
calculation | consulting why deep learning works
minimize Loss: but how avoid overtraining ?
c|c
(TM)
Problem: How can this possibly work ?
(TM)
6
calculation | consulting why deep learning works
highly non-convex ? apparently not
expected observed ?
has been suspected for a long time that local minima are not the issue
c|c
(TM)
Problem: Local Minima ?
(TM)
7
calculation | consulting why deep learning works
Duda, Hart and Stork, 2000
solution: add more capacity and regularize
c|c
(TM)
Motivations: what is Regularization ?
(TM)
8
calculation | consulting why deep learning works
every adjustable knob and switch is called regularization
https://arxiv.org/pdf/1710.10686.pdf
Dropout Batch Size Noisify Data
…
c|c
(TM)
(TM)
9
calculation | consulting why deep learning works
Understanding deep learning requires rethinking generalization
Problem: What is Regularization in DNNs ?
ICLR 2017 Best paper
Large models overfit on randomly labeled data
Regularization can not prevent this
Moore-Pensrose pseudoinverse (1955)
regularize (Phillips, 1962)
familiar optimization problem
c|c
(TM)
Motivations: what is Regularization ?
(TM)
10
calculation | consulting why deep learning works
Soften the rank of X, focus on large eigenvalues ( )
Ridge Regression / Tikhonov-Phillips Regularization
https://calculatedcontent.com/2012/09/28/kernels-greens-functions-and-resolvent-operators/
c|c
(TM)
Motivations: how we study Regularization
(TM)
11
calculation | consulting why deep learning works
turn off regularization, turn it back on systematically, study W
and traditional regularization is applied to W
the Energy Landscape is determined by the layer weights WL
L
L
c|c
(TM)
(TM)
12
calculation | consulting why deep learning works
Information bottleneck
Entropy collapse
local minima
k=1 saddle points
floor / ground state
k = 2 saddle points
Information / Entropy
Energy Landscape: and Information flow
what happens to the layer weight matrices WL ?
c|c
(TM)
(TM)
13
calculation | consulting why deep learning works
Self-Regularization: Experiments
Retrained LeNet5 on MINST using Keras
Two (2) other small models: 3-Layer MLP and a Mini AlexNet
And examine pre-trained models (AlexNet, Inception, …)
Conv2D MaxPool Conv2D MaxPool FC1 FC2 FC
c|c
(TM)
(TM)
14
calculation | consulting why deep learning works
Matrix Complexity: Entropy and Stable Rank
c|c
(TM)
(TM)
15
calculation | consulting why deep learning works
Random Matrix Theory: detailed insight into WL
Empirical Spectral Density (ESD: eigenvalues of X=W W )LL
T
import keras
import numpy as np
import matplotlib.pyplot as plt
…
W = model.layers[i].get_weights()[0]
…
X = np.dot(W, W.T)
evals, evecs = np.linalg.eig(W, W.T)
plt.hist(X, bin=100, density=True)
c|c
(TM)
(TM)
16
calculation | consulting why deep learning works
Random Matrix Theory: detailed insight into WL
Entropy decrease corresponds to breakdown of random structure
and the onset of a new kind of self-regularization
Empirical Spectral Density (ESD: eigenvalues of X=W W )LL
T
Random
Matrix
Random
+ Spikes
c|c
(TM)
(TM)
17
calculation | consulting why deep learning works
Random Matrix Theory: Marchenko-Pastur
converges to a deterministic function
Empirical Spectral Density (ESD)
with well defined edges (depends on Q, aspect ratio)
c|c
(TM)
(TM)
18
calculation | consulting why deep learning works
Random Matrix Theory: Marcenko Pastur
plus Tracy-Widom fluctuations
very crisp edges
Q
c|c
(TM)
(TM)
19
calculation | consulting why deep learning works
Experiments: just apply to pre-trained Models
https://medium.com/@siddharthdas_32104/
cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5
c|c
(TM)
(TM)
20
calculation | consulting why deep learning works
Experiments: just apply to pre-trained Models
LeNet5 (1998)
AlexNet (2012)
InceptionV3 (2014)
ResNet (2015)
…
DenseNet201 (2018)
https://medium.com/@siddharthdas_32104/
cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5
Conv2D MaxPool Conv2D MaxPool FC FC
c|c
(TM)
(TM)
21
calculation | consulting why deep learning works
Marchenko-Pastur Bulk + Spikes
Conv2D MaxPool Conv2D MaxPool FC FC
softrank = 10%
RMT: LeNet5
c|c
(TM)
(TM)
22
calculation | consulting why deep learning works
RMT: AlexNet
Marchenko-Pastur Bulk-decay | Heavy Tailed
FC1
zoomed in
FC2
zoomed in
c|c
(TM)
(TM)
23
calculation | consulting why deep learning works
Random Matrix Theory: InceptionV3
Marchenko-Pastur bulk decay, onset of Heavy Tails
W226
c|c
(TM)
(TM)
24
calculation | consulting why deep learning works
Eigenvalue Analysis: Rank Collapse ?
Modern DNNs: soft rank collapses; do not lose hard rank
> 0
(hard) rank collapse (Q>1)
signifies over-regularization
= 0
all smallest eigenvalues > 0,
within numerical (recipes) threshold~
c|c
(TM)
(TM)
25
calculation | consulting why deep learning works
RMT: 5+1 Phases of Training
c|c
(TM)
(TM)
26
calculation | consulting why deep learning works
Bulk+Spikes: Small Models
Rank 1 perturbation Perturbative correction
Bulk
Spikes
Smaller, older models can be described pertubatively w/RMT
c|c
(TM)
(TM)
27
calculation | consulting why deep learning works
Spikes: carry more information
Information begins to concentrate in the spikes
S(v)
spikes have less entropy, are more localized than bulk
c|c
(TM)
(TM)
28
calculation | consulting why deep learning works
Bulk+Spikes: ~ Tikhonov regularization
Small models like LeNet5 exhibit traditional regularization
softer rank , eigenvalues > , spikes carry most information
simple scale threshold
c|c
(TM)
(TM)
29
calculation | consulting why deep learning works
Heavy Tailed: Self-Regularization
W strongly correlated / highly non-random
Can be modeled as if drawn from a heavy tailed distribution
Then RMT/MP ESD will also have heavy tails
Known results from RMT / polymer theory (Bouchaud, Potters, etc)
AlexNet
ReseNet50
InceptionV3
DenseNet201
…
Large, well trained, modern DNNs exhibit heavy tailed self-regularization
c|c
(TM)
(TM)
30
calculation | consulting why deep learning works
Heavy Tailed: Self-Regularization
Large, well trained, modern DNNs exhibit heavy tailed self-regularization
Salient ideas: what we ‘suspect’ today
No single scale threshold
No simple low rank approximation for WL
Contributions from correlations at all scales
Can not be treated pertubatively
c|c
(TM)
(TM)
31
calculation | consulting why deep learning works
Self-Regularization: Batch size experiments
We can cause small models to exhibit strong correlations / heavy tails
By exploiting the Generalization Gap Phenomena
Large batch sizes => decrease generalization accuracy
Tuning the batch size from very large to very small
c|c
(TM)
(TM)
32
calculation | consulting why deep learning works
Batch Size Tuning: Generalization Gap
Decreasing the batch size induces strong correlations in W
Random-Like Bleeding-outRandom-Like
c|c
(TM)
(TM)
33
calculation | consulting why deep learning works
Batch Size Tuning: Generalization Gap
Decreasing the batch size induces strong correlations in W
c|c
(TM)
(TM)
34
calculation | consulting why deep learning works
Batch Size Tuning: Generalization Gap
Decreasing the batch size induces strong correlations in W
Bulk+Spikes Bulk+Spikes Bulk-decay
c|c
(TM)
(TM)
35
calculation | consulting why deep learning works
Batch Size Tuning: Generalization Gap
Decreasing the batch size induces strong correlations in W
Bulk-decay Bulk-decay Heavy-tailed
c|c
(TM)
(TM)
36
calculation | consulting why deep learning works
Summary

self-regularization ~ entropy / information decrease
modern DNNs have heavy-tailed self-regularization
5+1 phases of learning
applied Random Matrix Theory (RMT)
small models ~ Tinkhonov regularization
c|c
(TM)
(TM)
37
calculation | consulting why deep learning works
Implications: RMT and Deep Learning
How can RMT be used to understand the Energy Landscape ?
tradeoff between
Energy and Entropy minimization
Where are the local minima ?
How is the Hessian behaved ?
Are simpler models misleading ?
Can we design better learning strategies ?
c|c
(TM)
Energy Funnels: Minimizing Frustration

(TM)
38
calculation | consulting why deep learning works
http://www.nature.com/nsmb/journal/v4/n11/pdf/nsb1197-871.pdf
Energy Landscape Theory for polymer / protein folding
c|c
(TM)
the Spin Glass of Minimal Frustration 

(TM)
39
calculation | consulting why deep learning works
Conjectured 2015 on my blog (15 min fame on Hacker News)
https://calculatedcontent.com/2015/03/25/why-does-deep-learning-work/
Bulk+Spikes, flipped
low lying Energy state in Spin Glass ~ spikes in RMT
c|c
(TM)
RMT w/Heavy Tails: Energy Landscape ?
(TM)
40
calculation | consulting why deep learning works
Compare to LeCun’s Spin Glass model (2015)
Spin Glass with/Heavy Tails ?
Local minima do not concentrate
near the ground state
(Cizeau P and Bouchaud J-P 1993)
is Landscape is more funneled, no ‘problems’ with local minima ?
(TM)
c|c
(TM)
c | c
charles@calculationconsulting.com

Mais conteúdo relacionado

Mais procurados

Mining at scale with latent factor models for matrix completion
Mining at scale with latent factor models for matrix completionMining at scale with latent factor models for matrix completion
Mining at scale with latent factor models for matrix completion
Fabio Petroni, PhD
 

Mais procurados (20)

Search relevance
Search relevanceSearch relevance
Search relevance
 
Statistical Mechanics Methods for Discovering Knowledge from Production-Scale...
Statistical Mechanics Methods for Discovering Knowledge from Production-Scale...Statistical Mechanics Methods for Discovering Knowledge from Production-Scale...
Statistical Mechanics Methods for Discovering Knowledge from Production-Scale...
 
Weight watcher Bay Area ACM Feb 28, 2022
Weight watcher Bay Area ACM Feb 28, 2022 Weight watcher Bay Area ACM Feb 28, 2022
Weight watcher Bay Area ACM Feb 28, 2022
 
Georgetown B-school Talk 2021
Georgetown B-school Talk  2021Georgetown B-school Talk  2021
Georgetown B-school Talk 2021
 
ENS Macrh 2022.pdf
ENS Macrh 2022.pdfENS Macrh 2022.pdf
ENS Macrh 2022.pdf
 
WeightWatcher Introduction
WeightWatcher IntroductionWeightWatcher Introduction
WeightWatcher Introduction
 
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...CARI-2020, Application of LSTM architectures for next frame forecasting in Se...
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...
 
GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...
GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...
GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...
 
HDRF: Stream-Based Partitioning for Power-Law Graphs
HDRF: Stream-Based Partitioning for Power-Law GraphsHDRF: Stream-Based Partitioning for Power-Law Graphs
HDRF: Stream-Based Partitioning for Power-Law Graphs
 
Dimensionality reduction with UMAP
Dimensionality reduction with UMAPDimensionality reduction with UMAP
Dimensionality reduction with UMAP
 
LCBM: Statistics-Based Parallel Collaborative Filtering
LCBM: Statistics-Based Parallel Collaborative FilteringLCBM: Statistics-Based Parallel Collaborative Filtering
LCBM: Statistics-Based Parallel Collaborative Filtering
 
Cari presentation maurice-tchoupe-joskelngoufo
Cari presentation maurice-tchoupe-joskelngoufoCari presentation maurice-tchoupe-joskelngoufo
Cari presentation maurice-tchoupe-joskelngoufo
 
Mining at scale with latent factor models for matrix completion
Mining at scale with latent factor models for matrix completionMining at scale with latent factor models for matrix completion
Mining at scale with latent factor models for matrix completion
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
 
Neural Style Transfer in practice
Neural Style Transfer in practiceNeural Style Transfer in practice
Neural Style Transfer in practice
 
DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...
DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...
DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...
 
Metric-learn, a Scikit-learn compatible package
Metric-learn, a Scikit-learn compatible packageMetric-learn, a Scikit-learn compatible package
Metric-learn, a Scikit-learn compatible package
 
Chap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsChap 8. Optimization for training deep models
Chap 8. Optimization for training deep models
 
Teaching Graph Algorithms in the Field - Bipartite Matching in optical datace...
Teaching Graph Algorithms in the Field - Bipartite Matching in optical datace...Teaching Graph Algorithms in the Field - Bipartite Matching in optical datace...
Teaching Graph Algorithms in the Field - Bipartite Matching in optical datace...
 
A Parallel Data Distribution Management Algorithm
A Parallel Data Distribution Management AlgorithmA Parallel Data Distribution Management Algorithm
A Parallel Data Distribution Management Algorithm
 

Semelhante a Why Deep Learning Works: Self Regularization in Deep Neural Networks

Why Deep Learning Works: Self Regularization in Deep Neural Networks
Why Deep Learning Works: Self Regularization in Deep Neural NetworksWhy Deep Learning Works: Self Regularization in Deep Neural Networks
Why Deep Learning Works: Self Regularization in Deep Neural Networks
Charles Martin
 
Tensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with HummingbirdTensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with Hummingbird
Databricks
 
MUMS Opening Workshop - Machine-Learning Error Models for Quantifying the Epi...
MUMS Opening Workshop - Machine-Learning Error Models for Quantifying the Epi...MUMS Opening Workshop - Machine-Learning Error Models for Quantifying the Epi...
MUMS Opening Workshop - Machine-Learning Error Models for Quantifying the Epi...
The Statistical and Applied Mathematical Sciences Institute
 
Heaven: Supporting Systematic Comparative Research of RDF Stream Processing E...
Heaven: Supporting Systematic Comparative Research of RDF Stream Processing E...Heaven: Supporting Systematic Comparative Research of RDF Stream Processing E...
Heaven: Supporting Systematic Comparative Research of RDF Stream Processing E...
Riccardo Tommasini
 
Fast optimization intevacoct6_3final
Fast optimization intevacoct6_3finalFast optimization intevacoct6_3final
Fast optimization intevacoct6_3final
eArtius, Inc.
 
A simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsA simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representations
Devansh16
 

Semelhante a Why Deep Learning Works: Self Regularization in Deep Neural Networks (20)

Why Deep Learning Works: Self Regularization in Deep Neural Networks
Why Deep Learning Works: Self Regularization in Deep Neural NetworksWhy Deep Learning Works: Self Regularization in Deep Neural Networks
Why Deep Learning Works: Self Regularization in Deep Neural Networks
 
WeightWatcher LLM Update
WeightWatcher LLM UpdateWeightWatcher LLM Update
WeightWatcher LLM Update
 
WeightWatcher Update: January 2021
WeightWatcher Update:  January 2021WeightWatcher Update:  January 2021
WeightWatcher Update: January 2021
 
ICCF24.pdf
ICCF24.pdfICCF24.pdf
ICCF24.pdf
 
Tensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with HummingbirdTensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with Hummingbird
 
Srikanta Mishra
Srikanta MishraSrikanta Mishra
Srikanta Mishra
 
Integrated Model Discovery and Self-Adaptation of Robots
Integrated Model Discovery and Self-Adaptation of RobotsIntegrated Model Discovery and Self-Adaptation of Robots
Integrated Model Discovery and Self-Adaptation of Robots
 
MediaEval 2016 - UPMC at MediaEval2016 Retrieving Diverse Social Images Task
MediaEval 2016 - UPMC at MediaEval2016 Retrieving Diverse Social Images TaskMediaEval 2016 - UPMC at MediaEval2016 Retrieving Diverse Social Images Task
MediaEval 2016 - UPMC at MediaEval2016 Retrieving Diverse Social Images Task
 
MUMS Opening Workshop - Machine-Learning Error Models for Quantifying the Epi...
MUMS Opening Workshop - Machine-Learning Error Models for Quantifying the Epi...MUMS Opening Workshop - Machine-Learning Error Models for Quantifying the Epi...
MUMS Opening Workshop - Machine-Learning Error Models for Quantifying the Epi...
 
Online advertising and large scale model fitting
Online advertising and large scale model fittingOnline advertising and large scale model fitting
Online advertising and large scale model fitting
 
Project seminar ppt_steelcasting
Project seminar ppt_steelcastingProject seminar ppt_steelcasting
Project seminar ppt_steelcasting
 
Self-Balancing Multimemetic Algorithms in Dynamic Scale-Free Networks
Self-Balancing Multimemetic Algorithms in Dynamic Scale-Free NetworksSelf-Balancing Multimemetic Algorithms in Dynamic Scale-Free Networks
Self-Balancing Multimemetic Algorithms in Dynamic Scale-Free Networks
 
Heaven: Supporting Systematic Comparative Research of RDF Stream Processing E...
Heaven: Supporting Systematic Comparative Research of RDF Stream Processing E...Heaven: Supporting Systematic Comparative Research of RDF Stream Processing E...
Heaven: Supporting Systematic Comparative Research of RDF Stream Processing E...
 
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackConformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
 
Talwalkar mlconf (1)
Talwalkar mlconf (1)Talwalkar mlconf (1)
Talwalkar mlconf (1)
 
Overlapping community detection in Large-Scale Networks using BigCLAM model b...
Overlapping community detection in Large-Scale Networks using BigCLAM model b...Overlapping community detection in Large-Scale Networks using BigCLAM model b...
Overlapping community detection in Large-Scale Networks using BigCLAM model b...
 
Co-Learning: Consensus-based Learning for Multi-Agent Systems
 Co-Learning: Consensus-based Learning for Multi-Agent Systems Co-Learning: Consensus-based Learning for Multi-Agent Systems
Co-Learning: Consensus-based Learning for Multi-Agent Systems
 
autoTVM
autoTVMautoTVM
autoTVM
 
Fast optimization intevacoct6_3final
Fast optimization intevacoct6_3finalFast optimization intevacoct6_3final
Fast optimization intevacoct6_3final
 
A simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsA simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representations
 

Mais de Charles Martin (7)

Heavy Tails Workshop NeurIPS2023.pdf
Heavy Tails Workshop NeurIPS2023.pdfHeavy Tails Workshop NeurIPS2023.pdf
Heavy Tails Workshop NeurIPS2023.pdf
 
LLM avalanche June 2023.pdf
LLM avalanche June 2023.pdfLLM avalanche June 2023.pdf
LLM avalanche June 2023.pdf
 
Building AI Products: Delivery Vs Discovery
Building AI Products: Delivery Vs Discovery Building AI Products: Delivery Vs Discovery
Building AI Products: Delivery Vs Discovery
 
Palo alto university rotary club talk Sep 29, 2107
Palo alto university rotary club talk Sep 29, 2107Palo alto university rotary club talk Sep 29, 2107
Palo alto university rotary club talk Sep 29, 2107
 
Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3
 
Cc hass b school talk 2105
Cc hass b school talk  2105Cc hass b school talk  2105
Cc hass b school talk 2105
 
CC Talk at Berekely
CC Talk at BerekelyCC Talk at Berekely
CC Talk at Berekely
 

Último

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Último (20)

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 

Why Deep Learning Works: Self Regularization in Deep Neural Networks

  • 1. calculation | consulting why deep learning works: self-regularization in deep neural networks (TM) c|c (TM) charles@calculationconsulting.com
  • 2. calculation|consulting UC Berkeley / NERSC 2018 why deep learning works: self-regularization in deep neural networks (TM) charles@calculationconsulting.com
  • 3. calculation | consulting why deep learning works Who Are We? c|c (TM) Dr. Charles H. Martin, PhD University of Chicago, Chemical Physics NSF Fellow in Theoretical Chemistry Over 15 years experience in applied Machine Learning and AI ML algos for: Aardvark, acquired by Google (2010) Demand Media (eHow); first $1B IPO since Google Wall Street: BlackRock Fortune 500: Roche, France Telecom BigTech: eBay, Aardvark (Google), GoDaddy Private Equity: Anthropocene Institute www.calculationconsulting.com charles@calculationconsulting.com (TM) 3
  • 4. c|c (TM) Motivations: towards a Theory of Deep Learning (TM) 4 calculation | consulting why deep learning works Theoretical: deeper insight into Why Deep LearningWorks ? non-convex optimization ? regularization ? why is deep better ? VC vs Stat Mech vs ? … Practical: useful insight to improve engineering DNNs when is a network fully optimized ? large batch sizes ? better ensembles ? …
  • 5. c|c (TM) Set up: the Energy Landscape (TM) 5 calculation | consulting why deep learning works minimize Loss: but how avoid overtraining ?
  • 6. c|c (TM) Problem: How can this possibly work ? (TM) 6 calculation | consulting why deep learning works highly non-convex ? apparently not expected observed ? has been suspected for a long time that local minima are not the issue
  • 7. c|c (TM) Problem: Local Minima ? (TM) 7 calculation | consulting why deep learning works Duda, Hart and Stork, 2000 solution: add more capacity and regularize
  • 8. c|c (TM) Motivations: what is Regularization ? (TM) 8 calculation | consulting why deep learning works every adjustable knob and switch is called regularization https://arxiv.org/pdf/1710.10686.pdf Dropout Batch Size Noisify Data …
  • 9. c|c (TM) (TM) 9 calculation | consulting why deep learning works Understanding deep learning requires rethinking generalization Problem: What is Regularization in DNNs ? ICLR 2017 Best paper Large models overfit on randomly labeled data Regularization can not prevent this
  • 10. Moore-Pensrose pseudoinverse (1955) regularize (Phillips, 1962) familiar optimization problem c|c (TM) Motivations: what is Regularization ? (TM) 10 calculation | consulting why deep learning works Soften the rank of X, focus on large eigenvalues ( ) Ridge Regression / Tikhonov-Phillips Regularization https://calculatedcontent.com/2012/09/28/kernels-greens-functions-and-resolvent-operators/
  • 11. c|c (TM) Motivations: how we study Regularization (TM) 11 calculation | consulting why deep learning works turn off regularization, turn it back on systematically, study W and traditional regularization is applied to W the Energy Landscape is determined by the layer weights WL L L
  • 12. c|c (TM) (TM) 12 calculation | consulting why deep learning works Information bottleneck Entropy collapse local minima k=1 saddle points floor / ground state k = 2 saddle points Information / Entropy Energy Landscape: and Information flow what happens to the layer weight matrices WL ?
  • 13. c|c (TM) (TM) 13 calculation | consulting why deep learning works Self-Regularization: Experiments Retrained LeNet5 on MINST using Keras Two (2) other small models: 3-Layer MLP and a Mini AlexNet And examine pre-trained models (AlexNet, Inception, …) Conv2D MaxPool Conv2D MaxPool FC1 FC2 FC
  • 14. c|c (TM) (TM) 14 calculation | consulting why deep learning works Matrix Complexity: Entropy and Stable Rank
  • 15. c|c (TM) (TM) 15 calculation | consulting why deep learning works Random Matrix Theory: detailed insight into WL Empirical Spectral Density (ESD: eigenvalues of X=W W )LL T import keras import numpy as np import matplotlib.pyplot as plt … W = model.layers[i].get_weights()[0] … X = np.dot(W, W.T) evals, evecs = np.linalg.eig(W, W.T) plt.hist(X, bin=100, density=True)
  • 16. c|c (TM) (TM) 16 calculation | consulting why deep learning works Random Matrix Theory: detailed insight into WL Entropy decrease corresponds to breakdown of random structure and the onset of a new kind of self-regularization Empirical Spectral Density (ESD: eigenvalues of X=W W )LL T Random Matrix Random + Spikes
  • 17. c|c (TM) (TM) 17 calculation | consulting why deep learning works Random Matrix Theory: Marchenko-Pastur converges to a deterministic function Empirical Spectral Density (ESD) with well defined edges (depends on Q, aspect ratio)
  • 18. c|c (TM) (TM) 18 calculation | consulting why deep learning works Random Matrix Theory: Marcenko Pastur plus Tracy-Widom fluctuations very crisp edges Q
  • 19. c|c (TM) (TM) 19 calculation | consulting why deep learning works Experiments: just apply to pre-trained Models https://medium.com/@siddharthdas_32104/ cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5
  • 20. c|c (TM) (TM) 20 calculation | consulting why deep learning works Experiments: just apply to pre-trained Models LeNet5 (1998) AlexNet (2012) InceptionV3 (2014) ResNet (2015) … DenseNet201 (2018) https://medium.com/@siddharthdas_32104/ cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5 Conv2D MaxPool Conv2D MaxPool FC FC
  • 21. c|c (TM) (TM) 21 calculation | consulting why deep learning works Marchenko-Pastur Bulk + Spikes Conv2D MaxPool Conv2D MaxPool FC FC softrank = 10% RMT: LeNet5
  • 22. c|c (TM) (TM) 22 calculation | consulting why deep learning works RMT: AlexNet Marchenko-Pastur Bulk-decay | Heavy Tailed FC1 zoomed in FC2 zoomed in
  • 23. c|c (TM) (TM) 23 calculation | consulting why deep learning works Random Matrix Theory: InceptionV3 Marchenko-Pastur bulk decay, onset of Heavy Tails W226
  • 24. c|c (TM) (TM) 24 calculation | consulting why deep learning works Eigenvalue Analysis: Rank Collapse ? Modern DNNs: soft rank collapses; do not lose hard rank > 0 (hard) rank collapse (Q>1) signifies over-regularization = 0 all smallest eigenvalues > 0, within numerical (recipes) threshold~
  • 25. c|c (TM) (TM) 25 calculation | consulting why deep learning works RMT: 5+1 Phases of Training
  • 26. c|c (TM) (TM) 26 calculation | consulting why deep learning works Bulk+Spikes: Small Models Rank 1 perturbation Perturbative correction Bulk Spikes Smaller, older models can be described pertubatively w/RMT
  • 27. c|c (TM) (TM) 27 calculation | consulting why deep learning works Spikes: carry more information Information begins to concentrate in the spikes S(v) spikes have less entropy, are more localized than bulk
  • 28. c|c (TM) (TM) 28 calculation | consulting why deep learning works Bulk+Spikes: ~ Tikhonov regularization Small models like LeNet5 exhibit traditional regularization softer rank , eigenvalues > , spikes carry most information simple scale threshold
  • 29. c|c (TM) (TM) 29 calculation | consulting why deep learning works Heavy Tailed: Self-Regularization W strongly correlated / highly non-random Can be modeled as if drawn from a heavy tailed distribution Then RMT/MP ESD will also have heavy tails Known results from RMT / polymer theory (Bouchaud, Potters, etc) AlexNet ReseNet50 InceptionV3 DenseNet201 … Large, well trained, modern DNNs exhibit heavy tailed self-regularization
  • 30. c|c (TM) (TM) 30 calculation | consulting why deep learning works Heavy Tailed: Self-Regularization Large, well trained, modern DNNs exhibit heavy tailed self-regularization Salient ideas: what we ‘suspect’ today No single scale threshold No simple low rank approximation for WL Contributions from correlations at all scales Can not be treated pertubatively
  • 31. c|c (TM) (TM) 31 calculation | consulting why deep learning works Self-Regularization: Batch size experiments We can cause small models to exhibit strong correlations / heavy tails By exploiting the Generalization Gap Phenomena Large batch sizes => decrease generalization accuracy Tuning the batch size from very large to very small
  • 32. c|c (TM) (TM) 32 calculation | consulting why deep learning works Batch Size Tuning: Generalization Gap Decreasing the batch size induces strong correlations in W Random-Like Bleeding-outRandom-Like
  • 33. c|c (TM) (TM) 33 calculation | consulting why deep learning works Batch Size Tuning: Generalization Gap Decreasing the batch size induces strong correlations in W
  • 34. c|c (TM) (TM) 34 calculation | consulting why deep learning works Batch Size Tuning: Generalization Gap Decreasing the batch size induces strong correlations in W Bulk+Spikes Bulk+Spikes Bulk-decay
  • 35. c|c (TM) (TM) 35 calculation | consulting why deep learning works Batch Size Tuning: Generalization Gap Decreasing the batch size induces strong correlations in W Bulk-decay Bulk-decay Heavy-tailed
  • 36. c|c (TM) (TM) 36 calculation | consulting why deep learning works Summary
 self-regularization ~ entropy / information decrease modern DNNs have heavy-tailed self-regularization 5+1 phases of learning applied Random Matrix Theory (RMT) small models ~ Tinkhonov regularization
  • 37. c|c (TM) (TM) 37 calculation | consulting why deep learning works Implications: RMT and Deep Learning How can RMT be used to understand the Energy Landscape ? tradeoff between Energy and Entropy minimization Where are the local minima ? How is the Hessian behaved ? Are simpler models misleading ? Can we design better learning strategies ?
  • 38. c|c (TM) Energy Funnels: Minimizing Frustration
 (TM) 38 calculation | consulting why deep learning works http://www.nature.com/nsmb/journal/v4/n11/pdf/nsb1197-871.pdf Energy Landscape Theory for polymer / protein folding
  • 39. c|c (TM) the Spin Glass of Minimal Frustration 
 (TM) 39 calculation | consulting why deep learning works Conjectured 2015 on my blog (15 min fame on Hacker News) https://calculatedcontent.com/2015/03/25/why-does-deep-learning-work/ Bulk+Spikes, flipped low lying Energy state in Spin Glass ~ spikes in RMT
  • 40. c|c (TM) RMT w/Heavy Tails: Energy Landscape ? (TM) 40 calculation | consulting why deep learning works Compare to LeCun’s Spin Glass model (2015) Spin Glass with/Heavy Tails ? Local minima do not concentrate near the ground state (Cizeau P and Bouchaud J-P 1993) is Landscape is more funneled, no ‘problems’ with local minima ?