2. Today’s contents
Understanding deep learning requires rethinking generalization
by C. Zhang, S. Bengio, M. Hardt, B. Recht, O. Vinyals
Nov. 2016: https://arxiv.org/abs/1611.03530
ICLR 2017 Best Paper
???
@#*($DFJLDK
…paper awards
3. Questions
Why large neural networks generalize well in practice?
Can the traditional theories on generalization actually
explain the results we are seeing these days?
What is it then that distinguishes neural networks
that generalize well from those that don’t?
4. Questions
Why large neural networks generalize well in practice?
Can the traditional theories on generalization actually
explain the results we are seeing these days?
What is it then that distinguishes neural networks
that generalize well from those that don’t?
5. Questions
Why large neural networks generalize well in practice?
Can the traditional theories on generalization actually
explain the results we are seeing these days?
What is it then that distinguishes neural networks
that generalize well from those that don’t?
6. Questions
Why large neural networks generalize well in practice?
Can the traditional theories on generalization actually
explain the results we are seeing these days?
What is it then that distinguishes neural networks
that generalize well from those that don’t?
“Deep neural networks TOO easily fit random labels.”
7. Conventional wisdom
Small generalization error due to
• Model family
• Various regularization techniques
“Generalization error”
전략: 최대한 적은 parameter를 갖으면서
training error가 최소인 model을 찾자
9. Effective capacity of neural networks
Parameter Count
Num Training Samples
MLP 1 x 512 AlexNet Inception Wide Resnet
MLP 1 x 512
p/n = 24
Inception
p/n = 33
Test error
AlexNet
p/n = 28
Wide ResNet
p/n = 179
10. Effective capacity of neural networks
Parameter Count
Num Training Samples
MLP 1 x 512 AlexNet Inception Wide Resnet
MLP 1 x 512
p/n = 24
Inception
p/n = 33
Test error
AlexNet
p/n = 28
Wide ResNet
p/n = 179
If counting the number of parameter is not a useful way to measure the model complexity,
then how can we measure the effective capacity of the model?
16. Implications
Rademacher complexity and VC-dimension
Uniform stability
“How sensitive the algorithm is to the replacement of a single example”
Solely a property of the algorithm
(nothing to do with the data)
17. Implications
• The effective capacity of neural networks is sufficient for
memorizing the entire data set.
• Even optimization on random labels remains easy. In fact,
training time increases only by a small constant factor
compared with training on the true labels
• Randomizing labels is solely a data transformation, leaving
all other properties of the learning problem unchanged
Summary
18. The role of regularization
Regularization 같을 걸 끼얹나?
What about regularization techniques?
당신의 모델이 자꾸 overfitting 될 때
이렇게 하면 살 수 있다.
위기탈출 넘버원
19. The role of regularization
What about regularization techniques?
20. The role of regularization
What about regularization techniques?
없어도 꽤 잘 되는데?
21. The role of regularization
What about regularization techniques?
있어도 잘만 (overfitting)되는데?
22. The role of regularization
What about regularization techniques?
Regularization certainly helps generalization but
it is NOT the fundamental reason for generalization.
있어도 잘만 (overfitting)되는데?
23. Finite sample expressivity
Much effort has gone into studying the expressivity of NNs
However, almost all of these results are at the “population level”
Showing what functions of the entire domain can and cannot be
represented by certain classes of NNs with the same number of
parameters
What is more relevant in practice is the expressive power of
NNs on a finite sample size of n
24. Finite sample expressivity
The expressive power of NNs on a finite sample size of n?
“As soon as the number of parameters p of a networks is greater than n, even
simple two-layer neural networks can represent any function of the input sample.”
"(2n + d)의 weight를 가지고, 활성화 함수로 ReLU를 사용하는 2층 뉴럴 네트워크는
d차원의 n개의 샘플에 대한 어떠한 함수든지 표현할 수 있다."
25. Finite sample expressivity
Proof)
• are invertible if and only if the
diagonal elements are nonzero
• have their eigenvalues taken directly
from the diagonal elements
Lower triangular matrix…
∎
∃𝐀𝐀−𝟏𝟏 ∵ 𝒓𝒓𝒓𝒓𝒓𝒓𝒓𝒓 𝐀𝐀 = 𝒏𝒏
30. Implicit regularization
An appeal to linear models
But do all global minima generalize equally well?
How do we determine one to the other?
“Curvature”
31. Implicit regularization
An appeal to linear models
But do all global minima generalize equally well?
How do we determine one to the other?
“Curvature”
In linear case, curvature of
all optimal solutions is the same
∵ Hessian of the loss function is
not a function of the choice of 𝑤𝑤
32. Implicit regularization
Algorithm 자체가 어떤 constraint
Here, implicit regularization = SGD algorithm
SGD는 solution이 Minimum Norm Solution으로 수렴
If curvature doesn’t distinguish global minima, what does?
33. Algorithm 자체가 어떤 constraint
Here, implicit regularization = SGD algorithm
SGD는 solution이 Minimum Norm Solution으로 수렴
Implicit regularization
If curvature doesn’t distinguish global minima, what does?
𝑤𝑤𝑚𝑚𝑚𝑚 = 𝐗𝐗 𝐓𝐓 𝐗𝐗𝐗𝐗 𝐓𝐓 −𝟏𝟏
𝑦𝑦
𝐗𝐗𝐗𝐗 𝐓𝐓
∈ ℝ𝒏𝒏×𝒏𝒏
일종의 kernel, gram matrix K
37. • Simple experimental framework for understanding the
effective capacity of deep learning models
• Successful DeepNets are able to overfit the training set
• Other formal measure of complexity for the models/
algorithms/data distributions are needed to precisely explain
the over-parameterized regime
Conclusion
38. We believe that …
“understanding neural networks
requires rethinking generalization.”
Conclusion
40. Things to discuss about…
• The effective capacity of neural networks is sufficient for
memorizing the entire data set.
• Even optimization on random labels remains easy. In fact,
training time increases only by a small constant factor
compared with training on the true labels
• Randomizing labels is solely a data transformation, leaving
all other properties of the learning problem unchanged
Summary
정말???
41.
42. 기존의 일반적인 supervised learning setting:
Training과 test의 domain이 같다고 가정.
Statistical Learning Theory
: SVM, …
49. 전자기기 고객평가 (X) /
긍정 혹은 부정 라벨 (Y)
비디오 게임 고객평가 (X)
NN으로 표현되는 H 함수 공간으로부터….
50. 전자기기 고객평가 (X) /
긍정 혹은 부정 라벨 (Y)
비디오 게임 고객평가 (X)
Classifier h를 학습하는데,
target의 label을 모르지만
source(X,Y)와 target(X)
두 도메인 모두에서 잘 label
을 찾는 h를 찾고 싶다.
NN으로 표현되는 H 함수 공간으로부터….
51. 기존 전략: 최대한 적은 parameter를 갖으면서
training error가 최소인 model을 찾자
52. 이제는 training domain (source)과 testing
domain (target)이 서로 다르다
기존의 전략 외에 다른 전략이 추가로 필요하다.
53. A Computable Adaptation Bound
Divergence estimation
complexity
Dependent on number
of unlabeled samples
54. The optimal joint hypothesis
is the hypothesis with minimal combined error
is that error