7. Why Does Deep Learning Success(2/4)
Beast Processor
7
*** 2017 Google TPU
8. Why Does Deep Learning Success(3/4)
• Algorithm
• Stochastic gradient descent (SGD) :
fast convergence for learning
• ReLU activation function :
solve vanishing gradient problem
• Dropout :
regularization
Technical Break Through
gradient descent(batch)
stochastic gradient descent
8
9. Why Does Deep Learning Success(4/4)
• Architecture: hierarchical representation
Technical Break Through
The Extraordinary Link Between Deep Neural Networks and the Nature of the Universe
MIT Technology Review, 2016/09.
9
13. Hubel/Wiesel Architecture
• D. Hubel and T. Wiesel (1959, 1962, Nobel Prize 1981)
• Visual cortex consists of a hierarchy of simple, complex, and
hyper-complex cells
13
15. SIFT Haar
Textons
Computer Vision Features
SURF, MSER, LBP, Color-SIFT, Color histogram, GLOH, …..
and many others:
15
Hand-designed
feature extraction
Trainable
classifier
Image/
Video
Pixels
Object
Class
Traditional recognition
HoG
16. Shallow vs Deep Architectures
Hand-designed
feature extraction
Trainable
classifier
Image/
Video
Pixels
Object
Class
Layer 1 Layer N
Simple
classifier
Object
Class
Image/
Video
Pixels
Traditional recognition: “Shallow” architecture
Deep learning: “Deep” architecture
…
16
Image Low-level
vision features
(edges, SIFT, HOG, etc.)
Object detection
/ classification
feature extractor classifier
17. Learn Feature Hierarchy
Fill in representation gap in recognition
Feature representation
Input data
1st layer
“Edges”
2nd layer
“Object parts”
3rd layer
“Objects”
Pixels
Layer 1
Simple
Classifier
Image/Video
Pixels
Layer 2
Layer 3
"Object Detectors Emerge in Deep Scene CNNs,"
B. Zhou, et al., ICLR 2015
17
Learning algorithm: SGD, ReLU, dropout
19. Taxonomy of Feature Learning Methods
• T• Support Vector Machine
• Logistic Regression
• Perceptron
• Deep Neural Net
• Convolutional Neural Net (CNN)
• Recurrent Neural Net
• Autoencoder
• Restricted Boltzmann machines*
• Sparse coding*
• Generative Adversarial Net (GAN)*
• Deep Belief Nets*
Deep Boltzmann machines*
• Hierarchical Sparse Coding*
DeepShallow
Supervised
Unsupervised
* supervised version exists
19
• Siamese Net
20. e.g. Google Photos search
Face Verification, Taigman et al. 2014 (FAIR)
Self-driving cars[Goodfellow et al. 2014]
Ciresan et al. 2013
Turaga et al 2010
CNN Applications (1/3)
20
21. ATARI game playing, Mnih 2013
AlphaGo, Silver et al 2016
VizDoom
StarCraft
CNN Applications (2/3)
21
23. CNN Example:
Recognition
OCR House Number Traffic Sign
Taigman et al. “DeepFace: Closing the Gap to Human-Level Performance in Face Verification,” CVPR 2014 23
25. CNN Example:
Scene Labeling
25
Farabet et al. "Learning hierarchical features for scene labeling" PAMI 2013 (LeCun)
Pinheiro et al. "Recurrent Convolutional Neural Networks for Scene Labeling" ICML 2014
26. CNN Example:
Action Recognition from Videos
26Simonyan et al. "Two-Stream Convolutional Networks for Action Recognition in Videos" NIPS 2014
A. Kapathy et al. "Large-scale Video Classification with Convolutional Neural Networks" CVPR 2014
28. CNN (Convnet) by LeCun in 1998
• Neural network with specialized
connectivity structure
• Stack multiple stages of feature
extractors
• Higher stages compute more
global, more invariant features
• Classification layer at the end
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner,
Gradient-based learning applied to document recognition,
Proceedings of the IEEE 86(11): 2278–2324, 1998.
28
30. Components of Each Layer
Pixels /
Features
Filter with
Dictionary
(convolutional
or tiled)
Spatial/Feature
(Sum or Max)
Normalization
between
feature
responses
Output
Features
+ Non-linearity
[Optional]
Slide: R. Fergus
[Optional]
30
31. Convolutional Filtering
Input Feature Map
– Dependencies are local
– Translation equivariance
– Tied filter weights (few params)
– Stride 1,2,… (faster, less mem.)
.
.
.
31
32. Non-Linearity
• Every neuron performs a
non-linear operation
– Tanh
– Sigmoid: 1/(1+exp(-x))
– Rectified linear unit (ReLU)
• Simplifies backprop
• Makes learning faster
• Avoids saturation issues
* Preferred option
x1
x2
x3
.
.
.
xd
w2
w3
wd
σ(wx+b)
w1
32
33. Pooling
• Sum or max
• Non-overlapping / overlapping regions
• Boureau et al. ICML’10 for theoretical analysis
Sum
Max
33
38. ImageNet Challenge: ILSVRC
38
• ~14 million labeled images, 20k classes
• Images gathered from Internet
• Human labels via Amazon Turk
• Challenge: 1.2 million training images, 1000 classes
Karpathy et al. "Large-scale video classification with convolutional neural networks." CVPR 2014 (Fei-Fei Li)
40. Going Deeper from 2012
40
Clarifai
He et al. "Deep Residual Learning for Image Recognition," CVPR 2016
41. AlexNet (2012 Winner)
• Similar framework to LeCun’98 but:
• More data (106 vs. 103 images)
• Bigger model (7 hidden layers, 650,000 units, 60,000,000 params)
• GPU implementation (50x speedup over CPU)
• Trained on two GPUs for a week
• Better regularization for training (DropOut)
Alex Krizhevsky, I. Sutskever, and G. Hinton,
"ImageNet Classification with Deep Convolutional Neural Networks"
NIPS 2012 41
42. AlexNe : 8 layers total
Trained on ImageNet
16.4% top-5 error
Layer 4: Conv
Layer 3: Conv
Layer 2: Conv + Pool
Layer 1: Conv + Pool
Layer 6: Full
Softmax Output
Layer 5: Conv + Pool
Layer 7: Full
Input Image
How Important Is Depth? (1/2)
42
Layer 6: Full
Layer 5: Conv +
Pool
Layer 4: Conv
Layer 3: Conv
Layer 2: Conv + Pool
Layer 1: Conv + Pool
Softmax Output
Input Image
Remove top fully connected layer(Layer 7)
Drop 16 million parameters
Only 1.1% drop in performance!
Layer 5: Conv +
Pool
Layer 4: Conv
Layer 3: Conv
Layer 2: Conv + Pool
Layer 1: Conv + Pool
Softmax Output
Input Image
Remove layers 6 & 7
Drop 50 million parameters
5.7% drop in performance
43. AlexNe : 8 layers total
Trained on ImageNet
16.4% top-5 error
Layer 4: Conv
Layer 3: Conv
Layer 2: Conv + Pool
Layer 1: Conv + Pool
Layer 6: Full
Softmax Output
Layer 5: Conv + Pool
Layer 7: Full
Input Image
How Important Is Depth? (2/2)
43
Remove layers 3 & 4
Drop 1 million parameters
3.0% drop in performance
Layer 6: Full
Softmax Output
Layer 5: Conv + Pool
Layer 7: Full
Input Image
Layer 2: Conv + Pool
Layer 1: Conv + Pool Layer 1: Conv + Pool
Layer 2: Conv + Pool
Layer 5: Conv + Pool
Input Image
Softmax Output
Remove layers 3, 4, 6 ,7
33.5% drop in performance
Depth of network is key
44. ZFNet (2013 2nd, Improved AlexNet)
44
M. Zeiler and R. Fergus, Visualizing and Understanding Convolutional Networks, ECCV, 2014
CONV1: change from (11x11 stride 4) to (7x7 stride 2)
CONV3,4,5: instead of 384, 384, 256 filters use 512, 1024, 512
ImageNet top 5 error: 16.4% -> 14.8%
Meaning of Each Layer in ZFNet
45. best model
Only 3x3 CONV stride 1, pad 1
and 2x2 MAX POOL stride 2
VGGNet (2014 2nd)
45
Simonyan & Zisserman, "Very deep convolutional networks for large-scale image recognition" ICLR 2015
TOTAL memory: 24M * 4 bytes ~= 93MB / image
(only forward! ~*2 for bwd)
TOTAL params: 138M parameters
19 layers
Top-5 error 7.3%
46. Inception module
GoogLeNet (2014 Winner)
46
Szegedy et al. "Going deeper with convolutions" CVPR 2015
• Important features:
Only 5 million parameters!
(Removes FC layers completely)
• Compared to AlexNet:
12X less params
2x more compute
6.67% (vs. 16.4%)
22 layers
Top-5 error 6.7%
47. 152 layers
Top-5 error 3.6%
ResNet (2015 Winner)
47He et al. "Deep Residual Learning for Image Recognition" CVPR 2016
spatial dimension
only 56x56!
48. DL Trend : CNN Models
Research by arxiv-sanity database of 28,303 (arxiv) Machine Learning papers over the last 5 years , 2017/04.
48
49. DL Trend : Optimization algorithm
Research by arxiv-sanity database of 28,303 (arxiv) Machine Learning papers over the last 5 years , 2017/04.
49
50. DL Trend : Top Hot Keywords
Research by arxiv-sanity database of 28,303 (arxiv) Machine Learning papers over the last 5 years, 2017/04.
50
52. CNN for Classification
52
“tabby cat”
1000-dim vector
end-to-end learning
image CNN features
e.g. vector of 1000 numbers giving
probabilities for different classes.
Fully connected layer
53. Localization / Detection
image CNN features
fully connected layer
Class
probabilities
4 numbers:
- X coord
- Y coord
- Width
- Height
53
image CNN features
1x1 CONV
E.g. YOLO (You Only Look Once)
(Demo: http://pjreddie.com/darknet/yolo/)
7x7x(5*B+C)
For each of 7x7 locations:
- [x,y,width,height,confidence]*B
- class
54. CNN for Pedestrian Detection
54
Girshick et al. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” NIPS 2015
Region Proposal Networks (RPNs)CNN
Faster R-CNN
58. Semantic Segmentation
pixels in, pixels out
58
image CNN features
NxNx3
deconv layers
NxNx20 array of class
probabilities at each pixel
image class “map”
65. Hardware
65
• Buy your own GPU machine
- NVIDIA DIGITS DevBox (TITAN X)
- NVIDIA DGX-1 (P100 GPUs)
• GPUs in the cloud
- Google Cloud Platform (GPU/TPU, TensorFlow)
- Amazon AWS EC2
- Microsoft Azure
VGG: ~2-3 weeks training with 4 GPUs
ResNet 101: 2-3 weeks with 4 GPUs
66. Q: How do I know
what architecture to use?
Ans: don’t be a hero.
1. Take whatever works best on ILSVRC (latest ResNet)
2. Download a pretrained model
3. Potentially add/delete some parts of it
4. Finetune it on your application.
66
Andrej Karpathy, Bay Area Deep Learning School, 2016
67. Q: How do I know
what hyperparameters to use?
Ans: don’t be a hero.
- Use whatever is reported to work best on ILSVRC.
- Play with the regularization strength (dropout rates)
67
Andrej Karpathy, Bay Area Deep Learning School, 2016