Deep Learning for Computer Vision

Deep Learning for Computer Vision
Yuan-Kai Wang
Fu Jen Catholic University
2017/05/26
1

Why Does Deep Learning Success(1/4)
Big Data
6

Beast Processor
7
*** 2017 Google TPU

• Algorithm
• Stochastic gradient descent (SGD) :
fast convergence for learning
• ReLU activation function :
solve vanishing gradient problem
• Dropout :
regularization
Technical Break Through
gradient descent(batch)
stochastic gradient descent
8

• Architecture: hierarchical representation
Technical Break Through
The Extraordinary Link Between Deep Neural Networks and the Nature of the Universe
MIT Technology Review, 2016/09.
9

Hubel/Wiesel Architecture
• D. Hubel and T. Wiesel (1959, 1962, Nobel Prize 1981)
• Visual cortex consists of a hierarchy of simple, complex, and
hyper-complex cells
13

“sandwich” architecture (SCSCSC…)
simple cells: modifiable parameters
complex cells: perform pooling
Neocognitron
Fukushima 1980
14

SIFT Haar
Textons
Computer Vision Features
SURF, MSER, LBP, Color-SIFT, Color histogram, GLOH, …..
and many others:
15
Hand-designed
feature extraction
Trainable
classifier
Image/
Video
Pixels
Object
Class
Traditional recognition
HoG

Shallow vs Deep Architectures
Hand-designed
feature extraction
Trainable
classifier
Image/
Video
Pixels
Object
Class
Layer 1 Layer N
Simple
classifier
Object
Class
Image/
Video
Pixels
Traditional recognition: “Shallow” architecture
Deep learning: “Deep” architecture
…
16
Image Low-level
vision features
(edges, SIFT, HOG, etc.)
Object detection
/ classification
feature extractor classifier

Learn Feature Hierarchy
Fill in representation gap in recognition
Feature representation
Input data
1st layer
“Edges”
2nd layer
“Object parts”
3rd layer
“Objects”
Pixels
Layer 1
Simple
Classifier
Image/Video
Pixels
Layer 2
Layer 3
"Object Detectors Emerge in Deep Scene CNNs,"
B. Zhou, et al., ICLR 2015
17
Learning algorithm: SGD, ReLU, dropout

No More Handcrafted Features !18

Taxonomy of Feature Learning Methods
• T• Support Vector Machine
• Logistic Regression
• Perceptron
• Deep Neural Net
• Convolutional Neural Net (CNN)
• Recurrent Neural Net
• Autoencoder
• Restricted Boltzmann machines*
• Sparse coding*
• Generative Adversarial Net (GAN)*
• Deep Belief Nets*
Deep Boltzmann machines*
• Hierarchical Sparse Coding*
DeepShallow
Supervised
Unsupervised
* supervised version exists
19
• Siamese Net

e.g. Google Photos search
Face Verification, Taigman et al. 2014 (FAIR)
Self-driving cars[Goodfellow et al. 2014]
Ciresan et al. 2013
Turaga et al 2010
CNN Applications (1/3)
20

ATARI game playing, Mnih 2013
AlphaGo, Silver et al 2016
VizDoom
StarCraft
21

DeepDream reddit.com/r/deepdream NeuralStyle, Gatys et al. 2015
deepart.io, Prisma, etc.
22

CNN Example:
Recognition
OCR House Number Traffic Sign
Taigman et al. “DeepFace: Closing the Gap to Human-Level Performance in Face Verification,” CVPR 2014 23

CNN Example:
Object/Pedestrian Detection
24

CNN Example:
Scene Labeling
25
Farabet et al. "Learning hierarchical features for scene labeling" PAMI 2013 (LeCun)
Pinheiro et al. "Recurrent Convolutional Neural Networks for Scene Labeling" ICML 2014

CNN Example:
Action Recognition from Videos
26Simonyan et al. "Two-Stream Convolutional Networks for Action Recognition in Videos" NIPS 2014
A. Kapathy et al. "Large-scale Video Classification with Convolutional Neural Networks" CVPR 2014

Convolutional Neural Networks
(CNN, ConvNets)
27

CNN (Convnet) by LeCun in 1998
• Neural network with specialized
connectivity structure
• Stack multiple stages of feature
extractors
• Higher stages compute more
global, more invariant features
• Classification layer at the end
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner,
Gradient-based learning applied to document recognition,
Proceedings of the IEEE 86(11): 2278–2324, 1998.
28

Basic Module in CNN
Input Image
Convolution
(Learned)
Non-linearity
Pooling
• Feed-forward:
– Convolve input with learned filters
– Non-linearity (rectified linear)
– Pooling (local max)
• Supervised learning
• Train convolutional filters by
back-propagating classification error
LeCun et al. 1998
Feature maps
29

Components of Each Layer
Pixels /
Features
Filter with
Dictionary
(convolutional
or tiled)
Spatial/Feature
(Sum or Max)
Normalization
between
feature
responses
Output
Features
+ Non-linearity
[Optional]
Slide: R. Fergus
[Optional]
30

Convolutional Filtering
Input Feature Map
– Dependencies are local
– Translation equivariance
– Tied filter weights (few params)
– Stride 1,2,… (faster, less mem.)
.
.
.
31

Non-Linearity
• Every neuron performs a
non-linear operation
– Tanh
– Sigmoid: 1/(1+exp(-x))
– Rectified linear unit (ReLU)
• Simplifies backprop
• Makes learning faster
• Avoids saturation issues
* Preferred option
x1
x2
x3
.
.
.
xd
w2
w3
wd
σ(wx+b)
w1
32

Pooling
• Sum or max
• Non-overlapping / overlapping regions
• Boureau et al. ICML’10 for theoretical analysis
Sum
Max
33

Normalization
• Contrast normalization (across feature maps)
– Local mean = 0, local std. = 1 (7x7 Gaussian)
– Equalizes the features maps
Feature Maps
Feature Maps
After Contrast Normalization
34

Compare: SIFT Descriptor
Image
Pixels Apply
Gabor filters
Spatial pool
(Sum)
Normalize to
unit length
Feature
Vector
Slide: R. Fergus
Lowe
[IJCV 2004]
35

e.g. 200K numbers e.g. 10 numbers
An Example of CNN
36

CNN Classical Models
Comparison
AlexNet, GoogLeNet, VGG, ResNet
37

ImageNet Challenge: ILSVRC
38
• ~14 million labeled images, 20k classes
• Images gathered from Internet
• Human labels via Amazon Turk
• Challenge: 1.2 million training images, 1000 classes
Karpathy et al. "Large-scale video classification with convolutional neural networks." CVPR 2014 (Fei-Fei Li)

car 99%
ILSVRC 2011 winner with 25.8 error rate
39

Going Deeper from 2012
40
Clarifai
He et al. "Deep Residual Learning for Image Recognition," CVPR 2016

AlexNet (2012 Winner)
• Similar framework to LeCun’98 but:
• More data (106 vs. 103 images)
• Bigger model (7 hidden layers, 650,000 units, 60,000,000 params)
• GPU implementation (50x speedup over CPU)
• Trained on two GPUs for a week
• Better regularization for training (DropOut)
Alex Krizhevsky, I. Sutskever, and G. Hinton,
"ImageNet Classification with Deep Convolutional Neural Networks"
NIPS 2012 41

AlexNe : 8 layers total
Trained on ImageNet
16.4% top-5 error
Layer 4: Conv
Layer 3: Conv
Layer 2: Conv + Pool
Layer 6: Full
Softmax Output
Layer 7: Full
Input Image
How Important Is Depth? (1/2)
42
Layer 6: Full
Layer 5: Conv +
Pool
Layer 4: Conv
Layer 3: Conv
Softmax Output
Input Image
Remove top fully connected layer(Layer 7)
Drop 16 million parameters
Only 1.1% drop in performance!
Layer 5: Conv +
Pool
Layer 4: Conv
Layer 3: Conv
Softmax Output
Input Image
Remove layers 6 & 7
5.7% drop in performance

AlexNe : 8 layers total
Trained on ImageNet
16.4% top-5 error
Layer 4: Conv
Layer 3: Conv
Layer 6: Full
Softmax Output
Layer 7: Full
Input Image
How Important Is Depth? (2/2)
43
Remove layers 3 & 4
Layer 6: Full
Softmax Output
Layer 7: Full
Input Image
Layer 1: Conv + Pool Layer 1: Conv + Pool
Input Image
Softmax Output
Remove layers 3, 4, 6 ,7
Depth of network is key

ZFNet (2013 2nd, Improved AlexNet)
44
M. Zeiler and R. Fergus, Visualizing and Understanding Convolutional Networks, ECCV, 2014
CONV1: change from (11x11 stride 4) to (7x7 stride 2)
CONV3,4,5: instead of 384, 384, 256 filters use 512, 1024, 512
ImageNet top 5 error: 16.4% -> 14.8%
Meaning of Each Layer in ZFNet

best model
Only 3x3 CONV stride 1, pad 1
and 2x2 MAX POOL stride 2
VGGNet (2014 2nd)
45
Simonyan & Zisserman, "Very deep convolutional networks for large-scale image recognition" ICLR 2015
TOTAL memory: 24M * 4 bytes ~= 93MB / image
(only forward! ~*2 for bwd)
TOTAL params: 138M parameters
19 layers
Top-5 error 7.3%

Inception module
GoogLeNet (2014 Winner)
46
Szegedy et al. "Going deeper with convolutions" CVPR 2015
• Important features:
Only 5 million parameters!
(Removes FC layers completely)
• Compared to AlexNet:
12X less params
2x more compute
6.67% (vs. 16.4%)
22 layers
Top-5 error 6.7%

152 layers
Top-5 error 3.6%
ResNet (2015 Winner)
47He et al. "Deep Residual Learning for Image Recognition" CVPR 2016
spatial dimension
only 56x56!

DL Trend : CNN Models
Research by arxiv-sanity database of 28,303 (arxiv) Machine Learning papers over the last 5 years , 2017/04.
48

DL Trend : Optimization algorithm
Research by arxiv-sanity database of 28,303 (arxiv) Machine Learning papers over the last 5 years , 2017/04.
49

DL Trend : Top Hot Keywords
Research by arxiv-sanity database of 28,303 (arxiv) Machine Learning papers over the last 5 years, 2017/04.
50

CNN Architectures for
Different Applications
51

CNN for Classification
52
“tabby cat”
1000-dim vector
end-to-end learning
image CNN features
e.g. vector of 1000 numbers giving
probabilities for different classes.
Fully connected layer

Localization / Detection
image CNN features
fully connected layer
Class
probabilities
4 numbers:
- X coord
- Y coord
- Width
- Height
53
image CNN features
1x1 CONV
E.g. YOLO (You Only Look Once)
(Demo: http://pjreddie.com/darknet/yolo/)
7x7x(5*B+C)
For each of 7x7 locations:
- [x,y,width,height,confidence]*B
- class

CNN for Pedestrian Detection
54
Girshick et al. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” NIPS 2015
Region Proposal Networks (RPNs)CNN
Faster R-CNN

Multi-scale Architecture
55Farabet et al. "Learning hierarchical features for scene labeling" PAMI 2013 (LeCun)

Multi-modal Architecture
56
Frome et al. "Devise: A deep visual-semantic embedding model." NIPS 2013 (Bengio)

Multi-task Architecture
57
Zhang et al. "Panda: Pose aligned networks for deep attribute modeling" CVPR 2014.

Semantic Segmentation
pixels in, pixels out
58
image CNN features
NxNx3
deconv layers
NxNx20 array of class
probabilities at each pixel
image class “map”

Convolution and Deconvolution
ConvNet
(CNN)
DeconvNet
(Deconvolutional
Layer)
Convolutional Autoencoder, Variational Autoencoder
Generational Adversarial Net
59ConvNetDeconvNet

Image Denoise by
Generative Adversarial Net
60

Object Tracking
by CNN and RNN
61

Person Reidentification by
CNN, RNN and Siamese
62

Deep Neural Networks
in Practice
63

CNN Libraries (open source)
64
• TensorFlow (Google): C++, Python
• Torch: Python
• Keras: Python
• Cuda-convnet (Google): C/C++, Python
• Caffe2 (Facebook): C/C++, Matlab, Python
• Caffe (Berkeley): C/C++, Matlab, Python
• Overfeat (NYU): C/C++
• ConvNetJS: Java script
• MatConvNet (VLFeat): Matlab
• DeepLearn Toolbox: Matlab

Hardware
65
• Buy your own GPU machine
- NVIDIA DIGITS DevBox (TITAN X)
- NVIDIA DGX-1 (P100 GPUs)
• GPUs in the cloud
- Google Cloud Platform (GPU/TPU, TensorFlow)
- Amazon AWS EC2
- Microsoft Azure
VGG: ~2-3 weeks training with 4 GPUs
ResNet 101: 2-3 weeks with 4 GPUs

Q: How do I know
what architecture to use?
Ans: don’t be a hero.
1. Take whatever works best on ILSVRC (latest ResNet)
2. Download a pretrained model
3. Potentially add/delete some parts of it
4. Finetune it on your application.
66
Andrej Karpathy, Bay Area Deep Learning School, 2016

Q: How do I know
what hyperparameters to use?
Ans: don’t be a hero.
- Use whatever is reported to work best on ILSVRC.
- Play with the regularization strength (dropout rates)
67
Andrej Karpathy, Bay Area Deep Learning School, 2016

Deep Learning for Computer Vision

Recomendados

Recomendados

Mais conteúdo relacionado

Mais de IEEE International Conference on Intelligent Information Hiding and Multimedia Signal Processing

Mais de IEEE International Conference on Intelligent Information Hiding and Multimedia Signal Processing (10)

Último

Último (20)

Deep Learning for Computer Vision