SlideShare uma empresa Scribd logo
1 de 68
Baixar para ler offline
Deep Learning for Computer Vision
Yuan-Kai Wang
Fu Jen Catholic University
2017/05/26
1
What Is Deep Learning
2
3
Google Lens
4
Google I/O 2017
5
Why Does Deep Learning Success(1/4)
Big Data
6
Why Does Deep Learning Success(2/4)
Beast Processor
7
*** 2017 Google TPU
Why Does Deep Learning Success(3/4)
• Algorithm
• Stochastic gradient descent (SGD) :
fast convergence for learning
• ReLU activation function :
solve vanishing gradient problem
• Dropout :
regularization
Technical Break Through
gradient descent(batch)
stochastic gradient descent
8
Why Does Deep Learning Success(4/4)
• Architecture: hierarchical representation
Technical Break Through
The Extraordinary Link Between Deep Neural Networks and the Nature of the Universe
MIT Technology Review, 2016/09.
9
10
Neural Network Evolution
11
12
Hubel/Wiesel Architecture
• D. Hubel and T. Wiesel (1959, 1962, Nobel Prize 1981)
• Visual cortex consists of a hierarchy of simple, complex, and
hyper-complex cells
13
“sandwich” architecture (SCSCSC…)
simple cells: modifiable parameters
complex cells: perform pooling
Neocognitron
Fukushima 1980
14
SIFT Haar
Textons
Computer Vision Features
SURF, MSER, LBP, Color-SIFT, Color histogram, GLOH, …..
and many others:
15
Hand-designed
feature extraction
Trainable
classifier
Image/
Video
Pixels
Object
Class
Traditional recognition
HoG
Shallow vs Deep Architectures
Hand-designed
feature extraction
Trainable
classifier
Image/
Video
Pixels
Object
Class
Layer 1 Layer N
Simple
classifier
Object
Class
Image/
Video
Pixels
Traditional recognition: “Shallow” architecture
Deep learning: “Deep” architecture
…
16
Image Low-level
vision features
(edges, SIFT, HOG, etc.)
Object detection
/ classification
feature extractor classifier
Learn Feature Hierarchy
Fill	in	representation	gap	in recognition
Feature representation
Input data
1st layer		
“Edges”
2nd	layer		
“Object parts”
3rd	layer		
“Objects”
Pixels
Layer 1
Simple
Classifier
Image/Video
Pixels
Layer 2
Layer 3
"Object Detectors Emerge in Deep Scene CNNs,"
B. Zhou, et al., ICLR 2015
17
Learning algorithm: SGD, ReLU, dropout
No More Handcrafted Features !18
Taxonomy of Feature Learning Methods
• T• Support	Vector Machine
• Logistic Regression
• Perceptron
• Deep	Neural Net
• Convolutional	Neural Net (CNN)
• Recurrent	Neural Net
• Autoencoder
• Restricted	Boltzmann machines*
• Sparse coding*
• Generative Adversarial Net (GAN)*
• Deep	Belief Nets*
Deep	Boltzmann machines*
• Hierarchical	Sparse Coding*
DeepShallow
Supervised
Unsupervised
* supervised version exists
19
• Siamese Net
e.g. Google Photos search
Face Verification, Taigman et al. 2014 (FAIR)
Self-driving cars[Goodfellow et al. 2014]
Ciresan et al. 2013
Turaga et al 2010
CNN Applications (1/3)
20
ATARI game playing, Mnih 2013
AlphaGo, Silver et al 2016
VizDoom
StarCraft
CNN Applications (2/3)
21
DeepDream reddit.com/r/deepdream NeuralStyle, Gatys et al. 2015
deepart.io, Prisma, etc.
CNN Applications (3/3)
22
CNN Example:
Recognition
OCR House Number Traffic Sign
Taigman et al. “DeepFace: Closing the Gap to Human-Level Performance in Face Verification,” CVPR 2014 23
CNN Example:
Object/Pedestrian Detection
24
CNN Example:
Scene Labeling
25
Farabet et al. "Learning hierarchical features for scene labeling" PAMI 2013 (LeCun)
Pinheiro et al. "Recurrent Convolutional Neural Networks for Scene Labeling" ICML 2014
CNN Example:
Action Recognition from Videos
26Simonyan et al. "Two-Stream Convolutional Networks for Action Recognition in Videos" NIPS 2014
A. Kapathy et al. "Large-scale Video Classification with Convolutional Neural Networks" CVPR 2014
Convolutional Neural Networks
(CNN, ConvNets)
27
CNN (Convnet) by LeCun in 1998
• Neural network with specialized
connectivity structure
• Stack multiple stages of feature
extractors
• Higher stages compute more
global, more invariant features
• Classification layer at the end
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner,
Gradient-based learning applied to document recognition,
Proceedings of the IEEE 86(11): 2278–2324, 1998.
28
Basic Module in CNN
Input Image
Convolution
(Learned)
Non-linearity
Pooling
• Feed-forward:
– Convolve input with learned filters
– Non-linearity	(rectified linear)
– Pooling	(local max)
• Supervised learning
• Train	convolutional	filters by
back-propagating	classification error
LeCun et al. 1998
Feature maps
29
Components of Each Layer
Pixels /
Features
Filter with
Dictionary
(convolutional
or tiled)
Spatial/Feature
(Sum or Max)
Normalization
between
feature
responses
Output
Features
+ Non-linearity
[Optional]
Slide: R. Fergus
[Optional]
30
Convolutional Filtering
Input Feature Map
– Dependencies are local
– Translation equivariance
– Tied filter weights (few params)
– Stride 1,2,… (faster, less mem.)
.
.
.
31
Non-Linearity
• Every neuron performs a
non-linear operation
– Tanh
– Sigmoid: 1/(1+exp(-x))
– Rectified linear unit (ReLU)
• Simplifies backprop
• Makes learning faster
• Avoids saturation issues
* Preferred option
x1
x2
x3
.
.
.
xd
w2
w3
wd
σ(wx+b)
w1
32
Pooling
• Sum	or max
• Non-overlapping	/	overlapping regions
• Boureau	et	al.	ICML’10	for	theoretical analysis
Sum
Max
33
Normalization
• Contrast	normalization	(across	feature maps)
– Local	mean	=	0,	local	std.	=	1 (7x7 Gaussian)
– Equalizes	the	features maps
Feature Maps
Feature Maps
After Contrast Normalization
34
Compare: SIFT Descriptor
Image
Pixels Apply
Gabor filters
Spatial pool
(Sum)
Normalize to
unit length
Feature
Vector
Slide: R. Fergus
Lowe
[IJCV 2004]
35
e.g. 200K numbers e.g. 10 numbers
An Example of CNN
36
CNN Classical Models
Comparison
AlexNet, GoogLeNet, VGG, ResNet
37
ImageNet Challenge: ILSVRC
38
• ~14 million labeled images, 20k classes
• Images gathered from Internet
• Human labels via Amazon Turk
• Challenge: 1.2 million training images, 1000 classes
Karpathy et al. "Large-scale video classification with convolutional neural networks." CVPR 2014 (Fei-Fei Li)
car 99%
ILSVRC 2011 winner with 25.8 error rate
39
Going Deeper from 2012
40
Clarifai
He et al. "Deep Residual Learning for Image Recognition," CVPR 2016
AlexNet (2012 Winner)
• Similar framework to LeCun’98 but:
• More data (106 vs. 103 images)
• Bigger model (7 hidden layers, 650,000 units, 60,000,000 params)
• GPU implementation (50x speedup over CPU)
• Trained on two GPUs for a week
• Better regularization for training (DropOut)
Alex Krizhevsky, I. Sutskever, and G. Hinton,
"ImageNet Classification with Deep Convolutional Neural Networks"
NIPS 2012 41
AlexNe : 8 layers total
Trained on ImageNet
16.4% top-5 error
Layer 4: Conv
Layer 3: Conv
Layer 2: Conv + Pool
Layer 1: Conv + Pool
Layer 6: Full
Softmax Output
Layer 5: Conv + Pool
Layer 7: Full
Input Image
How Important Is Depth? (1/2)
42
Layer 6: Full
Layer 5: Conv +
Pool
Layer 4: Conv
Layer 3: Conv
Layer 2: Conv + Pool
Layer 1: Conv + Pool
Softmax Output
Input Image
Remove top fully connected layer(Layer 7)
Drop 16 million parameters
Only 1.1% drop in performance!
Layer 5: Conv +
Pool
Layer 4: Conv
Layer 3: Conv
Layer 2: Conv + Pool
Layer 1: Conv + Pool
Softmax Output
Input Image
Remove layers 6 & 7
Drop 50 million parameters
5.7% drop in performance
AlexNe : 8 layers total
Trained on ImageNet
16.4% top-5 error
Layer 4: Conv
Layer 3: Conv
Layer 2: Conv + Pool
Layer 1: Conv + Pool
Layer 6: Full
Softmax Output
Layer 5: Conv + Pool
Layer 7: Full
Input Image
How Important Is Depth? (2/2)
43
Remove layers 3 & 4
Drop 1 million parameters
3.0% drop in performance
Layer 6: Full
Softmax Output
Layer 5: Conv + Pool
Layer 7: Full
Input Image
Layer 2: Conv + Pool
Layer 1: Conv + Pool Layer 1: Conv + Pool
Layer 2: Conv + Pool
Layer 5: Conv + Pool
Input Image
Softmax Output
Remove layers 3, 4, 6 ,7
33.5% drop in performance
Depth of network is key
ZFNet (2013 2nd, Improved AlexNet)
44
M. Zeiler and R. Fergus, Visualizing and Understanding Convolutional Networks, ECCV, 2014
CONV1: change from (11x11 stride 4) to (7x7 stride 2)
CONV3,4,5: instead of 384, 384, 256 filters use 512, 1024, 512
ImageNet top 5 error: 16.4% -> 14.8%
Meaning of Each Layer in ZFNet
best model
Only 3x3 CONV stride 1, pad 1
and 2x2 MAX POOL stride 2
VGGNet (2014 2nd)
45
Simonyan & Zisserman, "Very deep convolutional networks for large-scale image recognition" ICLR 2015
TOTAL memory: 24M * 4 bytes ~= 93MB / image
(only forward! ~*2 for bwd)
TOTAL params: 138M parameters
19 layers
Top-5 error 7.3%
Inception module
GoogLeNet (2014 Winner)
46
Szegedy et al. "Going deeper with convolutions" CVPR 2015
• Important features:
Only 5 million parameters!
(Removes FC layers completely)
• Compared to AlexNet:
12X less params
2x more compute
6.67% (vs. 16.4%)
22 layers
Top-5 error 6.7%
152 layers
Top-5 error 3.6%
ResNet (2015 Winner)
47He et al. "Deep Residual Learning for Image Recognition" CVPR 2016
spatial dimension
only 56x56!
DL Trend : CNN Models
Research by arxiv-sanity database of 28,303 (arxiv) Machine Learning papers over the last 5 years , 2017/04.
48
DL Trend : Optimization algorithm
Research by arxiv-sanity database of 28,303 (arxiv) Machine Learning papers over the last 5 years , 2017/04.
49
DL Trend : Top Hot Keywords
Research by arxiv-sanity database of 28,303 (arxiv) Machine Learning papers over the last 5 years, 2017/04.
50
CNN Architectures for
Different Applications
51
CNN for Classification
52
“tabby cat”
1000-dim vector
end-to-end learning
image CNN features
e.g. vector of 1000 numbers giving
probabilities for different classes.
Fully connected layer
Localization / Detection
image CNN features
fully connected layer
Class
probabilities
4 numbers:
- X coord
- Y coord
- Width
- Height
53
image CNN features
1x1 CONV
E.g. YOLO (You Only Look Once)
(Demo: http://pjreddie.com/darknet/yolo/)
7x7x(5*B+C)
For each of 7x7 locations:
- [x,y,width,height,confidence]*B
- class
CNN for Pedestrian Detection
54
Girshick et al. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” NIPS 2015
Region Proposal Networks (RPNs)CNN
Faster R-CNN
Multi-scale Architecture
55Farabet et al. "Learning hierarchical features for scene labeling" PAMI 2013 (LeCun)
Multi-modal Architecture
56
Frome et al. "Devise: A deep visual-semantic embedding model." NIPS 2013 (Bengio)
Multi-task Architecture
57
Zhang et al. "Panda: Pose aligned networks for deep attribute modeling" CVPR 2014.
Semantic Segmentation
pixels in, pixels out
58
image CNN features
NxNx3
deconv layers
NxNx20 array of class
probabilities at each pixel
image class “map”
Convolution and Deconvolution
ConvNet
(CNN)
DeconvNet
(Deconvolutional
Layer)
Convolutional Autoencoder, Variational Autoencoder
Generational Adversarial Net
59ConvNetDeconvNet
Image Denoise by
Generative Adversarial Net
60
Object Tracking
by CNN and RNN
61
Person Reidentification by
CNN, RNN and Siamese
62
Deep Neural Networks
in Practice
63
CNN Libraries (open source)
64
• TensorFlow (Google): C++, Python
• Torch: Python
• Keras: Python
• Cuda-convnet (Google): C/C++, Python
• Caffe2 (Facebook): C/C++, Matlab, Python
• Caffe (Berkeley): C/C++, Matlab, Python
• Overfeat (NYU): C/C++
• ConvNetJS: Java script
• MatConvNet (VLFeat): Matlab
• DeepLearn Toolbox: Matlab
Hardware
65
• Buy your own GPU machine
- NVIDIA DIGITS DevBox (TITAN X)
- NVIDIA DGX-1 (P100 GPUs)
• GPUs in the cloud
- Google Cloud Platform (GPU/TPU, TensorFlow)
- Amazon AWS EC2
- Microsoft Azure
VGG: ~2-3 weeks training with 4 GPUs
ResNet 101: 2-3 weeks with 4 GPUs
Q: How do I know
what architecture to use?
Ans: don’t be a hero.
1. Take whatever works best on ILSVRC (latest ResNet)
2. Download a pretrained model
3. Potentially add/delete some parts of it
4. Finetune it on your application.
66
Andrej Karpathy, Bay Area Deep Learning School, 2016
Q: How do I know
what hyperparameters to use?
Ans: don’t be a hero.
- Use whatever is reported to work best on ILSVRC.
- Play with the regularization strength (dropout rates)
67
Andrej Karpathy, Bay Area Deep Learning School, 2016
68
Thank
you!

Mais conteúdo relacionado

Mais de IEEE International Conference on Intelligent Information Hiding and Multimedia Signal Processing

Mais de IEEE International Conference on Intelligent Information Hiding and Multimedia Signal Processing (10)

08 probabilistic inference over time
08 probabilistic inference over time08 probabilistic inference over time
08 probabilistic inference over time
 
05 probabilistic graphical models
05 probabilistic graphical models05 probabilistic graphical models
05 probabilistic graphical models
 
04 Uncertainty inference(continuous)
04 Uncertainty inference(continuous)04 Uncertainty inference(continuous)
04 Uncertainty inference(continuous)
 
03 Uncertainty inference(discrete)
03 Uncertainty inference(discrete)03 Uncertainty inference(discrete)
03 Uncertainty inference(discrete)
 
01 Probability review
01 Probability review01 Probability review
01 Probability review
 
02 Statistics review
02 Statistics review02 Statistics review
02 Statistics review
 
Monocular Human Pose Estimation with Bayesian Networks
Monocular Human Pose Estimation with Bayesian NetworksMonocular Human Pose Estimation with Bayesian Networks
Monocular Human Pose Estimation with Bayesian Networks
 
Towards Embedded Computer Vision邁向嵌入式電腦視覺
Towards Embedded Computer Vision邁向嵌入式電腦視覺Towards Embedded Computer Vision邁向嵌入式電腦視覺
Towards Embedded Computer Vision邁向嵌入式電腦視覺
 
Intelligent Video Surveillance with Cloud Computing
Intelligent Video Surveillance with Cloud ComputingIntelligent Video Surveillance with Cloud Computing
Intelligent Video Surveillance with Cloud Computing
 
Intelligent Video Surveillance and Sousveillance
Intelligent Video Surveillance and SousveillanceIntelligent Video Surveillance and Sousveillance
Intelligent Video Surveillance and Sousveillance
 

Último

Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622
 

Último (20)

Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 

Deep Learning for Computer Vision

  • 1. Deep Learning for Computer Vision Yuan-Kai Wang Fu Jen Catholic University 2017/05/26 1
  • 2. What Is Deep Learning 2
  • 3. 3
  • 6. Why Does Deep Learning Success(1/4) Big Data 6
  • 7. Why Does Deep Learning Success(2/4) Beast Processor 7 *** 2017 Google TPU
  • 8. Why Does Deep Learning Success(3/4) • Algorithm • Stochastic gradient descent (SGD) : fast convergence for learning • ReLU activation function : solve vanishing gradient problem • Dropout : regularization Technical Break Through gradient descent(batch) stochastic gradient descent 8
  • 9. Why Does Deep Learning Success(4/4) • Architecture: hierarchical representation Technical Break Through The Extraordinary Link Between Deep Neural Networks and the Nature of the Universe MIT Technology Review, 2016/09. 9
  • 10. 10
  • 12. 12
  • 13. Hubel/Wiesel Architecture • D. Hubel and T. Wiesel (1959, 1962, Nobel Prize 1981) • Visual cortex consists of a hierarchy of simple, complex, and hyper-complex cells 13
  • 14. “sandwich” architecture (SCSCSC…) simple cells: modifiable parameters complex cells: perform pooling Neocognitron Fukushima 1980 14
  • 15. SIFT Haar Textons Computer Vision Features SURF, MSER, LBP, Color-SIFT, Color histogram, GLOH, ….. and many others: 15 Hand-designed feature extraction Trainable classifier Image/ Video Pixels Object Class Traditional recognition HoG
  • 16. Shallow vs Deep Architectures Hand-designed feature extraction Trainable classifier Image/ Video Pixels Object Class Layer 1 Layer N Simple classifier Object Class Image/ Video Pixels Traditional recognition: “Shallow” architecture Deep learning: “Deep” architecture … 16 Image Low-level vision features (edges, SIFT, HOG, etc.) Object detection / classification feature extractor classifier
  • 17. Learn Feature Hierarchy Fill in representation gap in recognition Feature representation Input data 1st layer “Edges” 2nd layer “Object parts” 3rd layer “Objects” Pixels Layer 1 Simple Classifier Image/Video Pixels Layer 2 Layer 3 "Object Detectors Emerge in Deep Scene CNNs," B. Zhou, et al., ICLR 2015 17 Learning algorithm: SGD, ReLU, dropout
  • 18. No More Handcrafted Features !18
  • 19. Taxonomy of Feature Learning Methods • T• Support Vector Machine • Logistic Regression • Perceptron • Deep Neural Net • Convolutional Neural Net (CNN) • Recurrent Neural Net • Autoencoder • Restricted Boltzmann machines* • Sparse coding* • Generative Adversarial Net (GAN)* • Deep Belief Nets* Deep Boltzmann machines* • Hierarchical Sparse Coding* DeepShallow Supervised Unsupervised * supervised version exists 19 • Siamese Net
  • 20. e.g. Google Photos search Face Verification, Taigman et al. 2014 (FAIR) Self-driving cars[Goodfellow et al. 2014] Ciresan et al. 2013 Turaga et al 2010 CNN Applications (1/3) 20
  • 21. ATARI game playing, Mnih 2013 AlphaGo, Silver et al 2016 VizDoom StarCraft CNN Applications (2/3) 21
  • 22. DeepDream reddit.com/r/deepdream NeuralStyle, Gatys et al. 2015 deepart.io, Prisma, etc. CNN Applications (3/3) 22
  • 23. CNN Example: Recognition OCR House Number Traffic Sign Taigman et al. “DeepFace: Closing the Gap to Human-Level Performance in Face Verification,” CVPR 2014 23
  • 25. CNN Example: Scene Labeling 25 Farabet et al. "Learning hierarchical features for scene labeling" PAMI 2013 (LeCun) Pinheiro et al. "Recurrent Convolutional Neural Networks for Scene Labeling" ICML 2014
  • 26. CNN Example: Action Recognition from Videos 26Simonyan et al. "Two-Stream Convolutional Networks for Action Recognition in Videos" NIPS 2014 A. Kapathy et al. "Large-scale Video Classification with Convolutional Neural Networks" CVPR 2014
  • 28. CNN (Convnet) by LeCun in 1998 • Neural network with specialized connectivity structure • Stack multiple stages of feature extractors • Higher stages compute more global, more invariant features • Classification layer at the end Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE 86(11): 2278–2324, 1998. 28
  • 29. Basic Module in CNN Input Image Convolution (Learned) Non-linearity Pooling • Feed-forward: – Convolve input with learned filters – Non-linearity (rectified linear) – Pooling (local max) • Supervised learning • Train convolutional filters by back-propagating classification error LeCun et al. 1998 Feature maps 29
  • 30. Components of Each Layer Pixels / Features Filter with Dictionary (convolutional or tiled) Spatial/Feature (Sum or Max) Normalization between feature responses Output Features + Non-linearity [Optional] Slide: R. Fergus [Optional] 30
  • 31. Convolutional Filtering Input Feature Map – Dependencies are local – Translation equivariance – Tied filter weights (few params) – Stride 1,2,… (faster, less mem.) . . . 31
  • 32. Non-Linearity • Every neuron performs a non-linear operation – Tanh – Sigmoid: 1/(1+exp(-x)) – Rectified linear unit (ReLU) • Simplifies backprop • Makes learning faster • Avoids saturation issues * Preferred option x1 x2 x3 . . . xd w2 w3 wd σ(wx+b) w1 32
  • 33. Pooling • Sum or max • Non-overlapping / overlapping regions • Boureau et al. ICML’10 for theoretical analysis Sum Max 33
  • 34. Normalization • Contrast normalization (across feature maps) – Local mean = 0, local std. = 1 (7x7 Gaussian) – Equalizes the features maps Feature Maps Feature Maps After Contrast Normalization 34
  • 35. Compare: SIFT Descriptor Image Pixels Apply Gabor filters Spatial pool (Sum) Normalize to unit length Feature Vector Slide: R. Fergus Lowe [IJCV 2004] 35
  • 36. e.g. 200K numbers e.g. 10 numbers An Example of CNN 36
  • 37. CNN Classical Models Comparison AlexNet, GoogLeNet, VGG, ResNet 37
  • 38. ImageNet Challenge: ILSVRC 38 • ~14 million labeled images, 20k classes • Images gathered from Internet • Human labels via Amazon Turk • Challenge: 1.2 million training images, 1000 classes Karpathy et al. "Large-scale video classification with convolutional neural networks." CVPR 2014 (Fei-Fei Li)
  • 39. car 99% ILSVRC 2011 winner with 25.8 error rate 39
  • 40. Going Deeper from 2012 40 Clarifai He et al. "Deep Residual Learning for Image Recognition," CVPR 2016
  • 41. AlexNet (2012 Winner) • Similar framework to LeCun’98 but: • More data (106 vs. 103 images) • Bigger model (7 hidden layers, 650,000 units, 60,000,000 params) • GPU implementation (50x speedup over CPU) • Trained on two GPUs for a week • Better regularization for training (DropOut) Alex Krizhevsky, I. Sutskever, and G. Hinton, "ImageNet Classification with Deep Convolutional Neural Networks" NIPS 2012 41
  • 42. AlexNe : 8 layers total Trained on ImageNet 16.4% top-5 error Layer 4: Conv Layer 3: Conv Layer 2: Conv + Pool Layer 1: Conv + Pool Layer 6: Full Softmax Output Layer 5: Conv + Pool Layer 7: Full Input Image How Important Is Depth? (1/2) 42 Layer 6: Full Layer 5: Conv + Pool Layer 4: Conv Layer 3: Conv Layer 2: Conv + Pool Layer 1: Conv + Pool Softmax Output Input Image Remove top fully connected layer(Layer 7) Drop 16 million parameters Only 1.1% drop in performance! Layer 5: Conv + Pool Layer 4: Conv Layer 3: Conv Layer 2: Conv + Pool Layer 1: Conv + Pool Softmax Output Input Image Remove layers 6 & 7 Drop 50 million parameters 5.7% drop in performance
  • 43. AlexNe : 8 layers total Trained on ImageNet 16.4% top-5 error Layer 4: Conv Layer 3: Conv Layer 2: Conv + Pool Layer 1: Conv + Pool Layer 6: Full Softmax Output Layer 5: Conv + Pool Layer 7: Full Input Image How Important Is Depth? (2/2) 43 Remove layers 3 & 4 Drop 1 million parameters 3.0% drop in performance Layer 6: Full Softmax Output Layer 5: Conv + Pool Layer 7: Full Input Image Layer 2: Conv + Pool Layer 1: Conv + Pool Layer 1: Conv + Pool Layer 2: Conv + Pool Layer 5: Conv + Pool Input Image Softmax Output Remove layers 3, 4, 6 ,7 33.5% drop in performance Depth of network is key
  • 44. ZFNet (2013 2nd, Improved AlexNet) 44 M. Zeiler and R. Fergus, Visualizing and Understanding Convolutional Networks, ECCV, 2014 CONV1: change from (11x11 stride 4) to (7x7 stride 2) CONV3,4,5: instead of 384, 384, 256 filters use 512, 1024, 512 ImageNet top 5 error: 16.4% -> 14.8% Meaning of Each Layer in ZFNet
  • 45. best model Only 3x3 CONV stride 1, pad 1 and 2x2 MAX POOL stride 2 VGGNet (2014 2nd) 45 Simonyan & Zisserman, "Very deep convolutional networks for large-scale image recognition" ICLR 2015 TOTAL memory: 24M * 4 bytes ~= 93MB / image (only forward! ~*2 for bwd) TOTAL params: 138M parameters 19 layers Top-5 error 7.3%
  • 46. Inception module GoogLeNet (2014 Winner) 46 Szegedy et al. "Going deeper with convolutions" CVPR 2015 • Important features: Only 5 million parameters! (Removes FC layers completely) • Compared to AlexNet: 12X less params 2x more compute 6.67% (vs. 16.4%) 22 layers Top-5 error 6.7%
  • 47. 152 layers Top-5 error 3.6% ResNet (2015 Winner) 47He et al. "Deep Residual Learning for Image Recognition" CVPR 2016 spatial dimension only 56x56!
  • 48. DL Trend : CNN Models Research by arxiv-sanity database of 28,303 (arxiv) Machine Learning papers over the last 5 years , 2017/04. 48
  • 49. DL Trend : Optimization algorithm Research by arxiv-sanity database of 28,303 (arxiv) Machine Learning papers over the last 5 years , 2017/04. 49
  • 50. DL Trend : Top Hot Keywords Research by arxiv-sanity database of 28,303 (arxiv) Machine Learning papers over the last 5 years, 2017/04. 50
  • 52. CNN for Classification 52 “tabby cat” 1000-dim vector end-to-end learning image CNN features e.g. vector of 1000 numbers giving probabilities for different classes. Fully connected layer
  • 53. Localization / Detection image CNN features fully connected layer Class probabilities 4 numbers: - X coord - Y coord - Width - Height 53 image CNN features 1x1 CONV E.g. YOLO (You Only Look Once) (Demo: http://pjreddie.com/darknet/yolo/) 7x7x(5*B+C) For each of 7x7 locations: - [x,y,width,height,confidence]*B - class
  • 54. CNN for Pedestrian Detection 54 Girshick et al. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” NIPS 2015 Region Proposal Networks (RPNs)CNN Faster R-CNN
  • 55. Multi-scale Architecture 55Farabet et al. "Learning hierarchical features for scene labeling" PAMI 2013 (LeCun)
  • 56. Multi-modal Architecture 56 Frome et al. "Devise: A deep visual-semantic embedding model." NIPS 2013 (Bengio)
  • 57. Multi-task Architecture 57 Zhang et al. "Panda: Pose aligned networks for deep attribute modeling" CVPR 2014.
  • 58. Semantic Segmentation pixels in, pixels out 58 image CNN features NxNx3 deconv layers NxNx20 array of class probabilities at each pixel image class “map”
  • 59. Convolution and Deconvolution ConvNet (CNN) DeconvNet (Deconvolutional Layer) Convolutional Autoencoder, Variational Autoencoder Generational Adversarial Net 59ConvNetDeconvNet
  • 60. Image Denoise by Generative Adversarial Net 60
  • 62. Person Reidentification by CNN, RNN and Siamese 62
  • 63. Deep Neural Networks in Practice 63
  • 64. CNN Libraries (open source) 64 • TensorFlow (Google): C++, Python • Torch: Python • Keras: Python • Cuda-convnet (Google): C/C++, Python • Caffe2 (Facebook): C/C++, Matlab, Python • Caffe (Berkeley): C/C++, Matlab, Python • Overfeat (NYU): C/C++ • ConvNetJS: Java script • MatConvNet (VLFeat): Matlab • DeepLearn Toolbox: Matlab
  • 65. Hardware 65 • Buy your own GPU machine - NVIDIA DIGITS DevBox (TITAN X) - NVIDIA DGX-1 (P100 GPUs) • GPUs in the cloud - Google Cloud Platform (GPU/TPU, TensorFlow) - Amazon AWS EC2 - Microsoft Azure VGG: ~2-3 weeks training with 4 GPUs ResNet 101: 2-3 weeks with 4 GPUs
  • 66. Q: How do I know what architecture to use? Ans: don’t be a hero. 1. Take whatever works best on ILSVRC (latest ResNet) 2. Download a pretrained model 3. Potentially add/delete some parts of it 4. Finetune it on your application. 66 Andrej Karpathy, Bay Area Deep Learning School, 2016
  • 67. Q: How do I know what hyperparameters to use? Ans: don’t be a hero. - Use whatever is reported to work best on ILSVRC. - Play with the regularization strength (dropout rates) 67 Andrej Karpathy, Bay Area Deep Learning School, 2016