SlideShare uma empresa Scribd logo
1 de 52
Object detection
Agenda
 Selective search
 RCNN family (Two stage)
 Retinanet (One stage)
 Anchors
 Losses
 Stats
 MAP
Classification vs Detection
Classification
Object Detection
Goal:
Problem: Where do we look in the image for the
object?
Kitte
n
Segmentation
Idea: If we correctly segment the image before running object
recognition, we can use our segmentations as candidate objects.
Advantages: Can be efficient, makes no assumptions about object
sizes or shapes.
Selective search
• Start by oversegmenting the input image
“Efficient graph-based image
segmentation” Felzenszwalband
Huttenlocher, IJCV2004
Image gradients
Image gradients
Similarity measures
 Color: 25 bin color histogram for each channel =75 (rgb)
 Texture: HOG like gaussian derivatives of the image in 8
directions and for each channel. Construct a 10-bin histogram
for each region = 240 dim vector.
 Size: Size similarity encourages smaller regions to merge
early. It ensures that region proposals at all scales are formed
at all parts of the image.
 Shape: Measures how well two regions (ri and rj) fit into each
other. If ri fits into rj merge them to fill gaps
Selective search
1. Merge two most similar regions basedonS.
2. Update similarities between the newregion and its
neighbors.
3. Gobackto step 1.until the
whole imageis
asingle region.
Selective search
• Use hierarchical segmentation: start with small superpixels and
merge based on diverse cues
• Take bounding boxesof all generatedregions andtreat them aspossible
object locations
Selective search
Stats
• Recallis aproportion of objects thatare
covered by some box with >0.5overlap
Selecte
d
setting
s
Region proposals!
R-CNN: Region proposals + CNN features
R-CNN details
• Cons
• Training is slow (84h), takes a lot of disk space
• 2000 CNN passes per image
• Inference (detection) is slow (47s / image with VGG16)
• The selective search algorithm is a fixed algorithm, no learning is
happening!. This could lead to the generation of bad candidate
region proposals.
Fast R-CNN
ConvNet
Forward whole image through ConvNet
“conv5” feature map of image
“RoI Pooling” layer
Linear +
softmax
FCs Fully-connected layers
Softmax classifier
Region
proposals
Linear Bounding-box regressors
Fast R-CNN
• Pros
• Less compute overhead
• 2.3 seconds per image inference time
• Cons
• Inference of 2.3 secs is still slow for real life!
• The selective search algorithm is a fixed algorithm, no learning is
happening!. This could lead to the generation of bad candidate
region proposals.
Fast R-CNN training
ConvNet
Linear +
softmax
FCs
Linear
Log loss + smooth L1 loss
Trainable
Multi-task loss
Speed comparison
Faster R-CNN
Region proposal network (RPN)
• Slide a small window over the feature map
• Predict object/no object
• Regress bounding box coordinates
• Box regression is with reference to anchors (3 scales x 3 aspect ratios)
Loss
i : index of an anchor in a mini-batch
pi: is the predicted probability of anchor i being an object
p∗i is 1 if the anchor is positive, and is 0 if the anchor is
negative.
ti: 4 predicted bounding box coordinates
t∗i: ground-truth box associated coordinates with a
positive anchor
Lreg (ti , t∗i ) = R(ti − t∗i ) where R is the robust loss
function (smooth L1)
Classification+Regression
Online hard example mining
• Class imbalance hurts training.
• We are training the model to learn background
space rather than detecting objects.
 Sort anchors by their calculated loss, apply NMS
 Pick the top ones such that ratio between the
picked negatives and positives is at most 3:1.
• Faster rcnn selects 256 anchors - 128 positive,
128 negative
Speed comparison
Faster R-CNN
• Pros
• 0.2 seconds per image inference time superfast for real life
• Uses RPN instead so better proposals as it can be trained
Strided convolutions (refresher)
Stride 1 convolution with 3x3 Kernel Stride 2 convolution with 3x3 Kernel
IOU: Intersection over union (refresher)
NMS: non max suppression (refresher)
Initial predicted boxes Filtered (Suppressed boxes) by IOU
Why one stage detector trails accuracy?
Two-stage:
The proposal stage rapidly
narrows down #candidate object
locations to a small number (e.g.,
1-2k), filtering out most
background samples
In the classification stage, fix
foreground-to-background ratio to
1:3, or online hard example
mining (OHEM).
One-stage:
Have to process a much larger
set of candidate object locations
regularly sampled across an
image, which amounts to
enumerating ~100k locations that
densely cover spatial positions,
scales, and aspect ratios.
 Extreme foreground-background class imbalance encountered
Activation maps
How about predicting from multiple maps?
As image goes through deeper in the
network, resolution decreases and
semantic value increases
Feature pyramid networks (FPN)
• Improve predictive power of
lower-level feature maps by
adding contextual
information from higher-
level feature maps
Top-Down+Lateral connections
Retinanet
Backbone
Activation
maps at
different
pyramid
levels
Can be:
Densenet
VGG
MobileNet
Retinanet - Architecture
Anchors
• Aspect ratios: 0.5, 1, 2
• Scales: 1, 1.25, 1.58
• Strides: 8,16,32,64,128
• Sizes: 32, 64, 128, 256, 512
• Total (A): ratio*scales=3*3=9 anchors/pixel location
• (K) object classes
Anchors - Example
• Anchor dims=(size*scale)/sqrt(ratio)
• Eg for 32 anchor size:
• [-22 -11 22 11] 44X22 [-28 -14 28 14] 56X28 [-35 -17 35 17] 70X34
• [-16 -16 16 16] 32X32 [-20 -20 20 20] 40X40 [-25 -25 25 25] 50X50
• [-11 -22 11 22] 22X44 [-14 -28 14 28] 28X56 [-17 -35 17 35] 34X70
For 800,600 input image:
• P3 activation map shape: 100,75
• Stride: 8
• Total (A) = 9 anchors per pixel location
• Total anchors at P3 level = 100*75*9
= 67500
• Similarly sum for all pyramid levels
P3,P4,P5,P6,P7 = total 90360! anchors per
image
Shift anchors
Shift anchors according to input image from activation map
(26,15)
(-22,-11)
(22,11)
(-18,-7)
(0,0)
(4,4)
Shift anchor centered at (0,0) on P3 (stride 8)
Activation map by [ 4. 4. 4. 4.]
Next shift [ 12. 4. 12. 4.], [ 20. 4. 20. 4.] , ….
(4,4) (12,4)
8
Input Image
Anchors applied wrt to input image!
Cross Entropy loss
Examples that are easily classified (pt >
0.5) incur a loss with non-trivial magnitude
but summed over a large number of easy
examples, these small loss values can
overwhelm the rare class.
Balanced Entropy loss
Alpha=1 for foreground,1-alpha for background
• Alpha hyperparam
• While α balances the importance
of positive/negative examples, it
does not differentiate between
easy/hard examples!
Example
• The loss from easy
examples = 100000×0.1 =
10000
• The loss from hard
examples = 100×2.3 =
230
• 10000 / 230 = 43. It is
about 40× bigger loss
from easy examples.
Focal loss!
• Misclassified, pt is small, modulating factor is near 1, loss is
unaffected.
• As pt → 1, the factor goes to 0 and the loss for well-classified
examples is down-weighted..
• with γ = 2, example classified with pt = 0.9 would have 100×
lower loss compared to CE and with pt ≈ 0.968 it would have
1000× lower loss. This in turn increases the importance of
correcting misclassified examples!
• Every sample is weighted
according to its error!
• Modulating factor added
• Focusing parameter γ smoothly
adjusts the rate at which easy
examples are downweighted
Focal loss!
unlike FL, OHEM completely
discards easy examples
Focal loss!
Smooth L1 loss: Bounding boxes
Prediction pipeline
Predicts regression(deltas) to anchor boxes!
 Filter by 0.05 anchor score threshold
 Get 1000 boxes per level, merge all
 Apply NMS at 0.5
 300 final boxes! display to user 
Stats
Stats
MAP: mean average precision
Precision = TP/(TP+FP) = 2/3 = 0.67
Recall is the proportion of TP out of the ground truth labels = 2/5 = 0.4
MAP: Interpolation approach (old 2007)
MAP: Interpolation approach (old 2007)
MAP: AUC approach (new 2011)
Thank You

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Object Detection Methods using Deep Learning
Object Detection Methods using Deep LearningObject Detection Methods using Deep Learning
Object Detection Methods using Deep Learning
 
Mask-RCNN for Instance Segmentation
Mask-RCNN for Instance SegmentationMask-RCNN for Instance Segmentation
Mask-RCNN for Instance Segmentation
 
Object Detection Using R-CNN Deep Learning Framework
Object Detection Using R-CNN Deep Learning FrameworkObject Detection Using R-CNN Deep Learning Framework
Object Detection Using R-CNN Deep Learning Framework
 
Object Detection Beyond Mask R-CNN and RetinaNet I
Object Detection Beyond Mask R-CNN and RetinaNet IObject Detection Beyond Mask R-CNN and RetinaNet I
Object Detection Beyond Mask R-CNN and RetinaNet I
 
Convolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep LearningConvolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep Learning
 
Semantic Segmentation - Fully Convolutional Networks for Semantic Segmentation
Semantic Segmentation - Fully Convolutional Networks for Semantic SegmentationSemantic Segmentation - Fully Convolutional Networks for Semantic Segmentation
Semantic Segmentation - Fully Convolutional Networks for Semantic Segmentation
 
Introduction to CNN
Introduction to CNNIntroduction to CNN
Introduction to CNN
 
Advanced deep learning based object detection methods
Advanced deep learning based object detection methodsAdvanced deep learning based object detection methods
Advanced deep learning based object detection methods
 
Semantic Segmentation Methods using Deep Learning
Semantic Segmentation Methods using Deep LearningSemantic Segmentation Methods using Deep Learning
Semantic Segmentation Methods using Deep Learning
 
Resnet
ResnetResnet
Resnet
 
AlexNet
AlexNetAlexNet
AlexNet
 
Convolutional Neural Networks : Popular Architectures
Convolutional Neural Networks : Popular ArchitecturesConvolutional Neural Networks : Popular Architectures
Convolutional Neural Networks : Popular Architectures
 
Single Shot Multibox Detector
Single Shot Multibox DetectorSingle Shot Multibox Detector
Single Shot Multibox Detector
 
Convolutional Neural Networks
Convolutional Neural NetworksConvolutional Neural Networks
Convolutional Neural Networks
 
Introduction to object detection
Introduction to object detectionIntroduction to object detection
Introduction to object detection
 
Object detection with deep learning
Object detection with deep learningObject detection with deep learning
Object detection with deep learning
 
Wasserstein GAN 수학 이해하기 I
Wasserstein GAN 수학 이해하기 IWasserstein GAN 수학 이해하기 I
Wasserstein GAN 수학 이해하기 I
 
[딥논읽] Meta-Transfer Learning for Zero-Shot Super-Resolution paper review
[딥논읽] Meta-Transfer Learning for Zero-Shot Super-Resolution paper review[딥논읽] Meta-Transfer Learning for Zero-Shot Super-Resolution paper review
[딥논읽] Meta-Transfer Learning for Zero-Shot Super-Resolution paper review
 
“An Introduction to Data Augmentation Techniques in ML Frameworks,” a Present...
“An Introduction to Data Augmentation Techniques in ML Frameworks,” a Present...“An Introduction to Data Augmentation Techniques in ML Frameworks,” a Present...
“An Introduction to Data Augmentation Techniques in ML Frameworks,” a Present...
 
Introduction to object detection
Introduction to object detectionIntroduction to object detection
Introduction to object detection
 

Semelhante a Object detection - RCNNs vs Retinanet

Deep Learning
Deep LearningDeep Learning
Deep Learning
Pierre de Lacaze
 

Semelhante a Object detection - RCNNs vs Retinanet (20)

Overview of Convolutional Neural Networks
Overview of Convolutional Neural NetworksOverview of Convolutional Neural Networks
Overview of Convolutional Neural Networks
 
object detection paper review
object detection paper reviewobject detection paper review
object detection paper review
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)
 
Computer Vision Landscape : Present and Future
Computer Vision Landscape : Present and FutureComputer Vision Landscape : Present and Future
Computer Vision Landscape : Present and Future
 
Detection focal loss 딥러닝 논문읽기 모임 발표자료
Detection focal loss 딥러닝 논문읽기 모임 발표자료Detection focal loss 딥러닝 논문읽기 모임 발표자료
Detection focal loss 딥러닝 논문읽기 모임 발표자료
 
Generating super resolution images using transformers
Generating super resolution images using transformersGenerating super resolution images using transformers
Generating super resolution images using transformers
 
Week5-Faster R-CNN.pptx
Week5-Faster R-CNN.pptxWeek5-Faster R-CNN.pptx
Week5-Faster R-CNN.pptx
 
Recent Progress on Object Detection_20170331
Recent Progress on Object Detection_20170331Recent Progress on Object Detection_20170331
Recent Progress on Object Detection_20170331
 
D3L4-objects.pdf
D3L4-objects.pdfD3L4-objects.pdf
D3L4-objects.pdf
 
DIP Lecture 7-9.pdf
DIP Lecture 7-9.pdfDIP Lecture 7-9.pdf
DIP Lecture 7-9.pdf
 
Anomaly Detection and Localization Using GAN and One-Class Classifier
Anomaly Detection and Localization  Using GAN and One-Class ClassifierAnomaly Detection and Localization  Using GAN and One-Class Classifier
Anomaly Detection and Localization Using GAN and One-Class Classifier
 
Deep Learning
Deep LearningDeep Learning
Deep Learning
 
lec10svm.ppt
lec10svm.pptlec10svm.ppt
lec10svm.ppt
 
Svm ms
Svm msSvm ms
Svm ms
 
lec10svm.ppt
lec10svm.pptlec10svm.ppt
lec10svm.ppt
 
Deep Learning Approach in Characterizing Salt Body on Seismic Images - by Zhe...
Deep Learning Approach in Characterizing Salt Body on Seismic Images - by Zhe...Deep Learning Approach in Characterizing Salt Body on Seismic Images - by Zhe...
Deep Learning Approach in Characterizing Salt Body on Seismic Images - by Zhe...
 
Deep Learning for Computer Vision: Object Detection (UPC 2016)
Deep Learning for Computer Vision: Object Detection (UPC 2016)Deep Learning for Computer Vision: Object Detection (UPC 2016)
Deep Learning for Computer Vision: Object Detection (UPC 2016)
 
lec10svm.ppt
lec10svm.pptlec10svm.ppt
lec10svm.ppt
 
Auro tripathy - Localizing with CNNs
Auro tripathy -  Localizing with CNNsAuro tripathy -  Localizing with CNNs
Auro tripathy - Localizing with CNNs
 
Deep learning for image video processing
Deep learning for image video processingDeep learning for image video processing
Deep learning for image video processing
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Último (20)

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Object detection - RCNNs vs Retinanet

  • 2. Agenda  Selective search  RCNN family (Two stage)  Retinanet (One stage)  Anchors  Losses  Stats  MAP
  • 4. Object Detection Goal: Problem: Where do we look in the image for the object? Kitte n
  • 5. Segmentation Idea: If we correctly segment the image before running object recognition, we can use our segmentations as candidate objects. Advantages: Can be efficient, makes no assumptions about object sizes or shapes.
  • 6. Selective search • Start by oversegmenting the input image “Efficient graph-based image segmentation” Felzenszwalband Huttenlocher, IJCV2004
  • 9. Similarity measures  Color: 25 bin color histogram for each channel =75 (rgb)  Texture: HOG like gaussian derivatives of the image in 8 directions and for each channel. Construct a 10-bin histogram for each region = 240 dim vector.  Size: Size similarity encourages smaller regions to merge early. It ensures that region proposals at all scales are formed at all parts of the image.  Shape: Measures how well two regions (ri and rj) fit into each other. If ri fits into rj merge them to fill gaps
  • 10. Selective search 1. Merge two most similar regions basedonS. 2. Update similarities between the newregion and its neighbors. 3. Gobackto step 1.until the whole imageis asingle region.
  • 11. Selective search • Use hierarchical segmentation: start with small superpixels and merge based on diverse cues • Take bounding boxesof all generatedregions andtreat them aspossible object locations
  • 13. Stats • Recallis aproportion of objects thatare covered by some box with >0.5overlap Selecte d setting s
  • 15. R-CNN: Region proposals + CNN features
  • 16. R-CNN details • Cons • Training is slow (84h), takes a lot of disk space • 2000 CNN passes per image • Inference (detection) is slow (47s / image with VGG16) • The selective search algorithm is a fixed algorithm, no learning is happening!. This could lead to the generation of bad candidate region proposals.
  • 17. Fast R-CNN ConvNet Forward whole image through ConvNet “conv5” feature map of image “RoI Pooling” layer Linear + softmax FCs Fully-connected layers Softmax classifier Region proposals Linear Bounding-box regressors
  • 18. Fast R-CNN • Pros • Less compute overhead • 2.3 seconds per image inference time • Cons • Inference of 2.3 secs is still slow for real life! • The selective search algorithm is a fixed algorithm, no learning is happening!. This could lead to the generation of bad candidate region proposals.
  • 19. Fast R-CNN training ConvNet Linear + softmax FCs Linear Log loss + smooth L1 loss Trainable Multi-task loss
  • 22. Region proposal network (RPN) • Slide a small window over the feature map • Predict object/no object • Regress bounding box coordinates • Box regression is with reference to anchors (3 scales x 3 aspect ratios)
  • 23. Loss i : index of an anchor in a mini-batch pi: is the predicted probability of anchor i being an object p∗i is 1 if the anchor is positive, and is 0 if the anchor is negative. ti: 4 predicted bounding box coordinates t∗i: ground-truth box associated coordinates with a positive anchor Lreg (ti , t∗i ) = R(ti − t∗i ) where R is the robust loss function (smooth L1) Classification+Regression
  • 24. Online hard example mining • Class imbalance hurts training. • We are training the model to learn background space rather than detecting objects.  Sort anchors by their calculated loss, apply NMS  Pick the top ones such that ratio between the picked negatives and positives is at most 3:1. • Faster rcnn selects 256 anchors - 128 positive, 128 negative
  • 26. Faster R-CNN • Pros • 0.2 seconds per image inference time superfast for real life • Uses RPN instead so better proposals as it can be trained
  • 27. Strided convolutions (refresher) Stride 1 convolution with 3x3 Kernel Stride 2 convolution with 3x3 Kernel
  • 28. IOU: Intersection over union (refresher)
  • 29. NMS: non max suppression (refresher) Initial predicted boxes Filtered (Suppressed boxes) by IOU
  • 30. Why one stage detector trails accuracy? Two-stage: The proposal stage rapidly narrows down #candidate object locations to a small number (e.g., 1-2k), filtering out most background samples In the classification stage, fix foreground-to-background ratio to 1:3, or online hard example mining (OHEM). One-stage: Have to process a much larger set of candidate object locations regularly sampled across an image, which amounts to enumerating ~100k locations that densely cover spatial positions, scales, and aspect ratios.  Extreme foreground-background class imbalance encountered
  • 31. Activation maps How about predicting from multiple maps? As image goes through deeper in the network, resolution decreases and semantic value increases
  • 32. Feature pyramid networks (FPN) • Improve predictive power of lower-level feature maps by adding contextual information from higher- level feature maps Top-Down+Lateral connections
  • 35. Anchors • Aspect ratios: 0.5, 1, 2 • Scales: 1, 1.25, 1.58 • Strides: 8,16,32,64,128 • Sizes: 32, 64, 128, 256, 512 • Total (A): ratio*scales=3*3=9 anchors/pixel location • (K) object classes
  • 36. Anchors - Example • Anchor dims=(size*scale)/sqrt(ratio) • Eg for 32 anchor size: • [-22 -11 22 11] 44X22 [-28 -14 28 14] 56X28 [-35 -17 35 17] 70X34 • [-16 -16 16 16] 32X32 [-20 -20 20 20] 40X40 [-25 -25 25 25] 50X50 • [-11 -22 11 22] 22X44 [-14 -28 14 28] 28X56 [-17 -35 17 35] 34X70 For 800,600 input image: • P3 activation map shape: 100,75 • Stride: 8 • Total (A) = 9 anchors per pixel location • Total anchors at P3 level = 100*75*9 = 67500 • Similarly sum for all pyramid levels P3,P4,P5,P6,P7 = total 90360! anchors per image
  • 37. Shift anchors Shift anchors according to input image from activation map (26,15) (-22,-11) (22,11) (-18,-7) (0,0) (4,4) Shift anchor centered at (0,0) on P3 (stride 8) Activation map by [ 4. 4. 4. 4.] Next shift [ 12. 4. 12. 4.], [ 20. 4. 20. 4.] , …. (4,4) (12,4) 8 Input Image Anchors applied wrt to input image!
  • 38. Cross Entropy loss Examples that are easily classified (pt > 0.5) incur a loss with non-trivial magnitude but summed over a large number of easy examples, these small loss values can overwhelm the rare class.
  • 39. Balanced Entropy loss Alpha=1 for foreground,1-alpha for background • Alpha hyperparam • While α balances the importance of positive/negative examples, it does not differentiate between easy/hard examples!
  • 40. Example • The loss from easy examples = 100000×0.1 = 10000 • The loss from hard examples = 100×2.3 = 230 • 10000 / 230 = 43. It is about 40× bigger loss from easy examples.
  • 41. Focal loss! • Misclassified, pt is small, modulating factor is near 1, loss is unaffected. • As pt → 1, the factor goes to 0 and the loss for well-classified examples is down-weighted.. • with γ = 2, example classified with pt = 0.9 would have 100× lower loss compared to CE and with pt ≈ 0.968 it would have 1000× lower loss. This in turn increases the importance of correcting misclassified examples! • Every sample is weighted according to its error! • Modulating factor added • Focusing parameter γ smoothly adjusts the rate at which easy examples are downweighted
  • 42. Focal loss! unlike FL, OHEM completely discards easy examples
  • 44. Smooth L1 loss: Bounding boxes
  • 45. Prediction pipeline Predicts regression(deltas) to anchor boxes!  Filter by 0.05 anchor score threshold  Get 1000 boxes per level, merge all  Apply NMS at 0.5  300 final boxes! display to user 
  • 46. Stats
  • 47. Stats
  • 48. MAP: mean average precision Precision = TP/(TP+FP) = 2/3 = 0.67 Recall is the proportion of TP out of the ground truth labels = 2/5 = 0.4
  • 51. MAP: AUC approach (new 2011)

Notas do Editor

  1. At each sliding-window location, we simultaneously predict multiple region proposals, where the number of maximum possible proposals for each location is denoted as k. So the reg layer has 4k outputs encoding the coordinates of k boxes, and the cls layer outputs 2k scores that estimate probability of object or not object for each proposal