Intro to selective search for object proposals, rcnn family and retinanet state of the art model deep dives for object detection along with MAP concept for evaluating model and how does anchor boxes make the model learn where to draw bounding boxes
5. Segmentation
Idea: If we correctly segment the image before running object
recognition, we can use our segmentations as candidate objects.
Advantages: Can be efficient, makes no assumptions about object
sizes or shapes.
6. Selective search
• Start by oversegmenting the input image
“Efficient graph-based image
segmentation” Felzenszwalband
Huttenlocher, IJCV2004
9. Similarity measures
Color: 25 bin color histogram for each channel =75 (rgb)
Texture: HOG like gaussian derivatives of the image in 8
directions and for each channel. Construct a 10-bin histogram
for each region = 240 dim vector.
Size: Size similarity encourages smaller regions to merge
early. It ensures that region proposals at all scales are formed
at all parts of the image.
Shape: Measures how well two regions (ri and rj) fit into each
other. If ri fits into rj merge them to fill gaps
10. Selective search
1. Merge two most similar regions basedonS.
2. Update similarities between the newregion and its
neighbors.
3. Gobackto step 1.until the
whole imageis
asingle region.
11. Selective search
• Use hierarchical segmentation: start with small superpixels and
merge based on diverse cues
• Take bounding boxesof all generatedregions andtreat them aspossible
object locations
16. R-CNN details
• Cons
• Training is slow (84h), takes a lot of disk space
• 2000 CNN passes per image
• Inference (detection) is slow (47s / image with VGG16)
• The selective search algorithm is a fixed algorithm, no learning is
happening!. This could lead to the generation of bad candidate
region proposals.
17. Fast R-CNN
ConvNet
Forward whole image through ConvNet
“conv5” feature map of image
“RoI Pooling” layer
Linear +
softmax
FCs Fully-connected layers
Softmax classifier
Region
proposals
Linear Bounding-box regressors
18. Fast R-CNN
• Pros
• Less compute overhead
• 2.3 seconds per image inference time
• Cons
• Inference of 2.3 secs is still slow for real life!
• The selective search algorithm is a fixed algorithm, no learning is
happening!. This could lead to the generation of bad candidate
region proposals.
22. Region proposal network (RPN)
• Slide a small window over the feature map
• Predict object/no object
• Regress bounding box coordinates
• Box regression is with reference to anchors (3 scales x 3 aspect ratios)
23. Loss
i : index of an anchor in a mini-batch
pi: is the predicted probability of anchor i being an object
p∗i is 1 if the anchor is positive, and is 0 if the anchor is
negative.
ti: 4 predicted bounding box coordinates
t∗i: ground-truth box associated coordinates with a
positive anchor
Lreg (ti , t∗i ) = R(ti − t∗i ) where R is the robust loss
function (smooth L1)
Classification+Regression
24. Online hard example mining
• Class imbalance hurts training.
• We are training the model to learn background
space rather than detecting objects.
Sort anchors by their calculated loss, apply NMS
Pick the top ones such that ratio between the
picked negatives and positives is at most 3:1.
• Faster rcnn selects 256 anchors - 128 positive,
128 negative
29. NMS: non max suppression (refresher)
Initial predicted boxes Filtered (Suppressed boxes) by IOU
30. Why one stage detector trails accuracy?
Two-stage:
The proposal stage rapidly
narrows down #candidate object
locations to a small number (e.g.,
1-2k), filtering out most
background samples
In the classification stage, fix
foreground-to-background ratio to
1:3, or online hard example
mining (OHEM).
One-stage:
Have to process a much larger
set of candidate object locations
regularly sampled across an
image, which amounts to
enumerating ~100k locations that
densely cover spatial positions,
scales, and aspect ratios.
Extreme foreground-background class imbalance encountered
31. Activation maps
How about predicting from multiple maps?
As image goes through deeper in the
network, resolution decreases and
semantic value increases
32. Feature pyramid networks (FPN)
• Improve predictive power of
lower-level feature maps by
adding contextual
information from higher-
level feature maps
Top-Down+Lateral connections
36. Anchors - Example
• Anchor dims=(size*scale)/sqrt(ratio)
• Eg for 32 anchor size:
• [-22 -11 22 11] 44X22 [-28 -14 28 14] 56X28 [-35 -17 35 17] 70X34
• [-16 -16 16 16] 32X32 [-20 -20 20 20] 40X40 [-25 -25 25 25] 50X50
• [-11 -22 11 22] 22X44 [-14 -28 14 28] 28X56 [-17 -35 17 35] 34X70
For 800,600 input image:
• P3 activation map shape: 100,75
• Stride: 8
• Total (A) = 9 anchors per pixel location
• Total anchors at P3 level = 100*75*9
= 67500
• Similarly sum for all pyramid levels
P3,P4,P5,P6,P7 = total 90360! anchors per
image
37. Shift anchors
Shift anchors according to input image from activation map
(26,15)
(-22,-11)
(22,11)
(-18,-7)
(0,0)
(4,4)
Shift anchor centered at (0,0) on P3 (stride 8)
Activation map by [ 4. 4. 4. 4.]
Next shift [ 12. 4. 12. 4.], [ 20. 4. 20. 4.] , ….
(4,4) (12,4)
8
Input Image
Anchors applied wrt to input image!
38. Cross Entropy loss
Examples that are easily classified (pt >
0.5) incur a loss with non-trivial magnitude
but summed over a large number of easy
examples, these small loss values can
overwhelm the rare class.
39. Balanced Entropy loss
Alpha=1 for foreground,1-alpha for background
• Alpha hyperparam
• While α balances the importance
of positive/negative examples, it
does not differentiate between
easy/hard examples!
40. Example
• The loss from easy
examples = 100000×0.1 =
10000
• The loss from hard
examples = 100×2.3 =
230
• 10000 / 230 = 43. It is
about 40× bigger loss
from easy examples.
41. Focal loss!
• Misclassified, pt is small, modulating factor is near 1, loss is
unaffected.
• As pt → 1, the factor goes to 0 and the loss for well-classified
examples is down-weighted..
• with γ = 2, example classified with pt = 0.9 would have 100×
lower loss compared to CE and with pt ≈ 0.968 it would have
1000× lower loss. This in turn increases the importance of
correcting misclassified examples!
• Every sample is weighted
according to its error!
• Modulating factor added
• Focusing parameter γ smoothly
adjusts the rate at which easy
examples are downweighted
45. Prediction pipeline
Predicts regression(deltas) to anchor boxes!
Filter by 0.05 anchor score threshold
Get 1000 boxes per level, merge all
Apply NMS at 0.5
300 final boxes! display to user
At each sliding-window location, we simultaneously predict multiple region proposals, where the number of maximum possible proposals for each location is denoted as k. So the reg layer has 4k outputs encoding the coordinates of k boxes, and the cls layer outputs 2k scores that estimate probability of object or not object for each proposal