I will talk about object and face detection problems, evolution of different approaches to solving these problems and about the ideas behind each of these approaches. Also I will describe meta-architecture that achieve state of the art results on faces detection problem and works faster than real-time.
2. About me
2
● Junior Researcher
@ Ring Ukraine
● Student of “Kyiv Polytechnic
Institute” (B.SE. Software
Engineering)
● Love algorithms and programming
competitions
3. 3
1. Object detection problem
a. Why is detection problem important?
b. Face detection problem
c. Datasets
d. How to evaluate different Object Detection approaches?
2. History of object detection architectures
a. Viola–Jones object detection
b. Classi ication based
c. Regression based
d. Cascade classi ication based
Agenda
7. 7
Object detection results are
mostly used as an input for other
tasks:
● face recognition
● person recognition
● self driving cars
● . . .
Why is object detection
so important?
14. 14
● Consists of 32 203 images with
393 703 labeled faces
● The faces vary largely in
appearance, pose and scale
● Multiple attributes annotated:
occlusion, pose and event
categories, which allows depth
analysis of existing algorithms
WIDER FACE: A Face Detection Benchmark
WIDER FACE: A Face
Detection Benchmark
15. 15WIDER FACE: A Face Detection Benchmark
WIDER FACE. Annotations
26. 26
Cascaded classi ier
Rapid object detection using a boosted cascade of simple feature
Viola-Jones detector (5)
27. 27
Pros:
● Really fast (can run at real-time
on embedded devices)
● Low false-positive rate
● Easy to tune
Cons:
● Hand-made features
● Hard to train
● Low detection rate on non
frontal faces
● Detects only simple objects
Rapid object detection using a boosted cascade of simple feature
Viola-Jones detector (5)
33. 33Rich feature hierarchies for accurate object detection and semantic segmen ation
● Regions: ~2000 Selective
Search proposals
● Feature Extractor:
AlexNet pre-trained on
ImageNet, ine-tuned on
PASCAL 2007
● Bounding box regression
to re ine box locations
● Performance: mAP of
53.7% on PASCAL 2007
R-CNN (1)
34. 34
Pros:
● Accurate
● Any architecture can be used as a feature
extractor
Cons:
● Hard to train (lots of training objectives:
softmax classi ier, linear SVMs, bound-box
regressions, lot of them train separately)
● Slow training (84h on GPU)
● Inference (detection) is slow (47s / image with
VGG-16 feature extractor)
Rich feature hierarchies for accurate object detection and semantic segmentation
R-CNN (2)
38. 38
● In each region proposal, used a
4-level spatial pyramid, with
grids:
■ 1×1
■ 2×2
■ 3×3
■ 6×6
● To each grid's cells we apply
some global pooling operation.
● Totally we get 50 bins to pool
the features from each feature
map.
Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition
SPP-Net (1): Spatial Pyramid
Pooling Layer
39. 39
ROIs from
proposal
method
● Fully-connected layers
● Forward whole image
through Convolutional
Network
● Get feature map of image
● Apply Spatial Pyramid
Pooling layer to feature
map
● Input image
● Classify regions and
apply bounding box
regressors
Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition
SPP-Net (2)
40. 40
What’s good about SPP-net?
Pascal VOC 2007 results
It's really faster…
Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition
SPP-Net (3)
41. 41
What’s wrong about SPP-net?
● Inherits the rest of
R CNN’s problems
● Introduces a new
problem: cannot update
parameters below SPP
layer during training
Trainable
(3 layers)
Frozen
(13 layers)
Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition
SPP-Net (4)
43. 43
● Fast test time, like
SPP-net
● One network,
trained in one stage
● Higher mean
average precision
than R CNN and
SPP-net
Fast R CNN
Fast R-CNN (1)
44. 44Fast R CNN
R-CNN Fast R-CNN
Training Time 84 hours 9.5 hours
(Speedup) 1x 8.8x
Test time per image
(network only)
47 seconds 0.32 seconds
(Speedup) 1x 146x
mAP (VOC 2007) 53.7% 66.9%
Comparison of R CNN and Fast R CNN (both use
VGG-16 feature extractor)
Fast R-CNN (2)
45. But, work time do not
include time for Selective
Search...
46. 46
R-CNN Fast R-CNN
Test time per image
(network only)
47 seconds 0.32 seconds
(Speedup) 1x 146x
Test time per image (with
Selective Search)
50 seconds 2 seconds
(Speedup) 1x 25x
Comparison of R CNN and Fast R CNN (both use
VGG-16 feature extractor)
Fast R-CNN (3)
52. 52
~ 100 FPS
Faster R CNN: Towards Real-Time Object Detection with Region Proposal Networks
Faster R-CNN (1):
Region proposal network
53. 53Faster R CNN: Towards Real-Time Object Detection with Region Proposal Networks
Faster R-CNN (2)
54. 54Faster R CNN: Towards Real-Time Object Detection with Region Proposal Networks
R-CNN Fast R-CNN Faster R-CNN
Test time per image
(with proposals)
50 seconds 2 seconds 0.2 seconds
(Speedup) 1x 25x 250x
mAP (VOC 2007) 53.7% 66.9% 69.9%
Comparison of R CNN/Fast R CNN/Faster R CNN (all use
VGG-16 feature extractor)
Faster R-CNN (3)
57. 57You Only Look Once: Uni ied, Real-Time Object Detection
YOLO’s pipeline
YOLO (1)
58. 58You Only Look Once: Uni ied, Real-Time Object Detection
Bottom layers from
GoogLeNet
Custom layers
YOLO architecture
YOLO (2)
59. 59
Pros:
● uite fast (~40 FPS on Nvidia Titan Black)
● End-to-end training
● Low Error Rate for
Foreground/Background misclassi ication
● Learn very general representation of objects
Cons:
● Less accurate than Fast R CNN (63.9% mAP
comparte to 66.9%)
● Loss function is an approximation
● Can not detect small objects
● Low detection rate of objects that located
close to each other
You Only Look Once: Unified, Real-Time Object Detection
Errors types comparison
Fast R CNN vs YOLO
YOLO (3)
62. 62
Apply regressors to default
box and get result
Regressors
Confidences for 21 classes
(20 VOC Pascal 2007 classes + background)
3 default boxes for
each cell
SSD: Single Shot MultiBox Detector
SSD detector example
SSD (2)
63. 63
Model mAP FPS
Faster R-CNN (VGG-16) 73.2% 7
Faster R-CNN (ZF) 62.1% 17
YOLO 63.4% 45
Tiny YOLO 52.7% 155
SSD300 (VGG-16) 72.1% 58
SSD500 (VGG-16) 75.1% 23
Pros:
● The best speed/accuracy trade-offs
● State of the art results on all object
detection datasets
● Pretty well works with light feature
extractors (InceprtionV2,
S ueeze Net, MobileNet, Shu leNet,
etc.)
Cons:
● Default boxes as hyper parameter
● Poorly works with heavy feature
extractors (ResNet-101, InceptionV4,
VGG-16, etc.)
SSD: Single Shot MultiBox Detector
Comparison of SSD with other
detectors
SSD (3)
66. 66
Network Input size FPS* Validation
Accuracy
P-Net 12x12 8000 94.6%
R-Net 24x24 650 95.4%
O-Net 48x48 220 95.4%
Networks speed and accuracy on crops
* - for original network input and batch size 1
Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks
MTCNN (1)
67. 67
MTCNN s Networks Architectures
Landmarks example
Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks
MTCNN (2)
68. 68
Recall/Precision curve
Test set o Wider Face date set
Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks
MTCNN (3)
69. 69
Pros:
● Really fast (100 FPS on GPU)
● Lot of speed/accuracy trade-offs
● State of the art results on big part of
Face Detection Datasets (CelebA,
FDDB, etc.)
Cons:
● Hard to train
● Lot of hyper-parameters
● Low detection rate of small faces
● Poorly works without landmarks
Model mAP FPS
MTCNN 85.1% 100
Faster R-CNN
(VGG-16)
93.2% 5
SSH
(VGG-16)
91.9% 10
Different face detector models comparison
Wider Face test set
Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks
MTCNN (4)