Александр Заричковый "Faster than real-time face detection"

Alexander Zarichkovyi
Ring Ukraine
Faster than real-time
face detection

About me
2
● Junior Researcher
@ Ring Ukraine
● Student of “Kyiv Polytechnic
Institute” (B.SE. Software
Engineering)
● Love algorithms and programming
competitions

3
1. Object detection problem
a. Why is detection problem important?
b. Face detection problem
c. Datasets
d. How to evaluate different Object Detection approaches?
2. History of object detection architectures
a. Viola–Jones object detection
b. Classi ication based
c. Regression based
d. Cascade classi ication based
Agenda

5http://cs231n.github.io/classi ication/
Image classification

6http://tutorial.caffe.berkeleyvision.org/caffe-cvpr15-detection.pdf
Classi ication problem Detection problem
Classification vs Detection

7
Object detection results are
mostly used as an input for other
tasks:
● face recognition
● person recognition
● self driving cars
● . . .
Why is object detection
so important?

9
What is the Face Detection problem?

10
How many faces do you see on
the picture?

11
● Occlusions
● Light conditions
● Pose
● Diversity
● ...
Why is it difficult?

13
20 classes:
• Person: person
• Animal: bird, cat, cow, dog, horse, sheep
• Vehicle: aeroplane, bicycle, boat, bus, car, motorbike, train
• Indoor: bottle, chair, dining table, potted plant, sofa, tv/monitor
Train/val size:
VOC 2007 has 9,963 images containing 24,640 annotated objects.
The PASCAL Visual Object Classes Challenge: A Retrospective
Pascal VOC 2007

14
● Consists of 32 203 images with
393 703 labeled faces
● The faces vary largely in
appearance, pose and scale
● Multiple attributes annotated:
occlusion, pose and event
categories, which allows depth
analysis of existing algorithms
WIDER FACE: A Face Detection Benchmark
WIDER FACE: A Face
Detection Benchmark

15WIDER FACE: A Face Detection Benchmark
WIDER FACE. Annotations

How good is your
Detection algorithm?

17https://www.pyimagesearch.com/2016/11/07/intersection-over-union-iou-for-object-detection/
Intersection Over Union
(Jaccard index)

18
https://classeval.wordpress.com/introduction/introduction-to-the-precision-recall-plot
Recall/Precision curve
Evaluation Plot

Viola–Jones object
detection architecture
Viola-Jones
detector
CVPR, 2001

21
Main principles:
● Integral image
● Scanning window
● HAAR-like Features
● Boosted feature selection
● Cascaded classi ier
Rapid object detection using a boosted cascade of simple feature
Viola-Jones detector (1)

22
Integral image

23
Scanning window

24
HAAR-like Features

25
Boosted feature selection
α
α
α
α
…
…
α α α … α
Feature importance
(hi
∈ ℜ)
Feature
(αi
∈ ℜ)

26
Cascaded classi ier

27
Pros:
● Really fast (can run at real-time
on embedded devices)
● Low false-positive rate
● Easy to tune
Cons:
● Hand-made features
● Hard to train
● Low detection rate on non
frontal faces
● Detects only simple objects

Classification based
architectures

Selective Search
Viola-Jones
detector
CVPR, 2001
Selective
Search
IJCV, 2013

30Selective Search for Object Recognition
Selective Search (1)

31
Selective search + SIFT + bag-of-words + SVMs = 35.1% mAP on
PASCAL 2007
Selective Search for Object Recognition
Selective Search (2)

Region-based CNN
(R-CNN)
Viola-Jones
detector
CVPR, 2001
Selective
Search
mAP: 35.1%
IJCV, 2013
R-CNN
Nov, 2013

33Rich feature hierarchies for accurate object detection and semantic segmen ation
● Regions: ~2000 Selective
Search proposals
● Feature Extractor:
AlexNet pre-trained on
ImageNet, ine-tuned on
PASCAL 2007
● Bounding box regression
to re ine box locations
● Performance: mAP of
53.7% on PASCAL 2007
R-CNN (1)

34
Pros:
● Accurate
● Any architecture can be used as a feature
extractor
Cons:
● Hard to train (lots of training objectives:
softmax classi ier, linear SVMs, bound-box
regressions, lot of them train separately)
● Slow training (84h on GPU)
● Inference (detection) is slow (47s / image with
VGG-16 feature extractor)
Rich feature hierarchies for accurate object detection and semantic segmentation
R-CNN (2)

Multiple usage of CNN
inference!
How to use CNN only once on the whole image?

Spatial Pyramid Pooling
Network (SPP-Net)
Viola-Jones
detector
CVPR, 2001
Selective
Search
mAP: 35.1%
IJCV, 2013
R-CNN
mAP: 53.7%
FPS: 0.05
Nov, 2013
SPP-Net
Jun, 2014

38
● In each region proposal, used a
4-level spatial pyramid, with
grids:
■ 1×1
■ 2×2
■ 3×3
■ 6×6
● To each grid's cells we apply
some global pooling operation.
● Totally we get 50 bins to pool
the features from each feature
map.
Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition
SPP-Net (1): Spatial Pyramid
Pooling Layer

39
ROIs from
proposal
method
● Fully-connected layers
● Forward whole image
through Convolutional
Network
● Get feature map of image
● Apply Spatial Pyramid
Pooling layer to feature
map
● Input image
● Classify regions and
apply bounding box
regressors
SPP-Net (2)

40
What’s good about SPP-net?
Pascal VOC 2007 results
It's really faster…
SPP-Net (3)

41
What’s wrong about SPP-net?
● Inherits the rest of
R CNN’s problems
● Introduces a new
problem: cannot update
parameters below SPP
layer during training
Trainable
(3 layers)
Frozen
(13 layers)
SPP-Net (4)

Fast R-CNN
Viola-Jones
detector
CVPR, 2001
Selective
Search
mAP: 35.1%
IJCV, 2013
R-CNN
mAP: 53.7%
FPS: 0.05
Nov, 2013
SPP-Net
mAP: 59.2%
Jun, 2014
Fast R-CNN
Apr, 2015

43
● Fast test time, like
SPP-net
● One network,
trained in one stage
● Higher mean
average precision
than R CNN and
SPP-net
Fast R CNN
Fast R-CNN (1)

44Fast R CNN
R-CNN Fast R-CNN
Training Time 84 hours 9.5 hours
(Speedup) 1x 8.8x
Test time per image
(network only)
47 seconds 0.32 seconds
(Speedup) 1x 146x
mAP (VOC 2007) 53.7% 66.9%
Comparison of R CNN and Fast R CNN (both use
Fast R-CNN (2)

But, work time do not
include time for Selective
Search...

46
R-CNN Fast R-CNN
Test time per image
(network only)
47 seconds 0.32 seconds
(Speedup) 1x 146x
Test time per image (with
Selective Search)
50 seconds 2 seconds
(Speedup) 1x 25x
Comparison of R CNN and Fast R CNN (both use
Fast R-CNN (3)

How to speedup
Selective Search?

Use other segmentation
algorithms?

Faster R-CNN
Viola-Jones
detector
CVPR, 2001
Selective
Search
mAP: 35.1%
IJCV, 2013
R-CNN
mAP: 53.7%
FPS: 0.05
Nov, 2013
SPP-Net
mAP: 59.2%
FPS: 0.47
Jun, 2014
Fast R-CNN
mAP: 66.9%
FPS: 0.5
Apr, 2015
Faster
R-CNN
Jun, 2015

52
~ 100 FPS
Faster R CNN: Towards Real-Time Object Detection with Region Proposal Networks
Faster R-CNN (1):
Region proposal network

53Faster R CNN: Towards Real-Time Object Detection with Region Proposal Networks
Faster R-CNN (2)

54Faster R CNN: Towards Real-Time Object Detection with Region Proposal Networks
R-CNN Fast R-CNN Faster R-CNN
Test time per image
(with proposals)
50 seconds 2 seconds 0.2 seconds
(Speedup) 1x 25x 250x
mAP (VOC 2007) 53.7% 66.9% 69.9%
Comparison of R CNN/Fast R CNN/Faster R CNN (all use
Faster R-CNN (3)

Regression based
architectures

You Only Look Once
(YOLO)
Viola-Jones
detector
CVPR, 2001
Selective
Search
mAP: 35.1%
IJCV, 2013
R-CNN
mAP: 53.7%
FPS: 0.05
Nov, 2013
SPP-Net
mAP: 59.2%
FPS: 0.47
Jun, 2014
YOLO
Jun, 2015
Fast R-CNN
mAP: 66.9%
FPS: 0.5
Apr, 2015
Faster
R-CNN
mAP: 69.9%
FPS: 5
Jun, 2015

57You Only Look Once: Uni ied, Real-Time Object Detection
YOLO’s pipeline
YOLO (1)

58You Only Look Once: Uni ied, Real-Time Object Detection
Bottom layers from
GoogLeNet
Custom layers
YOLO architecture
YOLO (2)

59
Pros:
● uite fast (~40 FPS on Nvidia Titan Black)
● End-to-end training
● Low Error Rate for
Foreground/Background misclassi ication
● Learn very general representation of objects
Cons:
● Less accurate than Fast R CNN (63.9% mAP
comparte to 66.9%)
● Loss function is an approximation
● Can not detect small objects
● Low detection rate of objects that located
close to each other
You Only Look Once: Unified, Real-Time Object Detection
Errors types comparison
Fast R CNN vs YOLO
YOLO (3)

Single Shot MultiBox
Detector (SSD)
Viola-Jones
detector
CVPR, 2001
Selective
Search
mAP: 35.1%
IJCV, 2013
R-CNN
mAP: 53.7%
FPS: 0.05
Nov, 2013
SPP-Net
mAP: 59.2%
FPS: 0.47
Jun, 2014
YOLO
mAP: 63.9%
FPS: 40
Jun, 2015
Faster
R-CNN
mAP: 69.9%
FPS: 5
Jun, 2015
Fast R-CNN
mAP: 66.9%
FPS: 0.5
Apr, 2015
SSD
Dec, 2015

61SSD: Single Shot MultiBox Detector
SSD architecture
SSD (1)

62
Apply regressors to default
box and get result
Regressors
Confidences for 21 classes
(20 VOC Pascal 2007 classes + background)
3 default boxes for
each cell
SSD: Single Shot MultiBox Detector
SSD detector example
SSD (2)

63
Model mAP FPS
Faster R-CNN (VGG-16) 73.2% 7
Faster R-CNN (ZF) 62.1% 17
YOLO 63.4% 45
Tiny YOLO 52.7% 155
SSD300 (VGG-16) 72.1% 58
SSD500 (VGG-16) 75.1% 23
Pros:
● The best speed/accuracy trade-offs
● State of the art results on all object
detection datasets
● Pretty well works with light feature
extractors (InceprtionV2,
S ueeze Net, MobileNet, Shu leNet,
etc.)
Cons:
● Default boxes as hyper parameter
● Poorly works with heavy feature
extractors (ResNet-101, InceptionV4,
VGG-16, etc.)
SSD: Single Shot MultiBox Detector
Comparison of SSD with other
detectors
SSD (3)

Cascade classification
based architectures

Multi-task cascade NN
(MTCNN)
Viola-Jones
detector
CVPR, 2001
Selective
Search
mAP: 35.1%
IJCV, 2013
R-CNN
mAP: 53.7%
FPS: 0.05
Nov, 2013
SPP-Net
mAP: 59.2%
FPS: 0.47
Jun, 2014
YOLO
mAP: 63.9%
FPS: 40
Jun, 2015
Faster
R-CNN
mAP: 69.9%
FPS: 5
Jun, 2015
Fast R-CNN
mAP: 66.9%
FPS: 0.5
Apr, 2015
SSD
mAP: 72.1%
FPS: 58
Dec, 2015
MTCNN
Apr, 2016

66
Network Input size FPS* Validation
Accuracy
P-Net 12x12 8000 94.6%
R-Net 24x24 650 95.4%
O-Net 48x48 220 95.4%
Networks speed and accuracy on crops
* - for original network input and batch size 1
Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks
MTCNN (1)

67
MTCNN s Networks Architectures
Landmarks example
MTCNN (2)

68
Recall/Precision curve
Test set o Wider Face date set
MTCNN (3)

69
Pros:
● Really fast (100 FPS on GPU)
● Lot of speed/accuracy trade-offs
● State of the art results on big part of
Face Detection Datasets (CelebA,
FDDB, etc.)
Cons:
● Hard to train
● Lot of hyper-parameters
● Low detection rate of small faces
● Poorly works without landmarks
Model mAP FPS
MTCNN 85.1% 100
Faster R-CNN
(VGG-16)
93.2% 5
SSH
(VGG-16)
91.9% 10
Different face detector models comparison
Wider Face test set
MTCNN (4)

71
Thanks for your attention!
Contact information:
alexander.zarichkovyi@ring.com

Александр Заричковый "Faster than real-time face detection"

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Александр Заричковый "Faster than real-time face detection"

Semelhante a Александр Заричковый "Faster than real-time face detection" (20)

Mais de Fwdays

Mais de Fwdays (20)

Último

Último (20)

Александр Заричковый "Faster than real-time face detection"