Bentham & Hooker's Classification. along with the merits and demerits of the ...
Mask R-CNN
1. Mask R-CNN
ICCV 2017(Oral)
Kaiming He Georgia Gkioxari Piotr Dollár Ross Girshick
Facebook AI Research (FAIR)
Chanuk Lim
KEPRI
2017.08.10
2. 1. Abstract
Our approach efficiently detects objects in an image while
simultaneously generating a high-quality segmentation mask
for each instance.
Mask R-CNN extends Faster R-CNN by adding a branch for
predicting an object mask in parallel with the existing branch
for bounding box recognition.
Mask R-CNN is simple to train and adds only a small overhead
to Faster R-CNN, running at 5fps.
3. 2. Review - Instance Segmentation
http://blog.naver.com/sogangori/221012300995
Instance segmentation combines elements from the classical computer vision
tasks of object detection, where the goal is to classify individual objects and
localize each using a bounding box, and semantic segmentation, where the
goal is to classify each pixel into a fixed set of categories without
differentiating object instances.
Instance segmentation is challenging because it requires the correct detection
of all objects in an image while also precisely segmenting each instance.
4. 2. Review - Fast R-CNN & Faster R-CNNround: R-CNN architechture
~2k region proposals
(independent algorithm)
convolutional feature
extraction
warped region proposals
SVM classification
classification
box regression
class specific LSR
CNN
Based on proposed Regions of Interest (RoI)
Requires region warping for fixed size features
Very ine cient pipeline
Background: Fast R-CNN
CNN
convolutional backbone
RoIPool layer
fixed size feature map
feature map
fully connected
layers
box
regression
classification
RoIs from
independent
method
Background: Faster R-CNN
RPN
CNN
convolutional backbone
RoIPool layer
fixed size feature map
feature map
fully connected
layers
box
regression
classification
sk R-CNN: overview
RPN
CNN
mask
branch
convolutional backbone
RoIAlign layer
fixed size feature map
feature map
box
regression
classification
fully connected
layershead
R-CNN Fast R-CNN
Mask R-CNN Faster R-CNN
Curtis Kim, kakao, https://www.slideshare.net/IldooKim/deep-object-detectors-1-20166
5. 4. Contribution
2. Adding a branch for predicting segmentation masks on each Region
of Interest (RoI), in parallel with the existing branch for classification
and bounding box regression.
1. To fix the misalignment, we propose a simple, quantization-free
layer, called RoIAlign, that faithfully preserves exact spatial locations.
7. 4. Contribution
1) RoIAlign
• Previous works - RoIPool
112 x 112 7 x 7
Max-pooling (Input is rounded off)
We propose an RoIAlign layer that removes the harsh quantization of RoIPool, properly aligning the
extracted features with the input.
RoIAlign improves mask accuracy by relative 10% to 50%, showing bigger gains under stricter
localization metrics.
Ex)
[32/16] = 2, 32/16 = 2
[33/16] = 2, 33/16 = 2.06
[34/16] = 2, 34/16 = 2.12
[35/16] = 2, 35/16 = 2.18
[36/16] = 2, 36/16 = 2.25
[37/16] = 2, 37/16 = 2.31
[38/16] = 2, 38/16 = 2.37
[39/16] = 2, 39/16 = 2.43
[40/16] = 3, 40/16 = 2.5
8. 4. Contribution
1) RoIAlign
Jaderberg, Max, Karen Simonyan, and Andrew Zisserman. "Spatial transformer networks." Advances in Neural Information Processing Systems. 2015.
We use bilinear interpolation(Spatial transformer networks, Jaderberg, et al.) to compute the exact
values of the input features at four regularly sampled locations in each RoI bin, and aggregate the
result (using max or average).
U V
Localisation net
Sampler
Spatial Transformer
Grid !
generator
T✓(G)✓
e 2: The architecture of a spatial transformer module. The input feature map U is passed to a localisation
rk which regresses the transformation parameters ✓. The regular spatial grid G over V is transformed to
mpling grid T✓(G), which is applied to U as described in Sect. 3.3, producing the warped output feature
V . The combination of the localisation network and sampling mechanism defines a spatial transformer.
for a differentiable attention mechanism, while [14] use a differentiable attention mechansim
ilising Gaussian kernels in a generative model. The work by Girshick et al. [11] uses a region
osal algorithm as a form of attention, and [7] show that it is possible to regress salient regions
a CNN. The framework we present in this paper can be seen as a generalisation of differentiable
ion to any spatial transformation.
Spatial Transformers
s section we describe the formulation of a spatial transformer. This is a differentiable module
h applies a spatial transformation to a feature map during a single forward pass, where the
ormation is conditioned on the particular input, producing a single output feature map. For
-channel inputs, the same warping is applied to each channel. For simplicity, in this section we
der single transforms and single outputs per transformer, however we can generalise to multiple
ormations, as shown in experiments.
patial transformer mechanism is split into three parts, shown in Fig. 2. In order of computation,
localisation network (Sect. 3.1) takes the input feature map, and through a number of hidden
(a) (b)
Figure 3: Two examples of applying the parameterised sampling grid to an image U producing the output V .
(a) The sampling grid is the regular grid G = TI (G), where I is the identity transformation parameters. (b)
The sampling grid is the result of warping the regular grid with an affine transformation T✓(G).
For clarity of exposition, assume for the moment that T✓ is a 2D affine transformation A✓. We will
discuss other transformations below. In this affine case, the pointwise transformation is
✓
xs
i
ys
i
◆
= T✓(Gi) = A✓
0
@
xt
i
yt
i
1
1
A =
✓11 ✓12 ✓13
✓21 ✓22 ✓23
0
@
xt
i
yt
i
1
1
A (1)
where (xt
i, yt
i ) are the target coordinates of the regular grid in the output feature map, (xs
i , ys
i ) are
the source coordinates in the input feature map that define the sample points, and A✓ is the affine
transformation matrix. We use height and width normalised coordinates, such that 1 xt
i, yt
i 1
when within the spatial bounds of the output, and 1 xs
i , ys
i 1 when within the spatial bounds
of the input (and similarly for the y coordinates). The source/target transformation and sampling is
equivalent to the standard texture mapping and coordinates used in graphics [8].
The transform defined in (10) allows cropping, translation, rotation, scale, and skew to be applied
to the input feature map, and requires only 6 parameters (the 6 elements of A✓) to be produced by
the localisation network. It allows cropping because if the transformation is a contraction (i.e. the
determinant of the left 2 ⇥ 2 sub-matrix has magnitude less than unity) then the mapped regular grid
will lie in a parallelogram of area less than the range of xs
i , ys
i . The effect of this transformation on
the grid compared to the identity transform is shown in Fig. 3.
The class of transformations T✓ may be more constrained, such as that used for attention
A✓ =
s 0 tx
0 s ty
(2)
allowing cropping, translation, and isotropic scaling by varying s, tx, and ty. The transformation
(a) (b)
Figure 3: Two examples of applying the parameterised sampling grid to an image U producing the outpu
(a) The sampling grid is the regular grid G = TI (G), where I is the identity transformation parameters.
The sampling grid is the result of warping the regular grid with an affine transformation T✓(G).
For clarity of exposition, assume for the moment that T✓ is a 2D affine transformation A✓. We
discuss other transformations below. In this affine case, the pointwise transformation is
✓
xs
i
ys
i
◆
= T✓(Gi) = A✓
0
@
xt
i
yt
i
1
1
A =
✓11 ✓12 ✓13
✓21 ✓22 ✓23
0
@
xt
i
yt
i
1
1
A
where (xt
i, yt
i) are the target coordinates of the regular grid in the output feature map, (xs
i , ys
i )
the source coordinates in the input feature map that define the sample points, and A✓ is the af
transformation matrix. We use height and width normalised coordinates, such that 1 xt
i, yt
i
when within the spatial bounds of the output, and 1 xs
i , ys
i 1 when within the spatial bou
of the input (and similarly for the y coordinates). The source/target transformation and samplin
equivalent to the standard texture mapping and coordinates used in graphics [8].
The transform defined in (10) allows cropping, translation, rotation, scale, and skew to be app
to the input feature map, and requires only 6 parameters (the 6 elements of A✓) to be produced
the localisation network. It allows cropping because if the transformation is a contraction (i.e.
determinant of the left 2 ⇥ 2 sub-matrix has magnitude less than unity) then the mapped regular
will lie in a parallelogram of area less than the range of xs
i , ys
i . The effect of this transformation
the grid compared to the identity transform is shown in Fig. 3.
The class of transformations T✓ may be more constrained, such as that used for attention
A✓ =
s 0 tx
0 s ty
allowing cropping, translation, and isotropic scaling by varying s, tx, and ty. The transforma
T✓ can also be more general, such as a plane projective transformation with 8 parameters, pie
wise affine, or a thin plate spline. Indeed, the transformation can have any parameterised fo
provided that it is differentiable with respect to the parameters – this crucially allows gradients to
x and y are the parameters of a generic sampling kernel k() which defines the image
tion (e.g. bilinear), Uc
nm is the value at location (n, m) in channel c of the input, and V c
i
tput value for pixel i at location (xt
i, yt
i ) in channel c. Note that the sampling is done
ly for each channel of the input, so every channel is transformed in an identical way (this
s spatial consistency between channels).
, any sampling kernel can be used, as long as (sub-)gradients can be defined with respect to
s
i . For example, using the integer sampling kernel reduces (3) to
V c
i =
HX
n
WX
m
Uc
nm (bxs
i + 0.5c m) (bys
i + 0.5c n) (4)
x + 0.5c rounds x to the nearest integer and () is the Kronecker delta function. This
g kernel equates to just copying the value at the nearest pixel to (xs
i , ys
i ) to the output location
Alternatively, a bilinear sampling kernel can be used, giving
V c
i =
HX
n
WX
m
Uc
nm max(0, 1 |xs
i m|) max(0, 1 |ys
i n|) (5)
The class of transformations T✓ may be more constrained, such as that used for attention
A✓ =
s 0 tx
0 s ty
(2)
allowing cropping, translation, and isotropic scaling by varying s, tx, and ty. The transformation
T✓ can also be more general, such as a plane projective transformation with 8 parameters, piece-
wise affine, or a thin plate spline. Indeed, the transformation can have any parameterised form,
provided that it is differentiable with respect to the parameters – this crucially allows gradients to be
backpropagated through from the sample points T✓(Gi) to the localisation network output ✓. If the
transformation is parameterised in a structured, low-dimensional way, this reduces the complexity
of the task assigned to the localisation network. For instance, a generic class of structured and dif-
ferentiable transformations, which is a superset of attention, affine, projective, and thin plate spline
transformations, is T✓ = M✓B, where B is a target grid representation (e.g. in (10), B is the regu-
lar grid G in homogeneous coordinates), and M✓ is a matrix parameterised by ✓. In this case it is
possible to not only learn how to predict ✓ for a sample, but also to learn B for the task at hand.
3.3 Differentiable Image Sampling
To perform a spatial transformation of the input feature map, a sampler must take the set of sampling
points T✓(G), along with the input feature map U and produce the sampled output feature map V .
Each (xs
i , ys
i ) coordinate in T✓(G) defines the spatial location in the input where a sampling kernel
is applied to get the value at a particular pixel in the output V . This can be written as
V c
i =
HX
n
WX
m
Uc
nmk(xs
i m; x)k(ys
i n; y) 8i 2 [1 . . . H0
W0
] 8c 2 [1 . . . C] (3)
4
bilinear sampling kernel
Copying the value at the nearest pixel to (xs
i , yi
s ) to the output location (xt
i , yi
t ).
Target feature value
at location i in
channel c
Input feature value
at location (n,m) in
channel c
Sampling kernel
Kernel parameters
Sampling
coordinate
9. 4. Contribution
1) RoIAlign
Faster R-CNN)
Source: https://deepsense.io/region-of-interest-pooling-explained/
Input activation
RoIPool (Faster R-CNN)
Region projection and pooling sections
RoIPool (Faster R-CNN)
Max pooling output
Faster R-CNN
RoIPool
11. 4. Contribution
2) Network architecture
Instance
segmentation
Faster R-CNN + Instance segmentation
Faster R-CNN
ResNet
or
ResNeXt
lity of
ultiple
(i) the
re ex-
head
ssion)
RoI.
ave
RoI
RoI
14×14
×256
7×7
×256
14×14
×256
1024
28×28
×256
1024
mask
14×14
×256
class
box
2048RoI res5
7×7
×1024
7×7
×2048
×4
class
box
14×14
×80
mask
28×28
×80
Faster R-CNN
w/ ResNet [19]
Faster R-CNN
w/ FPN [27]
Figure 3. Head Architecture: We extend two existing Faster R-
FPN
(Feature
Pyramid
Network)
te the generality of
CNN with multiple
ate between: (i) the
sed for feature ex-
the network head
tion and regression)
rately to each RoI.
ave
RoI
RoI
14×14
×256
7×7
×256
14×14
×256
1024
28×28
×256
1024
mask
14×14
×256
class
box
2048RoI res5
7×7
×1024
7×7
×2048
×4
class
box
14×14
×80
mask
28×28
×80
Faster R-CNN
w/ ResNet [19]
Faster R-CNN
w/ FPN [27]
Figure 3. Head Architecture: We extend two existing Faster R-Backbone BackboneHead Head
Background: Faster R-CNN
RPN
CNN
convolutional backbone
RoIPool layer
fixed size feature map
feature map
fully connected
layers
box
regression
classification
Faster R-CNN
12. 4. Contribution
2) Network architecture
• Backbone - ResNet
layer name output size 18-layer 34-layer 50-layer 101-layer 152-layer
conv1 112⇥112 7⇥7, 64, stride 2
conv2 x 56⇥56
3⇥3 max pool, stride 2
3⇥3, 64
3⇥3, 64
⇥2
3⇥3, 64
3⇥3, 64
⇥3
2
4
1⇥1, 64
3⇥3, 64
1⇥1, 256
3
5⇥3
2
4
1⇥1, 64
3⇥3, 64
1⇥1, 256
3
5⇥3
2
4
1⇥1, 64
3⇥3, 64
1⇥1, 256
3
5⇥3
conv3 x 28⇥28
3⇥3, 128
3⇥3, 128
⇥2
3⇥3, 128
3⇥3, 128
⇥4
2
4
1⇥1, 128
3⇥3, 128
1⇥1, 512
3
5⇥4
2
4
1⇥1, 128
3⇥3, 128
1⇥1, 512
3
5⇥4
2
4
1⇥1, 128
3⇥3, 128
1⇥1, 512
3
5⇥8
conv4 x 14⇥14
3⇥3, 256
3⇥3, 256
⇥2
3⇥3, 256
3⇥3, 256
⇥6
2
4
1⇥1, 256
3⇥3, 256
1⇥1, 1024
3
5⇥6
2
4
1⇥1, 256
3⇥3, 256
1⇥1, 1024
3
5⇥23
2
4
1⇥1, 256
3⇥3, 256
1⇥1, 1024
3
5⇥36
conv5 x 7⇥7
3⇥3, 512
3⇥3, 512
⇥2
3⇥3, 512
3⇥3, 512
⇥3
2
4
1⇥1, 512
3⇥3, 512
1⇥1, 2048
3
5⇥3
2
4
1⇥1, 512
3⇥3, 512
1⇥1, 2048
3
5⇥3
2
4
1⇥1, 512
3⇥3, 512
1⇥1, 2048
3
5⇥3
1⇥1 average pool, 1000-d fc, softmax
FLOPs 1.8⇥109
3.6⇥109
3.8⇥109
7.6⇥109
11.3⇥109
Table 1. Architectures for ImageNet. Building blocks are shown in brackets (see also Fig. 5), with the numbers of blocks stacked. Down-
sampling is performed by conv3 1, conv4 1, and conv5 1 with a stride of 2.
0 10 20 30 40 50
20
30
40
50
60
iter. (1e4)
error(%)
plain-18
plain-34
0 10 20 30 40 50
20
30
40
50
60
iter. (1e4)
error(%)
ResNet-18
ResNet-34
18-layer
34-layer
18-layer
34-layer
Figure 4. Training on ImageNet. Thin curves denote training error, and bold curves denote validation error of the center crops. Left: plain
• Backbone - FPNure Pyramid Networks for RPN
ce single scale feature
ith FPN.
scale anchors at each
ResNet-50-C4 ResNet-101-C4
- FPN exploits the inherent hierarchy of
CNNs to compute multi-scale features:
- Replace single scale feature map with FPN.
Source: Lin et al., Feature Pyramid Networks for Object Detection
13. 4. Contribution
2) Network architecture
- Objective Function
- RoI영역에 대한 Loss Function 정의
- Class Loss + Localization Loss
- Class Loss는 Log Loss
-
- Localization Loss는 위치에 대한 Smoo
-
- L2가 Gradient가 매우 커질 수 있
비해서
덜 민감함
- BG Class(u=0)은 제거
Fast R-CNN : Details #4
- Objective Function
- RoI영역에 대한 Loss Function 정의
- Class Loss + Localization Loss
- Class Loss는 Log Loss
-
-
- Objective Function
- RoI영역에 대한 Loss Function 정의
- Class Loss + Localization Loss
- Class Loss는 Log Loss
-
- Localization Loss는 위치에 대한 Smooth L1 Loss
-
- L2가 Gradient가 매우 커질 수 있는 것에
비해서
덜 민감함
- BG Class(u=0)은 제거
Fast R-CNN : Details #4
35
bounding-box classification and regression in par-
hich turned out to largely simplify the multi-stage
of original R-CNN [13]).
ally, during training, we define a multi-task loss on
mpled RoI as L = Lcls + Lbox + Lmask. The clas-
n loss Lcls and bounding-box loss Lbox are identi-
ose defined in [12]. The mask branch has a Km2
-
onal output for each RoI, which encodes K binary
f resolution m ⇥ m, one for each of the K classes.
we apply a per-pixel sigmoid, and define Lmask as
ge binary cross-entropy loss. For an RoI associated
und-truth class k, Lmask is only defined on the k-th
her mask outputs do not contribute to the loss).
efinition of Lmask allows the network to generate
or every class without competition among classes;
on the dedicated classification branch to predict the
el used to select the output mask. This decouples
d class prediction. This is different from common
when applying FCNs [29] to semantic segmenta-
ch typically uses a per-pixel softmax and a multino-
ss-entropy loss. In that case, masks across classes
; in our case, with a per-pixel sigmoid and a binary
These quantizations introduce mis
RoI and the extracted features. W
classification, which is robust to sm
large negative effect on predicting
To address this, we propose an
moves the harsh quantization of Ro
the extracted features with the inp
is simple: we avoid any quantizati
or bins (i.e., we use x/16 instead of
interpolation [22] to compute the
features at four regularly sampled l
and aggregate the result (using ma
RoIAlign leads to large impro
§4.2. We also compare to the RoIW
in [10]. Unlike RoIAlign, RoIWa
ment issue and was implemented i
just like RoIPool. So even thoug
bilinear resampling motivated by
with RoIPool as shown by experim
ble 2c), demonstrating the crucial
2We sample four regular locations, so t
or average pooling. In fact, interpolating
A. Fast R-CNN
A.
B.
Segmentation
Mask branch features:
• Fully convolutional
• K · (m ⇥ m) sigmoid outputs:
! pixel-wise binary classification
! one mask for each class, no competition
• Lmask: mean binary cross-entropy
Overall head loss:
L = Lbox + Lmask
B.
Log loss
Smooth L1 loss
14. FCIS+++ [26] (top) vs. Mask R-CNN (bottom, ResNet-101-FPN). FCIS exhibits systematic artifacts on
-features AP AP50 AP75
-50-C4 30.3 51.2 31.5
101-C4 32.7 54.2 34.3
50-FPN 33.6 55.2 35.3
101-FPN 35.4 57.3 37.5
101-FPN 36.7 59.5 38.9
bone Architecture: Better back-
g expected gains: deeper networks
FPN outperforms C4 features, and
improves on ResNet.
AP AP50 AP75
softmax 24.8 44.1 25.1
sigmoid 30.3 51.2 31.5
+5.5 +7.1 +6.4
(b) Multinomial vs. Independent Masks
(ResNet-50-C4): Decoupling via per-
class binary masks (sigmoid) gives large
gains over multinomial masks (softmax).
align? bilinear? a
RoIPool [12] m
RoIWarp [10]
X m
X a
RoIAlign
X X m
X X a
(c) RoIAlign (ResNet-50-C4): M
layers. Our RoIAlign layer impr
AP75 by ⇠5 points. Using prope
tor that contributes to the large gap
AP AP50 AP75 APbb APbb
50 APbb
75
l 23.6 46.5 21.6 28.2 52.7 26.9
n 30.9 51.8 32.1 34.0 55.3 36.4
+7.3 + 5.3 +10.5 +5.8 +2.6 +9.5
ign (ResNet-50-C5, stride 32): Mask-level and box-level
large-stride features. Misalignments are more severe than
e-16 features (Table 2c), resulting in massive accuracy gaps.
mask branch
MLP fc: 1024!1024!80·282
MLP fc: 1024!1024!1024!80·282
FCN conv: 256!256!256!256!256!80
(e) Mask Branch (ResNet-50-FPN): Fully convolu
multi-layer perceptrons (MLP, fully-connected) for m
prove results as they take advantage of explicitly enco
2. Ablations for Mask R-CNN. We train on trainval35k, test on minival, and report mask AP un
omial vs. Independent Masks: Mask R-CNN de- AP by about 3 points over RoIPool,
Our definition of Lmask allows the network to generate masks for every class without competition
among classes; we rely on the dedicated classification branch to predict the class label used to
select the output mask. This decouples mask and class prediction. This is different from common
practice when applying FCNs [29] to semantic segmentation, which typically uses a per-pixel
softmax and a multinomial cross-entropy loss. In that case, masks across classes compete; in our
case, with a per-pixel sigmoid and a binary loss, they do not. We show by experiments that this
formulation is key for good instance segmentation results.
2) Network architecture
4. Contribution
15. 5. Experiments
• Main dataset: MS COCO
• 80 classes
• 80k train image
• 35k sub- set of val images
• 5k images for ablation experiments
• Metric
17. 5. Experiments
Figure 5. FCIS+++ [26] (top) vs. Mask R-CNN (bottom, ResNet-101-FPN). FCIS exhibits systematic artifacts on overlapping objects.
net-depth-features AP AP50 AP75
ResNet-50-C4 30.3 51.2 31.5
ResNet-101-C4 32.7 54.2 34.3
ResNet-50-FPN 33.6 55.2 35.3
ResNet-101-FPN 35.4 57.3 37.5
ResNeXt-101-FPN 36.7 59.5 38.9
(a) Backbone Architecture: Better back-
bones bring expected gains: deeper networks
do better, FPN outperforms C4 features, and
ResNeXt improves on ResNet.
AP AP50 AP75
softmax 24.8 44.1 25.1
sigmoid 30.3 51.2 31.5
+5.5 +7.1 +6.4
(b) Multinomial vs. Independent Masks
(ResNet-50-C4): Decoupling via per-
class binary masks (sigmoid) gives large
gains over multinomial masks (softmax).
align? bilinear? agg. AP AP50 AP75
RoIPool [12] max 26.9 48.8 26.4
RoIWarp [10]
X max 27.2 49.2 27.1
X ave 27.1 48.9 27.1
RoIAlign
X X max 30.2 51.0 31.8
X X ave 30.3 51.2 31.5
(c) RoIAlign (ResNet-50-C4): Mask results with various RoI
layers. Our RoIAlign layer improves AP by ⇠3 points and
AP75 by ⇠5 points. Using proper alignment is the only fac-
tor that contributes to the large gap between RoI layers.
AP AP50 AP75 APbb APbb
50 APbb
75
RoIPool 23.6 46.5 21.6 28.2 52.7 26.9
RoIAlign 30.9 51.8 32.1 34.0 55.3 36.4
+7.3 + 5.3 +10.5 +5.8 +2.6 +9.5
(d) RoIAlign (ResNet-50-C5, stride 32): Mask-level and box-level
AP using large-stride features. Misalignments are more severe than
with stride-16 features (Table 2c), resulting in massive accuracy gaps.
mask branch AP AP50 AP75
MLP fc: 1024!1024!80·282 31.5 53.7 32.8
MLP fc: 1024!1024!1024!80·282 31.5 54.0 32.6
FCN conv: 256!256!256!256!256!80 33.6 55.2 35.3
(e) Mask Branch (ResNet-50-FPN): Fully convolutional networks (FCN) vs.
multi-layer perceptrons (MLP, fully-connected) for mask prediction. FCNs im-
prove results as they take advantage of explicitly encoding spatial layout.
Table 2. Ablations for Mask R-CNN. We train on trainval35k, test on minival, and report mask AP unless otherwise noted.
Multinomial vs. Independent Masks: Mask R-CNN de-
couples mask and class prediction: as the existing box
branch predicts the class label, we generate a mask for each
class without competition among classes (by a per-pixel sig-
moid and a binary loss). In Table 2b, we compare this to
using a per-pixel softmax and a multinomial loss (as com-
monly used in FCN [29]). This alternative couples the tasks
of mask and class prediction, and results in a severe loss
AP by about 3 points over RoIPool, with much of the gain
coming at high IoU (AP75). RoIAlign is insensitive to
max/average pool; we use average in the rest of the paper.
Additionally, we compare with RoIWarp proposed in
MNC [10] that also adopt bilinear sampling. As discussed
in §3, RoIWarp still quantizes the RoI, losing alignment
with the input. As can be seen in Table 2c, RoIWarp per-
forms on par with RoIPool and much worse than RoIAlign.
18. 5. Experiments
Figure 4. More results of Mask R-CNN on COCO test images, using ResNet-101-FPN and running at 5 fps, with 35.7 mask AP (Table 1).
backbone AP AP50 AP75 APS APM APL
MNC [10] ResNet-101-C4 24.6 44.3 24.8 4.7 25.9 43.6
FCIS [26] +OHEM ResNet-101-C5-dilated 29.2 49.5 - 7.1 31.3 50.0
FCIS+++ [26] +OHEM ResNet-101-C5-dilated 33.6 54.5 - - - -
Mask R-CNN ResNet-101-C4 33.1 54.9 34.8 12.1 35.6 51.1
Mask R-CNN ResNet-101-FPN 35.7 58.0 37.8 15.5 38.1 52.4
Mask R-CNN ResNeXt-101-FPN 37.1 60.0 39.4 16.9 39.9 53.5
Table 1. Instance segmentation mask AP on COCO test-dev. MNC [10] and FCIS [26] are the winners of the COCO 2015 and 2016
segmentation challenges, respectively. Without bells and whistles, Mask R-CNN outperforms the more complex FCIS+++, which includes
multi-scale train/test, horizontal flip test, and OHEM [35]. All entries are single-model results.
is evaluating using mask IoU. As in previous work [5, 27], expect many such improvements to be applicable to ours.
person1.00
person1.00
person1.00
person1.00
umbrella1.00umbrella.99
car.99 car.93
giraffe1.00 giraffe1.00
person1.00
person1.00
person1.00 person1.00
person.95
sports ball1.00
sports ball.98
person1.00
person1.00
person1.00
tie.95
tie1.00
FCISMaskR-CNN
train1.00
train.99
train.80
person1.00 person1.00person1.00
person1.00
person1.00person1.00
skateboard.98
person.99 person.99
skateboard.99
handbag.93
Figure 5. FCIS+++ [26] (top) vs. Mask R-CNN (bottom, ResNet-101-FPN). FCIS exhibits systematic artifacts on overlapping objects.
net-depth-features AP AP50 AP75
ResNet-50-C4 30.3 51.2 31.5
ResNet-101-C4 32.7 54.2 34.3
ResNet-50-FPN 33.6 55.2 35.3
ResNet-101-FPN 35.4 57.3 37.5
ResNeXt-101-FPN 36.7 59.5 38.9
(a) Backbone Architecture: Better back-
bones bring expected gains: deeper networks
do better, FPN outperforms C4 features, and
ResNeXt improves on ResNet.
AP AP50 AP75
softmax 24.8 44.1 25.1
sigmoid 30.3 51.2 31.5
+5.5 +7.1 +6.4
(b) Multinomial vs. Independent Masks
(ResNet-50-C4): Decoupling via per-
class binary masks (sigmoid) gives large
gains over multinomial masks (softmax).
align? bilinear? agg. AP AP50 AP75
RoIPool [12] max 26.9 48.8 26.4
RoIWarp [10]
X max 27.2 49.2 27.1
X ave 27.1 48.9 27.1
RoIAlign
X X max 30.2 51.0 31.8
X X ave 30.3 51.2 31.5
(c) RoIAlign (ResNet-50-C4): Mask results with various RoI
layers. Our RoIAlign layer improves AP by ⇠3 points and
AP75 by ⇠5 points. Using proper alignment is the only fac-
tor that contributes to the large gap between RoI layers.
AP AP50 AP75 APbb APbb
50 APbb
75
RoIPool 23.6 46.5 21.6 28.2 52.7 26.9
RoIAlign 30.9 51.8 32.1 34.0 55.3 36.4
+7.3 + 5.3 +10.5 +5.8 +2.6 +9.5
mask branch AP AP50 AP75
MLP fc: 1024!1024!80·282 31.5 53.7 32.8
MLP fc: 1024!1024!1024!80·282 31.5 54.0 32.6
FCN conv: 256!256!256!256!256!80 33.6 55.2 35.3
• Instance segmentation
19. 5. Experiments
backbone APbb APbb
50 APbb
75 APbb
S APbb
M APbb
L
Faster R-CNN+++ [19] ResNet-101-C4 34.9 55.7 37.4 15.6 38.7 50.9
Faster R-CNN w FPN [27] ResNet-101-FPN 36.2 59.1 39.0 18.2 39.0 48.2
Faster R-CNN by G-RMI [21] Inception-ResNet-v2 [37] 34.7 55.5 36.7 13.5 38.1 52.0
Faster R-CNN w TDM [36] Inception-ResNet-v2-TDM 36.8 57.7 39.2 16.2 39.8 52.1
Faster R-CNN, RoIAlign ResNet-101-FPN 37.3 59.6 40.3 19.8 40.2 48.8
Mask R-CNN ResNet-101-FPN 38.2 60.3 41.7 20.1 41.1 50.2
Mask R-CNN ResNeXt-101-FPN 39.8 62.3 43.4 22.1 43.2 51.2
Table 3. Object detection single-model results (bounding box AP), vs. state-of-the-art on test-dev. Mask R-CNN using ResNet-101-
FPN outperforms the base variants of all previous state-of-the-art models (the mask output is ignored in these experiments). The gains of
Mask R-CNN over [27] come from using RoIAlign (+1.1 APbb
), multitask training (+0.9 APbb
), and ResNeXt-101 (+1.6 APbb
).
Mask Branch: Segmentation is a pixel-to-pixel task and
we exploit the spatial layout of masks by using an FCN.
In Table 2e, we compare multi-layer perceptrons (MLP)
and FCNs, using a ResNet-50-FPN backbone. Using FCNs
gives a 2.1 mask AP gain over MLPs. We note that we
choose this backbone so that the conv layers of the FCN
head are not pre-trained, for a fair comparison with MLP.
4.3. Bounding Box Detection Results
We compare Mask R-CNN to the state-of-the-art COCO
bounding-box object detection in Table 3. For this result,
even though the full Mask R-CNN model is trained, only
the classification and box outputs are used at inference (the
mask output is ignored). Mask R-CNN using ResNet-101-
FPN outperforms the base variants of all previous state-of-
the-art models, including the single-model variant of G-
RMI [21], the winner of the COCO 2016 Detection Chal-
ant takes ⇠400ms as it has a heavier box head (Figure 3), so
we do not recommend using the C4 variant in practice.
Although Mask R-CNN is fast, we note that our design
is not optimized for speed, and better speed/accuracy trade-
offs could be achieved [21], e.g., by varying image sizes and
proposal numbers, which is beyond the scope of this paper.
Training: Mask R-CNN is also fast to train. Training with
ResNet-50-FPN on COCO trainval35k takes 32 hours
in our synchronized 8-GPU implementation (0.72s per 16-
image mini-batch), and 44 hours with ResNet-101-FPN. In
fact, fast prototyping can be completed in less than one day
when training on the train set. We hope such rapid train-
ing will remove a major hurdle in this area and encourage
more people to perform research on this challenging topic.
5. Mask R-CNN for Human Pose Estimation
• Object detection
20. Reference paper
• R. Girshick. Fast R-CNN. In ICCV, 2015.
• S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object
detection with region proposal networks. In NIPS, 2015.
• M. Jaderberg, K. Simonyan, A. Zisserman, andK. Kavukcuoglu. Spatial
transformer networks. In NIPS, 2015.
• K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image
recognition. In CVPR, 2016.
• S. Xie, R. Girshick, P. Dolla ́r, Z. Tu, and K. He. Aggregated residual
transformations for deep neural networks. In CVPR, 2017.
• T.-Y.Lin, P.Dolla ́r, R.Girshick, K.He, B.Hariharan, and S. Belongie. Feature
pyramid networks for object detection.In CVPR, 2017.
• J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for
semantic segmentation. In CVPR, 2015.
• Y. Li, H. Qi, J. Dai, X. Ji, and Y. Wei. Fully convolutional instance-aware
semantic segmentation. In CVPR, 2017.
• R.Girshick, F.Iandola, T.Darrell, and J.Malik. Deformable part models are
convolutional neural networks. In CVPR, 2015.