SlideShare uma empresa Scribd logo
1 de 20
Baixar para ler offline
Mask R-CNN
ICCV 2017(Oral)
Kaiming He Georgia Gkioxari Piotr Dollár Ross Girshick
Facebook AI Research (FAIR)
Chanuk	Lim	
KEPRI	
2017.08.10
1. Abstract
Our approach efficiently detects objects in an image while
simultaneously generating a high-quality segmentation mask
for each instance.
Mask R-CNN extends Faster R-CNN by adding a branch for
predicting an object mask in parallel with the existing branch
for bounding box recognition.
Mask R-CNN is simple to train and adds only a small overhead
to Faster R-CNN, running at 5fps.
2. Review - Instance Segmentation
http://blog.naver.com/sogangori/221012300995
Instance segmentation combines elements from the classical computer vision
tasks of object detection, where the goal is to classify individual objects and
localize each using a bounding box, and semantic segmentation, where the
goal is to classify each pixel into a fixed set of categories without
differentiating object instances.
Instance segmentation is challenging because it requires the correct detection
of all objects in an image while also precisely segmenting each instance.
2. Review - Fast R-CNN & Faster R-CNNround: R-CNN architechture
~2k region proposals
(independent algorithm)
convolutional feature
extraction
warped region proposals
SVM classification
classification
box regression
class specific LSR
CNN
Based on proposed Regions of Interest (RoI)
Requires region warping for fixed size features
Very ine cient pipeline
Background: Fast R-CNN
CNN
convolutional backbone
RoIPool layer
fixed size feature map
feature map
fully connected
layers
box
regression
classification
RoIs from
independent
method
Background: Faster R-CNN
RPN
CNN
convolutional backbone
RoIPool layer
fixed size feature map
feature map
fully connected
layers
box
regression
classification
sk R-CNN: overview
RPN
CNN
mask
branch
convolutional backbone
RoIAlign layer
fixed size feature map
feature map
box
regression
classification
fully connected
layershead
R-CNN Fast R-CNN
Mask R-CNN Faster R-CNN
Curtis Kim, kakao, https://www.slideshare.net/IldooKim/deep-object-detectors-1-20166
4. Contribution
2. Adding a branch for predicting segmentation masks on each Region
of Interest (RoI), in parallel with the existing branch for classification
and bounding box regression.
1. To fix the misalignment, we propose a simple, quantization-free
layer, called RoIAlign, that faithfully preserves exact spatial locations.
4. Contribution
http://blog.naver.com/sogangori/221012300995
ㅁ
4. Contribution
1) RoIAlign
• Previous works - RoIPool
112 x 112 7 x 7
Max-pooling (Input is rounded off)
We propose an RoIAlign layer that removes the harsh quantization of RoIPool, properly aligning the
extracted features with the input.
RoIAlign improves mask accuracy by relative 10% to 50%, showing bigger gains under stricter
localization metrics.
Ex) 

[32/16] = 2, 32/16 = 2

[33/16] = 2, 33/16 = 2.06

[34/16] = 2, 34/16 = 2.12

[35/16] = 2, 35/16 = 2.18

[36/16] = 2, 36/16 = 2.25

[37/16] = 2, 37/16 = 2.31

[38/16] = 2, 38/16 = 2.37

[39/16] = 2, 39/16 = 2.43

[40/16] = 3, 40/16 = 2.5
4. Contribution
1) RoIAlign
Jaderberg, Max, Karen Simonyan, and Andrew Zisserman. "Spatial transformer networks." Advances in Neural Information Processing Systems. 2015.
We use bilinear interpolation(Spatial transformer networks, Jaderberg, et al.) to compute the exact
values of the input features at four regularly sampled locations in each RoI bin, and aggregate the
result (using max or average).
U V
Localisation net
Sampler
Spatial Transformer
Grid !
generator
T✓(G)✓
e 2: The architecture of a spatial transformer module. The input feature map U is passed to a localisation
rk which regresses the transformation parameters ✓. The regular spatial grid G over V is transformed to
mpling grid T✓(G), which is applied to U as described in Sect. 3.3, producing the warped output feature
V . The combination of the localisation network and sampling mechanism defines a spatial transformer.
for a differentiable attention mechanism, while [14] use a differentiable attention mechansim
ilising Gaussian kernels in a generative model. The work by Girshick et al. [11] uses a region
osal algorithm as a form of attention, and [7] show that it is possible to regress salient regions
a CNN. The framework we present in this paper can be seen as a generalisation of differentiable
ion to any spatial transformation.
Spatial Transformers
s section we describe the formulation of a spatial transformer. This is a differentiable module
h applies a spatial transformation to a feature map during a single forward pass, where the
ormation is conditioned on the particular input, producing a single output feature map. For
-channel inputs, the same warping is applied to each channel. For simplicity, in this section we
der single transforms and single outputs per transformer, however we can generalise to multiple
ormations, as shown in experiments.
patial transformer mechanism is split into three parts, shown in Fig. 2. In order of computation,
localisation network (Sect. 3.1) takes the input feature map, and through a number of hidden
(a) (b)
Figure 3: Two examples of applying the parameterised sampling grid to an image U producing the output V .
(a) The sampling grid is the regular grid G = TI (G), where I is the identity transformation parameters. (b)
The sampling grid is the result of warping the regular grid with an affine transformation T✓(G).
For clarity of exposition, assume for the moment that T✓ is a 2D affine transformation A✓. We will
discuss other transformations below. In this affine case, the pointwise transformation is
✓
xs
i
ys
i
◆
= T✓(Gi) = A✓
0
@
xt
i
yt
i
1
1
A =

✓11 ✓12 ✓13
✓21 ✓22 ✓23
0
@
xt
i
yt
i
1
1
A (1)
where (xt
i, yt
i ) are the target coordinates of the regular grid in the output feature map, (xs
i , ys
i ) are
the source coordinates in the input feature map that define the sample points, and A✓ is the affine
transformation matrix. We use height and width normalised coordinates, such that 1  xt
i, yt
i  1
when within the spatial bounds of the output, and 1  xs
i , ys
i  1 when within the spatial bounds
of the input (and similarly for the y coordinates). The source/target transformation and sampling is
equivalent to the standard texture mapping and coordinates used in graphics [8].
The transform defined in (10) allows cropping, translation, rotation, scale, and skew to be applied
to the input feature map, and requires only 6 parameters (the 6 elements of A✓) to be produced by
the localisation network. It allows cropping because if the transformation is a contraction (i.e. the
determinant of the left 2 ⇥ 2 sub-matrix has magnitude less than unity) then the mapped regular grid
will lie in a parallelogram of area less than the range of xs
i , ys
i . The effect of this transformation on
the grid compared to the identity transform is shown in Fig. 3.
The class of transformations T✓ may be more constrained, such as that used for attention
A✓ =

s 0 tx
0 s ty
(2)
allowing cropping, translation, and isotropic scaling by varying s, tx, and ty. The transformation
(a) (b)
Figure 3: Two examples of applying the parameterised sampling grid to an image U producing the outpu
(a) The sampling grid is the regular grid G = TI (G), where I is the identity transformation parameters.
The sampling grid is the result of warping the regular grid with an affine transformation T✓(G).
For clarity of exposition, assume for the moment that T✓ is a 2D affine transformation A✓. We
discuss other transformations below. In this affine case, the pointwise transformation is
✓
xs
i
ys
i
◆
= T✓(Gi) = A✓
0
@
xt
i
yt
i
1
1
A =

✓11 ✓12 ✓13
✓21 ✓22 ✓23
0
@
xt
i
yt
i
1
1
A
where (xt
i, yt
i) are the target coordinates of the regular grid in the output feature map, (xs
i , ys
i )
the source coordinates in the input feature map that define the sample points, and A✓ is the af
transformation matrix. We use height and width normalised coordinates, such that 1  xt
i, yt
i
when within the spatial bounds of the output, and 1  xs
i , ys
i  1 when within the spatial bou
of the input (and similarly for the y coordinates). The source/target transformation and samplin
equivalent to the standard texture mapping and coordinates used in graphics [8].
The transform defined in (10) allows cropping, translation, rotation, scale, and skew to be app
to the input feature map, and requires only 6 parameters (the 6 elements of A✓) to be produced
the localisation network. It allows cropping because if the transformation is a contraction (i.e.
determinant of the left 2 ⇥ 2 sub-matrix has magnitude less than unity) then the mapped regular
will lie in a parallelogram of area less than the range of xs
i , ys
i . The effect of this transformation
the grid compared to the identity transform is shown in Fig. 3.
The class of transformations T✓ may be more constrained, such as that used for attention
A✓ =

s 0 tx
0 s ty
allowing cropping, translation, and isotropic scaling by varying s, tx, and ty. The transforma
T✓ can also be more general, such as a plane projective transformation with 8 parameters, pie
wise affine, or a thin plate spline. Indeed, the transformation can have any parameterised fo
provided that it is differentiable with respect to the parameters – this crucially allows gradients to
x and y are the parameters of a generic sampling kernel k() which defines the image
tion (e.g. bilinear), Uc
nm is the value at location (n, m) in channel c of the input, and V c
i
tput value for pixel i at location (xt
i, yt
i ) in channel c. Note that the sampling is done
ly for each channel of the input, so every channel is transformed in an identical way (this
s spatial consistency between channels).
, any sampling kernel can be used, as long as (sub-)gradients can be defined with respect to
s
i . For example, using the integer sampling kernel reduces (3) to
V c
i =
HX
n
WX
m
Uc
nm (bxs
i + 0.5c m) (bys
i + 0.5c n) (4)
x + 0.5c rounds x to the nearest integer and () is the Kronecker delta function. This
g kernel equates to just copying the value at the nearest pixel to (xs
i , ys
i ) to the output location
Alternatively, a bilinear sampling kernel can be used, giving
V c
i =
HX
n
WX
m
Uc
nm max(0, 1 |xs
i m|) max(0, 1 |ys
i n|) (5)
The class of transformations T✓ may be more constrained, such as that used for attention
A✓ =

s 0 tx
0 s ty
(2)
allowing cropping, translation, and isotropic scaling by varying s, tx, and ty. The transformation
T✓ can also be more general, such as a plane projective transformation with 8 parameters, piece-
wise affine, or a thin plate spline. Indeed, the transformation can have any parameterised form,
provided that it is differentiable with respect to the parameters – this crucially allows gradients to be
backpropagated through from the sample points T✓(Gi) to the localisation network output ✓. If the
transformation is parameterised in a structured, low-dimensional way, this reduces the complexity
of the task assigned to the localisation network. For instance, a generic class of structured and dif-
ferentiable transformations, which is a superset of attention, affine, projective, and thin plate spline
transformations, is T✓ = M✓B, where B is a target grid representation (e.g. in (10), B is the regu-
lar grid G in homogeneous coordinates), and M✓ is a matrix parameterised by ✓. In this case it is
possible to not only learn how to predict ✓ for a sample, but also to learn B for the task at hand.
3.3 Differentiable Image Sampling
To perform a spatial transformation of the input feature map, a sampler must take the set of sampling
points T✓(G), along with the input feature map U and produce the sampled output feature map V .
Each (xs
i , ys
i ) coordinate in T✓(G) defines the spatial location in the input where a sampling kernel
is applied to get the value at a particular pixel in the output V . This can be written as
V c
i =
HX
n
WX
m
Uc
nmk(xs
i m; x)k(ys
i n; y) 8i 2 [1 . . . H0
W0
] 8c 2 [1 . . . C] (3)
4
bilinear sampling kernel
Copying the value at the nearest pixel to (xs
i , yi
s ) to the output location (xt
i , yi
t ).
Target feature value
at location i in
channel c
Input feature value
at location (n,m) in
channel c
Sampling kernel
Kernel parameters
Sampling
coordinate
4. Contribution
1) RoIAlign
Faster R-CNN)
Source: https://deepsense.io/region-of-interest-pooling-explained/
Input activation
RoIPool (Faster R-CNN)
Region projection and pooling sections
RoIPool (Faster R-CNN)
Max pooling output
Faster R-CNN
RoIPool
4. Contribution
1) RoIAlign Mask R-CNN
RoIAlign
ask R-CNN)
Input activation
RoIAlign (Mask R-CNN)
Region projection and pooling sections
RoIAlign (Mask R-CNN)
Sampling locations
RoIAlign (Mask R-CNN)
Bilinear interpolated values
oIAlign (Mask R-CNN)
Max pooling output
Silvio Galesso, https://lmb.informatik.uni-freiburg.de/lectures/seminar_brox/seminar_ss17/maskrcnn_slides.pdf
4. Contribution
2) Network architecture
Instance
segmentation
Faster R-CNN + Instance segmentation
Faster R-CNN
ResNet
or
ResNeXt
lity of
ultiple
(i) the
re ex-
head
ssion)
RoI.
ave
RoI
RoI
14×14
×256
7×7
×256
14×14
×256
1024
28×28
×256
1024
mask
14×14
×256
class
box
2048RoI res5
7×7
×1024
7×7
×2048
×4
class
box
14×14
×80
mask
28×28
×80
Faster R-CNN
w/ ResNet [19]
Faster R-CNN
w/ FPN [27]
Figure 3. Head Architecture: We extend two existing Faster R-
FPN
(Feature
Pyramid
Network)
te the generality of
CNN with multiple
ate between: (i) the
sed for feature ex-
the network head
tion and regression)
rately to each RoI.
ave
RoI
RoI
14×14
×256
7×7
×256
14×14
×256
1024
28×28
×256
1024
mask
14×14
×256
class
box
2048RoI res5
7×7
×1024
7×7
×2048
×4
class
box
14×14
×80
mask
28×28
×80
Faster R-CNN
w/ ResNet [19]
Faster R-CNN
w/ FPN [27]
Figure 3. Head Architecture: We extend two existing Faster R-Backbone BackboneHead Head
Background: Faster R-CNN
RPN
CNN
convolutional backbone
RoIPool layer
fixed size feature map
feature map
fully connected
layers
box
regression
classification
Faster R-CNN
4. Contribution
2) Network architecture
• Backbone - ResNet
layer name output size 18-layer 34-layer 50-layer 101-layer 152-layer
conv1 112⇥112 7⇥7, 64, stride 2
conv2 x 56⇥56
3⇥3 max pool, stride 2

3⇥3, 64
3⇥3, 64
⇥2

3⇥3, 64
3⇥3, 64
⇥3
2
4
1⇥1, 64
3⇥3, 64
1⇥1, 256
3
5⇥3
2
4
1⇥1, 64
3⇥3, 64
1⇥1, 256
3
5⇥3
2
4
1⇥1, 64
3⇥3, 64
1⇥1, 256
3
5⇥3
conv3 x 28⇥28

3⇥3, 128
3⇥3, 128
⇥2

3⇥3, 128
3⇥3, 128
⇥4
2
4
1⇥1, 128
3⇥3, 128
1⇥1, 512
3
5⇥4
2
4
1⇥1, 128
3⇥3, 128
1⇥1, 512
3
5⇥4
2
4
1⇥1, 128
3⇥3, 128
1⇥1, 512
3
5⇥8
conv4 x 14⇥14

3⇥3, 256
3⇥3, 256
⇥2

3⇥3, 256
3⇥3, 256
⇥6
2
4
1⇥1, 256
3⇥3, 256
1⇥1, 1024
3
5⇥6
2
4
1⇥1, 256
3⇥3, 256
1⇥1, 1024
3
5⇥23
2
4
1⇥1, 256
3⇥3, 256
1⇥1, 1024
3
5⇥36
conv5 x 7⇥7

3⇥3, 512
3⇥3, 512
⇥2

3⇥3, 512
3⇥3, 512
⇥3
2
4
1⇥1, 512
3⇥3, 512
1⇥1, 2048
3
5⇥3
2
4
1⇥1, 512
3⇥3, 512
1⇥1, 2048
3
5⇥3
2
4
1⇥1, 512
3⇥3, 512
1⇥1, 2048
3
5⇥3
1⇥1 average pool, 1000-d fc, softmax
FLOPs 1.8⇥109
3.6⇥109
3.8⇥109
7.6⇥109
11.3⇥109
Table 1. Architectures for ImageNet. Building blocks are shown in brackets (see also Fig. 5), with the numbers of blocks stacked. Down-
sampling is performed by conv3 1, conv4 1, and conv5 1 with a stride of 2.
0 10 20 30 40 50
20
30
40
50
60
iter. (1e4)
error(%)
plain-18
plain-34
0 10 20 30 40 50
20
30
40
50
60
iter. (1e4)
error(%)
ResNet-18
ResNet-34
18-layer
34-layer
18-layer
34-layer
Figure 4. Training on ImageNet. Thin curves denote training error, and bold curves denote validation error of the center crops. Left: plain
• Backbone - FPNure Pyramid Networks for RPN
ce single scale feature
ith FPN.
scale anchors at each
ResNet-50-C4 ResNet-101-C4
- FPN exploits the inherent hierarchy of
CNNs to compute multi-scale features:
- Replace single scale feature map with FPN.
Source: Lin et al., Feature Pyramid Networks for Object Detection
4. Contribution
2) Network architecture
- Objective Function
- RoI영역에 대한 Loss Function 정의
- Class Loss + Localization Loss
- Class Loss는 Log Loss
-
- Localization Loss는 위치에 대한 Smoo
-
- L2가 Gradient가 매우 커질 수 있
비해서
덜 민감함
- BG Class(u=0)은 제거
Fast R-CNN : Details #4
- Objective Function
- RoI영역에 대한 Loss Function 정의
- Class Loss + Localization Loss
- Class Loss는 Log Loss
-
-
- Objective Function
- RoI영역에 대한 Loss Function 정의
- Class Loss + Localization Loss
- Class Loss는 Log Loss
-
- Localization Loss는 위치에 대한 Smooth L1 Loss
-
- L2가 Gradient가 매우 커질 수 있는 것에
비해서
덜 민감함
- BG Class(u=0)은 제거
Fast R-CNN : Details #4
35
bounding-box classification and regression in par-
hich turned out to largely simplify the multi-stage
of original R-CNN [13]).
ally, during training, we define a multi-task loss on
mpled RoI as L = Lcls + Lbox + Lmask. The clas-
n loss Lcls and bounding-box loss Lbox are identi-
ose defined in [12]. The mask branch has a Km2
-
onal output for each RoI, which encodes K binary
f resolution m ⇥ m, one for each of the K classes.
we apply a per-pixel sigmoid, and define Lmask as
ge binary cross-entropy loss. For an RoI associated
und-truth class k, Lmask is only defined on the k-th
her mask outputs do not contribute to the loss).
efinition of Lmask allows the network to generate
or every class without competition among classes;
on the dedicated classification branch to predict the
el used to select the output mask. This decouples
d class prediction. This is different from common
when applying FCNs [29] to semantic segmenta-
ch typically uses a per-pixel softmax and a multino-
ss-entropy loss. In that case, masks across classes
; in our case, with a per-pixel sigmoid and a binary
These quantizations introduce mis
RoI and the extracted features. W
classification, which is robust to sm
large negative effect on predicting
To address this, we propose an
moves the harsh quantization of Ro
the extracted features with the inp
is simple: we avoid any quantizati
or bins (i.e., we use x/16 instead of
interpolation [22] to compute the
features at four regularly sampled l
and aggregate the result (using ma
RoIAlign leads to large impro
§4.2. We also compare to the RoIW
in [10]. Unlike RoIAlign, RoIWa
ment issue and was implemented i
just like RoIPool. So even thoug
bilinear resampling motivated by
with RoIPool as shown by experim
ble 2c), demonstrating the crucial
2We sample four regular locations, so t
or average pooling. In fact, interpolating
A. Fast R-CNN
A.
B.
Segmentation
Mask branch features:
• Fully convolutional
• K · (m ⇥ m) sigmoid outputs:
! pixel-wise binary classification
! one mask for each class, no competition
• Lmask: mean binary cross-entropy
Overall head loss:
L = Lbox + Lmask
B.
Log loss
Smooth L1 loss
FCIS+++ [26] (top) vs. Mask R-CNN (bottom, ResNet-101-FPN). FCIS exhibits systematic artifacts on
-features AP AP50 AP75
-50-C4 30.3 51.2 31.5
101-C4 32.7 54.2 34.3
50-FPN 33.6 55.2 35.3
101-FPN 35.4 57.3 37.5
101-FPN 36.7 59.5 38.9
bone Architecture: Better back-
g expected gains: deeper networks
FPN outperforms C4 features, and
improves on ResNet.
AP AP50 AP75
softmax 24.8 44.1 25.1
sigmoid 30.3 51.2 31.5
+5.5 +7.1 +6.4
(b) Multinomial vs. Independent Masks
(ResNet-50-C4): Decoupling via per-
class binary masks (sigmoid) gives large
gains over multinomial masks (softmax).
align? bilinear? a
RoIPool [12] m
RoIWarp [10]
X m
X a
RoIAlign
X X m
X X a
(c) RoIAlign (ResNet-50-C4): M
layers. Our RoIAlign layer impr
AP75 by ⇠5 points. Using prope
tor that contributes to the large gap
AP AP50 AP75 APbb APbb
50 APbb
75
l 23.6 46.5 21.6 28.2 52.7 26.9
n 30.9 51.8 32.1 34.0 55.3 36.4
+7.3 + 5.3 +10.5 +5.8 +2.6 +9.5
ign (ResNet-50-C5, stride 32): Mask-level and box-level
large-stride features. Misalignments are more severe than
e-16 features (Table 2c), resulting in massive accuracy gaps.
mask branch
MLP fc: 1024!1024!80·282
MLP fc: 1024!1024!1024!80·282
FCN conv: 256!256!256!256!256!80
(e) Mask Branch (ResNet-50-FPN): Fully convolu
multi-layer perceptrons (MLP, fully-connected) for m
prove results as they take advantage of explicitly enco
2. Ablations for Mask R-CNN. We train on trainval35k, test on minival, and report mask AP un
omial vs. Independent Masks: Mask R-CNN de- AP by about 3 points over RoIPool,
Our definition of Lmask allows the network to generate masks for every class without competition
among classes; we rely on the dedicated classification branch to predict the class label used to
select the output mask. This decouples mask and class prediction. This is different from common
practice when applying FCNs [29] to semantic segmentation, which typically uses a per-pixel
softmax and a multinomial cross-entropy loss. In that case, masks across classes compete; in our
case, with a per-pixel sigmoid and a binary loss, they do not. We show by experiments that this
formulation is key for good instance segmentation results.
2) Network architecture
4. Contribution
5. Experiments
• Main dataset: MS COCO
• 80 classes

• 80k train image
• 35k sub- set of val images 

• 5k images for ablation experiments
• Metric
5. Experiments
horse1.00
horse1.00 horse1.00
bus1.00
bus1.00
car.98
truck.88
car.93
car.78
car.98
car.91car.96
car.99
car.94
car.99
car.98truck.86
car.99
car.95
car1.00
car.93car.98
car.95
car.97
car.87
car.99
car.82
car.78
car.93
car.95
car.97
person.99
traffic light.73
person1.00
person.99
person.95
person.93
person.93
person1.00
person.98
skateboard.82
suitcase1.00
suitcase.99
suitcase.96
suitcase1.00
suitcase.93
suitcase.98
suitcase.88
suitcase.72
stop sign.88
person1.00 person1.00
person1.00
person1.00
person.99
person.99
bench.76
skateboard.91
skateboard.83
handbag.81
surfboard1.00
person1.00
person1.00 surfboard1.00person1.00
person.98
surfboard1.00
person1.00
surfboard.98 surfboard1.00
person.91
person.74person1.00
person1.00
person1.00
person1.00
person1.00person1.00person.98
person.99
person1.00
person.99 umbrella1.00
person.95
umbrella.99umbrella.97
umbrella.97
umbrella.96
umbrella1.00
backpack.96
umbrella.98
backpack.95
person.80
backpack.98
bicycle.93
umbrella.89
person.89
handbag.97
handbag.85
person1.00
person1.00person1.00person1.00
person1.00
person1.00
motorcycle.72
kite.89
person.99
kite.95
person.99
person1.00
person.81
person.72
kite.93
person.89
kite1.00
person.98
person1.00
kite.84
kite.97
person.80
handbag.80
person.99
kite.82
person.98person.96
kite.98
person.99
person.82
kite.81
person.95 person.84
kite.98
kite.72
kite.99
kite.84
kite.99
person.94
person.72person.98
kite.95
person.98person.77
kite.73
person.78 person.71person.87
kite.88
kite.88
person.94
kite.86
kite.89
zebra.99
zebra1.00
zebra1.00
zebra.99 zebra1.00
zebra.96
zebra.74
zebra.96
zebra.99zebra.90
zebra.88
zebra.76
dining table.91
dining table.78
chair.97
person.99
person.86
chair.94
chair.98
person.95
chair.95
person.97
chair.92
chair.99
person.97
person.99
person.94person.99
person.87
person.99
chair.83
person.94
person.99person.98
chair.87
chair.95
person.97
person.96
chair.99
person.86
person.89
chair.89
wine glass.93
person.98 person.88
person.97
person.88
person.88
person.91 chair.96
person.95
person.77
person.92
wine glass.94
cup.83
wine glass.94
wine glass.83
cup.91
chair.85 dining table.96
wine glass.91
person.96
cup.98
person.83
dining table.75
cup.96
person.72
wine glass.80
chair.98
person.81
person.82
dining table.81
chair.85
chair.78
cup.75
person.77
cup.71
wine glass.80
cup.79cup.93
cup.71
person.99
person.99
person1.00
person1.00
frisbee1.00
person.80
person.82
elephant1.00elephant1.00
elephant1.00
elephant.97
elephant.99
person1.00
person1.00
dining table.95
person1.00
person.88
wine glass1.00
bottle.97
wine glass1.00
wine glass.99
tv.98 tv.84
person1.00
bench.97
person.98
person1.00 person1.00
handbag.73
person.86potted plant.92
bird.93
person.76 person.98person.78person.78backpack.88handbag.91
cell phone.77clock.73
person.99
person1.00 person.98
person1.00
person1.00 person1.00
person.99
person.99 person.99person1.00person1.00 person.98 person.99
handbag.88
person1.00person.98
person.92
handbag.99
person.97
person.95
handbag.88
traffic light.99
person.95
person.87
person.95
traffic light.87
traffic light.71
person.80
person.95person.95person.73person.74
tie.85
car.99
car.86
car.97
car1.00car.95 car.97
traffic light1.00
traffic light.99
car.99
person.99
car.95
car.97car.98
car.98
car.91
car1.00
car.96
car.96
bicycle.86
car.97
car.97
car.97
car.94
car.95
car.94
car.81
person.87
parking meter.98
car.89
donut1.00
donut.90
donut.88
donut.81
donut.95
donut.96
donut1.00donut.98
donut.99
donut.94
donut.97
donut.99
donut.98
donut1.00
donut.95
donut1.00
donut.98
donut.98
donut.99
donut.96
donut.89
donut.96
donut.95
donut.98
donut.89
donut.93
donut.95
donut.90
donut.89
donut.89
donut.89
donut.86
donut.86
person1.00
person1.00person1.00
person1.00
person1.00
person1.00
person1.00
dog1.00
baseball bat.99
baseball bat.85
baseball bat.98
truck.92
truck.99
truck.96truck.99truck.97
bus.99
truck.93bus.90
person1.00
person1.00
horse.77
horse.99
cow.93
person.96
person1.00
person.99
horse.97
person.98person.97
person.98
person.96
person1.00
tennis racket1.00
chair.73
person.90
person.77
person.97
person.81
person.87
person.71person.96 person.99 person.98person.94
chair.97
chair.80
chair.71
chair.94chair.92
chair.99
chair.93
chair.99
chair.91chair.81chair.98chair.83
chair.81
chair.81
chair.93
sports ball.99
person1.00
couch.82
person1.00
person.99
person1.00
person1.00person1.00 person.99
skateboard.99
person.90
person.98
person.99
person.91
person.99person1.00
person.80
skateboard.98
Figure 4. More results of Mask R-CNN on COCO test images, using ResNet-101-FPN and running at 5 fps, with 35.7 mask AP (Table 1).
backbone AP AP50 AP75 APS APM APL
MNC [10] ResNet-101-C4 24.6 44.3 24.8 4.7 25.9 43.6
FCIS [26] +OHEM ResNet-101-C5-dilated 29.2 49.5 - 7.1 31.3 50.0
5. Experiments
Figure 5. FCIS+++ [26] (top) vs. Mask R-CNN (bottom, ResNet-101-FPN). FCIS exhibits systematic artifacts on overlapping objects.
net-depth-features AP AP50 AP75
ResNet-50-C4 30.3 51.2 31.5
ResNet-101-C4 32.7 54.2 34.3
ResNet-50-FPN 33.6 55.2 35.3
ResNet-101-FPN 35.4 57.3 37.5
ResNeXt-101-FPN 36.7 59.5 38.9
(a) Backbone Architecture: Better back-
bones bring expected gains: deeper networks
do better, FPN outperforms C4 features, and
ResNeXt improves on ResNet.
AP AP50 AP75
softmax 24.8 44.1 25.1
sigmoid 30.3 51.2 31.5
+5.5 +7.1 +6.4
(b) Multinomial vs. Independent Masks
(ResNet-50-C4): Decoupling via per-
class binary masks (sigmoid) gives large
gains over multinomial masks (softmax).
align? bilinear? agg. AP AP50 AP75
RoIPool [12] max 26.9 48.8 26.4
RoIWarp [10]
X max 27.2 49.2 27.1
X ave 27.1 48.9 27.1
RoIAlign
X X max 30.2 51.0 31.8
X X ave 30.3 51.2 31.5
(c) RoIAlign (ResNet-50-C4): Mask results with various RoI
layers. Our RoIAlign layer improves AP by ⇠3 points and
AP75 by ⇠5 points. Using proper alignment is the only fac-
tor that contributes to the large gap between RoI layers.
AP AP50 AP75 APbb APbb
50 APbb
75
RoIPool 23.6 46.5 21.6 28.2 52.7 26.9
RoIAlign 30.9 51.8 32.1 34.0 55.3 36.4
+7.3 + 5.3 +10.5 +5.8 +2.6 +9.5
(d) RoIAlign (ResNet-50-C5, stride 32): Mask-level and box-level
AP using large-stride features. Misalignments are more severe than
with stride-16 features (Table 2c), resulting in massive accuracy gaps.
mask branch AP AP50 AP75
MLP fc: 1024!1024!80·282 31.5 53.7 32.8
MLP fc: 1024!1024!1024!80·282 31.5 54.0 32.6
FCN conv: 256!256!256!256!256!80 33.6 55.2 35.3
(e) Mask Branch (ResNet-50-FPN): Fully convolutional networks (FCN) vs.
multi-layer perceptrons (MLP, fully-connected) for mask prediction. FCNs im-
prove results as they take advantage of explicitly encoding spatial layout.
Table 2. Ablations for Mask R-CNN. We train on trainval35k, test on minival, and report mask AP unless otherwise noted.
Multinomial vs. Independent Masks: Mask R-CNN de-
couples mask and class prediction: as the existing box
branch predicts the class label, we generate a mask for each
class without competition among classes (by a per-pixel sig-
moid and a binary loss). In Table 2b, we compare this to
using a per-pixel softmax and a multinomial loss (as com-
monly used in FCN [29]). This alternative couples the tasks
of mask and class prediction, and results in a severe loss
AP by about 3 points over RoIPool, with much of the gain
coming at high IoU (AP75). RoIAlign is insensitive to
max/average pool; we use average in the rest of the paper.
Additionally, we compare with RoIWarp proposed in
MNC [10] that also adopt bilinear sampling. As discussed
in §3, RoIWarp still quantizes the RoI, losing alignment
with the input. As can be seen in Table 2c, RoIWarp per-
forms on par with RoIPool and much worse than RoIAlign.
5. Experiments
Figure 4. More results of Mask R-CNN on COCO test images, using ResNet-101-FPN and running at 5 fps, with 35.7 mask AP (Table 1).
backbone AP AP50 AP75 APS APM APL
MNC [10] ResNet-101-C4 24.6 44.3 24.8 4.7 25.9 43.6
FCIS [26] +OHEM ResNet-101-C5-dilated 29.2 49.5 - 7.1 31.3 50.0
FCIS+++ [26] +OHEM ResNet-101-C5-dilated 33.6 54.5 - - - -
Mask R-CNN ResNet-101-C4 33.1 54.9 34.8 12.1 35.6 51.1
Mask R-CNN ResNet-101-FPN 35.7 58.0 37.8 15.5 38.1 52.4
Mask R-CNN ResNeXt-101-FPN 37.1 60.0 39.4 16.9 39.9 53.5
Table 1. Instance segmentation mask AP on COCO test-dev. MNC [10] and FCIS [26] are the winners of the COCO 2015 and 2016
segmentation challenges, respectively. Without bells and whistles, Mask R-CNN outperforms the more complex FCIS+++, which includes
multi-scale train/test, horizontal flip test, and OHEM [35]. All entries are single-model results.
is evaluating using mask IoU. As in previous work [5, 27], expect many such improvements to be applicable to ours.
person1.00
person1.00
person1.00
person1.00
umbrella1.00umbrella.99
car.99 car.93
giraffe1.00 giraffe1.00
person1.00
person1.00
person1.00 person1.00
person.95
sports ball1.00
sports ball.98
person1.00
person1.00
person1.00
tie.95
tie1.00
FCISMaskR-CNN
train1.00
train.99
train.80
person1.00 person1.00person1.00
person1.00
person1.00person1.00
skateboard.98
person.99 person.99
skateboard.99
handbag.93
Figure 5. FCIS+++ [26] (top) vs. Mask R-CNN (bottom, ResNet-101-FPN). FCIS exhibits systematic artifacts on overlapping objects.
net-depth-features AP AP50 AP75
ResNet-50-C4 30.3 51.2 31.5
ResNet-101-C4 32.7 54.2 34.3
ResNet-50-FPN 33.6 55.2 35.3
ResNet-101-FPN 35.4 57.3 37.5
ResNeXt-101-FPN 36.7 59.5 38.9
(a) Backbone Architecture: Better back-
bones bring expected gains: deeper networks
do better, FPN outperforms C4 features, and
ResNeXt improves on ResNet.
AP AP50 AP75
softmax 24.8 44.1 25.1
sigmoid 30.3 51.2 31.5
+5.5 +7.1 +6.4
(b) Multinomial vs. Independent Masks
(ResNet-50-C4): Decoupling via per-
class binary masks (sigmoid) gives large
gains over multinomial masks (softmax).
align? bilinear? agg. AP AP50 AP75
RoIPool [12] max 26.9 48.8 26.4
RoIWarp [10]
X max 27.2 49.2 27.1
X ave 27.1 48.9 27.1
RoIAlign
X X max 30.2 51.0 31.8
X X ave 30.3 51.2 31.5
(c) RoIAlign (ResNet-50-C4): Mask results with various RoI
layers. Our RoIAlign layer improves AP by ⇠3 points and
AP75 by ⇠5 points. Using proper alignment is the only fac-
tor that contributes to the large gap between RoI layers.
AP AP50 AP75 APbb APbb
50 APbb
75
RoIPool 23.6 46.5 21.6 28.2 52.7 26.9
RoIAlign 30.9 51.8 32.1 34.0 55.3 36.4
+7.3 + 5.3 +10.5 +5.8 +2.6 +9.5
mask branch AP AP50 AP75
MLP fc: 1024!1024!80·282 31.5 53.7 32.8
MLP fc: 1024!1024!1024!80·282 31.5 54.0 32.6
FCN conv: 256!256!256!256!256!80 33.6 55.2 35.3
• Instance segmentation
5. Experiments
backbone APbb APbb
50 APbb
75 APbb
S APbb
M APbb
L
Faster R-CNN+++ [19] ResNet-101-C4 34.9 55.7 37.4 15.6 38.7 50.9
Faster R-CNN w FPN [27] ResNet-101-FPN 36.2 59.1 39.0 18.2 39.0 48.2
Faster R-CNN by G-RMI [21] Inception-ResNet-v2 [37] 34.7 55.5 36.7 13.5 38.1 52.0
Faster R-CNN w TDM [36] Inception-ResNet-v2-TDM 36.8 57.7 39.2 16.2 39.8 52.1
Faster R-CNN, RoIAlign ResNet-101-FPN 37.3 59.6 40.3 19.8 40.2 48.8
Mask R-CNN ResNet-101-FPN 38.2 60.3 41.7 20.1 41.1 50.2
Mask R-CNN ResNeXt-101-FPN 39.8 62.3 43.4 22.1 43.2 51.2
Table 3. Object detection single-model results (bounding box AP), vs. state-of-the-art on test-dev. Mask R-CNN using ResNet-101-
FPN outperforms the base variants of all previous state-of-the-art models (the mask output is ignored in these experiments). The gains of
Mask R-CNN over [27] come from using RoIAlign (+1.1 APbb
), multitask training (+0.9 APbb
), and ResNeXt-101 (+1.6 APbb
).
Mask Branch: Segmentation is a pixel-to-pixel task and
we exploit the spatial layout of masks by using an FCN.
In Table 2e, we compare multi-layer perceptrons (MLP)
and FCNs, using a ResNet-50-FPN backbone. Using FCNs
gives a 2.1 mask AP gain over MLPs. We note that we
choose this backbone so that the conv layers of the FCN
head are not pre-trained, for a fair comparison with MLP.
4.3. Bounding Box Detection Results
We compare Mask R-CNN to the state-of-the-art COCO
bounding-box object detection in Table 3. For this result,
even though the full Mask R-CNN model is trained, only
the classification and box outputs are used at inference (the
mask output is ignored). Mask R-CNN using ResNet-101-
FPN outperforms the base variants of all previous state-of-
the-art models, including the single-model variant of G-
RMI [21], the winner of the COCO 2016 Detection Chal-
ant takes ⇠400ms as it has a heavier box head (Figure 3), so
we do not recommend using the C4 variant in practice.
Although Mask R-CNN is fast, we note that our design
is not optimized for speed, and better speed/accuracy trade-
offs could be achieved [21], e.g., by varying image sizes and
proposal numbers, which is beyond the scope of this paper.
Training: Mask R-CNN is also fast to train. Training with
ResNet-50-FPN on COCO trainval35k takes 32 hours
in our synchronized 8-GPU implementation (0.72s per 16-
image mini-batch), and 44 hours with ResNet-101-FPN. In
fact, fast prototyping can be completed in less than one day
when training on the train set. We hope such rapid train-
ing will remove a major hurdle in this area and encourage
more people to perform research on this challenging topic.
5. Mask R-CNN for Human Pose Estimation
• Object detection
Reference paper
• R. Girshick. Fast R-CNN. In ICCV, 2015.
• S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object
detection with region proposal networks. In NIPS, 2015.
• M. Jaderberg, K. Simonyan, A. Zisserman, andK. Kavukcuoglu. Spatial
transformer networks. In NIPS, 2015.
• K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image
recognition. In CVPR, 2016.
• S. Xie, R. Girshick, P. Dolla ́r, Z. Tu, and K. He. Aggregated residual
transformations for deep neural networks. In CVPR, 2017.
• T.-Y.Lin, P.Dolla ́r, R.Girshick, K.He, B.Hariharan, and S. Belongie. Feature
pyramid networks for object detection.In CVPR, 2017.
• J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for
semantic segmentation. In CVPR, 2015.
• Y. Li, H. Qi, J. Dai, X. Ji, and Y. Wei. Fully convolutional instance-aware
semantic segmentation. In CVPR, 2017.
• R.Girshick, F.Iandola, T.Darrell, and J.Malik. Deformable part models are
convolutional neural networks. In CVPR, 2015.

Mais conteúdo relacionado

Mais procurados

【チュートリアル】コンピュータビジョンによる動画認識
【チュートリアル】コンピュータビジョンによる動画認識【チュートリアル】コンピュータビジョンによる動画認識
【チュートリアル】コンピュータビジョンによる動画認識Hirokatsu Kataoka
 
2018/12/28 LiDARで取得した道路上点群に対するsemantic segmentation
2018/12/28 LiDARで取得した道路上点群に対するsemantic segmentation2018/12/28 LiDARで取得した道路上点群に対するsemantic segmentation
2018/12/28 LiDARで取得した道路上点群に対するsemantic segmentationTakuya Minagawa
 
SSII2022 [SS1] ニューラル3D表現の最新動向〜 ニューラルネットでなんでも表せる?? 〜​
SSII2022 [SS1] ニューラル3D表現の最新動向〜 ニューラルネットでなんでも表せる?? 〜​SSII2022 [SS1] ニューラル3D表現の最新動向〜 ニューラルネットでなんでも表せる?? 〜​
SSII2022 [SS1] ニューラル3D表現の最新動向〜 ニューラルネットでなんでも表せる?? 〜​SSII
 
近年のHierarchical Vision Transformer
近年のHierarchical Vision Transformer近年のHierarchical Vision Transformer
近年のHierarchical Vision TransformerYusuke Uchida
 
【DL輪読会】ViT + Self Supervised Learningまとめ
【DL輪読会】ViT + Self Supervised Learningまとめ【DL輪読会】ViT + Self Supervised Learningまとめ
【DL輪読会】ViT + Self Supervised LearningまとめDeep Learning JP
 
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料Yusuke Uchida
 
【DL輪読会】Segment Anything
【DL輪読会】Segment Anything【DL輪読会】Segment Anything
【DL輪読会】Segment AnythingDeep Learning JP
 
SSII2021 [SS1] Transformer x Computer Visionの 実活用可能性と展望 〜 TransformerのCompute...
SSII2021 [SS1] Transformer x Computer Visionの 実活用可能性と展望 〜 TransformerのCompute...SSII2021 [SS1] Transformer x Computer Visionの 実活用可能性と展望 〜 TransformerのCompute...
SSII2021 [SS1] Transformer x Computer Visionの 実活用可能性と展望 〜 TransformerのCompute...SSII
 
【メタサーベイ】Neural Fields
【メタサーベイ】Neural Fields【メタサーベイ】Neural Fields
【メタサーベイ】Neural Fieldscvpaper. challenge
 
Image Segmentation Using Deep Learning : A survey
Image Segmentation Using Deep Learning : A surveyImage Segmentation Using Deep Learning : A survey
Image Segmentation Using Deep Learning : A surveyNUPUR YADAV
 
SSII2021 [OS2-02] 深層学習におけるデータ拡張の原理と最新動向
SSII2021 [OS2-02] 深層学習におけるデータ拡張の原理と最新動向SSII2021 [OS2-02] 深層学習におけるデータ拡張の原理と最新動向
SSII2021 [OS2-02] 深層学習におけるデータ拡張の原理と最新動向SSII
 
論文紹介: Fast R-CNN&Faster R-CNN
論文紹介: Fast R-CNN&Faster R-CNN論文紹介: Fast R-CNN&Faster R-CNN
論文紹介: Fast R-CNN&Faster R-CNNTakashi Abe
 
【DL輪読会】ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders
【DL輪読会】ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders【DL輪読会】ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders
【DL輪読会】ConvNeXt V2: Co-designing and Scaling ConvNets with Masked AutoencodersDeep Learning JP
 
【チュートリアル】コンピュータビジョンによる動画認識 v2
【チュートリアル】コンピュータビジョンによる動画認識 v2【チュートリアル】コンピュータビジョンによる動画認識 v2
【チュートリアル】コンピュータビジョンによる動画認識 v2Hirokatsu Kataoka
 
Faster R-CNN: Towards real-time object detection with region proposal network...
Faster R-CNN: Towards real-time object detection with region proposal network...Faster R-CNN: Towards real-time object detection with region proposal network...
Faster R-CNN: Towards real-time object detection with region proposal network...Universitat Politècnica de Catalunya
 
Domain Adaptation 発展と動向まとめ(サーベイ資料)
Domain Adaptation 発展と動向まとめ(サーベイ資料)Domain Adaptation 発展と動向まとめ(サーベイ資料)
Domain Adaptation 発展と動向まとめ(サーベイ資料)Yamato OKAMOTO
 
SSII2020 [OS2-02] 教師あり事前学習を凌駕する「弱」教師あり事前学習
SSII2020 [OS2-02] 教師あり事前学習を凌駕する「弱」教師あり事前学習SSII2020 [OS2-02] 教師あり事前学習を凌駕する「弱」教師あり事前学習
SSII2020 [OS2-02] 教師あり事前学習を凌駕する「弱」教師あり事前学習SSII
 
You Only Look One-level Featureの解説と見せかけた物体検出のよもやま話
You Only Look One-level Featureの解説と見せかけた物体検出のよもやま話You Only Look One-level Featureの解説と見せかけた物体検出のよもやま話
You Only Look One-level Featureの解説と見せかけた物体検出のよもやま話Yusuke Uchida
 

Mais procurados (20)

【チュートリアル】コンピュータビジョンによる動画認識
【チュートリアル】コンピュータビジョンによる動画認識【チュートリアル】コンピュータビジョンによる動画認識
【チュートリアル】コンピュータビジョンによる動画認識
 
2018/12/28 LiDARで取得した道路上点群に対するsemantic segmentation
2018/12/28 LiDARで取得した道路上点群に対するsemantic segmentation2018/12/28 LiDARで取得した道路上点群に対するsemantic segmentation
2018/12/28 LiDARで取得した道路上点群に対するsemantic segmentation
 
SSII2022 [SS1] ニューラル3D表現の最新動向〜 ニューラルネットでなんでも表せる?? 〜​
SSII2022 [SS1] ニューラル3D表現の最新動向〜 ニューラルネットでなんでも表せる?? 〜​SSII2022 [SS1] ニューラル3D表現の最新動向〜 ニューラルネットでなんでも表せる?? 〜​
SSII2022 [SS1] ニューラル3D表現の最新動向〜 ニューラルネットでなんでも表せる?? 〜​
 
近年のHierarchical Vision Transformer
近年のHierarchical Vision Transformer近年のHierarchical Vision Transformer
近年のHierarchical Vision Transformer
 
【DL輪読会】ViT + Self Supervised Learningまとめ
【DL輪読会】ViT + Self Supervised Learningまとめ【DL輪読会】ViT + Self Supervised Learningまとめ
【DL輪読会】ViT + Self Supervised Learningまとめ
 
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料
 
【DL輪読会】Segment Anything
【DL輪読会】Segment Anything【DL輪読会】Segment Anything
【DL輪読会】Segment Anything
 
SSII2021 [SS1] Transformer x Computer Visionの 実活用可能性と展望 〜 TransformerのCompute...
SSII2021 [SS1] Transformer x Computer Visionの 実活用可能性と展望 〜 TransformerのCompute...SSII2021 [SS1] Transformer x Computer Visionの 実活用可能性と展望 〜 TransformerのCompute...
SSII2021 [SS1] Transformer x Computer Visionの 実活用可能性と展望 〜 TransformerのCompute...
 
【メタサーベイ】Neural Fields
【メタサーベイ】Neural Fields【メタサーベイ】Neural Fields
【メタサーベイ】Neural Fields
 
Image Segmentation Using Deep Learning : A survey
Image Segmentation Using Deep Learning : A surveyImage Segmentation Using Deep Learning : A survey
Image Segmentation Using Deep Learning : A survey
 
SSII2021 [OS2-02] 深層学習におけるデータ拡張の原理と最新動向
SSII2021 [OS2-02] 深層学習におけるデータ拡張の原理と最新動向SSII2021 [OS2-02] 深層学習におけるデータ拡張の原理と最新動向
SSII2021 [OS2-02] 深層学習におけるデータ拡張の原理と最新動向
 
論文紹介: Fast R-CNN&Faster R-CNN
論文紹介: Fast R-CNN&Faster R-CNN論文紹介: Fast R-CNN&Faster R-CNN
論文紹介: Fast R-CNN&Faster R-CNN
 
【DL輪読会】ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders
【DL輪読会】ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders【DL輪読会】ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders
【DL輪読会】ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders
 
【チュートリアル】コンピュータビジョンによる動画認識 v2
【チュートリアル】コンピュータビジョンによる動画認識 v2【チュートリアル】コンピュータビジョンによる動画認識 v2
【チュートリアル】コンピュータビジョンによる動画認識 v2
 
The Origin of Grad-CAM
The Origin of Grad-CAMThe Origin of Grad-CAM
The Origin of Grad-CAM
 
Faster R-CNN: Towards real-time object detection with region proposal network...
Faster R-CNN: Towards real-time object detection with region proposal network...Faster R-CNN: Towards real-time object detection with region proposal network...
Faster R-CNN: Towards real-time object detection with region proposal network...
 
Domain Adaptation 発展と動向まとめ(サーベイ資料)
Domain Adaptation 発展と動向まとめ(サーベイ資料)Domain Adaptation 発展と動向まとめ(サーベイ資料)
Domain Adaptation 発展と動向まとめ(サーベイ資料)
 
SSII2020 [OS2-02] 教師あり事前学習を凌駕する「弱」教師あり事前学習
SSII2020 [OS2-02] 教師あり事前学習を凌駕する「弱」教師あり事前学習SSII2020 [OS2-02] 教師あり事前学習を凌駕する「弱」教師あり事前学習
SSII2020 [OS2-02] 教師あり事前学習を凌駕する「弱」教師あり事前学習
 
実装レベルで学ぶVQVAE
実装レベルで学ぶVQVAE実装レベルで学ぶVQVAE
実装レベルで学ぶVQVAE
 
You Only Look One-level Featureの解説と見せかけた物体検出のよもやま話
You Only Look One-level Featureの解説と見せかけた物体検出のよもやま話You Only Look One-level Featureの解説と見せかけた物体検出のよもやま話
You Only Look One-level Featureの解説と見せかけた物体検出のよもやま話
 

Semelhante a Mask R-CNN

Illustration Clamor Echelon Evaluation via Prime Piece Psychotherapy
Illustration Clamor Echelon Evaluation via Prime Piece PsychotherapyIllustration Clamor Echelon Evaluation via Prime Piece Psychotherapy
Illustration Clamor Echelon Evaluation via Prime Piece PsychotherapyIJMER
 
ALEXANDER FRACTIONAL INTEGRAL FILTERING OF WAVELET COEFFICIENTS FOR IMAGE DEN...
ALEXANDER FRACTIONAL INTEGRAL FILTERING OF WAVELET COEFFICIENTS FOR IMAGE DEN...ALEXANDER FRACTIONAL INTEGRAL FILTERING OF WAVELET COEFFICIENTS FOR IMAGE DEN...
ALEXANDER FRACTIONAL INTEGRAL FILTERING OF WAVELET COEFFICIENTS FOR IMAGE DEN...sipij
 
DETECTION OF MOVING OBJECT
DETECTION OF MOVING OBJECTDETECTION OF MOVING OBJECT
DETECTION OF MOVING OBJECTAM Publications
 
PERFORMANCE EVALUATION OF DIFFERENT TECHNIQUES FOR TEXTURE CLASSIFICATION
PERFORMANCE EVALUATION OF DIFFERENT TECHNIQUES FOR TEXTURE CLASSIFICATION PERFORMANCE EVALUATION OF DIFFERENT TECHNIQUES FOR TEXTURE CLASSIFICATION
PERFORMANCE EVALUATION OF DIFFERENT TECHNIQUES FOR TEXTURE CLASSIFICATION cscpconf
 
Research Paper v2.0
Research Paper v2.0Research Paper v2.0
Research Paper v2.0Kapil Tiwari
 
An Efficient Algorithm for the Segmentation of Astronomical Images
An Efficient Algorithm for the Segmentation of Astronomical  ImagesAn Efficient Algorithm for the Segmentation of Astronomical  Images
An Efficient Algorithm for the Segmentation of Astronomical ImagesIOSR Journals
 
ANALYSIS OF INTEREST POINTS OF CURVELET COEFFICIENTS CONTRIBUTIONS OF MICROS...
ANALYSIS OF INTEREST POINTS OF CURVELET  COEFFICIENTS CONTRIBUTIONS OF MICROS...ANALYSIS OF INTEREST POINTS OF CURVELET  COEFFICIENTS CONTRIBUTIONS OF MICROS...
ANALYSIS OF INTEREST POINTS OF CURVELET COEFFICIENTS CONTRIBUTIONS OF MICROS...sipij
 
Higher dimension raster 5d
Higher dimension raster 5dHigher dimension raster 5d
Higher dimension raster 5dPedram Mazloom
 
Empirical Analysis of Invariance of Transform Coefficients under Rotation
Empirical Analysis of Invariance of Transform Coefficients under RotationEmpirical Analysis of Invariance of Transform Coefficients under Rotation
Empirical Analysis of Invariance of Transform Coefficients under RotationIJERD Editor
 
Improved Characters Feature Extraction and Matching Algorithm Based on SIFT
Improved Characters Feature Extraction and Matching Algorithm Based on SIFTImproved Characters Feature Extraction and Matching Algorithm Based on SIFT
Improved Characters Feature Extraction and Matching Algorithm Based on SIFTNooria Sukmaningtyas
 
Medial Axis Transformation based Skeletonzation of Image Patterns using Image...
Medial Axis Transformation based Skeletonzation of Image Patterns using Image...Medial Axis Transformation based Skeletonzation of Image Patterns using Image...
Medial Axis Transformation based Skeletonzation of Image Patterns using Image...IOSR Journals
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Global threshold and region based active contour model for accurate image seg...
Global threshold and region based active contour model for accurate image seg...Global threshold and region based active contour model for accurate image seg...
Global threshold and region based active contour model for accurate image seg...sipij
 
DESPECKLING OF SAR IMAGES BY OPTIMIZING AVERAGED POWER SPECTRAL VALUE IN CURV...
DESPECKLING OF SAR IMAGES BY OPTIMIZING AVERAGED POWER SPECTRAL VALUE IN CURV...DESPECKLING OF SAR IMAGES BY OPTIMIZING AVERAGED POWER SPECTRAL VALUE IN CURV...
DESPECKLING OF SAR IMAGES BY OPTIMIZING AVERAGED POWER SPECTRAL VALUE IN CURV...ijistjournal
 

Semelhante a Mask R-CNN (20)

Illustration Clamor Echelon Evaluation via Prime Piece Psychotherapy
Illustration Clamor Echelon Evaluation via Prime Piece PsychotherapyIllustration Clamor Echelon Evaluation via Prime Piece Psychotherapy
Illustration Clamor Echelon Evaluation via Prime Piece Psychotherapy
 
paper
paperpaper
paper
 
N045077984
N045077984N045077984
N045077984
 
R044120124
R044120124R044120124
R044120124
 
ijecct
ijecctijecct
ijecct
 
ALEXANDER FRACTIONAL INTEGRAL FILTERING OF WAVELET COEFFICIENTS FOR IMAGE DEN...
ALEXANDER FRACTIONAL INTEGRAL FILTERING OF WAVELET COEFFICIENTS FOR IMAGE DEN...ALEXANDER FRACTIONAL INTEGRAL FILTERING OF WAVELET COEFFICIENTS FOR IMAGE DEN...
ALEXANDER FRACTIONAL INTEGRAL FILTERING OF WAVELET COEFFICIENTS FOR IMAGE DEN...
 
DETECTION OF MOVING OBJECT
DETECTION OF MOVING OBJECTDETECTION OF MOVING OBJECT
DETECTION OF MOVING OBJECT
 
www.ijerd.com
www.ijerd.comwww.ijerd.com
www.ijerd.com
 
PERFORMANCE EVALUATION OF DIFFERENT TECHNIQUES FOR TEXTURE CLASSIFICATION
PERFORMANCE EVALUATION OF DIFFERENT TECHNIQUES FOR TEXTURE CLASSIFICATION PERFORMANCE EVALUATION OF DIFFERENT TECHNIQUES FOR TEXTURE CLASSIFICATION
PERFORMANCE EVALUATION OF DIFFERENT TECHNIQUES FOR TEXTURE CLASSIFICATION
 
Research Paper v2.0
Research Paper v2.0Research Paper v2.0
Research Paper v2.0
 
An Efficient Algorithm for the Segmentation of Astronomical Images
An Efficient Algorithm for the Segmentation of Astronomical  ImagesAn Efficient Algorithm for the Segmentation of Astronomical  Images
An Efficient Algorithm for the Segmentation of Astronomical Images
 
ANALYSIS OF INTEREST POINTS OF CURVELET COEFFICIENTS CONTRIBUTIONS OF MICROS...
ANALYSIS OF INTEREST POINTS OF CURVELET  COEFFICIENTS CONTRIBUTIONS OF MICROS...ANALYSIS OF INTEREST POINTS OF CURVELET  COEFFICIENTS CONTRIBUTIONS OF MICROS...
ANALYSIS OF INTEREST POINTS OF CURVELET COEFFICIENTS CONTRIBUTIONS OF MICROS...
 
Higher dimension raster 5d
Higher dimension raster 5dHigher dimension raster 5d
Higher dimension raster 5d
 
A0280105
A0280105A0280105
A0280105
 
Empirical Analysis of Invariance of Transform Coefficients under Rotation
Empirical Analysis of Invariance of Transform Coefficients under RotationEmpirical Analysis of Invariance of Transform Coefficients under Rotation
Empirical Analysis of Invariance of Transform Coefficients under Rotation
 
Improved Characters Feature Extraction and Matching Algorithm Based on SIFT
Improved Characters Feature Extraction and Matching Algorithm Based on SIFTImproved Characters Feature Extraction and Matching Algorithm Based on SIFT
Improved Characters Feature Extraction and Matching Algorithm Based on SIFT
 
Medial Axis Transformation based Skeletonzation of Image Patterns using Image...
Medial Axis Transformation based Skeletonzation of Image Patterns using Image...Medial Axis Transformation based Skeletonzation of Image Patterns using Image...
Medial Axis Transformation based Skeletonzation of Image Patterns using Image...
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 
Global threshold and region based active contour model for accurate image seg...
Global threshold and region based active contour model for accurate image seg...Global threshold and region based active contour model for accurate image seg...
Global threshold and region based active contour model for accurate image seg...
 
DESPECKLING OF SAR IMAGES BY OPTIMIZING AVERAGED POWER SPECTRAL VALUE IN CURV...
DESPECKLING OF SAR IMAGES BY OPTIMIZING AVERAGED POWER SPECTRAL VALUE IN CURV...DESPECKLING OF SAR IMAGES BY OPTIMIZING AVERAGED POWER SPECTRAL VALUE IN CURV...
DESPECKLING OF SAR IMAGES BY OPTIMIZING AVERAGED POWER SPECTRAL VALUE IN CURV...
 

Último

Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 sciencefloriejanemacaya1
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxAleenaTreesaSaji
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Patrick Diehl
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfSwapnil Therkar
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
A relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfA relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfnehabiju2046
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxpradhanghanshyam7136
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |aasikanpl
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptMAESTRELLAMesa2
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Nistarini College, Purulia (W.B) India
 

Último (20)

Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 science
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptx
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
A relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfA relative description on Sonoporation.pdf
A relative description on Sonoporation.pdf
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptx
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.ppt
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...
 

Mask R-CNN

  • 1. Mask R-CNN ICCV 2017(Oral) Kaiming He Georgia Gkioxari Piotr Dollár Ross Girshick Facebook AI Research (FAIR) Chanuk Lim KEPRI 2017.08.10
  • 2. 1. Abstract Our approach efficiently detects objects in an image while simultaneously generating a high-quality segmentation mask for each instance. Mask R-CNN extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5fps.
  • 3. 2. Review - Instance Segmentation http://blog.naver.com/sogangori/221012300995 Instance segmentation combines elements from the classical computer vision tasks of object detection, where the goal is to classify individual objects and localize each using a bounding box, and semantic segmentation, where the goal is to classify each pixel into a fixed set of categories without differentiating object instances. Instance segmentation is challenging because it requires the correct detection of all objects in an image while also precisely segmenting each instance.
  • 4. 2. Review - Fast R-CNN & Faster R-CNNround: R-CNN architechture ~2k region proposals (independent algorithm) convolutional feature extraction warped region proposals SVM classification classification box regression class specific LSR CNN Based on proposed Regions of Interest (RoI) Requires region warping for fixed size features Very ine cient pipeline Background: Fast R-CNN CNN convolutional backbone RoIPool layer fixed size feature map feature map fully connected layers box regression classification RoIs from independent method Background: Faster R-CNN RPN CNN convolutional backbone RoIPool layer fixed size feature map feature map fully connected layers box regression classification sk R-CNN: overview RPN CNN mask branch convolutional backbone RoIAlign layer fixed size feature map feature map box regression classification fully connected layershead R-CNN Fast R-CNN Mask R-CNN Faster R-CNN Curtis Kim, kakao, https://www.slideshare.net/IldooKim/deep-object-detectors-1-20166
  • 5. 4. Contribution 2. Adding a branch for predicting segmentation masks on each Region of Interest (RoI), in parallel with the existing branch for classification and bounding box regression. 1. To fix the misalignment, we propose a simple, quantization-free layer, called RoIAlign, that faithfully preserves exact spatial locations.
  • 7. 4. Contribution 1) RoIAlign • Previous works - RoIPool 112 x 112 7 x 7 Max-pooling (Input is rounded off) We propose an RoIAlign layer that removes the harsh quantization of RoIPool, properly aligning the extracted features with the input. RoIAlign improves mask accuracy by relative 10% to 50%, showing bigger gains under stricter localization metrics. Ex) [32/16] = 2, 32/16 = 2 [33/16] = 2, 33/16 = 2.06 [34/16] = 2, 34/16 = 2.12 [35/16] = 2, 35/16 = 2.18 [36/16] = 2, 36/16 = 2.25 [37/16] = 2, 37/16 = 2.31 [38/16] = 2, 38/16 = 2.37 [39/16] = 2, 39/16 = 2.43 [40/16] = 3, 40/16 = 2.5
  • 8. 4. Contribution 1) RoIAlign Jaderberg, Max, Karen Simonyan, and Andrew Zisserman. "Spatial transformer networks." Advances in Neural Information Processing Systems. 2015. We use bilinear interpolation(Spatial transformer networks, Jaderberg, et al.) to compute the exact values of the input features at four regularly sampled locations in each RoI bin, and aggregate the result (using max or average). U V Localisation net Sampler Spatial Transformer Grid ! generator T✓(G)✓ e 2: The architecture of a spatial transformer module. The input feature map U is passed to a localisation rk which regresses the transformation parameters ✓. The regular spatial grid G over V is transformed to mpling grid T✓(G), which is applied to U as described in Sect. 3.3, producing the warped output feature V . The combination of the localisation network and sampling mechanism defines a spatial transformer. for a differentiable attention mechanism, while [14] use a differentiable attention mechansim ilising Gaussian kernels in a generative model. The work by Girshick et al. [11] uses a region osal algorithm as a form of attention, and [7] show that it is possible to regress salient regions a CNN. The framework we present in this paper can be seen as a generalisation of differentiable ion to any spatial transformation. Spatial Transformers s section we describe the formulation of a spatial transformer. This is a differentiable module h applies a spatial transformation to a feature map during a single forward pass, where the ormation is conditioned on the particular input, producing a single output feature map. For -channel inputs, the same warping is applied to each channel. For simplicity, in this section we der single transforms and single outputs per transformer, however we can generalise to multiple ormations, as shown in experiments. patial transformer mechanism is split into three parts, shown in Fig. 2. In order of computation, localisation network (Sect. 3.1) takes the input feature map, and through a number of hidden (a) (b) Figure 3: Two examples of applying the parameterised sampling grid to an image U producing the output V . (a) The sampling grid is the regular grid G = TI (G), where I is the identity transformation parameters. (b) The sampling grid is the result of warping the regular grid with an affine transformation T✓(G). For clarity of exposition, assume for the moment that T✓ is a 2D affine transformation A✓. We will discuss other transformations below. In this affine case, the pointwise transformation is ✓ xs i ys i ◆ = T✓(Gi) = A✓ 0 @ xt i yt i 1 1 A =  ✓11 ✓12 ✓13 ✓21 ✓22 ✓23 0 @ xt i yt i 1 1 A (1) where (xt i, yt i ) are the target coordinates of the regular grid in the output feature map, (xs i , ys i ) are the source coordinates in the input feature map that define the sample points, and A✓ is the affine transformation matrix. We use height and width normalised coordinates, such that 1  xt i, yt i  1 when within the spatial bounds of the output, and 1  xs i , ys i  1 when within the spatial bounds of the input (and similarly for the y coordinates). The source/target transformation and sampling is equivalent to the standard texture mapping and coordinates used in graphics [8]. The transform defined in (10) allows cropping, translation, rotation, scale, and skew to be applied to the input feature map, and requires only 6 parameters (the 6 elements of A✓) to be produced by the localisation network. It allows cropping because if the transformation is a contraction (i.e. the determinant of the left 2 ⇥ 2 sub-matrix has magnitude less than unity) then the mapped regular grid will lie in a parallelogram of area less than the range of xs i , ys i . The effect of this transformation on the grid compared to the identity transform is shown in Fig. 3. The class of transformations T✓ may be more constrained, such as that used for attention A✓ =  s 0 tx 0 s ty (2) allowing cropping, translation, and isotropic scaling by varying s, tx, and ty. The transformation (a) (b) Figure 3: Two examples of applying the parameterised sampling grid to an image U producing the outpu (a) The sampling grid is the regular grid G = TI (G), where I is the identity transformation parameters. The sampling grid is the result of warping the regular grid with an affine transformation T✓(G). For clarity of exposition, assume for the moment that T✓ is a 2D affine transformation A✓. We discuss other transformations below. In this affine case, the pointwise transformation is ✓ xs i ys i ◆ = T✓(Gi) = A✓ 0 @ xt i yt i 1 1 A =  ✓11 ✓12 ✓13 ✓21 ✓22 ✓23 0 @ xt i yt i 1 1 A where (xt i, yt i) are the target coordinates of the regular grid in the output feature map, (xs i , ys i ) the source coordinates in the input feature map that define the sample points, and A✓ is the af transformation matrix. We use height and width normalised coordinates, such that 1  xt i, yt i when within the spatial bounds of the output, and 1  xs i , ys i  1 when within the spatial bou of the input (and similarly for the y coordinates). The source/target transformation and samplin equivalent to the standard texture mapping and coordinates used in graphics [8]. The transform defined in (10) allows cropping, translation, rotation, scale, and skew to be app to the input feature map, and requires only 6 parameters (the 6 elements of A✓) to be produced the localisation network. It allows cropping because if the transformation is a contraction (i.e. determinant of the left 2 ⇥ 2 sub-matrix has magnitude less than unity) then the mapped regular will lie in a parallelogram of area less than the range of xs i , ys i . The effect of this transformation the grid compared to the identity transform is shown in Fig. 3. The class of transformations T✓ may be more constrained, such as that used for attention A✓ =  s 0 tx 0 s ty allowing cropping, translation, and isotropic scaling by varying s, tx, and ty. The transforma T✓ can also be more general, such as a plane projective transformation with 8 parameters, pie wise affine, or a thin plate spline. Indeed, the transformation can have any parameterised fo provided that it is differentiable with respect to the parameters – this crucially allows gradients to x and y are the parameters of a generic sampling kernel k() which defines the image tion (e.g. bilinear), Uc nm is the value at location (n, m) in channel c of the input, and V c i tput value for pixel i at location (xt i, yt i ) in channel c. Note that the sampling is done ly for each channel of the input, so every channel is transformed in an identical way (this s spatial consistency between channels). , any sampling kernel can be used, as long as (sub-)gradients can be defined with respect to s i . For example, using the integer sampling kernel reduces (3) to V c i = HX n WX m Uc nm (bxs i + 0.5c m) (bys i + 0.5c n) (4) x + 0.5c rounds x to the nearest integer and () is the Kronecker delta function. This g kernel equates to just copying the value at the nearest pixel to (xs i , ys i ) to the output location Alternatively, a bilinear sampling kernel can be used, giving V c i = HX n WX m Uc nm max(0, 1 |xs i m|) max(0, 1 |ys i n|) (5) The class of transformations T✓ may be more constrained, such as that used for attention A✓ =  s 0 tx 0 s ty (2) allowing cropping, translation, and isotropic scaling by varying s, tx, and ty. The transformation T✓ can also be more general, such as a plane projective transformation with 8 parameters, piece- wise affine, or a thin plate spline. Indeed, the transformation can have any parameterised form, provided that it is differentiable with respect to the parameters – this crucially allows gradients to be backpropagated through from the sample points T✓(Gi) to the localisation network output ✓. If the transformation is parameterised in a structured, low-dimensional way, this reduces the complexity of the task assigned to the localisation network. For instance, a generic class of structured and dif- ferentiable transformations, which is a superset of attention, affine, projective, and thin plate spline transformations, is T✓ = M✓B, where B is a target grid representation (e.g. in (10), B is the regu- lar grid G in homogeneous coordinates), and M✓ is a matrix parameterised by ✓. In this case it is possible to not only learn how to predict ✓ for a sample, but also to learn B for the task at hand. 3.3 Differentiable Image Sampling To perform a spatial transformation of the input feature map, a sampler must take the set of sampling points T✓(G), along with the input feature map U and produce the sampled output feature map V . Each (xs i , ys i ) coordinate in T✓(G) defines the spatial location in the input where a sampling kernel is applied to get the value at a particular pixel in the output V . This can be written as V c i = HX n WX m Uc nmk(xs i m; x)k(ys i n; y) 8i 2 [1 . . . H0 W0 ] 8c 2 [1 . . . C] (3) 4 bilinear sampling kernel Copying the value at the nearest pixel to (xs i , yi s ) to the output location (xt i , yi t ). Target feature value at location i in channel c Input feature value at location (n,m) in channel c Sampling kernel Kernel parameters Sampling coordinate
  • 9. 4. Contribution 1) RoIAlign Faster R-CNN) Source: https://deepsense.io/region-of-interest-pooling-explained/ Input activation RoIPool (Faster R-CNN) Region projection and pooling sections RoIPool (Faster R-CNN) Max pooling output Faster R-CNN RoIPool
  • 10. 4. Contribution 1) RoIAlign Mask R-CNN RoIAlign ask R-CNN) Input activation RoIAlign (Mask R-CNN) Region projection and pooling sections RoIAlign (Mask R-CNN) Sampling locations RoIAlign (Mask R-CNN) Bilinear interpolated values oIAlign (Mask R-CNN) Max pooling output Silvio Galesso, https://lmb.informatik.uni-freiburg.de/lectures/seminar_brox/seminar_ss17/maskrcnn_slides.pdf
  • 11. 4. Contribution 2) Network architecture Instance segmentation Faster R-CNN + Instance segmentation Faster R-CNN ResNet or ResNeXt lity of ultiple (i) the re ex- head ssion) RoI. ave RoI RoI 14×14 ×256 7×7 ×256 14×14 ×256 1024 28×28 ×256 1024 mask 14×14 ×256 class box 2048RoI res5 7×7 ×1024 7×7 ×2048 ×4 class box 14×14 ×80 mask 28×28 ×80 Faster R-CNN w/ ResNet [19] Faster R-CNN w/ FPN [27] Figure 3. Head Architecture: We extend two existing Faster R- FPN (Feature Pyramid Network) te the generality of CNN with multiple ate between: (i) the sed for feature ex- the network head tion and regression) rately to each RoI. ave RoI RoI 14×14 ×256 7×7 ×256 14×14 ×256 1024 28×28 ×256 1024 mask 14×14 ×256 class box 2048RoI res5 7×7 ×1024 7×7 ×2048 ×4 class box 14×14 ×80 mask 28×28 ×80 Faster R-CNN w/ ResNet [19] Faster R-CNN w/ FPN [27] Figure 3. Head Architecture: We extend two existing Faster R-Backbone BackboneHead Head Background: Faster R-CNN RPN CNN convolutional backbone RoIPool layer fixed size feature map feature map fully connected layers box regression classification Faster R-CNN
  • 12. 4. Contribution 2) Network architecture • Backbone - ResNet layer name output size 18-layer 34-layer 50-layer 101-layer 152-layer conv1 112⇥112 7⇥7, 64, stride 2 conv2 x 56⇥56 3⇥3 max pool, stride 2  3⇥3, 64 3⇥3, 64 ⇥2  3⇥3, 64 3⇥3, 64 ⇥3 2 4 1⇥1, 64 3⇥3, 64 1⇥1, 256 3 5⇥3 2 4 1⇥1, 64 3⇥3, 64 1⇥1, 256 3 5⇥3 2 4 1⇥1, 64 3⇥3, 64 1⇥1, 256 3 5⇥3 conv3 x 28⇥28  3⇥3, 128 3⇥3, 128 ⇥2  3⇥3, 128 3⇥3, 128 ⇥4 2 4 1⇥1, 128 3⇥3, 128 1⇥1, 512 3 5⇥4 2 4 1⇥1, 128 3⇥3, 128 1⇥1, 512 3 5⇥4 2 4 1⇥1, 128 3⇥3, 128 1⇥1, 512 3 5⇥8 conv4 x 14⇥14  3⇥3, 256 3⇥3, 256 ⇥2  3⇥3, 256 3⇥3, 256 ⇥6 2 4 1⇥1, 256 3⇥3, 256 1⇥1, 1024 3 5⇥6 2 4 1⇥1, 256 3⇥3, 256 1⇥1, 1024 3 5⇥23 2 4 1⇥1, 256 3⇥3, 256 1⇥1, 1024 3 5⇥36 conv5 x 7⇥7  3⇥3, 512 3⇥3, 512 ⇥2  3⇥3, 512 3⇥3, 512 ⇥3 2 4 1⇥1, 512 3⇥3, 512 1⇥1, 2048 3 5⇥3 2 4 1⇥1, 512 3⇥3, 512 1⇥1, 2048 3 5⇥3 2 4 1⇥1, 512 3⇥3, 512 1⇥1, 2048 3 5⇥3 1⇥1 average pool, 1000-d fc, softmax FLOPs 1.8⇥109 3.6⇥109 3.8⇥109 7.6⇥109 11.3⇥109 Table 1. Architectures for ImageNet. Building blocks are shown in brackets (see also Fig. 5), with the numbers of blocks stacked. Down- sampling is performed by conv3 1, conv4 1, and conv5 1 with a stride of 2. 0 10 20 30 40 50 20 30 40 50 60 iter. (1e4) error(%) plain-18 plain-34 0 10 20 30 40 50 20 30 40 50 60 iter. (1e4) error(%) ResNet-18 ResNet-34 18-layer 34-layer 18-layer 34-layer Figure 4. Training on ImageNet. Thin curves denote training error, and bold curves denote validation error of the center crops. Left: plain • Backbone - FPNure Pyramid Networks for RPN ce single scale feature ith FPN. scale anchors at each ResNet-50-C4 ResNet-101-C4 - FPN exploits the inherent hierarchy of CNNs to compute multi-scale features: - Replace single scale feature map with FPN. Source: Lin et al., Feature Pyramid Networks for Object Detection
  • 13. 4. Contribution 2) Network architecture - Objective Function - RoI영역에 대한 Loss Function 정의 - Class Loss + Localization Loss - Class Loss는 Log Loss - - Localization Loss는 위치에 대한 Smoo - - L2가 Gradient가 매우 커질 수 있 비해서 덜 민감함 - BG Class(u=0)은 제거 Fast R-CNN : Details #4 - Objective Function - RoI영역에 대한 Loss Function 정의 - Class Loss + Localization Loss - Class Loss는 Log Loss - - - Objective Function - RoI영역에 대한 Loss Function 정의 - Class Loss + Localization Loss - Class Loss는 Log Loss - - Localization Loss는 위치에 대한 Smooth L1 Loss - - L2가 Gradient가 매우 커질 수 있는 것에 비해서 덜 민감함 - BG Class(u=0)은 제거 Fast R-CNN : Details #4 35 bounding-box classification and regression in par- hich turned out to largely simplify the multi-stage of original R-CNN [13]). ally, during training, we define a multi-task loss on mpled RoI as L = Lcls + Lbox + Lmask. The clas- n loss Lcls and bounding-box loss Lbox are identi- ose defined in [12]. The mask branch has a Km2 - onal output for each RoI, which encodes K binary f resolution m ⇥ m, one for each of the K classes. we apply a per-pixel sigmoid, and define Lmask as ge binary cross-entropy loss. For an RoI associated und-truth class k, Lmask is only defined on the k-th her mask outputs do not contribute to the loss). efinition of Lmask allows the network to generate or every class without competition among classes; on the dedicated classification branch to predict the el used to select the output mask. This decouples d class prediction. This is different from common when applying FCNs [29] to semantic segmenta- ch typically uses a per-pixel softmax and a multino- ss-entropy loss. In that case, masks across classes ; in our case, with a per-pixel sigmoid and a binary These quantizations introduce mis RoI and the extracted features. W classification, which is robust to sm large negative effect on predicting To address this, we propose an moves the harsh quantization of Ro the extracted features with the inp is simple: we avoid any quantizati or bins (i.e., we use x/16 instead of interpolation [22] to compute the features at four regularly sampled l and aggregate the result (using ma RoIAlign leads to large impro §4.2. We also compare to the RoIW in [10]. Unlike RoIAlign, RoIWa ment issue and was implemented i just like RoIPool. So even thoug bilinear resampling motivated by with RoIPool as shown by experim ble 2c), demonstrating the crucial 2We sample four regular locations, so t or average pooling. In fact, interpolating A. Fast R-CNN A. B. Segmentation Mask branch features: • Fully convolutional • K · (m ⇥ m) sigmoid outputs: ! pixel-wise binary classification ! one mask for each class, no competition • Lmask: mean binary cross-entropy Overall head loss: L = Lbox + Lmask B. Log loss Smooth L1 loss
  • 14. FCIS+++ [26] (top) vs. Mask R-CNN (bottom, ResNet-101-FPN). FCIS exhibits systematic artifacts on -features AP AP50 AP75 -50-C4 30.3 51.2 31.5 101-C4 32.7 54.2 34.3 50-FPN 33.6 55.2 35.3 101-FPN 35.4 57.3 37.5 101-FPN 36.7 59.5 38.9 bone Architecture: Better back- g expected gains: deeper networks FPN outperforms C4 features, and improves on ResNet. AP AP50 AP75 softmax 24.8 44.1 25.1 sigmoid 30.3 51.2 31.5 +5.5 +7.1 +6.4 (b) Multinomial vs. Independent Masks (ResNet-50-C4): Decoupling via per- class binary masks (sigmoid) gives large gains over multinomial masks (softmax). align? bilinear? a RoIPool [12] m RoIWarp [10] X m X a RoIAlign X X m X X a (c) RoIAlign (ResNet-50-C4): M layers. Our RoIAlign layer impr AP75 by ⇠5 points. Using prope tor that contributes to the large gap AP AP50 AP75 APbb APbb 50 APbb 75 l 23.6 46.5 21.6 28.2 52.7 26.9 n 30.9 51.8 32.1 34.0 55.3 36.4 +7.3 + 5.3 +10.5 +5.8 +2.6 +9.5 ign (ResNet-50-C5, stride 32): Mask-level and box-level large-stride features. Misalignments are more severe than e-16 features (Table 2c), resulting in massive accuracy gaps. mask branch MLP fc: 1024!1024!80·282 MLP fc: 1024!1024!1024!80·282 FCN conv: 256!256!256!256!256!80 (e) Mask Branch (ResNet-50-FPN): Fully convolu multi-layer perceptrons (MLP, fully-connected) for m prove results as they take advantage of explicitly enco 2. Ablations for Mask R-CNN. We train on trainval35k, test on minival, and report mask AP un omial vs. Independent Masks: Mask R-CNN de- AP by about 3 points over RoIPool, Our definition of Lmask allows the network to generate masks for every class without competition among classes; we rely on the dedicated classification branch to predict the class label used to select the output mask. This decouples mask and class prediction. This is different from common practice when applying FCNs [29] to semantic segmentation, which typically uses a per-pixel softmax and a multinomial cross-entropy loss. In that case, masks across classes compete; in our case, with a per-pixel sigmoid and a binary loss, they do not. We show by experiments that this formulation is key for good instance segmentation results. 2) Network architecture 4. Contribution
  • 15. 5. Experiments • Main dataset: MS COCO • 80 classes
 • 80k train image • 35k sub- set of val images 
 • 5k images for ablation experiments • Metric
  • 16. 5. Experiments horse1.00 horse1.00 horse1.00 bus1.00 bus1.00 car.98 truck.88 car.93 car.78 car.98 car.91car.96 car.99 car.94 car.99 car.98truck.86 car.99 car.95 car1.00 car.93car.98 car.95 car.97 car.87 car.99 car.82 car.78 car.93 car.95 car.97 person.99 traffic light.73 person1.00 person.99 person.95 person.93 person.93 person1.00 person.98 skateboard.82 suitcase1.00 suitcase.99 suitcase.96 suitcase1.00 suitcase.93 suitcase.98 suitcase.88 suitcase.72 stop sign.88 person1.00 person1.00 person1.00 person1.00 person.99 person.99 bench.76 skateboard.91 skateboard.83 handbag.81 surfboard1.00 person1.00 person1.00 surfboard1.00person1.00 person.98 surfboard1.00 person1.00 surfboard.98 surfboard1.00 person.91 person.74person1.00 person1.00 person1.00 person1.00 person1.00person1.00person.98 person.99 person1.00 person.99 umbrella1.00 person.95 umbrella.99umbrella.97 umbrella.97 umbrella.96 umbrella1.00 backpack.96 umbrella.98 backpack.95 person.80 backpack.98 bicycle.93 umbrella.89 person.89 handbag.97 handbag.85 person1.00 person1.00person1.00person1.00 person1.00 person1.00 motorcycle.72 kite.89 person.99 kite.95 person.99 person1.00 person.81 person.72 kite.93 person.89 kite1.00 person.98 person1.00 kite.84 kite.97 person.80 handbag.80 person.99 kite.82 person.98person.96 kite.98 person.99 person.82 kite.81 person.95 person.84 kite.98 kite.72 kite.99 kite.84 kite.99 person.94 person.72person.98 kite.95 person.98person.77 kite.73 person.78 person.71person.87 kite.88 kite.88 person.94 kite.86 kite.89 zebra.99 zebra1.00 zebra1.00 zebra.99 zebra1.00 zebra.96 zebra.74 zebra.96 zebra.99zebra.90 zebra.88 zebra.76 dining table.91 dining table.78 chair.97 person.99 person.86 chair.94 chair.98 person.95 chair.95 person.97 chair.92 chair.99 person.97 person.99 person.94person.99 person.87 person.99 chair.83 person.94 person.99person.98 chair.87 chair.95 person.97 person.96 chair.99 person.86 person.89 chair.89 wine glass.93 person.98 person.88 person.97 person.88 person.88 person.91 chair.96 person.95 person.77 person.92 wine glass.94 cup.83 wine glass.94 wine glass.83 cup.91 chair.85 dining table.96 wine glass.91 person.96 cup.98 person.83 dining table.75 cup.96 person.72 wine glass.80 chair.98 person.81 person.82 dining table.81 chair.85 chair.78 cup.75 person.77 cup.71 wine glass.80 cup.79cup.93 cup.71 person.99 person.99 person1.00 person1.00 frisbee1.00 person.80 person.82 elephant1.00elephant1.00 elephant1.00 elephant.97 elephant.99 person1.00 person1.00 dining table.95 person1.00 person.88 wine glass1.00 bottle.97 wine glass1.00 wine glass.99 tv.98 tv.84 person1.00 bench.97 person.98 person1.00 person1.00 handbag.73 person.86potted plant.92 bird.93 person.76 person.98person.78person.78backpack.88handbag.91 cell phone.77clock.73 person.99 person1.00 person.98 person1.00 person1.00 person1.00 person.99 person.99 person.99person1.00person1.00 person.98 person.99 handbag.88 person1.00person.98 person.92 handbag.99 person.97 person.95 handbag.88 traffic light.99 person.95 person.87 person.95 traffic light.87 traffic light.71 person.80 person.95person.95person.73person.74 tie.85 car.99 car.86 car.97 car1.00car.95 car.97 traffic light1.00 traffic light.99 car.99 person.99 car.95 car.97car.98 car.98 car.91 car1.00 car.96 car.96 bicycle.86 car.97 car.97 car.97 car.94 car.95 car.94 car.81 person.87 parking meter.98 car.89 donut1.00 donut.90 donut.88 donut.81 donut.95 donut.96 donut1.00donut.98 donut.99 donut.94 donut.97 donut.99 donut.98 donut1.00 donut.95 donut1.00 donut.98 donut.98 donut.99 donut.96 donut.89 donut.96 donut.95 donut.98 donut.89 donut.93 donut.95 donut.90 donut.89 donut.89 donut.89 donut.86 donut.86 person1.00 person1.00person1.00 person1.00 person1.00 person1.00 person1.00 dog1.00 baseball bat.99 baseball bat.85 baseball bat.98 truck.92 truck.99 truck.96truck.99truck.97 bus.99 truck.93bus.90 person1.00 person1.00 horse.77 horse.99 cow.93 person.96 person1.00 person.99 horse.97 person.98person.97 person.98 person.96 person1.00 tennis racket1.00 chair.73 person.90 person.77 person.97 person.81 person.87 person.71person.96 person.99 person.98person.94 chair.97 chair.80 chair.71 chair.94chair.92 chair.99 chair.93 chair.99 chair.91chair.81chair.98chair.83 chair.81 chair.81 chair.93 sports ball.99 person1.00 couch.82 person1.00 person.99 person1.00 person1.00person1.00 person.99 skateboard.99 person.90 person.98 person.99 person.91 person.99person1.00 person.80 skateboard.98 Figure 4. More results of Mask R-CNN on COCO test images, using ResNet-101-FPN and running at 5 fps, with 35.7 mask AP (Table 1). backbone AP AP50 AP75 APS APM APL MNC [10] ResNet-101-C4 24.6 44.3 24.8 4.7 25.9 43.6 FCIS [26] +OHEM ResNet-101-C5-dilated 29.2 49.5 - 7.1 31.3 50.0
  • 17. 5. Experiments Figure 5. FCIS+++ [26] (top) vs. Mask R-CNN (bottom, ResNet-101-FPN). FCIS exhibits systematic artifacts on overlapping objects. net-depth-features AP AP50 AP75 ResNet-50-C4 30.3 51.2 31.5 ResNet-101-C4 32.7 54.2 34.3 ResNet-50-FPN 33.6 55.2 35.3 ResNet-101-FPN 35.4 57.3 37.5 ResNeXt-101-FPN 36.7 59.5 38.9 (a) Backbone Architecture: Better back- bones bring expected gains: deeper networks do better, FPN outperforms C4 features, and ResNeXt improves on ResNet. AP AP50 AP75 softmax 24.8 44.1 25.1 sigmoid 30.3 51.2 31.5 +5.5 +7.1 +6.4 (b) Multinomial vs. Independent Masks (ResNet-50-C4): Decoupling via per- class binary masks (sigmoid) gives large gains over multinomial masks (softmax). align? bilinear? agg. AP AP50 AP75 RoIPool [12] max 26.9 48.8 26.4 RoIWarp [10] X max 27.2 49.2 27.1 X ave 27.1 48.9 27.1 RoIAlign X X max 30.2 51.0 31.8 X X ave 30.3 51.2 31.5 (c) RoIAlign (ResNet-50-C4): Mask results with various RoI layers. Our RoIAlign layer improves AP by ⇠3 points and AP75 by ⇠5 points. Using proper alignment is the only fac- tor that contributes to the large gap between RoI layers. AP AP50 AP75 APbb APbb 50 APbb 75 RoIPool 23.6 46.5 21.6 28.2 52.7 26.9 RoIAlign 30.9 51.8 32.1 34.0 55.3 36.4 +7.3 + 5.3 +10.5 +5.8 +2.6 +9.5 (d) RoIAlign (ResNet-50-C5, stride 32): Mask-level and box-level AP using large-stride features. Misalignments are more severe than with stride-16 features (Table 2c), resulting in massive accuracy gaps. mask branch AP AP50 AP75 MLP fc: 1024!1024!80·282 31.5 53.7 32.8 MLP fc: 1024!1024!1024!80·282 31.5 54.0 32.6 FCN conv: 256!256!256!256!256!80 33.6 55.2 35.3 (e) Mask Branch (ResNet-50-FPN): Fully convolutional networks (FCN) vs. multi-layer perceptrons (MLP, fully-connected) for mask prediction. FCNs im- prove results as they take advantage of explicitly encoding spatial layout. Table 2. Ablations for Mask R-CNN. We train on trainval35k, test on minival, and report mask AP unless otherwise noted. Multinomial vs. Independent Masks: Mask R-CNN de- couples mask and class prediction: as the existing box branch predicts the class label, we generate a mask for each class without competition among classes (by a per-pixel sig- moid and a binary loss). In Table 2b, we compare this to using a per-pixel softmax and a multinomial loss (as com- monly used in FCN [29]). This alternative couples the tasks of mask and class prediction, and results in a severe loss AP by about 3 points over RoIPool, with much of the gain coming at high IoU (AP75). RoIAlign is insensitive to max/average pool; we use average in the rest of the paper. Additionally, we compare with RoIWarp proposed in MNC [10] that also adopt bilinear sampling. As discussed in §3, RoIWarp still quantizes the RoI, losing alignment with the input. As can be seen in Table 2c, RoIWarp per- forms on par with RoIPool and much worse than RoIAlign.
  • 18. 5. Experiments Figure 4. More results of Mask R-CNN on COCO test images, using ResNet-101-FPN and running at 5 fps, with 35.7 mask AP (Table 1). backbone AP AP50 AP75 APS APM APL MNC [10] ResNet-101-C4 24.6 44.3 24.8 4.7 25.9 43.6 FCIS [26] +OHEM ResNet-101-C5-dilated 29.2 49.5 - 7.1 31.3 50.0 FCIS+++ [26] +OHEM ResNet-101-C5-dilated 33.6 54.5 - - - - Mask R-CNN ResNet-101-C4 33.1 54.9 34.8 12.1 35.6 51.1 Mask R-CNN ResNet-101-FPN 35.7 58.0 37.8 15.5 38.1 52.4 Mask R-CNN ResNeXt-101-FPN 37.1 60.0 39.4 16.9 39.9 53.5 Table 1. Instance segmentation mask AP on COCO test-dev. MNC [10] and FCIS [26] are the winners of the COCO 2015 and 2016 segmentation challenges, respectively. Without bells and whistles, Mask R-CNN outperforms the more complex FCIS+++, which includes multi-scale train/test, horizontal flip test, and OHEM [35]. All entries are single-model results. is evaluating using mask IoU. As in previous work [5, 27], expect many such improvements to be applicable to ours. person1.00 person1.00 person1.00 person1.00 umbrella1.00umbrella.99 car.99 car.93 giraffe1.00 giraffe1.00 person1.00 person1.00 person1.00 person1.00 person.95 sports ball1.00 sports ball.98 person1.00 person1.00 person1.00 tie.95 tie1.00 FCISMaskR-CNN train1.00 train.99 train.80 person1.00 person1.00person1.00 person1.00 person1.00person1.00 skateboard.98 person.99 person.99 skateboard.99 handbag.93 Figure 5. FCIS+++ [26] (top) vs. Mask R-CNN (bottom, ResNet-101-FPN). FCIS exhibits systematic artifacts on overlapping objects. net-depth-features AP AP50 AP75 ResNet-50-C4 30.3 51.2 31.5 ResNet-101-C4 32.7 54.2 34.3 ResNet-50-FPN 33.6 55.2 35.3 ResNet-101-FPN 35.4 57.3 37.5 ResNeXt-101-FPN 36.7 59.5 38.9 (a) Backbone Architecture: Better back- bones bring expected gains: deeper networks do better, FPN outperforms C4 features, and ResNeXt improves on ResNet. AP AP50 AP75 softmax 24.8 44.1 25.1 sigmoid 30.3 51.2 31.5 +5.5 +7.1 +6.4 (b) Multinomial vs. Independent Masks (ResNet-50-C4): Decoupling via per- class binary masks (sigmoid) gives large gains over multinomial masks (softmax). align? bilinear? agg. AP AP50 AP75 RoIPool [12] max 26.9 48.8 26.4 RoIWarp [10] X max 27.2 49.2 27.1 X ave 27.1 48.9 27.1 RoIAlign X X max 30.2 51.0 31.8 X X ave 30.3 51.2 31.5 (c) RoIAlign (ResNet-50-C4): Mask results with various RoI layers. Our RoIAlign layer improves AP by ⇠3 points and AP75 by ⇠5 points. Using proper alignment is the only fac- tor that contributes to the large gap between RoI layers. AP AP50 AP75 APbb APbb 50 APbb 75 RoIPool 23.6 46.5 21.6 28.2 52.7 26.9 RoIAlign 30.9 51.8 32.1 34.0 55.3 36.4 +7.3 + 5.3 +10.5 +5.8 +2.6 +9.5 mask branch AP AP50 AP75 MLP fc: 1024!1024!80·282 31.5 53.7 32.8 MLP fc: 1024!1024!1024!80·282 31.5 54.0 32.6 FCN conv: 256!256!256!256!256!80 33.6 55.2 35.3 • Instance segmentation
  • 19. 5. Experiments backbone APbb APbb 50 APbb 75 APbb S APbb M APbb L Faster R-CNN+++ [19] ResNet-101-C4 34.9 55.7 37.4 15.6 38.7 50.9 Faster R-CNN w FPN [27] ResNet-101-FPN 36.2 59.1 39.0 18.2 39.0 48.2 Faster R-CNN by G-RMI [21] Inception-ResNet-v2 [37] 34.7 55.5 36.7 13.5 38.1 52.0 Faster R-CNN w TDM [36] Inception-ResNet-v2-TDM 36.8 57.7 39.2 16.2 39.8 52.1 Faster R-CNN, RoIAlign ResNet-101-FPN 37.3 59.6 40.3 19.8 40.2 48.8 Mask R-CNN ResNet-101-FPN 38.2 60.3 41.7 20.1 41.1 50.2 Mask R-CNN ResNeXt-101-FPN 39.8 62.3 43.4 22.1 43.2 51.2 Table 3. Object detection single-model results (bounding box AP), vs. state-of-the-art on test-dev. Mask R-CNN using ResNet-101- FPN outperforms the base variants of all previous state-of-the-art models (the mask output is ignored in these experiments). The gains of Mask R-CNN over [27] come from using RoIAlign (+1.1 APbb ), multitask training (+0.9 APbb ), and ResNeXt-101 (+1.6 APbb ). Mask Branch: Segmentation is a pixel-to-pixel task and we exploit the spatial layout of masks by using an FCN. In Table 2e, we compare multi-layer perceptrons (MLP) and FCNs, using a ResNet-50-FPN backbone. Using FCNs gives a 2.1 mask AP gain over MLPs. We note that we choose this backbone so that the conv layers of the FCN head are not pre-trained, for a fair comparison with MLP. 4.3. Bounding Box Detection Results We compare Mask R-CNN to the state-of-the-art COCO bounding-box object detection in Table 3. For this result, even though the full Mask R-CNN model is trained, only the classification and box outputs are used at inference (the mask output is ignored). Mask R-CNN using ResNet-101- FPN outperforms the base variants of all previous state-of- the-art models, including the single-model variant of G- RMI [21], the winner of the COCO 2016 Detection Chal- ant takes ⇠400ms as it has a heavier box head (Figure 3), so we do not recommend using the C4 variant in practice. Although Mask R-CNN is fast, we note that our design is not optimized for speed, and better speed/accuracy trade- offs could be achieved [21], e.g., by varying image sizes and proposal numbers, which is beyond the scope of this paper. Training: Mask R-CNN is also fast to train. Training with ResNet-50-FPN on COCO trainval35k takes 32 hours in our synchronized 8-GPU implementation (0.72s per 16- image mini-batch), and 44 hours with ResNet-101-FPN. In fact, fast prototyping can be completed in less than one day when training on the train set. We hope such rapid train- ing will remove a major hurdle in this area and encourage more people to perform research on this challenging topic. 5. Mask R-CNN for Human Pose Estimation • Object detection
  • 20. Reference paper • R. Girshick. Fast R-CNN. In ICCV, 2015. • S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015. • M. Jaderberg, K. Simonyan, A. Zisserman, andK. Kavukcuoglu. Spatial transformer networks. In NIPS, 2015. • K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016. • S. Xie, R. Girshick, P. Dolla ́r, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In CVPR, 2017. • T.-Y.Lin, P.Dolla ́r, R.Girshick, K.He, B.Hariharan, and S. Belongie. Feature pyramid networks for object detection.In CVPR, 2017. • J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015. • Y. Li, H. Qi, J. Dai, X. Ji, and Y. Wei. Fully convolutional instance-aware semantic segmentation. In CVPR, 2017. • R.Girshick, F.Iandola, T.Darrell, and J.Malik. Deformable part models are convolutional neural networks. In CVPR, 2015.