Ji Lin, Chuang Gan, Song Han; TSM: Temporal Shift Module for Efficient Video Understanding, Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 7083-7093
https://openaccess.thecvf.com/content_ICCV_2019/html/Lin_TSM_Temporal_Shift_Module_for_Efficient_Video_Understanding_ICCV_2019_paper.html
2. 概要
nトレードオフ
• 28'99
• 高効率
• 低性能
• :8'99
• 低効率
• 高性能
n;<=>.?*@#-("AB#C.D)@<の提案
• 28'99ベースで時間情報を考慮
nリアルタイムで低レイテンシーで動画認識
Top-1 Top-5 Top-1 Top-5
TSN (our impl.) 30.0 60.5 - -
MultiScale TRN [58] 48.8 77.6 50.9 79.3
2-Stream TRN [58] 55.5 83.1 56.2 83.2
TSM8F 59.1 85.6 - -
TSM16F 63.4 88.5 64.3 89.6
TSMRGB+Flow 66.0 90.5 66.6 91.3
38
41
43
46
48
51
0 100 200 300 400 500 600 700
Ours ECO [ ] I3D from [ ]
FLOPs/Video (G)
Accuracy
(%)
ECOEnLite
TSMEn
NL I3D+GCN
NL I3D
I3D
ECO16F
ECO8F
TSM16F
30M 100M 150M
# Parameters
TSM8F
61 50
Figure 5. TSM enjoys better accuracy-cost trade-off than I3D family
and ECO family on Something-Something-V1 [14] dataset. (GCN
includes the cost of ResNet-50 RPN to generate region proposals.)
to generate the bounding boxes, which is unfair to compare
since external data (MSCOCO) and extra training cost is
introduced. Thus we compared TSM to its CNN part: Non-
Mod
I3D fro
ECO16
I3D fro
I3Dre
TSM
TSM
work [
conside
optical
nition m
of two-
We s
trade-o
set of S
parame
see tha
than bo
based m
based m
It can a
it achie
compu
is alrea
効率
高 低
低
高
性能
3. !"#$$の計算量削減
n:8'99の計算量を減らす
• 9.%#@.E*@ =.D)@<#F9$G
HI*%+J&#'1KL234MN
• O'P#HQ.@A*+(*?"J&#O''1234MN
• 前半28'99
• 後半:8'99
i is often based only on the current and the latest time steps
(e.g., j = i or i 1).
The non-local operation is also different from a fully-
connected (fc) layer. Eq.(1) computes responses based on
relationships between different locations, whereas fc uses
learned weights. In other words, the relationship between xj
and xi is not a function of the input data in fc, unlike in non-
local layers. Furthermore, our formulation in Eq.(1) supports
inputs of variable sizes, and maintains the corresponding
size in the output. On the contrary, an fc layer requires a
fixed-size input/output and loses positional correspondence
(e.g., that from xi to yi at the position i).
A non-local operation is a flexible building block and can
be easily used together with convolutional/recurrent layers.
It can be added into the earlier part of deep neural networks,
unlike fc layers that are often used in the end. This allows us
to build a richer hierarchy that combines both non-local and
local information.
3.2. Instantiations
Next we describe several versions of f and g. Interest-
ingly, we will show by experiments (Table 2a) that our non-
local models are not sensitive to these choices, indicating
that the generic non-local behavior is the main reason for the
observed improvements.
For simplicity, we only consider g in the form of a linear
θ: 1×1×1 φ: 1×1×1 g: 1×1×1
1×1×1
softmax
z
T×H×W×1024
T×H×W×512 T×H×W×512 T×H×W×512
THW×512 512×THW
THW×THW
THW×512
THW×512
T×H×W×512
T×H×W×1024
x
Figure 2. A spacetime non-local block. The feature maps are
shown as the shape of their tensors, e.g., T⇥H⇥W⇥1024 for
1024 channels (proper reshaping is performed when noted). “⌦”
denotes matrix multiplication, and “ ” denotes element-wise sum.
The softmax operation is performed on each row. The blue boxes de-
note 1⇥1⇥1 convolutions. Here we show the embedded Gaussian
version, with a bottleneck of 512 channels. The vanilla Gaussian
version can be done by removing ✓ and , and the dot-product
version can be done by replacing softmax with scaling by 1/N.
y = softmax(xT
WT
✓ W x)g(x), which is the self-attention
form in [49]. As such, our work provides insight by relating
this recent self-attention model to the classic computer vision
:8
28
ECO: Efficient Convolutional Network for Online Video Understanding 5
Input to 3D Net
K x [N x 28 x 28]
M2 M1
K x 28 x 28
Weight Sharing
Weight Sharing
M2 M1
Video (V)
K x 28 x 28
2D Net (H2d)
K x 28 x 28
3D Net
Action
s1
s2
SN
Mϕ = [M1,. .,M1]
Mϕ = [MK,. .,MK]
MK
1
K
(H3d)
1
1
1
MK
2
2
2
M2 M1
MK
N
N
N
1 N
N
1
Temporal stacking of feature
maps
Feature Maps
4. 動画認識のための%"#$$のシフト
nシフト方法
• オフライン
• 一部特徴量を前後フレームと交換
• オンライン
• 次のフレームに特徴量を渡す
n挿入方法
• 0%R>@*E<
• シフトした特徴量をST">#E.%%<EB".%
• L<S"D)*@
• シフト前の特徴量をST">#E.%%<EB".%
nL<S"D)*@の方が高性能
• シフトする前と後を足し合わせる
• 空間情報が壊れず時間情報を融合
X Y
+
shift conv
(a) In-place TSM.
X Y
+
shift conv
(b) Residual TSM.
Figure 3. Residual shift is better than in-place shift. In-place shift
happens before a convolution layer (or a residual block). Residual
shift fuses temporal information inside a residual branch.
ule for Efficient Video Understanding
Chuang Gan
M Watson AI Lab
ng@csail.mit.edu
Song Han
MIT
songhan@mit.edu
se to
accu-
CNNs
poral
good
ng it
neric
both
Channel C
Temporal
T
(a) The original ten-
sor without shift.
pad zero
temporal
shift
truncate
T
C
H,W
(b) Offline temporal
shift (bi-direction).
t=0
t=3
…
t=1
t=2
Channel C
(c) Online temporal
shift (uni-direction).
Figure 1. Temporal Shift Module (TSM) performs efficient tem-
poral modeling by moving the feature map along the temporal
dimension. It is computationally free on top of a 2D convolution,
5. 画像認識における空間シフトと問題点
n畳み込みの代わりに特徴量を全てシフト
n動画認識にも適用したい
n問題点
• レイテンシーの増加
• 空間特徴量が把握できない
!"#$%&'()*+,-./
.3. Efficient Neural Networks
The efficiency of 2D CNN has been extensively studied.
ome works focused on designing an efficient model [21, 20,
6, 56]. Recently neural architecture search [62, 63, 31]
as been introduced to find an efficient architecture au-
omatically [44, 3]. Another way is to prune, quan-
ze and compress an existing model for efficient deploy-
ment [16, 15, 29, 59, 18, 47]. Address shift, which is a
ardware-friendly primitive, has also been exploited for com-
act 2D CNN design on image recognition tasks [51, 57].
0 1/8 1/4 1/2 1
P100
TX2
CPU
Naive shift:
large overhead
Latency
Overhead
Shift Proportion
0%
3%
6%
9%
12%
15%
Our Choice
(a) Overhead vs. proportion.
0 1/8 1/4 1/2 1 0 1/8 1/4 1/2 1
In-place TSM
Residual TSM
Naive shift:
low acc.
Accuracy
15%
Shift Proportion
69%
71%
73%
75%
67%
Our Choice
2D baseline
(b) Residual vs. in-place.
⨷ …
N
(a) Spatial Convolution
M
DF
DF
M
DK
DK
N
DF
DF
…
⨷
…
DK
DK
1
⨷
⨷
…
…
(b) Depth-wise convolution
M
DF
DF
M M
DF
DF
(c) Shift
M
DF
DF
DF
DF
M
…
…
…
Figure 2: Illustration of (a) spatial convolutions, (b) depth-wise convolutions and (c) shift. In (c), the 3x3 grids denote a shift
matrix with a kernel size of 3. The lighted cell denotes a 1 at that position and white cells denote 0s.
In this paper, we present the shift operation (Figure 1) as
an alternative to spatial convolutions. The shift operation
moves each channel of its input tensor in a different spatial
direction. A shift-based module interleaves shift operations
with point-wise convolutions, which further mixes spatial
information across channels. Unlike spatial convolutions,
the shift operation itself requires zero FLOPs and zero pa-
rameters. As opposed to depth-wise convolutions, shift op-
erations can be easily and efficiently implemented.
Our approach is orthogonal to model compression [4],
tensor factorization [27] and low-bit networks [16]. As a
result, any of these techniques could be composed with our
proposed method to further reduce model size.
We introduce a new hyperparameter for shift-based mod-
where î = i−#DK/2$, ĵ = j−#DK/2$ are the re-centered
spatial indices; k, l and i, j index along spatial dimensions
and n, m index into channels. The number of parameters
required by a spatial convolution is M × N × D2
K and the
computational cost is M × N × D2
K × D2
F . As the kernel
size DK increases, we see the number of parameters and
computational cost grow quadratically.
A popular variant of the spatial convolution is a depth-
wise convolution [7, 1], which is usually followed by a
point-wise convolution (1x1 convolution). Altogether, the
module is called the depth-wise separable convolution. A
depth-wise convolution, as shown in Figure 2(b), aggre-
gates spatial information from a DK × DK patch within
each channel, and can be described as
9. トレードオフ
n-.=<B("%+R-.=<B("%+#U4による比較
• O'Pより:倍
• 9$R0:8#HI*%+J&#'1KL234MNよりY倍 計算量が少ない
Table 4. Results on Something-Something-V2. Our TSM achieves
state-of-the-art performance.
Method
Val Test
Top-1 Top-5 Top-1 Top-5
TSN (our impl.) 30.0 60.5 - -
MultiScale TRN [58] 48.8 77.6 50.9 79.3
2-Stream TRN [58] 55.5 83.1 56.2 83.2
TSM8F 59.1 85.6 - -
TSM16F 63.4 88.5 64.3 89.6
TSMRGB+Flow 66.0 90.5 66.6 91.3
38
41
43
46
48
51
0 100 200 300 400 500 600 700
Ours ECO [ ] I3D from [ ]
FLOPs/Video (G)
Accuracy
(%)
ECOEnLite
TSMEn
NL I3D+GCN
NL I3D
I3D
ECO16F
ECO8F
TSM16F
30M 100M 150M
# Parameters
TSM8F
61 50
Figure 5. TSM enjoys better accuracy-cost trade-off than I3D family
and ECO family on Something-Something-V1 [14] dataset. (GCN
includes the cost of ResNet-50 RPN to generate region proposals.)
Table 5. TSM enjoys low GPU inference latency and high through-
put. V/s means videos per second, higher the better (Measured on
NVIDIA Tesla P100 GPU).
Model
Efficiency Statistics Accuracy
FLOPs Param. Latency Thrput. Sth. Kinetics
I3D from [50] 306G 35.3M 165.3ms 6.1V/s 41.6% -
ECO16F [61] 64G 47.5M 30.6ms 45.6V/s 41.4% -
I3D from [49] 33G 29.3M 25.8ms 42.4V/s - 73.3%
I3Dreplace 48G 33.0M 28.0ms 37.9V/s 44.9% -
TSM8F 33G 24.3M 17.4ms 77.4V/s 45.6% 74.1%
TSM16F 65G 24.3M 29.0ms 39.5V/s 47.2% 74.7%
work [34] to extract bounding boxes, whose cost is also
considered in the chart. Note that the computation cost of
optical flow extraction is usually larger than the video recog-
nition model itself. Therefore, we do not report the FLOPs
of two-stream based methods.
We show the accuracy, FLOPs, and number of parameters
trade-off in Figure 5. The accuracy is tested on the validation
set of Something-Something-V1 dataset, and the number of
parameters is indicated by the area of the circles. We can
see that our TSM based methods have a better Pareto curve
than both previous state-of-the-art efficient models (ECO
based models) and high-performance models (non-local I3D
based models). TSM models are both efficient and accurate.
Table 4. Results on Something-Something-V2. Our TSM achieves
state-of-the-art performance.
Method
Val Test
Top-1 Top-5 Top-1 Top-5
TSN (our impl.) 30.0 60.5 - -
MultiScale TRN [58] 48.8 77.6 50.9 79.3
2-Stream TRN [58] 55.5 83.1 56.2 83.2
TSM8F 59.1 85.6 - -
TSM16F 63.4 88.5 64.3 89.6
TSMRGB+Flow 66.0 90.5 66.6 91.3
43
46
48
51
Ours ECO [ ] I3D from [ ]
curacy
(%)
ECOEnLite
TSMEn
NL I3D+GCN
NL I3D
TSM16F
TSM8F
61 50
Table 5. TSM enjoys low GPU inference latency and high through-
put. V/s means videos per second, higher the better (Measured on
NVIDIA Tesla P100 GPU).
Model
Efficiency Statistics Accuracy
FLOPs Param. Latency Thrput. Sth. Kinetics
I3D from [50] 306G 35.3M 165.3ms 6.1V/s 41.6% -
ECO16F [61] 64G 47.5M 30.6ms 45.6V/s 41.4% -
I3D from [49] 33G 29.3M 25.8ms 42.4V/s - 73.3%
I3Dreplace 48G 33.0M 28.0ms 37.9V/s 44.9% -
TSM8F 33G 24.3M 17.4ms 77.4V/s 45.6% 74.1%
TSM16F 65G 24.3M 29.0ms 39.5V/s 47.2% 74.7%
work [34] to extract bounding boxes, whose cost is also
considered in the chart. Note that the computation cost of
optical flow extraction is usually larger than the video recog-
nition model itself. Therefore, we do not report the FLOPs
of two-stream based methods.
10. 高速な認識の優位性
n['434の観察時間ごとの認識率
n最初の43`を観測したとき
• ;-CはO'Pに比べてX`ほど高い
n観測初期から高精度
nオンラインの場合2フレーム目から
前フレームの特徴量を考慮できる
Table 6. Comparing the accuracy of offline TSM and online TSM on
different datasets. Online TSM brings negligible latency overhead.
Model Latency Kinetics UCF101 HMDB51 Something
TSN 4.7ms 70.6% 91.7% 64.7% 20.5%
+Offline - 74.1% 95.9% 73.5% 47.3%
+Online 4.8ms 74.3% 95.5% 73.6% 46.3%
Accuracy
%
80
84
88
92
96
Video Observation %
10 20 40 60 80 100
ECO (s=8)
ECO (s=12)
ECO (s=20)
TSM
Figure 6. Early recognition on UCF101. TSM gives high prediction
accuracy after only observing a small portion of the video.
of backbone design, we replace every TSM primitive with
3 × 1 × 1 convolution and denote this model as I3Dreplace. It