SlideShare uma empresa Scribd logo
1 de 11
Baixar para ler offline
!"#$%!&'()*+,%"-./0%#)12,&%
/)*%3//.4.&50%6.1&)%
751&*80+51.59
!"#$"%&#'()*%+#,*%&#-.%+#/*%&#0''12345
橋口凌大(名工大玉木研)
英語論文紹介2324643647
概要
nトレードオフ
• 28'99
• 高効率
• 低性能
• :8'99
• 低効率
• 高性能
n;<=>.?*@#-("AB#C.D)@<の提案
• 28'99ベースで時間情報を考慮
nリアルタイムで低レイテンシーで動画認識
Top-1 Top-5 Top-1 Top-5
TSN (our impl.) 30.0 60.5 - -
MultiScale TRN [58] 48.8 77.6 50.9 79.3
2-Stream TRN [58] 55.5 83.1 56.2 83.2
TSM8F 59.1 85.6 - -
TSM16F 63.4 88.5 64.3 89.6
TSMRGB+Flow 66.0 90.5 66.6 91.3
38
41
43
46
48
51
0 100 200 300 400 500 600 700
Ours ECO [ ] I3D from [ ]
FLOPs/Video (G)
Accuracy
(%)
ECOEnLite
TSMEn
NL I3D+GCN
NL I3D
I3D
ECO16F
ECO8F
TSM16F
30M 100M 150M
# Parameters
TSM8F
61 50
Figure 5. TSM enjoys better accuracy-cost trade-off than I3D family
and ECO family on Something-Something-V1 [14] dataset. (GCN
includes the cost of ResNet-50 RPN to generate region proposals.)
to generate the bounding boxes, which is unfair to compare
since external data (MSCOCO) and extra training cost is
introduced. Thus we compared TSM to its CNN part: Non-
Mod
I3D fro
ECO16
I3D fro
I3Dre
TSM
TSM
work [
conside
optical
nition m
of two-
We s
trade-o
set of S
parame
see tha
than bo
based m
based m
It can a
it achie
compu
is alrea
効率
高 低
低
高
性能
!"#$$の計算量削減
n:8'99の計算量を減らす
• 9.%#@.E*@ =.D)@<#F9$G
HI*%+J&#'1KL234MN
• O'P#HQ.@A*+(*?"J&#O''1234MN
• 前半28'99
• 後半:8'99
i is often based only on the current and the latest time steps
(e.g., j = i or i 1).
The non-local operation is also different from a fully-
connected (fc) layer. Eq.(1) computes responses based on
relationships between different locations, whereas fc uses
learned weights. In other words, the relationship between xj
and xi is not a function of the input data in fc, unlike in non-
local layers. Furthermore, our formulation in Eq.(1) supports
inputs of variable sizes, and maintains the corresponding
size in the output. On the contrary, an fc layer requires a
fixed-size input/output and loses positional correspondence
(e.g., that from xi to yi at the position i).
A non-local operation is a flexible building block and can
be easily used together with convolutional/recurrent layers.
It can be added into the earlier part of deep neural networks,
unlike fc layers that are often used in the end. This allows us
to build a richer hierarchy that combines both non-local and
local information.
3.2. Instantiations
Next we describe several versions of f and g. Interest-
ingly, we will show by experiments (Table 2a) that our non-
local models are not sensitive to these choices, indicating
that the generic non-local behavior is the main reason for the
observed improvements.
For simplicity, we only consider g in the form of a linear
θ: 1×1×1 φ: 1×1×1 g: 1×1×1
1×1×1
softmax
z
T×H×W×1024
T×H×W×512 T×H×W×512 T×H×W×512
THW×512 512×THW
THW×THW
THW×512
THW×512
T×H×W×512
T×H×W×1024
x
Figure 2. A spacetime non-local block. The feature maps are
shown as the shape of their tensors, e.g., T⇥H⇥W⇥1024 for
1024 channels (proper reshaping is performed when noted). “⌦”
denotes matrix multiplication, and “ ” denotes element-wise sum.
The softmax operation is performed on each row. The blue boxes de-
note 1⇥1⇥1 convolutions. Here we show the embedded Gaussian
version, with a bottleneck of 512 channels. The vanilla Gaussian
version can be done by removing ✓ and , and the dot-product
version can be done by replacing softmax with scaling by 1/N.
y = softmax(xT
WT
✓ W x)g(x), which is the self-attention
form in [49]. As such, our work provides insight by relating
this recent self-attention model to the classic computer vision
:8
28
ECO: Efficient Convolutional Network for Online Video Understanding 5
Input to 3D Net
K x [N x 28 x 28]
M2 M1
K x 28 x 28
Weight Sharing
Weight Sharing
M2 M1
Video (V)
K x 28 x 28
2D Net (H2d)
K x 28 x 28
3D Net
Action
s1
s2
SN
Mϕ = [M1,. .,M1]
Mϕ = [MK,. .,MK]
MK
1
K
(H3d)
1
1
1
MK
2
2
2
M2 M1
MK
N
N
N
1 N
N
1
Temporal stacking of feature
maps
Feature Maps
動画認識のための%"#$$のシフト
nシフト方法
• オフライン
• 一部特徴量を前後フレームと交換
• オンライン
• 次のフレームに特徴量を渡す
n挿入方法
• 0%R>@*E<
• シフトした特徴量をST">#E.%%<EB".%
• L<S"D)*@
• シフト前の特徴量をST">#E.%%<EB".%
nL<S"D)*@の方が高性能
• シフトする前と後を足し合わせる
• 空間情報が壊れず時間情報を融合
X Y
+
shift conv
(a) In-place TSM.
X Y
+
shift conv
(b) Residual TSM.
Figure 3. Residual shift is better than in-place shift. In-place shift
happens before a convolution layer (or a residual block). Residual
shift fuses temporal information inside a residual branch.
ule for Efficient Video Understanding
Chuang Gan
M Watson AI Lab
ng@csail.mit.edu
Song Han
MIT
songhan@mit.edu
se to
accu-
CNNs
poral
good
ng it
neric
both
Channel C
Temporal
T
(a) The original ten-
sor without shift.
pad zero
temporal
shift
truncate
T
C
H,W
(b) Offline temporal
shift (bi-direction).
t=0
t=3
…
t=1
t=2
Channel C
(c) Online temporal
shift (uni-direction).
Figure 1. Temporal Shift Module (TSM) performs efficient tem-
poral modeling by moving the feature map along the temporal
dimension. It is computationally free on top of a 2D convolution,
画像認識における空間シフトと問題点
n畳み込みの代わりに特徴量を全てシフト
n動画認識にも適用したい
n問題点
• レイテンシーの増加
• 空間特徴量が把握できない
!"#$%&'()*+,-./
.3. Efficient Neural Networks
The efficiency of 2D CNN has been extensively studied.
ome works focused on designing an efficient model [21, 20,
6, 56]. Recently neural architecture search [62, 63, 31]
as been introduced to find an efficient architecture au-
omatically [44, 3]. Another way is to prune, quan-
ze and compress an existing model for efficient deploy-
ment [16, 15, 29, 59, 18, 47]. Address shift, which is a
ardware-friendly primitive, has also been exploited for com-
act 2D CNN design on image recognition tasks [51, 57].
0 1/8 1/4 1/2 1
P100
TX2
CPU
Naive shift:
large overhead
Latency
Overhead
Shift Proportion
0%
3%
6%
9%
12%
15%
Our Choice
(a) Overhead vs. proportion.
0 1/8 1/4 1/2 1 0 1/8 1/4 1/2 1
In-place TSM
Residual TSM
Naive shift:
low acc.
Accuracy
15%
Shift Proportion
69%
71%
73%
75%
67%
Our Choice
2D baseline
(b) Residual vs. in-place.
⨷ …
N
(a)	Spatial	Convolution
M
DF
DF
M
DK
DK
N
DF
DF
…
⨷
…
DK
DK
1
⨷
⨷
…
…
(b)	Depth-wise	convolution
M
DF
DF
M M
DF
DF
(c)	Shift
M
DF
DF
DF
DF
M
…
…
…
Figure 2: Illustration of (a) spatial convolutions, (b) depth-wise convolutions and (c) shift. In (c), the 3x3 grids denote a shift
matrix with a kernel size of 3. The lighted cell denotes a 1 at that position and white cells denote 0s.
In this paper, we present the shift operation (Figure 1) as
an alternative to spatial convolutions. The shift operation
moves each channel of its input tensor in a different spatial
direction. A shift-based module interleaves shift operations
with point-wise convolutions, which further mixes spatial
information across channels. Unlike spatial convolutions,
the shift operation itself requires zero FLOPs and zero pa-
rameters. As opposed to depth-wise convolutions, shift op-
erations can be easily and efficiently implemented.
Our approach is orthogonal to model compression [4],
tensor factorization [27] and low-bit networks [16]. As a
result, any of these techniques could be composed with our
proposed method to further reduce model size.
We introduce a new hyperparameter for shift-based mod-
where î = i−#DK/2$, ĵ = j−#DK/2$ are the re-centered
spatial indices; k, l and i, j index along spatial dimensions
and n, m index into channels. The number of parameters
required by a spatial convolution is M × N × D2
K and the
computational cost is M × N × D2
K × D2
F . As the kernel
size DK increases, we see the number of parameters and
computational cost grow quadratically.
A popular variant of the spatial convolution is a depth-
wise convolution [7, 1], which is usually followed by a
point-wise convolution (1x1 convolution). Altogether, the
module is called the depth-wise separable convolution. A
depth-wise convolution, as shown in Figure 2(b), aggre-
gates spatial information from a DK × DK patch within
each channel, and can be described as
データセットとベースライン
n時間情報が重要
• -.=<B("%+RS.=<B("%+#U4VU2 H,.W*@J&#0''1234XN
• '(*?*D<S#H,)%%*?J&#234YN
• !<SB<?#HC*B<?ZW%ST*J&#N
n時間情報がそれほど重要でない
• ['434#H-..=?.J&#2342N
• ]"%<B"ES#H]*WJ&#234XN
• /C8^74#H])<(%<J&#2344N
nベースライン
• ;<=>.?*@#S<+=<%B#%<B_.?T#HI*%+J&#O''1234YN
-.=<B("%+RS.=<B("%+
各データセットにおける&'(の効果
nC.?<#;<=>.?*@だと大幅な性能向上
n$<SS#;<=>.?*@でも性能向上
nline
ence,
each
next
maps
f 7/8
erate
M for
ges:
need
ring
per-
eline.
one
he a
mory
MB
Table 1. Our method consistently outperforms 2D counterparts on
multiple datasets at zero extra computation (protocol: ResNet-50
8f input, 10 clips for Kinetics, 2 for others, full-resolution).
Dataset Model Acc1 Acc5 ∆ Acc1
Less
Temporal
Kinetics
TSN 70.6 89.2
+3.5
Ours 74.1 91.2
UCF101
TSN 91.7 99.2
+4.2
Ours 95.9 99.7
HMDB51
TSN 64.7 89.9
+8.8
Ours 73.5 94.3
More
Temporal
Something
V1
TSN 20.5 47.5
+28.0
Ours 47.3 76.2
Something
V2
TSN 30.4 61.0
+31.3
Ours 61.7 87.4
Jester
TSN 83.9 99.6
+11.7
Ours 97.0 99.9
I3D from [50] 3D ResNet-50 32×
Non-local I3D from [50] 3D ResNet-50 32×
Non-local I3D + GCN [50] 3D ResNet-50+GCN 32×
TSM ResNet-50
TSM ResNet-50 1
TSMEn ResNet-50 2
TSMRGB+Flow ResNet-50 16
Table 3. TSM can consistently improve the performance over
ferent backbones on Kinetics dataset.
Mb-V2 R-50 RX-101 NL R-50
TSN 66.5 70.7 72.4 74.6
TSM 69.5 74.1 76.3 75.7
∆Acc. +3.0 +3.4 +3.9 +1.1
consistently outperforms the 2D TSN baseline at no ex
computation. For the lower part, we present the results
Something-Something V1 and V2 [14] and Jester [1], wh
n全てのベースラインで性能向上
nすでに時間的モデリングをしているモデ
ルより性能向上
n;-Cの時間情報モデリング能力が高い
時間情報を考慮したモデルでも性能向上
')&*との比較
n28ベースラインを大幅に改善
n:8の手法より性能が良い
Table 2. Comparing TSM against other methods on Something-Something dataset (center crop, 1 clip/video unless otherwise specified).
Model Backbone #Frame FLOPs/Video #Param. Val Top-1 Val Top-5 Test Top-1
TSN [58] BNInception 8 16G 10.7M 19.5 - -
TSN (our impl.) ResNet-50 8 33G 24.3M 19.7 46.6 -
TRN-Multiscale [58] BNInception 8 16G 18.3M 34.4 - 33.6
TRN-Multiscale (our impl.) ResNet-50 8 33G 31.8M 38.9 68.1 -
Two-stream TRNRGB+Flow [58] BNInception 8+8 - 36.6M 42.0 - 40.7
ECO [61] BNIncep+3D Res18 8 32G 47.5M 39.6 - -
ECO [61] BNIncep+3D Res18 16 64G 47.5M 41.4 - -
ECOEnLite [61] BNIncep+3D Res18 92 267G 150M 46.4 - 42.3
ECOEnLiteRGB+Flow [61] BNIncep+3D Res18 92+92 - 300M 49.5 - 43.9
I3D from [50] 3D ResNet-50 32×2clip 153G1
×2 28.0M 41.6 72.2 -
Non-local I3D from [50] 3D ResNet-50 32×2clip 168G1
×2 35.3M 44.4 76.0 -
Non-local I3D + GCN [50] 3D ResNet-50+GCN 32×2clip 303G2
×2 62.2M2
46.1 76.8 45.0
TSM ResNet-50 8 33G 24.3M 45.6 74.2 -
TSM ResNet-50 16 65G 24.3M 47.2 77.1 46.0
TSMEn ResNet-50 24 98G 48.6M 49.7 78.5 -
TSMRGB+Flow ResNet-50 16+16 - 48.6M 52.6 81.9 50.7
Table 3. TSM can consistently improve the performance over dif- Something-Something-V1. Something-Something-V1 is
28ベース
:8ベース
提案手法
トレードオフ
n-.=<B("%+R-.=<B("%+#U4による比較
• O'Pより:倍
• 9$R0:8#HI*%+J&#'1KL234MNよりY倍 計算量が少ない
Table 4. Results on Something-Something-V2. Our TSM achieves
state-of-the-art performance.
Method
Val Test
Top-1 Top-5 Top-1 Top-5
TSN (our impl.) 30.0 60.5 - -
MultiScale TRN [58] 48.8 77.6 50.9 79.3
2-Stream TRN [58] 55.5 83.1 56.2 83.2
TSM8F 59.1 85.6 - -
TSM16F 63.4 88.5 64.3 89.6
TSMRGB+Flow 66.0 90.5 66.6 91.3
38
41
43
46
48
51
0 100 200 300 400 500 600 700
Ours ECO [ ] I3D from [ ]
FLOPs/Video (G)
Accuracy
(%)
ECOEnLite
TSMEn
NL I3D+GCN
NL I3D
I3D
ECO16F
ECO8F
TSM16F
30M 100M 150M
# Parameters
TSM8F
61 50
Figure 5. TSM enjoys better accuracy-cost trade-off than I3D family
and ECO family on Something-Something-V1 [14] dataset. (GCN
includes the cost of ResNet-50 RPN to generate region proposals.)
Table 5. TSM enjoys low GPU inference latency and high through-
put. V/s means videos per second, higher the better (Measured on
NVIDIA Tesla P100 GPU).
Model
Efficiency Statistics Accuracy
FLOPs Param. Latency Thrput. Sth. Kinetics
I3D from [50] 306G 35.3M 165.3ms 6.1V/s 41.6% -
ECO16F [61] 64G 47.5M 30.6ms 45.6V/s 41.4% -
I3D from [49] 33G 29.3M 25.8ms 42.4V/s - 73.3%
I3Dreplace 48G 33.0M 28.0ms 37.9V/s 44.9% -
TSM8F 33G 24.3M 17.4ms 77.4V/s 45.6% 74.1%
TSM16F 65G 24.3M 29.0ms 39.5V/s 47.2% 74.7%
work [34] to extract bounding boxes, whose cost is also
considered in the chart. Note that the computation cost of
optical flow extraction is usually larger than the video recog-
nition model itself. Therefore, we do not report the FLOPs
of two-stream based methods.
We show the accuracy, FLOPs, and number of parameters
trade-off in Figure 5. The accuracy is tested on the validation
set of Something-Something-V1 dataset, and the number of
parameters is indicated by the area of the circles. We can
see that our TSM based methods have a better Pareto curve
than both previous state-of-the-art efficient models (ECO
based models) and high-performance models (non-local I3D
based models). TSM models are both efficient and accurate.
Table 4. Results on Something-Something-V2. Our TSM achieves
state-of-the-art performance.
Method
Val Test
Top-1 Top-5 Top-1 Top-5
TSN (our impl.) 30.0 60.5 - -
MultiScale TRN [58] 48.8 77.6 50.9 79.3
2-Stream TRN [58] 55.5 83.1 56.2 83.2
TSM8F 59.1 85.6 - -
TSM16F 63.4 88.5 64.3 89.6
TSMRGB+Flow 66.0 90.5 66.6 91.3
43
46
48
51
Ours ECO [ ] I3D from [ ]
curacy
(%)
ECOEnLite
TSMEn
NL I3D+GCN
NL I3D
TSM16F
TSM8F
61 50
Table 5. TSM enjoys low GPU inference latency and high through-
put. V/s means videos per second, higher the better (Measured on
NVIDIA Tesla P100 GPU).
Model
Efficiency Statistics Accuracy
FLOPs Param. Latency Thrput. Sth. Kinetics
I3D from [50] 306G 35.3M 165.3ms 6.1V/s 41.6% -
ECO16F [61] 64G 47.5M 30.6ms 45.6V/s 41.4% -
I3D from [49] 33G 29.3M 25.8ms 42.4V/s - 73.3%
I3Dreplace 48G 33.0M 28.0ms 37.9V/s 44.9% -
TSM8F 33G 24.3M 17.4ms 77.4V/s 45.6% 74.1%
TSM16F 65G 24.3M 29.0ms 39.5V/s 47.2% 74.7%
work [34] to extract bounding boxes, whose cost is also
considered in the chart. Note that the computation cost of
optical flow extraction is usually larger than the video recog-
nition model itself. Therefore, we do not report the FLOPs
of two-stream based methods.
高速な認識の優位性
n['434の観察時間ごとの認識率
n最初の43`を観測したとき
• ;-CはO'Pに比べてX`ほど高い
n観測初期から高精度
nオンラインの場合2フレーム目から
前フレームの特徴量を考慮できる
Table 6. Comparing the accuracy of offline TSM and online TSM on
different datasets. Online TSM brings negligible latency overhead.
Model Latency Kinetics UCF101 HMDB51 Something
TSN 4.7ms 70.6% 91.7% 64.7% 20.5%
+Offline - 74.1% 95.9% 73.5% 47.3%
+Online 4.8ms 74.3% 95.5% 73.6% 46.3%
Accuracy
%
80
84
88
92
96
Video Observation %
10 20 40 60 80 100
ECO (s=8)
ECO (s=12)
ECO (s=20)
TSM
Figure 6. Early recognition on UCF101. TSM gives high prediction
accuracy after only observing a small portion of the video.
of backbone design, we replace every TSM primitive with
3 × 1 × 1 convolution and denote this model as I3Dreplace. It
まとめ
n;<=>.?*@#-("AB#C.D)@<の提案
• 28'99に挿入し,時間方向モデリングが可能
• 追加コストなし
• 計算量ゼロ
• パラメータゼロ
n低遅延動画認識
• 効率的で高精度
• エッジデバイスによる低遅延な動画認識が可能

Mais conteúdo relacionado

Mais procurados

[DL輪読会]Neural Ordinary Differential Equations
[DL輪読会]Neural Ordinary Differential Equations[DL輪読会]Neural Ordinary Differential Equations
[DL輪読会]Neural Ordinary Differential EquationsDeep Learning JP
 
ConvNetの歴史とResNet亜種、ベストプラクティス
ConvNetの歴史とResNet亜種、ベストプラクティスConvNetの歴史とResNet亜種、ベストプラクティス
ConvNetの歴史とResNet亜種、ベストプラクティスYusuke Uchida
 
PRML学習者から入る深層生成モデル入門
PRML学習者から入る深層生成モデル入門PRML学習者から入る深層生成モデル入門
PRML学習者から入る深層生成モデル入門tmtm otm
 
深層学習によるHuman Pose Estimationの基礎
深層学習によるHuman Pose Estimationの基礎深層学習によるHuman Pose Estimationの基礎
深層学習によるHuman Pose Estimationの基礎Takumi Ohkuma
 
【メタサーベイ】数式ドリブン教師あり学習
【メタサーベイ】数式ドリブン教師あり学習【メタサーベイ】数式ドリブン教師あり学習
【メタサーベイ】数式ドリブン教師あり学習cvpaper. challenge
 
SSII2021 [OS2-02] 深層学習におけるデータ拡張の原理と最新動向
SSII2021 [OS2-02] 深層学習におけるデータ拡張の原理と最新動向SSII2021 [OS2-02] 深層学習におけるデータ拡張の原理と最新動向
SSII2021 [OS2-02] 深層学習におけるデータ拡張の原理と最新動向SSII
 
Action Recognitionの歴史と最新動向
Action Recognitionの歴史と最新動向Action Recognitionの歴史と最新動向
Action Recognitionの歴史と最新動向Ohnishi Katsunori
 
[Ridge-i 論文よみかい] Wasserstein auto encoder
[Ridge-i 論文よみかい] Wasserstein auto encoder[Ridge-i 論文よみかい] Wasserstein auto encoder
[Ridge-i 論文よみかい] Wasserstein auto encoderMasanari Kimura
 
【DL輪読会】言語以外でのTransformerのまとめ (ViT, Perceiver, Frozen Pretrained Transformer etc)
【DL輪読会】言語以外でのTransformerのまとめ (ViT, Perceiver, Frozen Pretrained Transformer etc)【DL輪読会】言語以外でのTransformerのまとめ (ViT, Perceiver, Frozen Pretrained Transformer etc)
【DL輪読会】言語以外でのTransformerのまとめ (ViT, Perceiver, Frozen Pretrained Transformer etc)Deep Learning JP
 
Word Tour: One-dimensional Word Embeddings via the Traveling Salesman Problem...
Word Tour: One-dimensional Word Embeddings via the Traveling Salesman Problem...Word Tour: One-dimensional Word Embeddings via the Traveling Salesman Problem...
Word Tour: One-dimensional Word Embeddings via the Traveling Salesman Problem...joisino
 
【論文読み会】Deep Clustering for Unsupervised Learning of Visual Features
【論文読み会】Deep Clustering for Unsupervised Learning of Visual Features【論文読み会】Deep Clustering for Unsupervised Learning of Visual Features
【論文読み会】Deep Clustering for Unsupervised Learning of Visual FeaturesARISE analytics
 
【DL輪読会】A Time Series is Worth 64 Words: Long-term Forecasting with Transformers
【DL輪読会】A Time Series is Worth 64 Words: Long-term Forecasting with Transformers【DL輪読会】A Time Series is Worth 64 Words: Long-term Forecasting with Transformers
【DL輪読会】A Time Series is Worth 64 Words: Long-term Forecasting with TransformersDeep Learning JP
 
Optimizer入門&最新動向
Optimizer入門&最新動向Optimizer入門&最新動向
Optimizer入門&最新動向Motokawa Tetsuya
 
[DL輪読会]PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metr...
[DL輪読会]PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metr...[DL輪読会]PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metr...
[DL輪読会]PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metr...Deep Learning JP
 
[論文紹介] LSTM (LONG SHORT-TERM MEMORY)
[論文紹介] LSTM (LONG SHORT-TERM MEMORY)[論文紹介] LSTM (LONG SHORT-TERM MEMORY)
[論文紹介] LSTM (LONG SHORT-TERM MEMORY)Tomoyuki Hioki
 
Anomaly detection 系の論文を一言でまとめた
Anomaly detection 系の論文を一言でまとめたAnomaly detection 系の論文を一言でまとめた
Anomaly detection 系の論文を一言でまとめたぱんいち すみもと
 
【メタサーベイ】Video Transformer
 【メタサーベイ】Video Transformer 【メタサーベイ】Video Transformer
【メタサーベイ】Video Transformercvpaper. challenge
 
[DL輪読会]“SimPLe”,“Improved Dynamics Model”,“PlaNet” 近年のVAEベース系列モデルの進展とそのモデルベース...
[DL輪読会]“SimPLe”,“Improved Dynamics Model”,“PlaNet” 近年のVAEベース系列モデルの進展とそのモデルベース...[DL輪読会]“SimPLe”,“Improved Dynamics Model”,“PlaNet” 近年のVAEベース系列モデルの進展とそのモデルベース...
[DL輪読会]“SimPLe”,“Improved Dynamics Model”,“PlaNet” 近年のVAEベース系列モデルの進展とそのモデルベース...Deep Learning JP
 
【DL輪読会】High-Resolution Image Synthesis with Latent Diffusion Models
【DL輪読会】High-Resolution Image Synthesis with Latent Diffusion Models【DL輪読会】High-Resolution Image Synthesis with Latent Diffusion Models
【DL輪読会】High-Resolution Image Synthesis with Latent Diffusion ModelsDeep Learning JP
 

Mais procurados (20)

[DL輪読会]Neural Ordinary Differential Equations
[DL輪読会]Neural Ordinary Differential Equations[DL輪読会]Neural Ordinary Differential Equations
[DL輪読会]Neural Ordinary Differential Equations
 
ConvNetの歴史とResNet亜種、ベストプラクティス
ConvNetの歴史とResNet亜種、ベストプラクティスConvNetの歴史とResNet亜種、ベストプラクティス
ConvNetの歴史とResNet亜種、ベストプラクティス
 
PRML学習者から入る深層生成モデル入門
PRML学習者から入る深層生成モデル入門PRML学習者から入る深層生成モデル入門
PRML学習者から入る深層生成モデル入門
 
深層学習によるHuman Pose Estimationの基礎
深層学習によるHuman Pose Estimationの基礎深層学習によるHuman Pose Estimationの基礎
深層学習によるHuman Pose Estimationの基礎
 
【メタサーベイ】数式ドリブン教師あり学習
【メタサーベイ】数式ドリブン教師あり学習【メタサーベイ】数式ドリブン教師あり学習
【メタサーベイ】数式ドリブン教師あり学習
 
SSII2021 [OS2-02] 深層学習におけるデータ拡張の原理と最新動向
SSII2021 [OS2-02] 深層学習におけるデータ拡張の原理と最新動向SSII2021 [OS2-02] 深層学習におけるデータ拡張の原理と最新動向
SSII2021 [OS2-02] 深層学習におけるデータ拡張の原理と最新動向
 
Action Recognitionの歴史と最新動向
Action Recognitionの歴史と最新動向Action Recognitionの歴史と最新動向
Action Recognitionの歴史と最新動向
 
[Ridge-i 論文よみかい] Wasserstein auto encoder
[Ridge-i 論文よみかい] Wasserstein auto encoder[Ridge-i 論文よみかい] Wasserstein auto encoder
[Ridge-i 論文よみかい] Wasserstein auto encoder
 
【DL輪読会】言語以外でのTransformerのまとめ (ViT, Perceiver, Frozen Pretrained Transformer etc)
【DL輪読会】言語以外でのTransformerのまとめ (ViT, Perceiver, Frozen Pretrained Transformer etc)【DL輪読会】言語以外でのTransformerのまとめ (ViT, Perceiver, Frozen Pretrained Transformer etc)
【DL輪読会】言語以外でのTransformerのまとめ (ViT, Perceiver, Frozen Pretrained Transformer etc)
 
Word Tour: One-dimensional Word Embeddings via the Traveling Salesman Problem...
Word Tour: One-dimensional Word Embeddings via the Traveling Salesman Problem...Word Tour: One-dimensional Word Embeddings via the Traveling Salesman Problem...
Word Tour: One-dimensional Word Embeddings via the Traveling Salesman Problem...
 
【論文読み会】Deep Clustering for Unsupervised Learning of Visual Features
【論文読み会】Deep Clustering for Unsupervised Learning of Visual Features【論文読み会】Deep Clustering for Unsupervised Learning of Visual Features
【論文読み会】Deep Clustering for Unsupervised Learning of Visual Features
 
【DL輪読会】A Time Series is Worth 64 Words: Long-term Forecasting with Transformers
【DL輪読会】A Time Series is Worth 64 Words: Long-term Forecasting with Transformers【DL輪読会】A Time Series is Worth 64 Words: Long-term Forecasting with Transformers
【DL輪読会】A Time Series is Worth 64 Words: Long-term Forecasting with Transformers
 
ResNetの仕組み
ResNetの仕組みResNetの仕組み
ResNetの仕組み
 
Optimizer入門&最新動向
Optimizer入門&最新動向Optimizer入門&最新動向
Optimizer入門&最新動向
 
[DL輪読会]PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metr...
[DL輪読会]PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metr...[DL輪読会]PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metr...
[DL輪読会]PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metr...
 
[論文紹介] LSTM (LONG SHORT-TERM MEMORY)
[論文紹介] LSTM (LONG SHORT-TERM MEMORY)[論文紹介] LSTM (LONG SHORT-TERM MEMORY)
[論文紹介] LSTM (LONG SHORT-TERM MEMORY)
 
Anomaly detection 系の論文を一言でまとめた
Anomaly detection 系の論文を一言でまとめたAnomaly detection 系の論文を一言でまとめた
Anomaly detection 系の論文を一言でまとめた
 
【メタサーベイ】Video Transformer
 【メタサーベイ】Video Transformer 【メタサーベイ】Video Transformer
【メタサーベイ】Video Transformer
 
[DL輪読会]“SimPLe”,“Improved Dynamics Model”,“PlaNet” 近年のVAEベース系列モデルの進展とそのモデルベース...
[DL輪読会]“SimPLe”,“Improved Dynamics Model”,“PlaNet” 近年のVAEベース系列モデルの進展とそのモデルベース...[DL輪読会]“SimPLe”,“Improved Dynamics Model”,“PlaNet” 近年のVAEベース系列モデルの進展とそのモデルベース...
[DL輪読会]“SimPLe”,“Improved Dynamics Model”,“PlaNet” 近年のVAEベース系列モデルの進展とそのモデルベース...
 
【DL輪読会】High-Resolution Image Synthesis with Latent Diffusion Models
【DL輪読会】High-Resolution Image Synthesis with Latent Diffusion Models【DL輪読会】High-Resolution Image Synthesis with Latent Diffusion Models
【DL輪読会】High-Resolution Image Synthesis with Latent Diffusion Models
 

Semelhante a 文献紹介:TSM: Temporal Shift Module for Efficient Video Understanding

文献紹介:Gate-Shift Networks for Video Action Recognition
文献紹介:Gate-Shift Networks for Video Action Recognition文献紹介:Gate-Shift Networks for Video Action Recognition
文献紹介:Gate-Shift Networks for Video Action RecognitionToru Tamaki
 
文献紹介:Token Shift Transformer for Video Classification
文献紹介:Token Shift Transformer for Video Classification文献紹介:Token Shift Transformer for Video Classification
文献紹介:Token Shift Transformer for Video ClassificationToru Tamaki
 
文献紹介:Learnable Gated Temporal Shift Module for Free-form Video Inpainting
文献紹介:Learnable Gated Temporal Shift Module for Free-form Video Inpainting文献紹介:Learnable Gated Temporal Shift Module for Free-form Video Inpainting
文献紹介:Learnable Gated Temporal Shift Module for Free-form Video InpaintingToru Tamaki
 
文献紹介:An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
文献紹介:An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale文献紹介:An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
文献紹介:An Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleToru Tamaki
 
文献紹介:Deep Analysis of CNN-Based Spatio-Temporal Representations for Action Re...
文献紹介:Deep Analysis of CNN-Based Spatio-Temporal Representations for Action Re...文献紹介:Deep Analysis of CNN-Based Spatio-Temporal Representations for Action Re...
文献紹介:Deep Analysis of CNN-Based Spatio-Temporal Representations for Action Re...Toru Tamaki
 
文献紹介:Extreme Low-Resolution Activity Recognition Using a Super-Resolution-Ori...
文献紹介:Extreme Low-Resolution Activity Recognition Using a Super-Resolution-Ori...文献紹介:Extreme Low-Resolution Activity Recognition Using a Super-Resolution-Ori...
文献紹介:Extreme Low-Resolution Activity Recognition Using a Super-Resolution-Ori...Toru Tamaki
 
文献紹介:SegFormer: Simple and Efficient Design for Semantic Segmentation with Tr...
文献紹介:SegFormer: Simple and Efficient Design for Semantic Segmentation with Tr...文献紹介:SegFormer: Simple and Efficient Design for Semantic Segmentation with Tr...
文献紹介:SegFormer: Simple and Efficient Design for Semantic Segmentation with Tr...Toru Tamaki
 
文献紹介:Video Transformer Network
文献紹介:Video Transformer Network文献紹介:Video Transformer Network
文献紹介:Video Transformer NetworkToru Tamaki
 
文献紹介:Learning Video Stabilization Using Optical Flow
文献紹介:Learning Video Stabilization Using Optical Flow文献紹介:Learning Video Stabilization Using Optical Flow
文献紹介:Learning Video Stabilization Using Optical FlowToru Tamaki
 
文献紹介:Rethinking Data Augmentation for Image Super-resolution: A Comprehensive...
文献紹介:Rethinking Data Augmentation for Image Super-resolution: A Comprehensive...文献紹介:Rethinking Data Augmentation for Image Super-resolution: A Comprehensive...
文献紹介:Rethinking Data Augmentation for Image Super-resolution: A Comprehensive...Toru Tamaki
 
文献紹介:SlowFast Networks for Video Recognition
文献紹介:SlowFast Networks for Video Recognition文献紹介:SlowFast Networks for Video Recognition
文献紹介:SlowFast Networks for Video RecognitionToru Tamaki
 
文献紹介:Prior Guided GAN Based Semantic Inpainting
文献紹介:Prior Guided GAN Based Semantic Inpainting文献紹介:Prior Guided GAN Based Semantic Inpainting
文献紹介:Prior Guided GAN Based Semantic InpaintingToru Tamaki
 
レトリバ勉強会資料:深層学習による自然言語処理2章
レトリバ勉強会資料:深層学習による自然言語処理2章レトリバ勉強会資料:深層学習による自然言語処理2章
レトリバ勉強会資料:深層学習による自然言語処理2章Hiroki Iida
 
第11回 配信講義 計算科学技術特論B(2022)
第11回 配信講義 計算科学技術特論B(2022)第11回 配信講義 計算科学技術特論B(2022)
第11回 配信講義 計算科学技術特論B(2022)RCCSRENKEI
 
Shadow gunのサンプルから学べるモバイル最適化
Shadow gunのサンプルから学べるモバイル最適化Shadow gunのサンプルから学べるモバイル最適化
Shadow gunのサンプルから学べるモバイル最適化Katsutoshi Makino
 
深層生成モデルを用いたマルチモーダル学習
深層生成モデルを用いたマルチモーダル学習深層生成モデルを用いたマルチモーダル学習
深層生成モデルを用いたマルチモーダル学習Masahiro Suzuki
 
点群深層学習 Meta-study
点群深層学習 Meta-study点群深層学習 Meta-study
点群深層学習 Meta-studyNaoya Chiba
 
Chainerの使い方と自然言語処理への応用
Chainerの使い方と自然言語処理への応用Chainerの使い方と自然言語処理への応用
Chainerの使い方と自然言語処理への応用Seiya Tokui
 
RでGISハンズオンセッション
RでGISハンズオンセッションRでGISハンズオンセッション
RでGISハンズオンセッションarctic_tern265
 
充足可能性問題のいろいろ
充足可能性問題のいろいろ充足可能性問題のいろいろ
充足可能性問題のいろいろHiroshi Yamashita
 

Semelhante a 文献紹介:TSM: Temporal Shift Module for Efficient Video Understanding (20)

文献紹介:Gate-Shift Networks for Video Action Recognition
文献紹介:Gate-Shift Networks for Video Action Recognition文献紹介:Gate-Shift Networks for Video Action Recognition
文献紹介:Gate-Shift Networks for Video Action Recognition
 
文献紹介:Token Shift Transformer for Video Classification
文献紹介:Token Shift Transformer for Video Classification文献紹介:Token Shift Transformer for Video Classification
文献紹介:Token Shift Transformer for Video Classification
 
文献紹介:Learnable Gated Temporal Shift Module for Free-form Video Inpainting
文献紹介:Learnable Gated Temporal Shift Module for Free-form Video Inpainting文献紹介:Learnable Gated Temporal Shift Module for Free-form Video Inpainting
文献紹介:Learnable Gated Temporal Shift Module for Free-form Video Inpainting
 
文献紹介:An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
文献紹介:An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale文献紹介:An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
文献紹介:An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
 
文献紹介:Deep Analysis of CNN-Based Spatio-Temporal Representations for Action Re...
文献紹介:Deep Analysis of CNN-Based Spatio-Temporal Representations for Action Re...文献紹介:Deep Analysis of CNN-Based Spatio-Temporal Representations for Action Re...
文献紹介:Deep Analysis of CNN-Based Spatio-Temporal Representations for Action Re...
 
文献紹介:Extreme Low-Resolution Activity Recognition Using a Super-Resolution-Ori...
文献紹介:Extreme Low-Resolution Activity Recognition Using a Super-Resolution-Ori...文献紹介:Extreme Low-Resolution Activity Recognition Using a Super-Resolution-Ori...
文献紹介:Extreme Low-Resolution Activity Recognition Using a Super-Resolution-Ori...
 
文献紹介:SegFormer: Simple and Efficient Design for Semantic Segmentation with Tr...
文献紹介:SegFormer: Simple and Efficient Design for Semantic Segmentation with Tr...文献紹介:SegFormer: Simple and Efficient Design for Semantic Segmentation with Tr...
文献紹介:SegFormer: Simple and Efficient Design for Semantic Segmentation with Tr...
 
文献紹介:Video Transformer Network
文献紹介:Video Transformer Network文献紹介:Video Transformer Network
文献紹介:Video Transformer Network
 
文献紹介:Learning Video Stabilization Using Optical Flow
文献紹介:Learning Video Stabilization Using Optical Flow文献紹介:Learning Video Stabilization Using Optical Flow
文献紹介:Learning Video Stabilization Using Optical Flow
 
文献紹介:Rethinking Data Augmentation for Image Super-resolution: A Comprehensive...
文献紹介:Rethinking Data Augmentation for Image Super-resolution: A Comprehensive...文献紹介:Rethinking Data Augmentation for Image Super-resolution: A Comprehensive...
文献紹介:Rethinking Data Augmentation for Image Super-resolution: A Comprehensive...
 
文献紹介:SlowFast Networks for Video Recognition
文献紹介:SlowFast Networks for Video Recognition文献紹介:SlowFast Networks for Video Recognition
文献紹介:SlowFast Networks for Video Recognition
 
文献紹介:Prior Guided GAN Based Semantic Inpainting
文献紹介:Prior Guided GAN Based Semantic Inpainting文献紹介:Prior Guided GAN Based Semantic Inpainting
文献紹介:Prior Guided GAN Based Semantic Inpainting
 
レトリバ勉強会資料:深層学習による自然言語処理2章
レトリバ勉強会資料:深層学習による自然言語処理2章レトリバ勉強会資料:深層学習による自然言語処理2章
レトリバ勉強会資料:深層学習による自然言語処理2章
 
第11回 配信講義 計算科学技術特論B(2022)
第11回 配信講義 計算科学技術特論B(2022)第11回 配信講義 計算科学技術特論B(2022)
第11回 配信講義 計算科学技術特論B(2022)
 
Shadow gunのサンプルから学べるモバイル最適化
Shadow gunのサンプルから学べるモバイル最適化Shadow gunのサンプルから学べるモバイル最適化
Shadow gunのサンプルから学べるモバイル最適化
 
深層生成モデルを用いたマルチモーダル学習
深層生成モデルを用いたマルチモーダル学習深層生成モデルを用いたマルチモーダル学習
深層生成モデルを用いたマルチモーダル学習
 
点群深層学習 Meta-study
点群深層学習 Meta-study点群深層学習 Meta-study
点群深層学習 Meta-study
 
Chainerの使い方と自然言語処理への応用
Chainerの使い方と自然言語処理への応用Chainerの使い方と自然言語処理への応用
Chainerの使い方と自然言語処理への応用
 
RでGISハンズオンセッション
RでGISハンズオンセッションRでGISハンズオンセッション
RでGISハンズオンセッション
 
充足可能性問題のいろいろ
充足可能性問題のいろいろ充足可能性問題のいろいろ
充足可能性問題のいろいろ
 

Mais de Toru Tamaki

論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...Toru Tamaki
 
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...Toru Tamaki
 
論文紹介:Automated Classification of Model Errors on ImageNet
論文紹介:Automated Classification of Model Errors on ImageNet論文紹介:Automated Classification of Model Errors on ImageNet
論文紹介:Automated Classification of Model Errors on ImageNetToru Tamaki
 
論文紹介:Semantic segmentation using Vision Transformers: A survey
論文紹介:Semantic segmentation using Vision Transformers: A survey論文紹介:Semantic segmentation using Vision Transformers: A survey
論文紹介:Semantic segmentation using Vision Transformers: A surveyToru Tamaki
 
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex ScenesToru Tamaki
 
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...Toru Tamaki
 
論文紹介:Tracking Anything with Decoupled Video Segmentation
論文紹介:Tracking Anything with Decoupled Video Segmentation論文紹介:Tracking Anything with Decoupled Video Segmentation
論文紹介:Tracking Anything with Decoupled Video SegmentationToru Tamaki
 
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
論文紹介:Real-Time Evaluation in Online Continual Learning: A New HopeToru Tamaki
 
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...Toru Tamaki
 
論文紹介:Multitask Vision-Language Prompt Tuning
論文紹介:Multitask Vision-Language Prompt Tuning論文紹介:Multitask Vision-Language Prompt Tuning
論文紹介:Multitask Vision-Language Prompt TuningToru Tamaki
 
論文紹介:MovieCLIP: Visual Scene Recognition in Movies
論文紹介:MovieCLIP: Visual Scene Recognition in Movies論文紹介:MovieCLIP: Visual Scene Recognition in Movies
論文紹介:MovieCLIP: Visual Scene Recognition in MoviesToru Tamaki
 
論文紹介:Discovering Universal Geometry in Embeddings with ICA
論文紹介:Discovering Universal Geometry in Embeddings with ICA論文紹介:Discovering Universal Geometry in Embeddings with ICA
論文紹介:Discovering Universal Geometry in Embeddings with ICAToru Tamaki
 
論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement
論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement
論文紹介:Efficient Video Action Detection with Token Dropout and Context RefinementToru Tamaki
 
論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...
論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...
論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...Toru Tamaki
 
論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...
論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...
論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...Toru Tamaki
 
論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion
論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion
論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusionToru Tamaki
 
論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving
論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving
論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous DrivingToru Tamaki
 
論文紹介:Spatio-Temporal Action Detection Under Large Motion
論文紹介:Spatio-Temporal Action Detection Under Large Motion論文紹介:Spatio-Temporal Action Detection Under Large Motion
論文紹介:Spatio-Temporal Action Detection Under Large MotionToru Tamaki
 
論文紹介:Vision Transformer Adapter for Dense Predictions
論文紹介:Vision Transformer Adapter for Dense Predictions論文紹介:Vision Transformer Adapter for Dense Predictions
論文紹介:Vision Transformer Adapter for Dense PredictionsToru Tamaki
 
動画像理解のための深層学習アプローチ Deep learning approaches to video understanding
動画像理解のための深層学習アプローチ Deep learning approaches to video understanding動画像理解のための深層学習アプローチ Deep learning approaches to video understanding
動画像理解のための深層学習アプローチ Deep learning approaches to video understandingToru Tamaki
 

Mais de Toru Tamaki (20)

論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
 
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
 
論文紹介:Automated Classification of Model Errors on ImageNet
論文紹介:Automated Classification of Model Errors on ImageNet論文紹介:Automated Classification of Model Errors on ImageNet
論文紹介:Automated Classification of Model Errors on ImageNet
 
論文紹介:Semantic segmentation using Vision Transformers: A survey
論文紹介:Semantic segmentation using Vision Transformers: A survey論文紹介:Semantic segmentation using Vision Transformers: A survey
論文紹介:Semantic segmentation using Vision Transformers: A survey
 
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
 
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
 
論文紹介:Tracking Anything with Decoupled Video Segmentation
論文紹介:Tracking Anything with Decoupled Video Segmentation論文紹介:Tracking Anything with Decoupled Video Segmentation
論文紹介:Tracking Anything with Decoupled Video Segmentation
 
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
 
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
 
論文紹介:Multitask Vision-Language Prompt Tuning
論文紹介:Multitask Vision-Language Prompt Tuning論文紹介:Multitask Vision-Language Prompt Tuning
論文紹介:Multitask Vision-Language Prompt Tuning
 
論文紹介:MovieCLIP: Visual Scene Recognition in Movies
論文紹介:MovieCLIP: Visual Scene Recognition in Movies論文紹介:MovieCLIP: Visual Scene Recognition in Movies
論文紹介:MovieCLIP: Visual Scene Recognition in Movies
 
論文紹介:Discovering Universal Geometry in Embeddings with ICA
論文紹介:Discovering Universal Geometry in Embeddings with ICA論文紹介:Discovering Universal Geometry in Embeddings with ICA
論文紹介:Discovering Universal Geometry in Embeddings with ICA
 
論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement
論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement
論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement
 
論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...
論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...
論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...
 
論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...
論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...
論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...
 
論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion
論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion
論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion
 
論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving
論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving
論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving
 
論文紹介:Spatio-Temporal Action Detection Under Large Motion
論文紹介:Spatio-Temporal Action Detection Under Large Motion論文紹介:Spatio-Temporal Action Detection Under Large Motion
論文紹介:Spatio-Temporal Action Detection Under Large Motion
 
論文紹介:Vision Transformer Adapter for Dense Predictions
論文紹介:Vision Transformer Adapter for Dense Predictions論文紹介:Vision Transformer Adapter for Dense Predictions
論文紹介:Vision Transformer Adapter for Dense Predictions
 
動画像理解のための深層学習アプローチ Deep learning approaches to video understanding
動画像理解のための深層学習アプローチ Deep learning approaches to video understanding動画像理解のための深層学習アプローチ Deep learning approaches to video understanding
動画像理解のための深層学習アプローチ Deep learning approaches to video understanding
 

Último

モーダル間の変換後の一致性とジャンル表を用いた解釈可能性の考察 ~Text-to-MusicとText-To-ImageかつImage-to-Music...
モーダル間の変換後の一致性とジャンル表を用いた解釈可能性の考察  ~Text-to-MusicとText-To-ImageかつImage-to-Music...モーダル間の変換後の一致性とジャンル表を用いた解釈可能性の考察  ~Text-to-MusicとText-To-ImageかつImage-to-Music...
モーダル間の変換後の一致性とジャンル表を用いた解釈可能性の考察 ~Text-to-MusicとText-To-ImageかつImage-to-Music...博三 太田
 
デジタル・フォレンジックの最新動向(2024年4月27日情洛会総会特別講演スライド)
デジタル・フォレンジックの最新動向(2024年4月27日情洛会総会特別講演スライド)デジタル・フォレンジックの最新動向(2024年4月27日情洛会総会特別講演スライド)
デジタル・フォレンジックの最新動向(2024年4月27日情洛会総会特別講演スライド)UEHARA, Tetsutaro
 
TataPixel: 畳の異方性を利用した切り替え可能なディスプレイの提案
TataPixel: 畳の異方性を利用した切り替え可能なディスプレイの提案TataPixel: 畳の異方性を利用した切り替え可能なディスプレイの提案
TataPixel: 畳の異方性を利用した切り替え可能なディスプレイの提案sugiuralab
 
CTO, VPoE, テックリードなどリーダーポジションに登用したくなるのはどんな人材か?
CTO, VPoE, テックリードなどリーダーポジションに登用したくなるのはどんな人材か?CTO, VPoE, テックリードなどリーダーポジションに登用したくなるのはどんな人材か?
CTO, VPoE, テックリードなどリーダーポジションに登用したくなるのはどんな人材か?akihisamiyanaga1
 
自分史上一番早い2024振り返り〜コロナ後、仕事は通常ペースに戻ったか〜 by IoT fullstack engineer
自分史上一番早い2024振り返り〜コロナ後、仕事は通常ペースに戻ったか〜 by IoT fullstack engineer自分史上一番早い2024振り返り〜コロナ後、仕事は通常ペースに戻ったか〜 by IoT fullstack engineer
自分史上一番早い2024振り返り〜コロナ後、仕事は通常ペースに戻ったか〜 by IoT fullstack engineerYuki Kikuchi
 
業務で生成AIを活用したい人のための生成AI入門講座(社外公開版:キンドリルジャパン社内勉強会:2024年4月発表)
業務で生成AIを活用したい人のための生成AI入門講座(社外公開版:キンドリルジャパン社内勉強会:2024年4月発表)業務で生成AIを活用したい人のための生成AI入門講座(社外公開版:キンドリルジャパン社内勉強会:2024年4月発表)
業務で生成AIを活用したい人のための生成AI入門講座(社外公開版:キンドリルジャパン社内勉強会:2024年4月発表)Hiroshi Tomioka
 
AWS の OpenShift サービス (ROSA) を使った OpenShift Virtualizationの始め方.pdf
AWS の OpenShift サービス (ROSA) を使った OpenShift Virtualizationの始め方.pdfAWS の OpenShift サービス (ROSA) を使った OpenShift Virtualizationの始め方.pdf
AWS の OpenShift サービス (ROSA) を使った OpenShift Virtualizationの始め方.pdfFumieNakayama
 
クラウドネイティブなサーバー仮想化基盤 - OpenShift Virtualization.pdf
クラウドネイティブなサーバー仮想化基盤 - OpenShift Virtualization.pdfクラウドネイティブなサーバー仮想化基盤 - OpenShift Virtualization.pdf
クラウドネイティブなサーバー仮想化基盤 - OpenShift Virtualization.pdfFumieNakayama
 
NewSQLの可用性構成パターン(OCHaCafe Season 8 #4 発表資料)
NewSQLの可用性構成パターン(OCHaCafe Season 8 #4 発表資料)NewSQLの可用性構成パターン(OCHaCafe Season 8 #4 発表資料)
NewSQLの可用性構成パターン(OCHaCafe Season 8 #4 発表資料)NTT DATA Technology & Innovation
 

Último (9)

モーダル間の変換後の一致性とジャンル表を用いた解釈可能性の考察 ~Text-to-MusicとText-To-ImageかつImage-to-Music...
モーダル間の変換後の一致性とジャンル表を用いた解釈可能性の考察  ~Text-to-MusicとText-To-ImageかつImage-to-Music...モーダル間の変換後の一致性とジャンル表を用いた解釈可能性の考察  ~Text-to-MusicとText-To-ImageかつImage-to-Music...
モーダル間の変換後の一致性とジャンル表を用いた解釈可能性の考察 ~Text-to-MusicとText-To-ImageかつImage-to-Music...
 
デジタル・フォレンジックの最新動向(2024年4月27日情洛会総会特別講演スライド)
デジタル・フォレンジックの最新動向(2024年4月27日情洛会総会特別講演スライド)デジタル・フォレンジックの最新動向(2024年4月27日情洛会総会特別講演スライド)
デジタル・フォレンジックの最新動向(2024年4月27日情洛会総会特別講演スライド)
 
TataPixel: 畳の異方性を利用した切り替え可能なディスプレイの提案
TataPixel: 畳の異方性を利用した切り替え可能なディスプレイの提案TataPixel: 畳の異方性を利用した切り替え可能なディスプレイの提案
TataPixel: 畳の異方性を利用した切り替え可能なディスプレイの提案
 
CTO, VPoE, テックリードなどリーダーポジションに登用したくなるのはどんな人材か?
CTO, VPoE, テックリードなどリーダーポジションに登用したくなるのはどんな人材か?CTO, VPoE, テックリードなどリーダーポジションに登用したくなるのはどんな人材か?
CTO, VPoE, テックリードなどリーダーポジションに登用したくなるのはどんな人材か?
 
自分史上一番早い2024振り返り〜コロナ後、仕事は通常ペースに戻ったか〜 by IoT fullstack engineer
自分史上一番早い2024振り返り〜コロナ後、仕事は通常ペースに戻ったか〜 by IoT fullstack engineer自分史上一番早い2024振り返り〜コロナ後、仕事は通常ペースに戻ったか〜 by IoT fullstack engineer
自分史上一番早い2024振り返り〜コロナ後、仕事は通常ペースに戻ったか〜 by IoT fullstack engineer
 
業務で生成AIを活用したい人のための生成AI入門講座(社外公開版:キンドリルジャパン社内勉強会:2024年4月発表)
業務で生成AIを活用したい人のための生成AI入門講座(社外公開版:キンドリルジャパン社内勉強会:2024年4月発表)業務で生成AIを活用したい人のための生成AI入門講座(社外公開版:キンドリルジャパン社内勉強会:2024年4月発表)
業務で生成AIを活用したい人のための生成AI入門講座(社外公開版:キンドリルジャパン社内勉強会:2024年4月発表)
 
AWS の OpenShift サービス (ROSA) を使った OpenShift Virtualizationの始め方.pdf
AWS の OpenShift サービス (ROSA) を使った OpenShift Virtualizationの始め方.pdfAWS の OpenShift サービス (ROSA) を使った OpenShift Virtualizationの始め方.pdf
AWS の OpenShift サービス (ROSA) を使った OpenShift Virtualizationの始め方.pdf
 
クラウドネイティブなサーバー仮想化基盤 - OpenShift Virtualization.pdf
クラウドネイティブなサーバー仮想化基盤 - OpenShift Virtualization.pdfクラウドネイティブなサーバー仮想化基盤 - OpenShift Virtualization.pdf
クラウドネイティブなサーバー仮想化基盤 - OpenShift Virtualization.pdf
 
NewSQLの可用性構成パターン(OCHaCafe Season 8 #4 発表資料)
NewSQLの可用性構成パターン(OCHaCafe Season 8 #4 発表資料)NewSQLの可用性構成パターン(OCHaCafe Season 8 #4 発表資料)
NewSQLの可用性構成パターン(OCHaCafe Season 8 #4 発表資料)
 

文献紹介:TSM: Temporal Shift Module for Efficient Video Understanding

  • 2. 概要 nトレードオフ • 28'99 • 高効率 • 低性能 • :8'99 • 低効率 • 高性能 n;<=>.?*@#-("AB#C.D)@<の提案 • 28'99ベースで時間情報を考慮 nリアルタイムで低レイテンシーで動画認識 Top-1 Top-5 Top-1 Top-5 TSN (our impl.) 30.0 60.5 - - MultiScale TRN [58] 48.8 77.6 50.9 79.3 2-Stream TRN [58] 55.5 83.1 56.2 83.2 TSM8F 59.1 85.6 - - TSM16F 63.4 88.5 64.3 89.6 TSMRGB+Flow 66.0 90.5 66.6 91.3 38 41 43 46 48 51 0 100 200 300 400 500 600 700 Ours ECO [ ] I3D from [ ] FLOPs/Video (G) Accuracy (%) ECOEnLite TSMEn NL I3D+GCN NL I3D I3D ECO16F ECO8F TSM16F 30M 100M 150M # Parameters TSM8F 61 50 Figure 5. TSM enjoys better accuracy-cost trade-off than I3D family and ECO family on Something-Something-V1 [14] dataset. (GCN includes the cost of ResNet-50 RPN to generate region proposals.) to generate the bounding boxes, which is unfair to compare since external data (MSCOCO) and extra training cost is introduced. Thus we compared TSM to its CNN part: Non- Mod I3D fro ECO16 I3D fro I3Dre TSM TSM work [ conside optical nition m of two- We s trade-o set of S parame see tha than bo based m based m It can a it achie compu is alrea 効率 高 低 低 高 性能
  • 3. !"#$$の計算量削減 n:8'99の計算量を減らす • 9.%#@.E*@ =.D)@<#F9$G HI*%+J&#'1KL234MN • O'P#HQ.@A*+(*?"J&#O''1234MN • 前半28'99 • 後半:8'99 i is often based only on the current and the latest time steps (e.g., j = i or i 1). The non-local operation is also different from a fully- connected (fc) layer. Eq.(1) computes responses based on relationships between different locations, whereas fc uses learned weights. In other words, the relationship between xj and xi is not a function of the input data in fc, unlike in non- local layers. Furthermore, our formulation in Eq.(1) supports inputs of variable sizes, and maintains the corresponding size in the output. On the contrary, an fc layer requires a fixed-size input/output and loses positional correspondence (e.g., that from xi to yi at the position i). A non-local operation is a flexible building block and can be easily used together with convolutional/recurrent layers. It can be added into the earlier part of deep neural networks, unlike fc layers that are often used in the end. This allows us to build a richer hierarchy that combines both non-local and local information. 3.2. Instantiations Next we describe several versions of f and g. Interest- ingly, we will show by experiments (Table 2a) that our non- local models are not sensitive to these choices, indicating that the generic non-local behavior is the main reason for the observed improvements. For simplicity, we only consider g in the form of a linear θ: 1×1×1 φ: 1×1×1 g: 1×1×1 1×1×1 softmax z T×H×W×1024 T×H×W×512 T×H×W×512 T×H×W×512 THW×512 512×THW THW×THW THW×512 THW×512 T×H×W×512 T×H×W×1024 x Figure 2. A spacetime non-local block. The feature maps are shown as the shape of their tensors, e.g., T⇥H⇥W⇥1024 for 1024 channels (proper reshaping is performed when noted). “⌦” denotes matrix multiplication, and “ ” denotes element-wise sum. The softmax operation is performed on each row. The blue boxes de- note 1⇥1⇥1 convolutions. Here we show the embedded Gaussian version, with a bottleneck of 512 channels. The vanilla Gaussian version can be done by removing ✓ and , and the dot-product version can be done by replacing softmax with scaling by 1/N. y = softmax(xT WT ✓ W x)g(x), which is the self-attention form in [49]. As such, our work provides insight by relating this recent self-attention model to the classic computer vision :8 28 ECO: Efficient Convolutional Network for Online Video Understanding 5 Input to 3D Net K x [N x 28 x 28] M2 M1 K x 28 x 28 Weight Sharing Weight Sharing M2 M1 Video (V) K x 28 x 28 2D Net (H2d) K x 28 x 28 3D Net Action s1 s2 SN Mϕ = [M1,. .,M1] Mϕ = [MK,. .,MK] MK 1 K (H3d) 1 1 1 MK 2 2 2 M2 M1 MK N N N 1 N N 1 Temporal stacking of feature maps Feature Maps
  • 4. 動画認識のための%"#$$のシフト nシフト方法 • オフライン • 一部特徴量を前後フレームと交換 • オンライン • 次のフレームに特徴量を渡す n挿入方法 • 0%R>@*E< • シフトした特徴量をST">#E.%%<EB".% • L<S"D)*@ • シフト前の特徴量をST">#E.%%<EB".% nL<S"D)*@の方が高性能 • シフトする前と後を足し合わせる • 空間情報が壊れず時間情報を融合 X Y + shift conv (a) In-place TSM. X Y + shift conv (b) Residual TSM. Figure 3. Residual shift is better than in-place shift. In-place shift happens before a convolution layer (or a residual block). Residual shift fuses temporal information inside a residual branch. ule for Efficient Video Understanding Chuang Gan M Watson AI Lab ng@csail.mit.edu Song Han MIT songhan@mit.edu se to accu- CNNs poral good ng it neric both Channel C Temporal T (a) The original ten- sor without shift. pad zero temporal shift truncate T C H,W (b) Offline temporal shift (bi-direction). t=0 t=3 … t=1 t=2 Channel C (c) Online temporal shift (uni-direction). Figure 1. Temporal Shift Module (TSM) performs efficient tem- poral modeling by moving the feature map along the temporal dimension. It is computationally free on top of a 2D convolution,
  • 5. 画像認識における空間シフトと問題点 n畳み込みの代わりに特徴量を全てシフト n動画認識にも適用したい n問題点 • レイテンシーの増加 • 空間特徴量が把握できない !"#$%&'()*+,-./ .3. Efficient Neural Networks The efficiency of 2D CNN has been extensively studied. ome works focused on designing an efficient model [21, 20, 6, 56]. Recently neural architecture search [62, 63, 31] as been introduced to find an efficient architecture au- omatically [44, 3]. Another way is to prune, quan- ze and compress an existing model for efficient deploy- ment [16, 15, 29, 59, 18, 47]. Address shift, which is a ardware-friendly primitive, has also been exploited for com- act 2D CNN design on image recognition tasks [51, 57]. 0 1/8 1/4 1/2 1 P100 TX2 CPU Naive shift: large overhead Latency Overhead Shift Proportion 0% 3% 6% 9% 12% 15% Our Choice (a) Overhead vs. proportion. 0 1/8 1/4 1/2 1 0 1/8 1/4 1/2 1 In-place TSM Residual TSM Naive shift: low acc. Accuracy 15% Shift Proportion 69% 71% 73% 75% 67% Our Choice 2D baseline (b) Residual vs. in-place. ⨷ … N (a) Spatial Convolution M DF DF M DK DK N DF DF … ⨷ … DK DK 1 ⨷ ⨷ … … (b) Depth-wise convolution M DF DF M M DF DF (c) Shift M DF DF DF DF M … … … Figure 2: Illustration of (a) spatial convolutions, (b) depth-wise convolutions and (c) shift. In (c), the 3x3 grids denote a shift matrix with a kernel size of 3. The lighted cell denotes a 1 at that position and white cells denote 0s. In this paper, we present the shift operation (Figure 1) as an alternative to spatial convolutions. The shift operation moves each channel of its input tensor in a different spatial direction. A shift-based module interleaves shift operations with point-wise convolutions, which further mixes spatial information across channels. Unlike spatial convolutions, the shift operation itself requires zero FLOPs and zero pa- rameters. As opposed to depth-wise convolutions, shift op- erations can be easily and efficiently implemented. Our approach is orthogonal to model compression [4], tensor factorization [27] and low-bit networks [16]. As a result, any of these techniques could be composed with our proposed method to further reduce model size. We introduce a new hyperparameter for shift-based mod- where î = i−#DK/2$, ĵ = j−#DK/2$ are the re-centered spatial indices; k, l and i, j index along spatial dimensions and n, m index into channels. The number of parameters required by a spatial convolution is M × N × D2 K and the computational cost is M × N × D2 K × D2 F . As the kernel size DK increases, we see the number of parameters and computational cost grow quadratically. A popular variant of the spatial convolution is a depth- wise convolution [7, 1], which is usually followed by a point-wise convolution (1x1 convolution). Altogether, the module is called the depth-wise separable convolution. A depth-wise convolution, as shown in Figure 2(b), aggre- gates spatial information from a DK × DK patch within each channel, and can be described as
  • 6. データセットとベースライン n時間情報が重要 • -.=<B("%+RS.=<B("%+#U4VU2 H,.W*@J&#0''1234XN • '(*?*D<S#H,)%%*?J&#234YN • !<SB<?#HC*B<?ZW%ST*J&#N n時間情報がそれほど重要でない • ['434#H-..=?.J&#2342N • ]"%<B"ES#H]*WJ&#234XN • /C8^74#H])<(%<J&#2344N nベースライン • ;<=>.?*@#S<+=<%B#%<B_.?T#HI*%+J&#O''1234YN -.=<B("%+RS.=<B("%+
  • 7. 各データセットにおける&'(の効果 nC.?<#;<=>.?*@だと大幅な性能向上 n$<SS#;<=>.?*@でも性能向上 nline ence, each next maps f 7/8 erate M for ges: need ring per- eline. one he a mory MB Table 1. Our method consistently outperforms 2D counterparts on multiple datasets at zero extra computation (protocol: ResNet-50 8f input, 10 clips for Kinetics, 2 for others, full-resolution). Dataset Model Acc1 Acc5 ∆ Acc1 Less Temporal Kinetics TSN 70.6 89.2 +3.5 Ours 74.1 91.2 UCF101 TSN 91.7 99.2 +4.2 Ours 95.9 99.7 HMDB51 TSN 64.7 89.9 +8.8 Ours 73.5 94.3 More Temporal Something V1 TSN 20.5 47.5 +28.0 Ours 47.3 76.2 Something V2 TSN 30.4 61.0 +31.3 Ours 61.7 87.4 Jester TSN 83.9 99.6 +11.7 Ours 97.0 99.9 I3D from [50] 3D ResNet-50 32× Non-local I3D from [50] 3D ResNet-50 32× Non-local I3D + GCN [50] 3D ResNet-50+GCN 32× TSM ResNet-50 TSM ResNet-50 1 TSMEn ResNet-50 2 TSMRGB+Flow ResNet-50 16 Table 3. TSM can consistently improve the performance over ferent backbones on Kinetics dataset. Mb-V2 R-50 RX-101 NL R-50 TSN 66.5 70.7 72.4 74.6 TSM 69.5 74.1 76.3 75.7 ∆Acc. +3.0 +3.4 +3.9 +1.1 consistently outperforms the 2D TSN baseline at no ex computation. For the lower part, we present the results Something-Something V1 and V2 [14] and Jester [1], wh n全てのベースラインで性能向上 nすでに時間的モデリングをしているモデ ルより性能向上 n;-Cの時間情報モデリング能力が高い 時間情報を考慮したモデルでも性能向上
  • 8. ')&*との比較 n28ベースラインを大幅に改善 n:8の手法より性能が良い Table 2. Comparing TSM against other methods on Something-Something dataset (center crop, 1 clip/video unless otherwise specified). Model Backbone #Frame FLOPs/Video #Param. Val Top-1 Val Top-5 Test Top-1 TSN [58] BNInception 8 16G 10.7M 19.5 - - TSN (our impl.) ResNet-50 8 33G 24.3M 19.7 46.6 - TRN-Multiscale [58] BNInception 8 16G 18.3M 34.4 - 33.6 TRN-Multiscale (our impl.) ResNet-50 8 33G 31.8M 38.9 68.1 - Two-stream TRNRGB+Flow [58] BNInception 8+8 - 36.6M 42.0 - 40.7 ECO [61] BNIncep+3D Res18 8 32G 47.5M 39.6 - - ECO [61] BNIncep+3D Res18 16 64G 47.5M 41.4 - - ECOEnLite [61] BNIncep+3D Res18 92 267G 150M 46.4 - 42.3 ECOEnLiteRGB+Flow [61] BNIncep+3D Res18 92+92 - 300M 49.5 - 43.9 I3D from [50] 3D ResNet-50 32×2clip 153G1 ×2 28.0M 41.6 72.2 - Non-local I3D from [50] 3D ResNet-50 32×2clip 168G1 ×2 35.3M 44.4 76.0 - Non-local I3D + GCN [50] 3D ResNet-50+GCN 32×2clip 303G2 ×2 62.2M2 46.1 76.8 45.0 TSM ResNet-50 8 33G 24.3M 45.6 74.2 - TSM ResNet-50 16 65G 24.3M 47.2 77.1 46.0 TSMEn ResNet-50 24 98G 48.6M 49.7 78.5 - TSMRGB+Flow ResNet-50 16+16 - 48.6M 52.6 81.9 50.7 Table 3. TSM can consistently improve the performance over dif- Something-Something-V1. Something-Something-V1 is 28ベース :8ベース 提案手法
  • 9. トレードオフ n-.=<B("%+R-.=<B("%+#U4による比較 • O'Pより:倍 • 9$R0:8#HI*%+J&#'1KL234MNよりY倍 計算量が少ない Table 4. Results on Something-Something-V2. Our TSM achieves state-of-the-art performance. Method Val Test Top-1 Top-5 Top-1 Top-5 TSN (our impl.) 30.0 60.5 - - MultiScale TRN [58] 48.8 77.6 50.9 79.3 2-Stream TRN [58] 55.5 83.1 56.2 83.2 TSM8F 59.1 85.6 - - TSM16F 63.4 88.5 64.3 89.6 TSMRGB+Flow 66.0 90.5 66.6 91.3 38 41 43 46 48 51 0 100 200 300 400 500 600 700 Ours ECO [ ] I3D from [ ] FLOPs/Video (G) Accuracy (%) ECOEnLite TSMEn NL I3D+GCN NL I3D I3D ECO16F ECO8F TSM16F 30M 100M 150M # Parameters TSM8F 61 50 Figure 5. TSM enjoys better accuracy-cost trade-off than I3D family and ECO family on Something-Something-V1 [14] dataset. (GCN includes the cost of ResNet-50 RPN to generate region proposals.) Table 5. TSM enjoys low GPU inference latency and high through- put. V/s means videos per second, higher the better (Measured on NVIDIA Tesla P100 GPU). Model Efficiency Statistics Accuracy FLOPs Param. Latency Thrput. Sth. Kinetics I3D from [50] 306G 35.3M 165.3ms 6.1V/s 41.6% - ECO16F [61] 64G 47.5M 30.6ms 45.6V/s 41.4% - I3D from [49] 33G 29.3M 25.8ms 42.4V/s - 73.3% I3Dreplace 48G 33.0M 28.0ms 37.9V/s 44.9% - TSM8F 33G 24.3M 17.4ms 77.4V/s 45.6% 74.1% TSM16F 65G 24.3M 29.0ms 39.5V/s 47.2% 74.7% work [34] to extract bounding boxes, whose cost is also considered in the chart. Note that the computation cost of optical flow extraction is usually larger than the video recog- nition model itself. Therefore, we do not report the FLOPs of two-stream based methods. We show the accuracy, FLOPs, and number of parameters trade-off in Figure 5. The accuracy is tested on the validation set of Something-Something-V1 dataset, and the number of parameters is indicated by the area of the circles. We can see that our TSM based methods have a better Pareto curve than both previous state-of-the-art efficient models (ECO based models) and high-performance models (non-local I3D based models). TSM models are both efficient and accurate. Table 4. Results on Something-Something-V2. Our TSM achieves state-of-the-art performance. Method Val Test Top-1 Top-5 Top-1 Top-5 TSN (our impl.) 30.0 60.5 - - MultiScale TRN [58] 48.8 77.6 50.9 79.3 2-Stream TRN [58] 55.5 83.1 56.2 83.2 TSM8F 59.1 85.6 - - TSM16F 63.4 88.5 64.3 89.6 TSMRGB+Flow 66.0 90.5 66.6 91.3 43 46 48 51 Ours ECO [ ] I3D from [ ] curacy (%) ECOEnLite TSMEn NL I3D+GCN NL I3D TSM16F TSM8F 61 50 Table 5. TSM enjoys low GPU inference latency and high through- put. V/s means videos per second, higher the better (Measured on NVIDIA Tesla P100 GPU). Model Efficiency Statistics Accuracy FLOPs Param. Latency Thrput. Sth. Kinetics I3D from [50] 306G 35.3M 165.3ms 6.1V/s 41.6% - ECO16F [61] 64G 47.5M 30.6ms 45.6V/s 41.4% - I3D from [49] 33G 29.3M 25.8ms 42.4V/s - 73.3% I3Dreplace 48G 33.0M 28.0ms 37.9V/s 44.9% - TSM8F 33G 24.3M 17.4ms 77.4V/s 45.6% 74.1% TSM16F 65G 24.3M 29.0ms 39.5V/s 47.2% 74.7% work [34] to extract bounding boxes, whose cost is also considered in the chart. Note that the computation cost of optical flow extraction is usually larger than the video recog- nition model itself. Therefore, we do not report the FLOPs of two-stream based methods.
  • 10. 高速な認識の優位性 n['434の観察時間ごとの認識率 n最初の43`を観測したとき • ;-CはO'Pに比べてX`ほど高い n観測初期から高精度 nオンラインの場合2フレーム目から 前フレームの特徴量を考慮できる Table 6. Comparing the accuracy of offline TSM and online TSM on different datasets. Online TSM brings negligible latency overhead. Model Latency Kinetics UCF101 HMDB51 Something TSN 4.7ms 70.6% 91.7% 64.7% 20.5% +Offline - 74.1% 95.9% 73.5% 47.3% +Online 4.8ms 74.3% 95.5% 73.6% 46.3% Accuracy % 80 84 88 92 96 Video Observation % 10 20 40 60 80 100 ECO (s=8) ECO (s=12) ECO (s=20) TSM Figure 6. Early recognition on UCF101. TSM gives high prediction accuracy after only observing a small portion of the video. of backbone design, we replace every TSM primitive with 3 × 1 × 1 convolution and denote this model as I3Dreplace. It
  • 11. まとめ n;<=>.?*@#-("AB#C.D)@<の提案 • 28'99に挿入し,時間方向モデリングが可能 • 追加コストなし • 計算量ゼロ • パラメータゼロ n低遅延動画認識 • 効率的で高精度 • エッジデバイスによる低遅延な動画認識が可能