文献紹介：Elaborative Rehearsal for Zero-Shot Action Recognition

Elaborative Rehearsal for
Zero-shot Action Recognition
ICCV2021
Shizhe Chen, Dong Huang
細谷優（名工大玉木研）
2022/10/14

Zero-shot Action Recognition（ZSAR）
◼Zero-shot Learning（ZSL）
• 学習データ無しでクラスを識別
• 代わりに周辺知識で学習
• ZSARはZSLの動作認識タスク
◼学習
• 動画とクラスラベル等を
埋め込み
• これら特徴を
関連づけるように学習
◼推論
• 動画特徴からクラスラベルを探索
• 学習時に用いていないクラスを識別可能
• 推論で用いるクラスの情報も埋め込むため
Unseen Class Seen Classes Side Information
Step 1: Visual Embedding Step 2: Semantic Label Embedding
Step 3: Training
Visual Feature Extractor Semantic Feature Extractor
f(.)
Google News Wikipedia
ImageNet WordNet
WikiHow
Apply Eye Makeup
New Action
Ice Dancing
Apply Eye Makeup Ice Dancing
Apply Eye
Makeup
Ice
Dancing
New Action
Prototype 1
Horse
Riding
Prototype 2
Playing
Guitar
Figure(1) Schematic representation of aZSL human action recognition framework.
attempting to overcome these limitations.
Thehuman ability torecognizeanaction without ever
having seen it before, that is, associating semantic infor-
mation from several sources to thevisual appearance of
actions, is the inspiration of ZSL approaches [43]. In
Figure 1, we provide an overview of ZSL approaches
considering the application in videos. This general
scheme can also be found in ZSL applied to object and
event recognition in both images and videos [22]. We
introduce the main aspects of the approaches through-
out this text.
before, in addition to extracting visual features, it is
necessary to associate them with a suitable prototype
and assign a label. This is made by learning an f (·)
mapping function between these spaces. As discussed
in Section 3, this mapping function can assume several
ways to be performed directly into the semantic space,
indirectly by creating an intermediate space or directly
into the visual space.
Thus, weconcentrate our investigation in approaches
that address the problem of recognizing human actions,
without having seen them before, in small video clips,
[Estevam+, Neurocomputing](arXiv)

概要
◼映像理解の課題
• クラス数の増加
• 教師あり学習の学習コスト増大
• アクションは複雑で多様
• アクションクラスの説明をどのように表現するか
◼Zero-shot 学習
• 学習したことがない動作の認識を目的とした学習
• クラス増加への対策
◼Elaborative Description（ED）を提案
• web上のテキストを使い，クラスを文章で表現
• GitHub

関連研究
◼初期の手法 (a)
• アクションを列挙して定義
[Liu+, CVPR2011]
◼オブジェクトを検出し埋め込み (b)
• Zero-shot学習 [Xu+, IJCV2017]
• AR最先端のネットワークを採用
[Brattoli+, CVPR2020]
• 過剰適合
◼アクションクラスの埋め込み (c)
• 本手法
• Elaborative Rehearsal（ER），人間の記憶技術
[Benjamin&Bjork, Journal of Experimental Psychology: Learning, Memory, and Cognition, 2000]
Inria, France
shi zhe. chen@
i nr i a. f r
CarnegieMellon University, USA
donghuang@
cm
u. edu
Abstract
The growing number of action classes has posed a new
challenge for video understanding, making Zero-Shot Ac-
tion Recognition (ZSAR) a thriving direction. The ZSAR
task aimsto recognizetarget (unseen) actionswithout train-
ing examples by leveraging semantic representations to
bridge seen and unseen actions. However, due to the com-
plexity and diversity of actions, it remains challenging to
semantically represent action classes and transfer knowl-
edge from seen data. In this work, we propose an ER-
enhanced ZSARmodel inspired by an effectivehuman mem-
ory technique Elaborative Rehearsal (ER), which involves
elaborating a new concept and relating it to known con-
cepts. Specifically, weexpand each action classasan Elab-
orative Description (ED) sentence, which is more discrim-
inative than a class name and less costly than manual-
defined attributes. Besides directly aligning class seman-
tics with videos, we incorporate objects from the video as
ElaborativeConcepts (EC) to improvevideo semantics and
Figure 1: Attributes and word embeddings are insufficient to se-
mantically represent action classes. Our ElaborativeRehearsal ap-
proach defines actions by Elaborative Descriptions (EDs) and as-
sociates videos with Elaborative Concepts (ECs, known concepts
detected from thevideo), which improvevideo semanticsand gen-
eralization video-action association for ZSAR. (I for videos, 4
for seen actions, ◦ for unseen actions, and ⇤ for ECs)

Elaborative Description（ED）
◼クラス名を文で定義
◼作り方
• Web上テキスト，辞書の定義などを収集
• 候補文を分割し，不要な文字を削除
• 適切な候補文を人が判断
• 良い候補文が無ければ作成
最終的なED
不要な文字を削除した候補文

テキストと映像の埋め込み
◼EDの埋め込み
• BERT [Devlin+, NAACL2019]を使用
• 各単語の特徴を平均化
• 全結合でクラス数次元にし，正規化
◼時間的特徴の抽出
• Temporal Shift Module [Lin+, ICCV2019] の最終出力を2048次元，正規化
◼フレームから認識されたオブジェクト名の埋め込み
• BiT [Kolesnikov+, arXiv2019] によるオブジェクト確率の予測，
確率上位のオブジェクトを取得
• 各オブジェクトをEDとして表現し埋め込み

埋め込みの全体像と出力
EDの埋め込み
（ラベル）
時空間的特徴の抽出各フレームのオブジェクト
確率を予測
ED埋め込み
（オブジェクト）
ED埋め込み（オブジェクト）

学習（Elaborative Rehearsal）とloss
◼類似度計算
◼AR loss(contrastive loss)
◼ER loss
cの添字は．「ラベルではなく，オブジェクトの
埋め込みによって類似度計算した」という意味
名前は違うが計算は同じ
（真値がラベルかオブジェクトかの違い）

実験設定
◼データセット
• Olympic Sports，HMDB51，
UCF101，Kinetics
• Zero-shot学習標準の分割
[Xu+, IJCV2017]を採用
• Kineticsは独自の分割（後述）
◼モデル（引用：6ページ）
• テキスト埋め込み：BERT
• ビデオ埋め込み：TSM
• 事前学習はKinetics400
• オブジェクトの取得：BiT
• 事前学習はImageNet21k
◼損失計算時の係数
• 𝜏 = 0.1, 𝜆 = 1
◼その他パラメータ
• epoch: 10
• Olympic sportsは100
• オプティマイザ：Adam
• 重み減衰：1e-4
• 学習率：1e-4
• スケジューラ：
warm-up + cosine annealing
◼評価
• top-1,top5 acc

実験結果（Olympics，HMDB，UCF）
◼ZSARの標準的なベンチマーク比較による比較
• 平均精度＋標準偏差の表記
• 本手法の有効性を実証

実験結果（Kinetics）
◼Kinetics ZSAR Benchmarkを提案
• Kinetics400 のtrainを学習
• Kinetics600 で追加された
220クラスのうち，
60クラスをval，
160クラスをtest
Kinetics ZSAR Benchmark のsplit詳細
valデータセットによる
3分割の平均acc

Ablation study
◼各提案手法の有効性を検証
• ED，BERT，ER loss，
Videoの埋め込み，
object classのEDとしての埋め込み

Few shot 学習との比較
◼ラベル付き動画の数を変えて
few-shot学習を行う
• 特徴量の抽出は同じ
• 線形分類層のみを学習
• 過学習を避ける
◼1つのクラスにつき，2つ以上の
動画を教師とする場合に劣る
Few-shot学習との比較詳細

まとめ
◼Elaborative Descriptionを提案
• Zero-shot学習に用いる
• より良いテキスト表現を得る
◼Elaborative Rehearsalの提案
• AR lossとER lossを用いた学習
• ER loss の採用で精度向上
◼Ablation study
• 各実装の有効性を示す

推論
◼アクションクラスは，最も高い類似度スコアによって認識

実験結果（Olympics，HMDB，UCF）
◼ZSARの標準的なベンチマーク比較による比較
• Video
• FV: fisher vector
• BoW: bag of words
• Obj: object
• S: image spatial feature
• Class
• A: attribute
• WN: word embedding
of class name
• WT: word embedding
of class text

文献紹介：Elaborative Rehearsal for Zero-Shot Action Recognition

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 文献紹介：Elaborative Rehearsal for Zero-Shot Action Recognition

Similar to 文献紹介：Elaborative Rehearsal for Zero-Shot Action Recognition (20)

More from Toru Tamaki

More from Toru Tamaki (20)

文献紹介：Elaborative Rehearsal for Zero-Shot Action Recognition