【DL輪読会】Bridge-Prompt: Toward Ordinal Action Understanding in Instructional Videos

•

0 gostou•486 visualizações

Deep Learning JP

2023/3/3 Deep Learning JP http://deeplearning.jp/seminar-2/

Tecnologia

DEEP LEARNING JP
[DL Papers]
Bridge-Prompt: Toward Ordinal Action Understanding in
InstructionalVideos(CVPR 2022)
Yoshifumi Seki
http://deeplearning.jp/

書誌情報
● 投稿先
○ CVPR 2022
● 投稿者
○ 精華大学
● 選定理由
○ 動画からの動作解析系に最近取り組ん
でいます
https://github.com/ttlmh/Bridge-Prompt

背景・目的
● 動画からの動作解析をいい感じにやりたい
● 動作には連続性がある
○ ex. 水を飲む動作
■ コップを持つ -> 水を入れる -> 水を飲む
○ ex. パンを食べる動作
■ バターを塗る -> ジャムをぬる -> パンを食べる
● 連続性をモデルに組み込みたい
○ グラフモデルは最近いくつかあるが道のラベルには対応できない
● Prompt Engineeringをやって大規模言語モデルの強みを活かす

Prompt Engineeringとは
● 与えられた入力（ラベル情報など）をテンプレートに入れて、適切な文として入力さ
せることで、大規模言語モデルの恩恵を受けられるようにするアイデア
●
● GPT-3でのfew shot learningの仕組みに採用
● OpenAIのCLIPによる画像分類でtext-image
● Action CLIPで動画にも適用

ActionCLIP
● ラベルからPrompt
Engineeringにより文章を生成
し、Text Encoder, Video
Encoderによって類似性を図る
ことでラベル推定をする
https://arxiv.org/abs/2109.08472

$Prompt部の詳細 ● 1. Stastical Prompt ○ いくつactionが動画中にあるか ○ The video has {num} actions. ● 2. Ordinal Prompt ○ 何番目のactionか ○ This is the {ord_i} action in the video. ● 3. Semantic Prompt ○ “{ord_i}, the person is performing the action step of {vp_i}” ● 3+1. Integrated Prompt ○ 全部 ○ Semanticを全て文として並べる$

評価用データセット
● 50Salads: 50 top view 30-fps instructional videos regarding salad preparation
○ 19 kind of actions
● Georgia Tech Egocentric Activities(GTEA): 28 egocentric 15-fps instructional
videos daily kitchen activities
○ 74 class of actions
● Breakfast: 1,712 third person 15-fps videos of breakfast preparation activities.
○ 48 type of different actions
○

Implementation
● 動画は16 frameで分割される
● Kinetics-400でAction CLIPを用いて事前学習をする
●

未知のIDに対する対応力
● fine-tune時に特定の行動だけを学習させた場合、類似した行動を推定できるか？
○ cofee2teaはfine-tuneをmaking cofeeだけで行って、making teaが当てられるかを見る
○ AKLは全体としての精度

まとめ・感想
● Prompt EngineeringがNLP以外にも出ていることを初めて知って勉強になりました
● 順序を持たせたことがどのような意味を持っているのかがこの実験だとあまりわか
らなかったので残念
● 未知のIDに対応できているのはすごいけど、この実験方法がそれを測るのに適切
かは疑問
● 既存モデルとの違いをもう少し結果から読み取りたかった
○ 精度だけだとどこが良くなっているのかよくわからん

Mais conteúdo relacionado

Mais procurados

論文紹介：Temporal Action Segmentation: An Analysis of Modern TechniquesToru Tamaki

[DL輪読会]MetaFormer is Actually What You Need for VisionDeep Learning JP

ICCV 2019 論文紹介 (26 papers)Hideki Okada

深層学習によるHuman Pose Estimationの基礎Takumi Ohkuma

SSII2021 [OS2-03] 自己教師あり学習における対照学習の基礎と応用SSII

【DL輪読会】Visual Classification via Description from Large Language Models (ICLR...Deep Learning JP

Generative Models（メタサーベイ）cvpaper. challenge

【DL輪読会】Flamingo: a Visual Language Model for Few-Shot Learning 画像×言語の大規模基盤モ...Deep Learning JP

【メタサーベイ】Video Transformercvpaper. challenge

Deep Learningによる超解像の進歩Hiroto Honda

これからの Vision & Language ～ Acadexit した4つの理由Yoshitaka Ushiku

[DL輪読会]Vision Transformer with Deformable Attention （Deformable Attention Tra...Deep Learning JP

SSII2021 [SS1] Transformer x Computer Visionの実活用可能性と展望〜 TransformerのCompute...SSII

SSII2022 [SS2] 少ないデータやラベルを効率的に活用する機械学習技術〜足りない情報をどのように補うか？〜SSII

近年のHierarchical Vision TransformerYusuke Uchida

SSII2021 [OS2-02] 深層学習におけるデータ拡張の原理と最新動向SSII

[解説スライド] NeRF: Representing Scenes as Neural Radiance Fields for View SynthesisKento Doi

動画認識サーベイv1（メタサーベイ）cvpaper. challenge

[DL輪読会]Model soups: averaging weights of multiple fine-tuned models improves ...Deep Learning JP

[DL輪読会]NeRF: Representing Scenes as Neural Radiance Fields for View SynthesisDeep Learning JP

Mais procurados (20)

論文紹介：Temporal Action Segmentation: An Analysis of Modern Techniques

[DL輪読会]MetaFormer is Actually What You Need for Vision

ICCV 2019 論文紹介 (26 papers)

深層学習によるHuman Pose Estimationの基礎

SSII2021 [OS2-03] 自己教師あり学習における対照学習の基礎と応用

【DL輪読会】Visual Classification via Description from Large Language Models (ICLR...

Generative Models（メタサーベイ）

【DL輪読会】Flamingo: a Visual Language Model for Few-Shot Learning 画像×言語の大規模基盤モ...

【メタサーベイ】Video Transformer

Deep Learningによる超解像の進歩

これからの Vision & Language ～ Acadexit した4つの理由

[DL輪読会]Vision Transformer with Deformable Attention （Deformable Attention Tra...

SSII2021 [SS1] Transformer x Computer Visionの実活用可能性と展望〜 TransformerのCompute...

SSII2022 [SS2] 少ないデータやラベルを効率的に活用する機械学習技術〜足りない情報をどのように補うか？〜

近年のHierarchical Vision Transformer

SSII2021 [OS2-02] 深層学習におけるデータ拡張の原理と最新動向

[解説スライド] NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis

動画認識サーベイv1（メタサーベイ）

[DL輪読会]Model soups: averaging weights of multiple fine-tuned models improves ...

[DL輪読会]NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis

Semelhante a 【DL輪読会】Bridge-Prompt: Toward Ordinal Action Understanding in Instructional Videos

実務でGo使い始めましたYuki Kikuchi

Django ORM道場：クエリの基本を押さえ，より良い形を身に付けようTakayuki Shimizukawa

【メタサーベイ】基盤モデル / Foundation Modelscvpaper. challenge

NodeにしましょうYuzo Hebishima

コミュニケーションスキルを重視したソフトウェア技術者教育手法の研究Yuichiro Saito

2015.08.29 JUS共催勉強会資料umidori

オンプレエンジニアがクラウドエンジニアを夢見て。じっと手を見る。Akihiro Kuwano

Kanamori cedec2011Yoshihiro Kanamori

第83回名古屋アジャイル勉強会「一言で言うと、アジャイルってなんなの？」hiroyuki Yamamoto

goog.require()を手書きしていいのは小学生までTeppei Sato

Androidの新ビルドシステムl_b__

C#でわかるこわくないMonadKouji Matsui

コンピュータビジョンの今を映す-CVPR 2017 速報より- （夏のトップカンファレンス論文読み会）cvpaper. challenge

Multibranch Pipeline with Docker 入門編kimulla

【いまこそ】エンジニアとデザイナー【立ち上がれ】 Yuki Kuroki

Webサイトのようには作れない！Webアプリ設計の考え方girigiribauer

初めてのDockerＹｏｕ＆Ｉ

Transformer 動向調査 in 画像認識(修正版)Kazuki Maeno

Eclipse modeling 勉強会はじめにAkira Tanaka

みくみくまうすについて&Unity で使えるコーディングノウハウtorisoup

Semelhante a 【DL輪読会】Bridge-Prompt: Toward Ordinal Action Understanding in Instructional Videos (20)

実務でGo使い始めました

Django ORM道場：クエリの基本を押さえ，より良い形を身に付けよう

【メタサーベイ】基盤モデル / Foundation Models

Nodeにしましょう

コミュニケーションスキルを重視したソフトウェア技術者教育手法の研究

2015.08.29 JUS共催勉強会資料

オンプレエンジニアがクラウドエンジニアを夢見て。じっと手を見る。

Kanamori cedec2011

第83回名古屋アジャイル勉強会「一言で言うと、アジャイルってなんなの？」

goog.require()を手書きしていいのは小学生まで

Androidの新ビルドシステム

C#でわかるこわくないMonad

コンピュータビジョンの今を映す-CVPR 2017 速報より- （夏のトップカンファレンス論文読み会）

Multibranch Pipeline with Docker 入門編

【いまこそ】エンジニアとデザイナー【立ち上がれ】

Webサイトのようには作れない！Webアプリ設計の考え方

初めてのDocker

Transformer 動向調査 in 画像認識(修正版)

Eclipse modeling 勉強会はじめに

みくみくまうすについて&Unity で使えるコーディングノウハウ

Mais de Deep Learning JP

【DL輪読会】AdaptDiffuser: Diffusion Models as Adaptive Self-evolving PlannersDeep Learning JP

【DL輪読会】事前学習用データセットについてDeep Learning JP

【DL輪読会】 "Learning to render novel views from wide-baseline stereo pairs." CVP...Deep Learning JP

【DL輪読会】Zero-Shot Dual-Lens Super-ResolutionDeep Learning JP

【DL輪読会】BloombergGPT: A Large Language Model for Finance arxivDeep Learning JP

【DL輪読会】マルチモーダル LLMDeep Learning JP

【 DL輪読会】ToolLLM: Facilitating Large Language Models to Master 16000+ Real-wo...Deep Learning JP

【DL輪読会】AnyLoc: Towards Universal Visual Place RecognitionDeep Learning JP

【DL輪読会】Can Neural Network Memorization Be Localized?Deep Learning JP

【DL輪読会】Hopfield network　関連研究についてDeep Learning JP

【DL輪読会】SimPer: Simple self-supervised learning of periodic targets( ICLR 2023 )Deep Learning JP

【DL輪読会】RLCD: Reinforcement Learning from Contrast Distillation for Language M...Deep Learning JP

【DL輪読会】"Secrets of RLHF in Large Language Models Part I: PPO"Deep Learning JP

【DL輪読会】"Language Instructed Reinforcement Learning for Human-AI Coordination "Deep Learning JP

【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat ModelsDeep Learning JP

【DL輪読会】"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware"Deep Learning JP

【DL輪読会】Parameter is Not All You Need:Starting from Non-Parametric Networks fo...Deep Learning JP

【DL輪読会】Drag Your GAN: Interactive Point-based Manipulation on the Generative ...Deep Learning JP

【DL輪読会】Self-Supervised Learning from Images with a Joint-Embedding Predictive...Deep Learning JP

【DL輪読会】Towards Understanding Ensemble, Knowledge Distillation and Self-Distil...Deep Learning JP

Mais de Deep Learning JP (20)

【DL輪読会】AdaptDiffuser: Diffusion Models as Adaptive Self-evolving Planners

【DL輪読会】事前学習用データセットについて

【DL輪読会】 "Learning to render novel views from wide-baseline stereo pairs." CVP...

【DL輪読会】Zero-Shot Dual-Lens Super-Resolution

【DL輪読会】BloombergGPT: A Large Language Model for Finance arxiv

【DL輪読会】マルチモーダル LLM

【 DL輪読会】ToolLLM: Facilitating Large Language Models to Master 16000+ Real-wo...

【DL輪読会】AnyLoc: Towards Universal Visual Place Recognition

【DL輪読会】Can Neural Network Memorization Be Localized?

【DL輪読会】Hopfield network　関連研究について

【DL輪読会】SimPer: Simple self-supervised learning of periodic targets( ICLR 2023 )

【DL輪読会】RLCD: Reinforcement Learning from Contrast Distillation for Language M...

【DL輪読会】"Secrets of RLHF in Large Language Models Part I: PPO"

【DL輪読会】"Language Instructed Reinforcement Learning for Human-AI Coordination "

【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models

【DL輪読会】"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware"

【DL輪読会】Parameter is Not All You Need:Starting from Non-Parametric Networks fo...

【DL輪読会】Drag Your GAN: Interactive Point-based Manipulation on the Generative ...

【DL輪読会】Self-Supervised Learning from Images with a Joint-Embedding Predictive...

【DL輪読会】Towards Understanding Ensemble, Knowledge Distillation and Self-Distil...

【DL輪読会】Bridge-Prompt: Toward Ordinal Action Understanding in Instructional Videos

1. DEEP LEARNING JP [DL Papers] Bridge-Prompt: Toward Ordinal Action Understanding in InstructionalVideos(CVPR 2022) Yoshifumi Seki http://deeplearning.jp/

2. 書誌情報 ● 投稿先 ○ CVPR 2022 ● 投稿者 ○ 精華大学 ● 選定理由 ○ 動画からの動作解析系に最近取り組んでいます https://github.com/ttlmh/Bridge-Prompt

3. 背景・目的 ● 動画からの動作解析をいい感じにやりたい ● 動作には連続性がある ○ ex. 水を飲む動作 ■ コップを持つ -> 水を入れる -> 水を飲む ○ ex. パンを食べる動作 ■ バターを塗る -> ジャムをぬる -> パンを食べる ● 連続性をモデルに組み込みたい ○ グラフモデルは最近いくつかあるが道のラベルには対応できない ● Prompt Engineeringをやって大規模言語モデルの強みを活かす

4. Prompt Engineeringとは ● 与えられた入力（ラベル情報など）をテンプレートに入れて、適切な文として入力させることで、大規模言語モデルの恩恵を受けられるようにするアイデア ● ● GPT-3でのfew shot learningの仕組みに採用 ● OpenAIのCLIPによる画像分類でtext-image ● Action CLIPで動画にも適用

5. CLIP(ICML2021) 2021/1/15の発表より

6. CLIP(ICML2021) 2021/1/15の発表より

7. ActionCLIP ● ラベルからPrompt Engineeringにより文章を生成し、Text Encoder, Video Encoderによって類似性を図ることでラベル推定をする https://arxiv.org/abs/2109.08472

8. 提案手法

9. 提案手法の全体図

10. Prompt部の詳細 ● 1. Stastical Prompt ○ いくつactionが動画中にあるか ○ The video has {num} actions. ● 2. Ordinal Prompt ○ 何番目のactionか ○ This is the {ord_i} action in the video. ● 3. Semantic Prompt ○ “{ord_i}, the person is performing the action step of {vp_i}” ● 3+1. Integrated Prompt ○ 全部 ○ Semanticを全て文として並べる

11. 評価用データセット ● 50Salads: 50 top view 30-fps instructional videos regarding salad preparation ○ 19 kind of actions ● Georgia Tech Egocentric Activities(GTEA): 28 egocentric 15-fps instructional videos daily kitchen activities ○ 74 class of actions ● Breakfast: 1,712 third person 15-fps videos of breakfast preparation activities. ○ 48 type of different actions ○

12. Implementation ● 動画は16 frameで分割される ● Kinetics-400でAction CLIPを用いて事前学習をする ●

13.

14.

15. Long-termな映像に対する比較

16.

17. Fusion Moduleの比較・検討

18. 未知のIDに対する対応力 ● fine-tune時に特定の行動だけを学習させた場合、類似した行動を推定できるか？ ○ cofee2teaはfine-tuneをmaking cofeeだけで行って、making teaが当てられるかを見る ○ AKLは全体としての精度

19. まとめ・感想 ● Prompt EngineeringがNLP以外にも出ていることを初めて知って勉強になりました ● 順序を持たせたことがどのような意味を持っているのかがこの実験だとあまりわからなかったので残念 ● 未知のIDに対応できているのはすごいけど、この実験方法がそれを測るのに適切かは疑問 ● 既存モデルとの違いをもう少し結果から読み取りたかった ○ 精度だけだとどこが良くなっているのかよくわからん

【DL輪読会】Bridge-Prompt: Toward Ordinal Action Understanding in Instructional Videos

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a 【DL輪読会】Bridge-Prompt: Toward Ordinal Action Understanding in Instructional Videos

Semelhante a 【DL輪読会】Bridge-Prompt: Toward Ordinal Action Understanding in Instructional Videos (20)

Mais de Deep Learning JP

Mais de Deep Learning JP (20)

【DL輪読会】Bridge-Prompt: Toward Ordinal Action Understanding in Instructional Videos