SlideShare a Scribd company logo
1 of 17
Download to read offline
Susang Kim(healess1@gmail.com)
Video Understanding(1)
Quo Vadis, Action Recognition?
A New Model and the Kinetics Dataset
Action Recognition 논문
DeepMind에서 발표한 논문(CVPR 2017)으로
Action Recognition을 위한 Two-Stream
Inflated 3D ConvNets(I3D)와 Kinetics
Dataset을 공개
Action Recognition : 특정 비디오 영상에서
사람이 어떤 행동을 하는지를 위한
Classification을 하는 것 (비디오 영상을
입력하여 예측 결과 출력)
Quo Vadis의 한장면:
Are these actors about to kiss each other,
or have they just done so?
⇒ Actions can be ambiguous in individual
frames
Kinetics Dataset - Human Action Video Dataset (본 논문에서 공개)
ImageNet(1000장/1000카테고리)으로 학습한 Pre-trained 모델을 활용하면 Classification뿐만 아니라
Object Detection/Segmentation등에서도 좋은 성능이 나온 것을 착안하여 만든 Dataset으로 Action
Recognition에서 Kinetics Dataset으로 학습한 Pre-trained 모델로 기존에 활용되던 HMDB-51과
UCF-101를 활용하여 fine-tuning를 통해 SOTA를 달성함으로써 대량의 학습데이터 필요성의 중요함을
본 연구를 통해 보여줌.
Kinetics Dataset : 1천개 클래스 1천개의 video clips로 Human Action 중심(단독 행동, 사람 간 행동,
물건을 다루는 행동)으로 정의(클래스당 600 비디오 클립으로 10초씩) 처음 400클래스 공개 후
600클래스로 추가 trimmed videos로 구성
본 논문에서는 miniKinetics dataset(full Kinetics의 사전 실험용)은 동영상 테스트의 빠른 실험을 위한
dataset으로 213 class에 12만개의 clips로 class당 150~1000개의 clips로 구성(validation : 25 clips / test
75 clips)
A Short Note about Kinetics-600 https://arxiv.org/pdf/1808.01340.pdf
Action Recognition Benchmark
DATASET YEAR # ACTIONS # CLIPS PER ACTION
KTH 2004 6 10
Weizmann 2005 9 9
IXMAS 2006 11 33
Hollywood 2008 8 30-140
UCF Sports 2009 9 14-35
Hollywood2 2009 12 61-278
UCF YouTube 2009 11 100
MSR 2009 3 14-25
Olympic 2010 16 50
UCF50 2010 50 min. 100
HMDB51 2011 51 min. 101
first pretraining on Kinetics and then fine-tuning on HMDB-51 and UCF-101s a boost in performance
HMDB-51 Dataset
ICCV 2011에서 공개된 Human Motion에 관한 6849개의 비디오 클립에 51개의 액션 카테고리로 정의 각
카테고리는 101개의 클립으로 구성
http://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database/#introduction
UCF101 Dataset
UCF101 Dataset : ICCV 2013에 공개된 데이터로 13320개의 비디오와 101개의 엑션 클래스가 있음
101개의 카테고리는 25그룹으로 나뉘는데 각각의 그룹은 4~7개의 액션이 정의된 비디오가 있음.
https://www.crcv.ucf.edu/data/UCF101.php
Video Architecture
ImageNet Pre-trained Model(Inception-v1)활용하여 기존 아키텍쳐(a~d)와 본 논문에서 제시한
I3D(e)+pre-training on Kinetics와의 비교를 통해 네트워크 구조 변경을 통한 성능 개선을 제시
The Old I: ConvNet+LSTM
Long-term Recurrent Convolutional Networks for Visual Recognition
and Description(CVPR 2015) (https://arxiv.org/pdf/1411.4389.pdf)
25 Fps를 뽑은 후 각각을 CNN(Inception-V1:512 hidden units)으로
피쳐를 뽑아 LSTM(+batch norm)으로 시계열 정보를 예측
cross-entropy loss 사용
LSTM을 통해 연속적으로 학습해야하기 때문에 연산이 어려움
The Old II: 3D ConvNets
D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri.
Learning spatiotemporal features with 3d convolutional networks (ICCV 2015)
https://arxiv.org/pdf/1412.0767.pdf
C3D로 정의하는 3D CNN은 추가적인 커널 수로 2D보다 많은 파라미터 수를
가지므로 학습에 어려움이 있음 배치별 15개의 비디오 K40 GPU로 학습
H x W x D => T x H x W x D (시간축이 추가됨-위아래 앞뒤로 3D로 움직임)
당시엔 ResNet같은 2D의 최적화된 네트워크가 없어 아래와 같이 새롭개 정의한
네트워크로 학습
The Old III: Two-Stream Networks
Motion Estimation을 하는 방법 중 효율적인 Optical Flow를 적용 RGB(Spatial)와 Optical Flow(Temporal)의 Two Stream
Two-Stream Convolutional Networks for Action Recognition in Videos (NIPS 2014)
(https://papers.nips.cc/paper/2014/file/00ec53c4682d36f5c4359f4ae7bd7ba1-Paper.pdf)
Convolutional two-stream network fusion for video action recognition (CVPR 2016)
https://arxiv.org/pdf/1604.06573.pdf
HMDB(Human Metabolome Database)에서 성능 개선을 얻은 방식으로 기존 방식에서 나온 피쳐를 복수개의
RGB와 Optical Flow로 추출하여 3D Conv로 학습시킴(3D Conv는 피쳐 추출이 아닌 loss 줄이기 위한 용도)
네트워크는 Inception-V1을 사용 10 Frame간격으로 5개의 연속된 RGB Frame을 Optical Flow의 피쳐와 합쳐
End-to-End로 학습
The Old IV: 3D-Fused Two-Stream
The New: Two-Stream Inflated 3D ConvNets(I3D)
본 논문에서 제안하는 방식으로 기존 2D로 피쳐를 뽑는 방식에서 복수
Frame인 RGB와 Optical flow를 동시에 3D Conv로 피쳐를 한번에 뽑아
합치는 방식으로 이를 통해 RGB의 연속된 움직임과 Optical Flow의
변화량을 정확하게 뽑아낼 수 있다고함.(새로운 3D 네트워크가 아닌
Inception v1을 쌓아 3D 만듬 - 검증된 네트워크 활용)
2D->3D(N × N filters become N × N × N : 디멘젼을 늘림)
3D Conv의 경우 복수개의 Frame와 Optical Flow가 입력이 되기에
ImageNet Pre-trained Model활용을 위해 기존 ImageNet의 모델을 N번(3D)
만큼 복사하여 붙임.(기존 weight 활용가능)
repeating the weights of the 2D filters N times along the time dimension
Two-Stream Inflated 3D ConvNet (I3D)의 명칭으로 제안함
I3D network trained on RGB inputs, and another on flow inputs which carry optimized, smooth flow
information. We trained the two networks separately and averaged their predictions at test time
Implementation Details
Inception V1의 네트워크를 성능 개선을 위해 수정 공간정보와 시간 정보를 동시에 학습시키기 때문에 stride의 수 조정이 중요한데
stride가 크면 공간을 못보고 stride가 작으면 움직임(시간)을 못봄 (첫번째와 두번째의 Max-Pool에서는 시간축으로는 보지 않음)
RGB만으로도 각 Frame 피쳐의 변화를 통해 모션 예측이 가능하지만 Optical Flow를 통해 더 세부적인 모션을 예측할 수 있음
Implementation Details
All models but the C3D-like 3D ConvNet use ImageNet pretrained Inception-V1 as base network.
For all architectures we follow each convolutional layer by a batch normalization layer and a ReLU
activation function, except for the last convolutional layers which produce the class scores(1x1x1) for
each network.
Training on videos used standard SGD with momentum set to 0.9 in all cases, with synchronous
parallelization across 32 GPUs for all models except the 3D ConvNets(64 GPUs).
We trained models on miniKinetics for up to 35k steps, and for 110k steps on Kinetics, with a 10%
reduction of learning rate when validation loss saturated.
We tuned the learning rate hyperparameter on the validation set of miniKinetics. Models were trained for
up to 5k steps on UCF-101 and HMDB-51 using a similar learning rate adaptation procedure as for
Kinetics but using just 16 GPUs. All the models were implemented in TensorFlow
(https://github.com/deepmind/kinetics-i3d)
During training data augmentation random cropping both spatially (resizing the smaller video side to 256
pixels, then randomly cropping a 224 × 224 patch and temporally, random left-right flipping, photometric)
During test time the models are applied convolutionally over the whole video taking 224 × 224 center
crops, and the predictions are averaged. We computed optical flow with a TV-L1 algorithm
Experimental Comparison of Architectures
UCF-101/HMDB-51로
학습 한 모델과
Kinetics Pre-trained
사용(Cl에 따른 성능의
차이를 보여줌
Comparison with the SOTA and Next
Comparison with state-of-the-art on the UCF-101
and HMDB-51 datasets, averaged over three
splits. First set of rows contains results of models
trained without labeled external data.
As future work,
for other video tasks such as semantic video
segmentation, video object detection, or optical
flow computation
we have not employed action tubes or attention
mechanisms to focus in on the human actors.
we plan to repeat all experiments using Kinetics
instead of miniKinetics, with and without
ImageNet pre-training, and explore inflating other
state-of-theart 2D ConvNets
Thanks
Any Questions?
You can send mail to
Susang Kim(healess1@gmail.com)

More Related Content

What's hot

機械学習を民主化する取り組み
機械学習を民主化する取り組み機械学習を民主化する取り組み
機械学習を民主化する取り組みYoshitaka Ushiku
 
GAN(と強化学習との関係)
GAN(と強化学習との関係)GAN(と強化学習との関係)
GAN(と強化学習との関係)Masahiro Suzuki
 
[DL輪読会]SOM-VAE: Interpretable Discrete Representation Learning on Time Series
[DL輪読会]SOM-VAE: Interpretable Discrete Representation Learning on Time Series[DL輪読会]SOM-VAE: Interpretable Discrete Representation Learning on Time Series
[DL輪読会]SOM-VAE: Interpretable Discrete Representation Learning on Time SeriesDeep Learning JP
 
Anomaly detection 系の論文を一言でまとめた
Anomaly detection 系の論文を一言でまとめたAnomaly detection 系の論文を一言でまとめた
Anomaly detection 系の論文を一言でまとめたぱんいち すみもと
 
[論文紹介] LSTM (LONG SHORT-TERM MEMORY)
[論文紹介] LSTM (LONG SHORT-TERM MEMORY)[論文紹介] LSTM (LONG SHORT-TERM MEMORY)
[論文紹介] LSTM (LONG SHORT-TERM MEMORY)Tomoyuki Hioki
 
Latest Frame interpolation Algorithms
Latest Frame interpolation AlgorithmsLatest Frame interpolation Algorithms
Latest Frame interpolation AlgorithmsHyeongmin Lee
 
深層学習によるポアソンデノイジング: 残差学習はポアソンノイズに対して有効か? 論文 Poisson Denoising by Deep Learnin...
深層学習によるポアソンデノイジング: 残差学習はポアソンノイズに対して有効か?  論文 Poisson Denoising by Deep Learnin...深層学習によるポアソンデノイジング: 残差学習はポアソンノイズに対して有効か?  論文 Poisson Denoising by Deep Learnin...
深層学習によるポアソンデノイジング: 残差学習はポアソンノイズに対して有効か? 論文 Poisson Denoising by Deep Learnin...doboncho
 
音声の生成と符号化
音声の生成と符号化音声の生成と符号化
音声の生成と符号化Akinori Ito
 
220206 transformer interpretability beyond attention visualization
220206 transformer interpretability beyond attention visualization220206 transformer interpretability beyond attention visualization
220206 transformer interpretability beyond attention visualizationtaeseon ryu
 
データ解析11 因子分析の応用
データ解析11 因子分析の応用データ解析11 因子分析の応用
データ解析11 因子分析の応用Hirotaka Hachiya
 
論文紹介 Semi-supervised Learning with Deep Generative Models
論文紹介 Semi-supervised Learning with Deep Generative Models論文紹介 Semi-supervised Learning with Deep Generative Models
論文紹介 Semi-supervised Learning with Deep Generative ModelsSeiya Tokui
 
Attention mechanism 소개 자료
Attention mechanism 소개 자료Attention mechanism 소개 자료
Attention mechanism 소개 자료Whi Kwon
 
[DL輪読会]Flow-based Deep Generative Models
[DL輪読会]Flow-based Deep Generative Models[DL輪読会]Flow-based Deep Generative Models
[DL輪読会]Flow-based Deep Generative ModelsDeep Learning JP
 
Deep Learningによる超解像の進歩
Deep Learningによる超解像の進歩Deep Learningによる超解像の進歩
Deep Learningによる超解像の進歩Hiroto Honda
 
深層学習による非滑らかな関数の推定
深層学習による非滑らかな関数の推定深層学習による非滑らかな関数の推定
深層学習による非滑らかな関数の推定Masaaki Imaizumi
 
深層学習時代の自然言語処理
深層学習時代の自然言語処理深層学習時代の自然言語処理
深層学習時代の自然言語処理Yuya Unno
 
【DL輪読会】DINOv2: Learning Robust Visual Features without Supervision
【DL輪読会】DINOv2: Learning Robust Visual Features without Supervision【DL輪読会】DINOv2: Learning Robust Visual Features without Supervision
【DL輪読会】DINOv2: Learning Robust Visual Features without SupervisionDeep Learning JP
 

What's hot (20)

機械学習を民主化する取り組み
機械学習を民主化する取り組み機械学習を民主化する取り組み
機械学習を民主化する取り組み
 
GAN(と強化学習との関係)
GAN(と強化学習との関係)GAN(と強化学習との関係)
GAN(と強化学習との関係)
 
[DL輪読会]SOM-VAE: Interpretable Discrete Representation Learning on Time Series
[DL輪読会]SOM-VAE: Interpretable Discrete Representation Learning on Time Series[DL輪読会]SOM-VAE: Interpretable Discrete Representation Learning on Time Series
[DL輪読会]SOM-VAE: Interpretable Discrete Representation Learning on Time Series
 
Anomaly detection 系の論文を一言でまとめた
Anomaly detection 系の論文を一言でまとめたAnomaly detection 系の論文を一言でまとめた
Anomaly detection 系の論文を一言でまとめた
 
[論文紹介] LSTM (LONG SHORT-TERM MEMORY)
[論文紹介] LSTM (LONG SHORT-TERM MEMORY)[論文紹介] LSTM (LONG SHORT-TERM MEMORY)
[論文紹介] LSTM (LONG SHORT-TERM MEMORY)
 
Latest Frame interpolation Algorithms
Latest Frame interpolation AlgorithmsLatest Frame interpolation Algorithms
Latest Frame interpolation Algorithms
 
深層学習によるポアソンデノイジング: 残差学習はポアソンノイズに対して有効か? 論文 Poisson Denoising by Deep Learnin...
深層学習によるポアソンデノイジング: 残差学習はポアソンノイズに対して有効か?  論文 Poisson Denoising by Deep Learnin...深層学習によるポアソンデノイジング: 残差学習はポアソンノイズに対して有効か?  論文 Poisson Denoising by Deep Learnin...
深層学習によるポアソンデノイジング: 残差学習はポアソンノイズに対して有効か? 論文 Poisson Denoising by Deep Learnin...
 
数式からみるWord2Vec
数式からみるWord2Vec数式からみるWord2Vec
数式からみるWord2Vec
 
音声の生成と符号化
音声の生成と符号化音声の生成と符号化
音声の生成と符号化
 
220206 transformer interpretability beyond attention visualization
220206 transformer interpretability beyond attention visualization220206 transformer interpretability beyond attention visualization
220206 transformer interpretability beyond attention visualization
 
データ解析11 因子分析の応用
データ解析11 因子分析の応用データ解析11 因子分析の応用
データ解析11 因子分析の応用
 
汎化性能測定
汎化性能測定汎化性能測定
汎化性能測定
 
論文紹介 Semi-supervised Learning with Deep Generative Models
論文紹介 Semi-supervised Learning with Deep Generative Models論文紹介 Semi-supervised Learning with Deep Generative Models
論文紹介 Semi-supervised Learning with Deep Generative Models
 
Attention mechanism 소개 자료
Attention mechanism 소개 자료Attention mechanism 소개 자료
Attention mechanism 소개 자료
 
[DL輪読会]Flow-based Deep Generative Models
[DL輪読会]Flow-based Deep Generative Models[DL輪読会]Flow-based Deep Generative Models
[DL輪読会]Flow-based Deep Generative Models
 
Deep Learningによる超解像の進歩
Deep Learningによる超解像の進歩Deep Learningによる超解像の進歩
Deep Learningによる超解像の進歩
 
深層学習による非滑らかな関数の推定
深層学習による非滑らかな関数の推定深層学習による非滑らかな関数の推定
深層学習による非滑らかな関数の推定
 
深層学習時代の自然言語処理
深層学習時代の自然言語処理深層学習時代の自然言語処理
深層学習時代の自然言語処理
 
【DL輪読会】DINOv2: Learning Robust Visual Features without Supervision
【DL輪読会】DINOv2: Learning Robust Visual Features without Supervision【DL輪読会】DINOv2: Learning Robust Visual Features without Supervision
【DL輪読会】DINOv2: Learning Robust Visual Features without Supervision
 
Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)
Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)
Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)
 

Similar to I3D and Kinetics datasets (Action Recognition)

[Paper] shuffle net an extremely efficient convolutional neural network for ...
[Paper] shuffle net  an extremely efficient convolutional neural network for ...[Paper] shuffle net  an extremely efficient convolutional neural network for ...
[Paper] shuffle net an extremely efficient convolutional neural network for ...Susang Kim
 
딥러닝 논문읽기 efficient netv2 논문리뷰
딥러닝 논문읽기 efficient netv2  논문리뷰딥러닝 논문읽기 efficient netv2  논문리뷰
딥러닝 논문읽기 efficient netv2 논문리뷰taeseon ryu
 
20210131deit-210204074124.pdf
20210131deit-210204074124.pdf20210131deit-210204074124.pdf
20210131deit-210204074124.pdfssusera9c46c
 
Training data-efficient image transformers & distillation through attention
Training data-efficient image transformers & distillation through attentionTraining data-efficient image transformers & distillation through attention
Training data-efficient image transformers & distillation through attentiontaeseon ryu
 
210801 hierarchical long term video frame prediction without supervision
210801 hierarchical long term video frame prediction without supervision210801 hierarchical long term video frame prediction without supervision
210801 hierarchical long term video frame prediction without supervisiontaeseon ryu
 
180624 mobile visionnet_baeksucon_jwkang_pub
180624 mobile visionnet_baeksucon_jwkang_pub180624 mobile visionnet_baeksucon_jwkang_pub
180624 mobile visionnet_baeksucon_jwkang_pubJaewook. Kang
 
[paper review] 손규빈 - Eye in the sky & 3D human pose estimation in video with ...
[paper review] 손규빈 - Eye in the sky & 3D human pose estimation in video with ...[paper review] 손규빈 - Eye in the sky & 3D human pose estimation in video with ...
[paper review] 손규빈 - Eye in the sky & 3D human pose estimation in video with ...Gyubin Son
 
[Pix2 pix] image to-image translation with conditional adversarial network re...
[Pix2 pix] image to-image translation with conditional adversarial network re...[Pix2 pix] image to-image translation with conditional adversarial network re...
[Pix2 pix] image to-image translation with conditional adversarial network re...JaeYeongKo
 
스마트폰 위의 딥러닝
스마트폰 위의 딥러닝스마트폰 위의 딥러닝
스마트폰 위의 딥러닝NAVER Engineering
 
History of Vision AI
History of Vision AIHistory of Vision AI
History of Vision AITae Young Lee
 
Tiny ml study 20201031
Tiny ml study 20201031Tiny ml study 20201031
Tiny ml study 20201031ByoungHern Kim
 
Long term feature banks for detailed video understanding (Action Recognition)
Long term feature banks for detailed video understanding (Action Recognition)Long term feature banks for detailed video understanding (Action Recognition)
Long term feature banks for detailed video understanding (Action Recognition)Susang Kim
 
Image Deep Learning 실무적용
Image Deep Learning 실무적용Image Deep Learning 실무적용
Image Deep Learning 실무적용Youngjae Kim
 
CNN for sentence classification
CNN for sentence classificationCNN for sentence classification
CNN for sentence classificationKyeongUkJang
 
순환신경망(Recurrent neural networks) 개요
순환신경망(Recurrent neural networks) 개요순환신경망(Recurrent neural networks) 개요
순환신경망(Recurrent neural networks) 개요Byoung-Hee Kim
 

Similar to I3D and Kinetics datasets (Action Recognition) (20)

LeNet & GoogLeNet
LeNet & GoogLeNetLeNet & GoogLeNet
LeNet & GoogLeNet
 
[Paper] shuffle net an extremely efficient convolutional neural network for ...
[Paper] shuffle net  an extremely efficient convolutional neural network for ...[Paper] shuffle net  an extremely efficient convolutional neural network for ...
[Paper] shuffle net an extremely efficient convolutional neural network for ...
 
딥러닝 논문읽기 efficient netv2 논문리뷰
딥러닝 논문읽기 efficient netv2  논문리뷰딥러닝 논문읽기 efficient netv2  논문리뷰
딥러닝 논문읽기 efficient netv2 논문리뷰
 
Vid2vid
Vid2vidVid2vid
Vid2vid
 
20210131deit-210204074124.pdf
20210131deit-210204074124.pdf20210131deit-210204074124.pdf
20210131deit-210204074124.pdf
 
Training data-efficient image transformers & distillation through attention
Training data-efficient image transformers & distillation through attentionTraining data-efficient image transformers & distillation through attention
Training data-efficient image transformers & distillation through attention
 
210801 hierarchical long term video frame prediction without supervision
210801 hierarchical long term video frame prediction without supervision210801 hierarchical long term video frame prediction without supervision
210801 hierarchical long term video frame prediction without supervision
 
180624 mobile visionnet_baeksucon_jwkang_pub
180624 mobile visionnet_baeksucon_jwkang_pub180624 mobile visionnet_baeksucon_jwkang_pub
180624 mobile visionnet_baeksucon_jwkang_pub
 
[paper review] 손규빈 - Eye in the sky & 3D human pose estimation in video with ...
[paper review] 손규빈 - Eye in the sky & 3D human pose estimation in video with ...[paper review] 손규빈 - Eye in the sky & 3D human pose estimation in video with ...
[paper review] 손규빈 - Eye in the sky & 3D human pose estimation in video with ...
 
[Pix2 pix] image to-image translation with conditional adversarial network re...
[Pix2 pix] image to-image translation with conditional adversarial network re...[Pix2 pix] image to-image translation with conditional adversarial network re...
[Pix2 pix] image to-image translation with conditional adversarial network re...
 
스마트폰 위의 딥러닝
스마트폰 위의 딥러닝스마트폰 위의 딥러닝
스마트폰 위의 딥러닝
 
History of Vision AI
History of Vision AIHistory of Vision AI
History of Vision AI
 
Tiny ml study 20201031
Tiny ml study 20201031Tiny ml study 20201031
Tiny ml study 20201031
 
Detecting fake jpeg images
Detecting fake jpeg imagesDetecting fake jpeg images
Detecting fake jpeg images
 
Long term feature banks for detailed video understanding (Action Recognition)
Long term feature banks for detailed video understanding (Action Recognition)Long term feature banks for detailed video understanding (Action Recognition)
Long term feature banks for detailed video understanding (Action Recognition)
 
Image Deep Learning 실무적용
Image Deep Learning 실무적용Image Deep Learning 실무적용
Image Deep Learning 실무적용
 
AUTOML
AUTOMLAUTOML
AUTOML
 
Automl
AutomlAutoml
Automl
 
CNN for sentence classification
CNN for sentence classificationCNN for sentence classification
CNN for sentence classification
 
순환신경망(Recurrent neural networks) 개요
순환신경망(Recurrent neural networks) 개요순환신경망(Recurrent neural networks) 개요
순환신경망(Recurrent neural networks) 개요
 

More from Susang Kim

[Paper] GIRAFFE: Representing Scenes as Compositional Generative Neural Featu...
[Paper] GIRAFFE: Representing Scenes as Compositional Generative Neural Featu...[Paper] GIRAFFE: Representing Scenes as Compositional Generative Neural Featu...
[Paper] GIRAFFE: Representing Scenes as Compositional Generative Neural Featu...Susang Kim
 
[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)Susang Kim
 
[Paper] dynamic routing between capsules
[Paper] dynamic routing between capsules[Paper] dynamic routing between capsules
[Paper] dynamic routing between capsulesSusang Kim
 
[Paper] anti spoofing for face recognition
[Paper] anti spoofing for face recognition[Paper] anti spoofing for face recognition
[Paper] anti spoofing for face recognitionSusang Kim
 
[Paper] attention mechanism(luong)
[Paper] attention mechanism(luong)[Paper] attention mechanism(luong)
[Paper] attention mechanism(luong)Susang Kim
 
[Paper] EDA : easy data augmentation techniques for boosting performance on t...
[Paper] EDA : easy data augmentation techniques for boosting performance on t...[Paper] EDA : easy data augmentation techniques for boosting performance on t...
[Paper] EDA : easy data augmentation techniques for boosting performance on t...Susang Kim
 
[Paper] auto ml part 1
[Paper] auto ml part 1[Paper] auto ml part 1
[Paper] auto ml part 1Susang Kim
 
[Paper] eXplainable ai(xai) in computer vision
[Paper] eXplainable ai(xai) in computer vision[Paper] eXplainable ai(xai) in computer vision
[Paper] eXplainable ai(xai) in computer visionSusang Kim
 
[Paper] learning video representations from correspondence proposals
[Paper]  learning video representations from correspondence proposals[Paper]  learning video representations from correspondence proposals
[Paper] learning video representations from correspondence proposalsSusang Kim
 
[Paper] DetectoRS for Object Detection
[Paper] DetectoRS for Object Detection[Paper] DetectoRS for Object Detection
[Paper] DetectoRS for Object DetectionSusang Kim
 
GroupFace (Face Recognition)
GroupFace (Face Recognition)GroupFace (Face Recognition)
GroupFace (Face Recognition)Susang Kim
 
제11회공개sw개발자대회 금상 TensorMSA(소개)
제11회공개sw개발자대회 금상 TensorMSA(소개)제11회공개sw개발자대회 금상 TensorMSA(소개)
제11회공개sw개발자대회 금상 TensorMSA(소개)Susang Kim
 
Sk t academy lecture note
Sk t academy lecture noteSk t academy lecture note
Sk t academy lecture noteSusang Kim
 
Python과 Tensorflow를 활용한 AI Chatbot 개발 및 실무 적용
Python과 Tensorflow를 활용한  AI Chatbot 개발 및 실무 적용Python과 Tensorflow를 활용한  AI Chatbot 개발 및 실무 적용
Python과 Tensorflow를 활용한 AI Chatbot 개발 및 실무 적용Susang Kim
 

More from Susang Kim (14)

[Paper] GIRAFFE: Representing Scenes as Compositional Generative Neural Featu...
[Paper] GIRAFFE: Representing Scenes as Compositional Generative Neural Featu...[Paper] GIRAFFE: Representing Scenes as Compositional Generative Neural Featu...
[Paper] GIRAFFE: Representing Scenes as Compositional Generative Neural Featu...
 
[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)
 
[Paper] dynamic routing between capsules
[Paper] dynamic routing between capsules[Paper] dynamic routing between capsules
[Paper] dynamic routing between capsules
 
[Paper] anti spoofing for face recognition
[Paper] anti spoofing for face recognition[Paper] anti spoofing for face recognition
[Paper] anti spoofing for face recognition
 
[Paper] attention mechanism(luong)
[Paper] attention mechanism(luong)[Paper] attention mechanism(luong)
[Paper] attention mechanism(luong)
 
[Paper] EDA : easy data augmentation techniques for boosting performance on t...
[Paper] EDA : easy data augmentation techniques for boosting performance on t...[Paper] EDA : easy data augmentation techniques for boosting performance on t...
[Paper] EDA : easy data augmentation techniques for boosting performance on t...
 
[Paper] auto ml part 1
[Paper] auto ml part 1[Paper] auto ml part 1
[Paper] auto ml part 1
 
[Paper] eXplainable ai(xai) in computer vision
[Paper] eXplainable ai(xai) in computer vision[Paper] eXplainable ai(xai) in computer vision
[Paper] eXplainable ai(xai) in computer vision
 
[Paper] learning video representations from correspondence proposals
[Paper]  learning video representations from correspondence proposals[Paper]  learning video representations from correspondence proposals
[Paper] learning video representations from correspondence proposals
 
[Paper] DetectoRS for Object Detection
[Paper] DetectoRS for Object Detection[Paper] DetectoRS for Object Detection
[Paper] DetectoRS for Object Detection
 
GroupFace (Face Recognition)
GroupFace (Face Recognition)GroupFace (Face Recognition)
GroupFace (Face Recognition)
 
제11회공개sw개발자대회 금상 TensorMSA(소개)
제11회공개sw개발자대회 금상 TensorMSA(소개)제11회공개sw개발자대회 금상 TensorMSA(소개)
제11회공개sw개발자대회 금상 TensorMSA(소개)
 
Sk t academy lecture note
Sk t academy lecture noteSk t academy lecture note
Sk t academy lecture note
 
Python과 Tensorflow를 활용한 AI Chatbot 개발 및 실무 적용
Python과 Tensorflow를 활용한  AI Chatbot 개발 및 실무 적용Python과 Tensorflow를 활용한  AI Chatbot 개발 및 실무 적용
Python과 Tensorflow를 활용한 AI Chatbot 개발 및 실무 적용
 

I3D and Kinetics datasets (Action Recognition)

  • 1. Susang Kim(healess1@gmail.com) Video Understanding(1) Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
  • 2. Action Recognition 논문 DeepMind에서 발표한 논문(CVPR 2017)으로 Action Recognition을 위한 Two-Stream Inflated 3D ConvNets(I3D)와 Kinetics Dataset을 공개 Action Recognition : 특정 비디오 영상에서 사람이 어떤 행동을 하는지를 위한 Classification을 하는 것 (비디오 영상을 입력하여 예측 결과 출력) Quo Vadis의 한장면: Are these actors about to kiss each other, or have they just done so? ⇒ Actions can be ambiguous in individual frames
  • 3. Kinetics Dataset - Human Action Video Dataset (본 논문에서 공개) ImageNet(1000장/1000카테고리)으로 학습한 Pre-trained 모델을 활용하면 Classification뿐만 아니라 Object Detection/Segmentation등에서도 좋은 성능이 나온 것을 착안하여 만든 Dataset으로 Action Recognition에서 Kinetics Dataset으로 학습한 Pre-trained 모델로 기존에 활용되던 HMDB-51과 UCF-101를 활용하여 fine-tuning를 통해 SOTA를 달성함으로써 대량의 학습데이터 필요성의 중요함을 본 연구를 통해 보여줌. Kinetics Dataset : 1천개 클래스 1천개의 video clips로 Human Action 중심(단독 행동, 사람 간 행동, 물건을 다루는 행동)으로 정의(클래스당 600 비디오 클립으로 10초씩) 처음 400클래스 공개 후 600클래스로 추가 trimmed videos로 구성 본 논문에서는 miniKinetics dataset(full Kinetics의 사전 실험용)은 동영상 테스트의 빠른 실험을 위한 dataset으로 213 class에 12만개의 clips로 class당 150~1000개의 clips로 구성(validation : 25 clips / test 75 clips) A Short Note about Kinetics-600 https://arxiv.org/pdf/1808.01340.pdf
  • 4. Action Recognition Benchmark DATASET YEAR # ACTIONS # CLIPS PER ACTION KTH 2004 6 10 Weizmann 2005 9 9 IXMAS 2006 11 33 Hollywood 2008 8 30-140 UCF Sports 2009 9 14-35 Hollywood2 2009 12 61-278 UCF YouTube 2009 11 100 MSR 2009 3 14-25 Olympic 2010 16 50 UCF50 2010 50 min. 100 HMDB51 2011 51 min. 101 first pretraining on Kinetics and then fine-tuning on HMDB-51 and UCF-101s a boost in performance
  • 5. HMDB-51 Dataset ICCV 2011에서 공개된 Human Motion에 관한 6849개의 비디오 클립에 51개의 액션 카테고리로 정의 각 카테고리는 101개의 클립으로 구성 http://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database/#introduction
  • 6. UCF101 Dataset UCF101 Dataset : ICCV 2013에 공개된 데이터로 13320개의 비디오와 101개의 엑션 클래스가 있음 101개의 카테고리는 25그룹으로 나뉘는데 각각의 그룹은 4~7개의 액션이 정의된 비디오가 있음. https://www.crcv.ucf.edu/data/UCF101.php
  • 7. Video Architecture ImageNet Pre-trained Model(Inception-v1)활용하여 기존 아키텍쳐(a~d)와 본 논문에서 제시한 I3D(e)+pre-training on Kinetics와의 비교를 통해 네트워크 구조 변경을 통한 성능 개선을 제시
  • 8. The Old I: ConvNet+LSTM Long-term Recurrent Convolutional Networks for Visual Recognition and Description(CVPR 2015) (https://arxiv.org/pdf/1411.4389.pdf) 25 Fps를 뽑은 후 각각을 CNN(Inception-V1:512 hidden units)으로 피쳐를 뽑아 LSTM(+batch norm)으로 시계열 정보를 예측 cross-entropy loss 사용 LSTM을 통해 연속적으로 학습해야하기 때문에 연산이 어려움
  • 9. The Old II: 3D ConvNets D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks (ICCV 2015) https://arxiv.org/pdf/1412.0767.pdf C3D로 정의하는 3D CNN은 추가적인 커널 수로 2D보다 많은 파라미터 수를 가지므로 학습에 어려움이 있음 배치별 15개의 비디오 K40 GPU로 학습 H x W x D => T x H x W x D (시간축이 추가됨-위아래 앞뒤로 3D로 움직임) 당시엔 ResNet같은 2D의 최적화된 네트워크가 없어 아래와 같이 새롭개 정의한 네트워크로 학습
  • 10. The Old III: Two-Stream Networks Motion Estimation을 하는 방법 중 효율적인 Optical Flow를 적용 RGB(Spatial)와 Optical Flow(Temporal)의 Two Stream Two-Stream Convolutional Networks for Action Recognition in Videos (NIPS 2014) (https://papers.nips.cc/paper/2014/file/00ec53c4682d36f5c4359f4ae7bd7ba1-Paper.pdf)
  • 11. Convolutional two-stream network fusion for video action recognition (CVPR 2016) https://arxiv.org/pdf/1604.06573.pdf HMDB(Human Metabolome Database)에서 성능 개선을 얻은 방식으로 기존 방식에서 나온 피쳐를 복수개의 RGB와 Optical Flow로 추출하여 3D Conv로 학습시킴(3D Conv는 피쳐 추출이 아닌 loss 줄이기 위한 용도) 네트워크는 Inception-V1을 사용 10 Frame간격으로 5개의 연속된 RGB Frame을 Optical Flow의 피쳐와 합쳐 End-to-End로 학습 The Old IV: 3D-Fused Two-Stream
  • 12. The New: Two-Stream Inflated 3D ConvNets(I3D) 본 논문에서 제안하는 방식으로 기존 2D로 피쳐를 뽑는 방식에서 복수 Frame인 RGB와 Optical flow를 동시에 3D Conv로 피쳐를 한번에 뽑아 합치는 방식으로 이를 통해 RGB의 연속된 움직임과 Optical Flow의 변화량을 정확하게 뽑아낼 수 있다고함.(새로운 3D 네트워크가 아닌 Inception v1을 쌓아 3D 만듬 - 검증된 네트워크 활용) 2D->3D(N × N filters become N × N × N : 디멘젼을 늘림) 3D Conv의 경우 복수개의 Frame와 Optical Flow가 입력이 되기에 ImageNet Pre-trained Model활용을 위해 기존 ImageNet의 모델을 N번(3D) 만큼 복사하여 붙임.(기존 weight 활용가능) repeating the weights of the 2D filters N times along the time dimension Two-Stream Inflated 3D ConvNet (I3D)의 명칭으로 제안함 I3D network trained on RGB inputs, and another on flow inputs which carry optimized, smooth flow information. We trained the two networks separately and averaged their predictions at test time
  • 13. Implementation Details Inception V1의 네트워크를 성능 개선을 위해 수정 공간정보와 시간 정보를 동시에 학습시키기 때문에 stride의 수 조정이 중요한데 stride가 크면 공간을 못보고 stride가 작으면 움직임(시간)을 못봄 (첫번째와 두번째의 Max-Pool에서는 시간축으로는 보지 않음) RGB만으로도 각 Frame 피쳐의 변화를 통해 모션 예측이 가능하지만 Optical Flow를 통해 더 세부적인 모션을 예측할 수 있음
  • 14. Implementation Details All models but the C3D-like 3D ConvNet use ImageNet pretrained Inception-V1 as base network. For all architectures we follow each convolutional layer by a batch normalization layer and a ReLU activation function, except for the last convolutional layers which produce the class scores(1x1x1) for each network. Training on videos used standard SGD with momentum set to 0.9 in all cases, with synchronous parallelization across 32 GPUs for all models except the 3D ConvNets(64 GPUs). We trained models on miniKinetics for up to 35k steps, and for 110k steps on Kinetics, with a 10% reduction of learning rate when validation loss saturated. We tuned the learning rate hyperparameter on the validation set of miniKinetics. Models were trained for up to 5k steps on UCF-101 and HMDB-51 using a similar learning rate adaptation procedure as for Kinetics but using just 16 GPUs. All the models were implemented in TensorFlow (https://github.com/deepmind/kinetics-i3d) During training data augmentation random cropping both spatially (resizing the smaller video side to 256 pixels, then randomly cropping a 224 × 224 patch and temporally, random left-right flipping, photometric) During test time the models are applied convolutionally over the whole video taking 224 × 224 center crops, and the predictions are averaged. We computed optical flow with a TV-L1 algorithm
  • 15. Experimental Comparison of Architectures UCF-101/HMDB-51로 학습 한 모델과 Kinetics Pre-trained 사용(Cl에 따른 성능의 차이를 보여줌
  • 16. Comparison with the SOTA and Next Comparison with state-of-the-art on the UCF-101 and HMDB-51 datasets, averaged over three splits. First set of rows contains results of models trained without labeled external data. As future work, for other video tasks such as semantic video segmentation, video object detection, or optical flow computation we have not employed action tubes or attention mechanisms to focus in on the human actors. we plan to repeat all experiments using Kinetics instead of miniKinetics, with and without ImageNet pre-training, and explore inflating other state-of-theart 2D ConvNets
  • 17. Thanks Any Questions? You can send mail to Susang Kim(healess1@gmail.com)