I3D and Kinetics datasets (Action Recognition)

Susang Kim(healess1@gmail.com)
Video Understanding(1)
Quo Vadis, Action Recognition?
A New Model and the Kinetics Dataset

Action Recognition 논문
DeepMind에서 발표한 논문(CVPR 2017)으로
Action Recognition을 위한 Two-Stream
Inflated 3D ConvNets(I3D)와 Kinetics
Dataset을 공개
Action Recognition : 특정 비디오 영상에서
사람이 어떤 행동을 하는지를 위한
Classification을 하는 것 (비디오 영상을
입력하여 예측 결과 출력)
Quo Vadis의 한장면:
Are these actors about to kiss each other,
or have they just done so?
⇒ Actions can be ambiguous in individual
frames

Kinetics Dataset - Human Action Video Dataset (본 논문에서 공개)
ImageNet(1000장/1000카테고리)으로 학습한 Pre-trained 모델을 활용하면 Classification뿐만 아니라
Object Detection/Segmentation등에서도 좋은 성능이 나온 것을 착안하여 만든 Dataset으로 Action
Recognition에서 Kinetics Dataset으로 학습한 Pre-trained 모델로 기존에 활용되던 HMDB-51과
UCF-101를 활용하여 fine-tuning를 통해 SOTA를 달성함으로써 대량의 학습데이터 필요성의 중요함을
본 연구를 통해 보여줌.
Kinetics Dataset : 1천개 클래스 1천개의 video clips로 Human Action 중심(단독 행동, 사람 간 행동,
물건을 다루는 행동)으로 정의(클래스당 600 비디오 클립으로 10초씩) 처음 400클래스 공개 후
600클래스로 추가 trimmed videos로 구성
본 논문에서는 miniKinetics dataset(full Kinetics의 사전 실험용)은 동영상 테스트의 빠른 실험을 위한
dataset으로 213 class에 12만개의 clips로 class당 150~1000개의 clips로 구성(validation : 25 clips / test
75 clips)
A Short Note about Kinetics-600 https://arxiv.org/pdf/1808.01340.pdf

Action Recognition Benchmark
DATASET YEAR # ACTIONS # CLIPS PER ACTION
KTH 2004 6 10
Weizmann 2005 9 9
IXMAS 2006 11 33
Hollywood 2008 8 30-140
UCF Sports 2009 9 14-35
Hollywood2 2009 12 61-278
UCF YouTube 2009 11 100
MSR 2009 3 14-25
Olympic 2010 16 50
UCF50 2010 50 min. 100
HMDB51 2011 51 min. 101
first pretraining on Kinetics and then fine-tuning on HMDB-51 and UCF-101s a boost in performance

HMDB-51 Dataset
ICCV 2011에서 공개된 Human Motion에 관한 6849개의 비디오 클립에 51개의 액션 카테고리로 정의 각
카테고리는 101개의 클립으로 구성
http://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database/#introduction

UCF101 Dataset
UCF101 Dataset : ICCV 2013에 공개된 데이터로 13320개의 비디오와 101개의 엑션 클래스가 있음
101개의 카테고리는 25그룹으로 나뉘는데 각각의 그룹은 4~7개의 액션이 정의된 비디오가 있음.
https://www.crcv.ucf.edu/data/UCF101.php

Video Architecture
ImageNet Pre-trained Model(Inception-v1)활용하여 기존 아키텍쳐(a~d)와 본 논문에서 제시한
I3D(e)+pre-training on Kinetics와의 비교를 통해 네트워크 구조 변경을 통한 성능 개선을 제시

The Old I: ConvNet+LSTM
Long-term Recurrent Convolutional Networks for Visual Recognition
and Description(CVPR 2015) (https://arxiv.org/pdf/1411.4389.pdf)
25 Fps를 뽑은 후 각각을 CNN(Inception-V1:512 hidden units)으로
피쳐를 뽑아 LSTM(+batch norm)으로 시계열 정보를 예측
cross-entropy loss 사용
LSTM을 통해 연속적으로 학습해야하기 때문에 연산이 어려움

The Old II: 3D ConvNets
D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri.
Learning spatiotemporal features with 3d convolutional networks (ICCV 2015)
https://arxiv.org/pdf/1412.0767.pdf
C3D로 정의하는 3D CNN은 추가적인 커널 수로 2D보다 많은 파라미터 수를
가지므로 학습에 어려움이 있음 배치별 15개의 비디오 K40 GPU로 학습
H x W x D => T x H x W x D (시간축이 추가됨-위아래 앞뒤로 3D로 움직임)
당시엔 ResNet같은 2D의 최적화된 네트워크가 없어 아래와 같이 새롭개 정의한
네트워크로 학습

The Old III: Two-Stream Networks
Motion Estimation을 하는 방법 중 효율적인 Optical Flow를 적용 RGB(Spatial)와 Optical Flow(Temporal)의 Two Stream
Two-Stream Convolutional Networks for Action Recognition in Videos (NIPS 2014)
(https://papers.nips.cc/paper/2014/file/00ec53c4682d36f5c4359f4ae7bd7ba1-Paper.pdf)

Convolutional two-stream network fusion for video action recognition (CVPR 2016)
https://arxiv.org/pdf/1604.06573.pdf
HMDB(Human Metabolome Database)에서 성능 개선을 얻은 방식으로 기존 방식에서 나온 피쳐를 복수개의
RGB와 Optical Flow로 추출하여 3D Conv로 학습시킴(3D Conv는 피쳐 추출이 아닌 loss 줄이기 위한 용도)
네트워크는 Inception-V1을 사용 10 Frame간격으로 5개의 연속된 RGB Frame을 Optical Flow의 피쳐와 합쳐
End-to-End로 학습
The Old IV: 3D-Fused Two-Stream

The New: Two-Stream Inflated 3D ConvNets(I3D)
본 논문에서 제안하는 방식으로 기존 2D로 피쳐를 뽑는 방식에서 복수
Frame인 RGB와 Optical flow를 동시에 3D Conv로 피쳐를 한번에 뽑아
합치는 방식으로 이를 통해 RGB의 연속된 움직임과 Optical Flow의
변화량을 정확하게 뽑아낼 수 있다고함.(새로운 3D 네트워크가 아닌
Inception v1을 쌓아 3D 만듬 - 검증된 네트워크 활용)
2D->3D(N × N filters become N × N × N : 디멘젼을 늘림)
3D Conv의 경우 복수개의 Frame와 Optical Flow가 입력이 되기에
ImageNet Pre-trained Model활용을 위해 기존 ImageNet의 모델을 N번(3D)
만큼 복사하여 붙임.(기존 weight 활용가능)
repeating the weights of the 2D filters N times along the time dimension
Two-Stream Inflated 3D ConvNet (I3D)의 명칭으로 제안함
I3D network trained on RGB inputs, and another on flow inputs which carry optimized, smooth flow
information. We trained the two networks separately and averaged their predictions at test time

Implementation Details
Inception V1의 네트워크를 성능 개선을 위해 수정 공간정보와 시간 정보를 동시에 학습시키기 때문에 stride의 수 조정이 중요한데
stride가 크면 공간을 못보고 stride가 작으면 움직임(시간)을 못봄 (첫번째와 두번째의 Max-Pool에서는 시간축으로는 보지 않음)
RGB만으로도 각 Frame 피쳐의 변화를 통해 모션 예측이 가능하지만 Optical Flow를 통해 더 세부적인 모션을 예측할 수 있음

Implementation Details
All models but the C3D-like 3D ConvNet use ImageNet pretrained Inception-V1 as base network.
For all architectures we follow each convolutional layer by a batch normalization layer and a ReLU
activation function, except for the last convolutional layers which produce the class scores(1x1x1) for
each network.
Training on videos used standard SGD with momentum set to 0.9 in all cases, with synchronous
parallelization across 32 GPUs for all models except the 3D ConvNets(64 GPUs).
We trained models on miniKinetics for up to 35k steps, and for 110k steps on Kinetics, with a 10%
reduction of learning rate when validation loss saturated.
We tuned the learning rate hyperparameter on the validation set of miniKinetics. Models were trained for
up to 5k steps on UCF-101 and HMDB-51 using a similar learning rate adaptation procedure as for
Kinetics but using just 16 GPUs. All the models were implemented in TensorFlow
(https://github.com/deepmind/kinetics-i3d)
During training data augmentation random cropping both spatially (resizing the smaller video side to 256
pixels, then randomly cropping a 224 × 224 patch and temporally, random left-right flipping, photometric)
During test time the models are applied convolutionally over the whole video taking 224 × 224 center
crops, and the predictions are averaged. We computed optical flow with a TV-L1 algorithm

Experimental Comparison of Architectures
UCF-101/HMDB-51로
학습 한 모델과
Kinetics Pre-trained
사용(Cl에 따른 성능의
차이를 보여줌

Comparison with the SOTA and Next
Comparison with state-of-the-art on the UCF-101
and HMDB-51 datasets, averaged over three
splits. First set of rows contains results of models
trained without labeled external data.
As future work,
for other video tasks such as semantic video
segmentation, video object detection, or optical
flow computation
we have not employed action tubes or attention
mechanisms to focus in on the human actors.
we plan to repeat all experiments using Kinetics
instead of miniKinetics, with and without
ImageNet pre-training, and explore inflating other
state-of-theart 2D ConvNets

Thanks
Any Questions?
You can send mail to
Susang Kim(healess1@gmail.com)

I3D and Kinetics datasets (Action Recognition)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to I3D and Kinetics datasets (Action Recognition)

Similar to I3D and Kinetics datasets (Action Recognition) (20)

More from Susang Kim

More from Susang Kim (14)

I3D and Kinetics datasets (Action Recognition)