Human Behavior Understanding: From Human-Oriented Analysis to Action Recognition I
1. Human Behavior Understanding:
From Human-Oriented Analysis to Action Recognition
Ting Yao
Principal Researcher, Vision and Multimedia Lab, JD AI Research
Tutorial @ ICME, July 8th, 2019
7. 2D CNN + LSTM (LRCN)2011
2012
2013
2014
2015
2016
Long-term Recurrent Convolutional Networks for Visual
Recognition and Description. [Donahue et al. CVPR 2015]
8. 3D convolutional network (C3D)2011
2012
2013
2014
2015
2016
Learning Spatiotemporal Features with 3D Convolutional
Networks. [Tran et al. ICCV 2015]
9. Temporal segment networks (TSN)2011
2012
2013
2014
2015
2016
Temporal Segment Networks: Towards Good
Practices for Deep Action Recognition. [Wang et al. ECCV 2016]
13. 13
Convolution
3D Convolution
2D Convolution
3D Convolution 3D ResNet
2D ResNet
ResNet-152:
Time Cost: 9 x C2 x H x W
Model size: 230MB
3D ResNet-152:
Time Cost: 27 x C2 x T x H x W
Model size: 690MB
20. 20
Frames
(Nf)
2D CNN
2D CNN
2D CNN
...
...
Convolutional
Activations
...
Spatial Pyramid
Pooling
FV-VAE
Gradient
Vector
Visual
Representation
Loss
Function
Video Label
Training Epoch
Extraction Epoch
Local
Feature Set
Frames
(Nf)
2D CNN
2D CNN
2D CNN
...
...
Convolutional
Activations
...
Spatial Pyramid
Pooling
FV-VAE
Gradient
Vector
Visual
Representation
Loss
Function
Video Label
Training Epoch
Extraction Epoch
Local
Feature Set
2D CNN
/
3D CNN
21. 21
Frames
(Nf)
2D CNN
2D CNN
2D CNN
...
...
Convolutional
Activations
...
Spatial Pyramid
Pooling
FV-VAE
Gradient
Vector
Visual
Representation
Loss
Function
Video Label
Training Epoch
Extraction Epoch
Local
Feature Set
Frames
(Nf)
2D CNN
2D CNN
2D CNN
...
...
Convolutional
Activations
...
Spatial Pyramid
Pooling
FV-VAE
Gradient
Vector
Visual
Representation
Loss
Function
Video Label
Training Epoch
Extraction Epoch
Local
Feature Set
2D CNN
/
3D CNN
• Global Average Pooling
31. 31
human, guitar Playing Guitar
Jumping Jack
Cliff diving
Basketball dunk
Single Frame
Consecutive Frames
Clip (multiple adjacent frames)
whole video
Different actions may span different granularities!
32. 32
• Multi-granular spatio-temporal architecture for video action recognition
• Hierarchical modeling (4 granularities)
• Fusion based on multi-granular score distribution