SlideShare uma empresa Scribd logo
1 de 56
Baixar para ler offline
Human Behavior Understanding:
From Human-Oriented Analysis to Action Recognition
Ting Yao
Principal Researcher, Vision and Multimedia Lab, JD AI Research
Tutorial @ ICME, July 8th, 2019
horse
grass
person
“a boy is cleaning
the floor”
“not just
beautiful”
3
……
……
4
5
2011
2012
2013
2014
2015
Action recognition by dense trajectories. [Wang et al. CVPR 2011]
Hand-crafted feature
2016
2011
2012
2013
2014
2015
2016
Large-scale Video Classification with Convolutional Neural Networks.
[Karpathy et al. CVPR 2014]
Two-Stream Convolutional Networks for Action Recognition in
Videos. [Simonyan et al. NIPS 2014]
2D convolutional network
2D CNN + LSTM (LRCN)2011
2012
2013
2014
2015
2016
Long-term Recurrent Convolutional Networks for Visual
Recognition and Description. [Donahue et al. CVPR 2015]
3D convolutional network (C3D)2011
2012
2013
2014
2015
2016
Learning Spatiotemporal Features with 3D Convolutional
Networks. [Tran et al. ICCV 2015]
Temporal segment networks (TSN)2011
2012
2013
2014
2015
2016
Temporal Segment Networks: Towards Good
Practices for Deep Action Recognition. [Wang et al. ECCV 2016]
10
Sparse
Sample
Dense
Sample
2D CNN
3D CNN
Frame
Optical Flow
Input Video
Preprocessing
Sample
Strategy
Backbone
Network
Feature
Aggregation
Pooling
Quantization
Attention
RNN
Stream
Fusion
11
Backbone Network
Sparse
Sample
Dense
Sample
2D CNN
3D CNN
Frame
Optical Flow
Input Video
Preprocessing
Sample
Strategy
Backbone
Network
Feature
Aggregation
Pooling
Quantization
Attention
RNN
Stream
Fusion
12
State-of-the-Arts
Image
Domain
Video
Domain
VGG
[Simonyan et al. ICLR 2015]
C3D
[Tran et al. ICCV 2015]
Inception
[Szegedy et al. CVPR 2015]
I3D
[Carreira et al. CVPR 2017]
ResNet
[He et al. CVPR 2016]
P3D
[Qiu et al. ICCV 2017]
13
Convolution
3D Convolution
2D Convolution
3D Convolution 3D ResNet
2D ResNet
ResNet-152:
Time Cost: 9 x C2 x H x W
Model size: 230MB
3D ResNet-152:
Time Cost: 27 x C2 x T x H x W
Model size: 690MB
14
Spatial 2D
Spatial 2D
15
Bottleneck Architecture:
+
1x1 conv
1x1 conv
3x3 conv
ReLU
ReLU
ReLU
(a) Residual Unit
+
1x1x1 conv
1x1x1 conv
1x3x3 conv
ReLU
ReLU
3x1x1 conv
ReLU
ReLU
(b) P3D-A
+
1x1x1 conv
1x1x1 conv
ReLU
ReLU
1x3x3 conv 3x1x1 conv
+
ReLU
ReLU
(c) P3D-B
+
1x1x1 conv
1x1x1 conv
1x3x3 conv
ReLU
ReLU
3x1x1 conv
ReLU
+
ReLU
(d) P3D-C
(a)
(b)
(c)
(d)
Very deep 3D CNN but still
lighter weights than C3D
16
•R(2+1)D > MCx > rMCx > R3D > R2D
17
• ResNeXt-101 > Wide ResNet-50 > ResNet-200 > ResNet-152 >
ResNet-101 > ResNet-50 > DenseNet-201 > DenseNet-121
18
• Involve large-range (global) context into representation learning
• Model the diffusions between local and global features
19
Feature Aggregation
Sparse
Sample
Dense
Sample
2D CNN
3D CNN
Frame
Optical Flow
Input Video
Preprocessing
Sample
Strategy
Backbone
Network
Feature
Aggregation
Pooling
Quantization
Attention
RNN
Stream
Fusion
20
Frames
(Nf)
2D CNN
2D CNN
2D CNN
...
...
Convolutional
Activations
...
Spatial Pyramid
Pooling
FV-VAE
Gradient
Vector
Visual
Representation
Loss
Function
Video Label
Training Epoch
Extraction Epoch
Local
Feature Set
Frames
(Nf)
2D CNN
2D CNN
2D CNN
...
...
Convolutional
Activations
...
Spatial Pyramid
Pooling
FV-VAE
Gradient
Vector
Visual
Representation
Loss
Function
Video Label
Training Epoch
Extraction Epoch
Local
Feature Set
2D CNN
/
3D CNN
21
Frames
(Nf)
2D CNN
2D CNN
2D CNN
...
...
Convolutional
Activations
...
Spatial Pyramid
Pooling
FV-VAE
Gradient
Vector
Visual
Representation
Loss
Function
Video Label
Training Epoch
Extraction Epoch
Local
Feature Set
Frames
(Nf)
2D CNN
2D CNN
2D CNN
...
...
Convolutional
Activations
...
Spatial Pyramid
Pooling
FV-VAE
Gradient
Vector
Visual
Representation
Loss
Function
Video Label
Training Epoch
Extraction Epoch
Local
Feature Set
2D CNN
/
3D CNN
• Global Average Pooling
22
CNN
CNN
CNN
...
AttCell
AttCell
AttCell
...
X
X
X
..................
...
AttCell
X
LSTM
......
LSTM
......
LSTM
......
LSTM
......
......
......
23
24
Frames
(Nf)
2D CNN
2D CNN
2D CNN
...
...
Convolutional
Activations
...
Spatial Pyramid
Pooling
FV-VAE
Gradient
Vector
Visual
Representation
Loss
Function
Video Label
Training Epoch
Extraction Epoch
Local
Feature Set
Frames
(Nf)
2D CNN
2D CNN
2D CNN
...
...
Convolutional
Activations
...
Spatial Pyramid
Pooling
FV-VAE
Gradient
Vector
Visual
Representation
Loss
Function
Video Label
Training Epoch
Extraction Epoch
Local
Feature Set
2D CNN
/
3D CNN
• Global Average Pooling
• Attention
• Visual Attention [Sharma et al. ICLR workshop 2015]
• Recurrent Attention [Du et al. TIP 2018]
• Unified Attention [Li et al. TMM 2018]
25
Frames
(Nf)
2D CNN
2D CNN
2D CNN
...
...
Convolutional
Activations
...
Spatial Pyramid
Pooling
FV-VAE
Gradient
Vector
Visual
Representation
Loss
Function
Video Label
Training Epoch
Extraction Epoch
Local
Feature Set
Frames
(Nf)
2D CNN
2D CNN
2D CNN
...
...
Convolutional
Activations
...
Spatial Pyramid
Pooling
FV-VAE
Gradient
Vector
Visual
Representation
Loss
Function
Video Label
Training Epoch
Extraction Epoch
Local
Feature Set
2D CNN
/
3D CNN
• Global Average Pooling
• Attention
• Visual Attention [Sharma et al. ICLR workshop 2015]
• Recurrent Attention [Du et al. TIP 2018]
• Unified Attention [Li et al. TMM 2018]
• RNN
• LRCN [Donahue et al. CVPR 2015]
• Hybrid Framework (LSTM) [Wu et al. ACM MM 2015]
26
27
• Global Activations
• Fully connected layer
• Global pooling layer
• Fisher Vector with Variational
Auto-Encoder (FV-VAE)
• Fisher Vector (FV)
...
...
...
...
Global Activations
Convolutional
Activations
FV Encoding
FV-VAE Encoding
Convolutional
Activations
Normalization term Generative model
GMM
VAE
FV
FV-VAE
28
...
Reconstruct
Loss
...
Regularization
Loss
Classification
Loss
...
...
Encoder Sampling Decodertx
...
Reconstruct
Loss
...
...
...
Encoder Identity Decodertx
Back Propagation
Gradient Vector
Accumulator
• Assumption of FV
• Data is generated from Gaussian Mixture Model, which may not hold in practice
• VAE
• Encoder (𝑞 𝜙( Τ𝐳 𝐱)): learn new representations 𝐳 for the given input 𝐱
• Decoder (𝑝 𝜃( Τ𝐱 𝐳)): generate FV of new representations 𝐳
Training Extraction
FV: ℊ 𝜃
𝑋
= 𝐹𝜃
−
1
2
𝛻𝜃 log 𝑢 𝜽(𝑋) = −𝐹𝜃
−
1
2
σ 𝑡=1
𝑇𝑥
𝛻𝜃ℒ 𝒓𝒆𝒄(𝒙𝒕; 𝜃, 𝜙)Reconstruct loss: ℒ 𝒓𝒆𝒄 = − log 𝜇 𝒙 𝑡
= − log 𝑝 𝜃( Τ𝒙 𝑡 𝒛 𝑡)
29
CNN FV-VAE
Gradient
Vector
Video
Representation
Convolutional
Feature
…
…
Region Feature Set
Loss Function Ice Dancing
+
Spatial Pyramid
Pooling
Training Epoch
Extraction Epoch
• FV-VAE based action recognition framework
• CNN as convolutional feature extractor
• Encoding SPP output using FV-VAE
30
Stream Fusion
Sparse
Sample
Dense
Sample
2D CNN
3D CNN
Frame
Optical Flow
Input Video
Preprocessing
Sample
Strategy
Backbone
Network
Feature
Aggregation
Pooling
Quantization
Attention
RNN
Stream
Fusion
31
human, guitar Playing Guitar
Jumping Jack
Cliff diving
Basketball dunk
Single Frame
Consecutive Frames
Clip (multiple adjacent frames)
whole video
Different actions may span different granularities!
32
• Multi-granular spatio-temporal architecture for video action recognition
• Hierarchical modeling (4 granularities)
• Fusion based on multi-granular score distribution
33
34
35
Single Frame softmax
0.4
…
0.2
…
0.7
…
0.9
Consecutive
Frames
softmax
Clips softmax
Video softmax
Surfing scores
Sort
0.9
…
0.7
…
0.4
…
0.2
Improved
Surfing score
0.8
w=[1, 0, …, 0] Max-pooling
w=[1, 1, …, 1] Ave-pooling
optimized w Distribution-based classifier
36
37
2011 2012 2014 2015 2017 2018 2019
38
Method UCF101 HMDB51
Improved dense trajectories (IDT) [Wang et al. ICCV 2011] 85.9% 57.2%
Higher dimensional IDT [Peng et al. CVIU 2016] 87.9% 61.1%
2D CNN Slow Fusion [Karpathy et al. CVPR 2014] 65.4% --
Two-stream ConvNet [Simonyan et al. NIPS 2014] 88.0% 59.4%
Factorized ST-ConvNet [Sun et al. ICCV 2015] 88.1% 59.1%
Two-stream + LSTM [Yue-Hei et al. CVPR 2015] 88.6% --
Two-stream Conv fusion [Feichtenhofer et al. CVPR 2016] 92.5% 67.3%
Two-stream ST Residual Networks [Feichtenhofer et al. NIPS 2016] 93.4% 66.4%
Temporal Segment Networks [Wang et al. ECCV 2016] 94.0% 68.5%
C3D [Tran et al. ICCV 2015] 82.3% 56.8%
P3D ResNet [Qiu et al. ICCV 2017] 89.8% 58.6%
Two-stream P3D ResNet [Qiu et al. ICCV 2017] 94.5% 71.8%
I3D [Carreira et al. CVPR 2017] 93.4% 66.4%
I3D + Kinetics pre-train [Carreira et al. CVPR 2017] 97.9% 80.2%
LGD-3D + Kinetics pre-train [Qiu et al. CVPR 2019] 98.2% 80.5%
39
40
41
42
43
Feature Extractor
Pole vault
0.61
Pole vault
0.83
Pole vault
0.51
44
3D CNN
Pole vault 0.96
Gaussian Kernel
45
46
47
48
49
50
51
52
Sparse
Sample
Dense
Sample
2D CNN
3D CNN
Frame
Optical Flow
Input Video
Preprocessing
Sample
Strategy
Backbone
Network
Feature
Aggregation
Pooling
Quantization
Attention
RNN
Stream
Fusion
53
54
55
Thanks!
tingyao.ustc@gmail.com

Mais conteúdo relacionado

Mais procurados

IGARSS-SAR-Pritt.pptx
IGARSS-SAR-Pritt.pptxIGARSS-SAR-Pritt.pptx
IGARSS-SAR-Pritt.pptx
grssieee
 
A Bit More Deferred Cry Engine3
A Bit More Deferred   Cry Engine3A Bit More Deferred   Cry Engine3
A Bit More Deferred Cry Engine3
guest11b095
 
Deferred Rendering in Killzone 2
Deferred Rendering in Killzone 2Deferred Rendering in Killzone 2
Deferred Rendering in Killzone 2
ozlael ozlael
 
Crysis Next-Gen Effects (GDC 2008)
Crysis Next-Gen Effects (GDC 2008)Crysis Next-Gen Effects (GDC 2008)
Crysis Next-Gen Effects (GDC 2008)
Tiago Sousa
 
Scratch to Supercomputers: Bottoms-up Build of Large-scale Computational Lens...
Scratch to Supercomputers: Bottoms-up Build of Large-scale Computational Lens...Scratch to Supercomputers: Bottoms-up Build of Large-scale Computational Lens...
Scratch to Supercomputers: Bottoms-up Build of Large-scale Computational Lens...
inside-BigData.com
 

Mais procurados (20)

Graphics Gems from CryENGINE 3 (Siggraph 2013)
Graphics Gems from CryENGINE 3 (Siggraph 2013)Graphics Gems from CryENGINE 3 (Siggraph 2013)
Graphics Gems from CryENGINE 3 (Siggraph 2013)
 
Non-line-of-sight Imaging with Partial Occluders and Surface Normals | TOG 2019
Non-line-of-sight Imaging with Partial Occluders and Surface Normals | TOG 2019Non-line-of-sight Imaging with Partial Occluders and Surface Normals | TOG 2019
Non-line-of-sight Imaging with Partial Occluders and Surface Normals | TOG 2019
 
GDC2019 - SEED - Towards Deep Generative Models in Game Development
GDC2019 - SEED - Towards Deep Generative Models in Game DevelopmentGDC2019 - SEED - Towards Deep Generative Models in Game Development
GDC2019 - SEED - Towards Deep Generative Models in Game Development
 
Wave-Based Non-Line-of-Sight Imaging Using Fast f–k Migration | SIGGRAPH 2019
Wave-Based Non-Line-of-Sight Imaging Using Fast f–k Migration | SIGGRAPH 2019Wave-Based Non-Line-of-Sight Imaging Using Fast f–k Migration | SIGGRAPH 2019
Wave-Based Non-Line-of-Sight Imaging Using Fast f–k Migration | SIGGRAPH 2019
 
Past, Present and Future Challenges of Global Illumination in Games
Past, Present and Future Challenges of Global Illumination in GamesPast, Present and Future Challenges of Global Illumination in Games
Past, Present and Future Challenges of Global Illumination in Games
 
A Real-time Radiosity Architecture
A Real-time Radiosity ArchitectureA Real-time Radiosity Architecture
A Real-time Radiosity Architecture
 
Bending the Graphics Pipeline
Bending the Graphics PipelineBending the Graphics Pipeline
Bending the Graphics Pipeline
 
"High-resolution 3D Reconstruction on a Mobile Processor," a Presentation fro...
"High-resolution 3D Reconstruction on a Mobile Processor," a Presentation fro..."High-resolution 3D Reconstruction on a Mobile Processor," a Presentation fro...
"High-resolution 3D Reconstruction on a Mobile Processor," a Presentation fro...
 
Lighting you up in Battlefield 3
Lighting you up in Battlefield 3Lighting you up in Battlefield 3
Lighting you up in Battlefield 3
 
IGARSS-SAR-Pritt.pptx
IGARSS-SAR-Pritt.pptxIGARSS-SAR-Pritt.pptx
IGARSS-SAR-Pritt.pptx
 
A Bit More Deferred Cry Engine3
A Bit More Deferred   Cry Engine3A Bit More Deferred   Cry Engine3
A Bit More Deferred Cry Engine3
 
Deferred Rendering in Killzone 2
Deferred Rendering in Killzone 2Deferred Rendering in Killzone 2
Deferred Rendering in Killzone 2
 
Siggraph 2011: Occlusion culling in Alan Wake
Siggraph 2011: Occlusion culling in Alan WakeSiggraph 2011: Occlusion culling in Alan Wake
Siggraph 2011: Occlusion culling in Alan Wake
 
Rendering Technologies from Crysis 3 (GDC 2013)
Rendering Technologies from Crysis 3 (GDC 2013)Rendering Technologies from Crysis 3 (GDC 2013)
Rendering Technologies from Crysis 3 (GDC 2013)
 
Crysis Next-Gen Effects (GDC 2008)
Crysis Next-Gen Effects (GDC 2008)Crysis Next-Gen Effects (GDC 2008)
Crysis Next-Gen Effects (GDC 2008)
 
Scratch to Supercomputers: Bottoms-up Build of Large-scale Computational Lens...
Scratch to Supercomputers: Bottoms-up Build of Large-scale Computational Lens...Scratch to Supercomputers: Bottoms-up Build of Large-scale Computational Lens...
Scratch to Supercomputers: Bottoms-up Build of Large-scale Computational Lens...
 
Global illumination
Global illuminationGlobal illumination
Global illumination
 
Secrets of CryENGINE 3 Graphics Technology
Secrets of CryENGINE 3 Graphics TechnologySecrets of CryENGINE 3 Graphics Technology
Secrets of CryENGINE 3 Graphics Technology
 
Introduction to Point Based Global Illumination (PBGI)
Introduction to Point Based Global Illumination (PBGI)Introduction to Point Based Global Illumination (PBGI)
Introduction to Point Based Global Illumination (PBGI)
 
The Rendering Technology of Killzone 2
The Rendering Technology of Killzone 2The Rendering Technology of Killzone 2
The Rendering Technology of Killzone 2
 

Semelhante a Human Behavior Understanding: From Human-Oriented Analysis to Action Recognition I

A Multimodal Approach for Video Geocoding
A Multimodal Approach for   Video Geocoding A Multimodal Approach for   Video Geocoding
A Multimodal Approach for Video Geocoding
MediaEval2012
 
Human Action Recognition in Videos Employing 2DPCA on 2DHOOF and Radon Transform
Human Action Recognition in Videos Employing 2DPCA on 2DHOOF and Radon TransformHuman Action Recognition in Videos Employing 2DPCA on 2DHOOF and Radon Transform
Human Action Recognition in Videos Employing 2DPCA on 2DHOOF and Radon Transform
Fadwa Fouad
 
“Tools for Creating Next-Gen Computer Vision Apps on Snapdragon,” a Presentat...
“Tools for Creating Next-Gen Computer Vision Apps on Snapdragon,” a Presentat...“Tools for Creating Next-Gen Computer Vision Apps on Snapdragon,” a Presentat...
“Tools for Creating Next-Gen Computer Vision Apps on Snapdragon,” a Presentat...
Edge AI and Vision Alliance
 

Semelhante a Human Behavior Understanding: From Human-Oriented Analysis to Action Recognition I (20)

Action_recognition-topic.pptx
Action_recognition-topic.pptxAction_recognition-topic.pptx
Action_recognition-topic.pptx
 
med_poster_spie
med_poster_spiemed_poster_spie
med_poster_spie
 
Cycle-Contrast for Self-Supervised Video Represenation Learning
Cycle-Contrast for Self-Supervised Video Represenation LearningCycle-Contrast for Self-Supervised Video Represenation Learning
Cycle-Contrast for Self-Supervised Video Represenation Learning
 
Learning spatiotemporal features with 3 d convolutional networks
Learning spatiotemporal features with 3 d convolutional networksLearning spatiotemporal features with 3 d convolutional networks
Learning spatiotemporal features with 3 d convolutional networks
 
A Multimodal Approach for Video Geocoding
A Multimodal Approach for   Video Geocoding A Multimodal Approach for   Video Geocoding
A Multimodal Approach for Video Geocoding
 
Human Action Recognition in Videos Employing 2DPCA on 2DHOOF and Radon Transform
Human Action Recognition in Videos Employing 2DPCA on 2DHOOF and Radon TransformHuman Action Recognition in Videos Employing 2DPCA on 2DHOOF and Radon Transform
Human Action Recognition in Videos Employing 2DPCA on 2DHOOF and Radon Transform
 
Video Classification Basic
Video Classification Basic Video Classification Basic
Video Classification Basic
 
Video Classification: Human Action Recognition on HMDB-51 dataset
Video Classification: Human Action Recognition on HMDB-51 datasetVideo Classification: Human Action Recognition on HMDB-51 dataset
Video Classification: Human Action Recognition on HMDB-51 dataset
 
JASLA_presentation.pdf
JASLA_presentation.pdfJASLA_presentation.pdf
JASLA_presentation.pdf
 
Content-Based Image Retrieval (D2L6 Insight@DCU Machine Learning Workshop 2017)
Content-Based Image Retrieval (D2L6 Insight@DCU Machine Learning Workshop 2017)Content-Based Image Retrieval (D2L6 Insight@DCU Machine Learning Workshop 2017)
Content-Based Image Retrieval (D2L6 Insight@DCU Machine Learning Workshop 2017)
 
OPTE: Online Per-title Encoding for Live Video Streaming
OPTE: Online Per-title Encoding for Live Video StreamingOPTE: Online Per-title Encoding for Live Video Streaming
OPTE: Online Per-title Encoding for Live Video Streaming
 
OPTE: Online Per-title Encoding for Live Video Streaming.pdf
OPTE: Online Per-title Encoding for Live Video Streaming.pdfOPTE: Online Per-title Encoding for Live Video Streaming.pdf
OPTE: Online Per-title Encoding for Live Video Streaming.pdf
 
Emc san Online Training in Hyderabad
Emc san Online Training in HyderabadEmc san Online Training in Hyderabad
Emc san Online Training in Hyderabad
 
EMC SAN Online Training in India
EMC SAN Online Training in IndiaEMC SAN Online Training in India
EMC SAN Online Training in India
 
“Tools for Creating Next-Gen Computer Vision Apps on Snapdragon,” a Presentat...
“Tools for Creating Next-Gen Computer Vision Apps on Snapdragon,” a Presentat...“Tools for Creating Next-Gen Computer Vision Apps on Snapdragon,” a Presentat...
“Tools for Creating Next-Gen Computer Vision Apps on Snapdragon,” a Presentat...
 
YolactEdge Review [cdm]
YolactEdge Review [cdm]YolactEdge Review [cdm]
YolactEdge Review [cdm]
 
VEKG: Video Event Knowledge Graph to Represent Video Streams for Complex Even...
VEKG: Video Event Knowledge Graph to Represent Video Streams for Complex Even...VEKG: Video Event Knowledge Graph to Represent Video Streams for Complex Even...
VEKG: Video Event Knowledge Graph to Represent Video Streams for Complex Even...
 
Introduction to 360 Video
Introduction to  360 VideoIntroduction to  360 Video
Introduction to 360 Video
 
Reading group - Week 2 - Trajectory Pooled Deep-Convolutional Descriptors (TDD)
Reading group - Week 2 - Trajectory Pooled Deep-Convolutional Descriptors (TDD)Reading group - Week 2 - Trajectory Pooled Deep-Convolutional Descriptors (TDD)
Reading group - Week 2 - Trajectory Pooled Deep-Convolutional Descriptors (TDD)
 
Real-time Bangla License Plate Recognition System for Low Resource Video-base...
Real-time Bangla License Plate Recognition System for Low Resource Video-base...Real-time Bangla License Plate Recognition System for Low Resource Video-base...
Real-time Bangla License Plate Recognition System for Low Resource Video-base...
 

Mais de Wanjin Yu

Mais de Wanjin Yu (15)

Architecture Design for Deep Neural Networks III
Architecture Design for Deep Neural Networks IIIArchitecture Design for Deep Neural Networks III
Architecture Design for Deep Neural Networks III
 
Intelligent Multimedia Recommendation
Intelligent Multimedia RecommendationIntelligent Multimedia Recommendation
Intelligent Multimedia Recommendation
 
Architecture Design for Deep Neural Networks II
Architecture Design for Deep Neural Networks IIArchitecture Design for Deep Neural Networks II
Architecture Design for Deep Neural Networks II
 
Architecture Design for Deep Neural Networks I
Architecture Design for Deep Neural Networks IArchitecture Design for Deep Neural Networks I
Architecture Design for Deep Neural Networks I
 
Causally regularized machine learning
Causally regularized machine learningCausally regularized machine learning
Causally regularized machine learning
 
Computer vision for transportation
Computer vision for transportationComputer vision for transportation
Computer vision for transportation
 
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet IIIObject Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
 
Object Detection Beyond Mask R-CNN and RetinaNet II
Object Detection Beyond Mask R-CNN and RetinaNet IIObject Detection Beyond Mask R-CNN and RetinaNet II
Object Detection Beyond Mask R-CNN and RetinaNet II
 
Object Detection Beyond Mask R-CNN and RetinaNet I
Object Detection Beyond Mask R-CNN and RetinaNet IObject Detection Beyond Mask R-CNN and RetinaNet I
Object Detection Beyond Mask R-CNN and RetinaNet I
 
Visual Search and Question Answering II
Visual Search and Question Answering IIVisual Search and Question Answering II
Visual Search and Question Answering II
 
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
 
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
 
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
 
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
 
Big Data Intelligence: from Correlation Discovery to Causal Reasoning
Big Data Intelligence: from Correlation Discovery to Causal Reasoning Big Data Intelligence: from Correlation Discovery to Causal Reasoning
Big Data Intelligence: from Correlation Discovery to Causal Reasoning
 

Último

在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
ydyuyu
 
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
call girls in Anand Vihar (delhi) call me [🔝9953056974🔝] escort service 24X7
call girls in Anand Vihar (delhi) call me [🔝9953056974🔝] escort service 24X7call girls in Anand Vihar (delhi) call me [🔝9953056974🔝] escort service 24X7
call girls in Anand Vihar (delhi) call me [🔝9953056974🔝] escort service 24X7
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girlsRussian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
Monica Sydney
 
一比一原版田纳西大学毕业证如何办理
一比一原版田纳西大学毕业证如何办理一比一原版田纳西大学毕业证如何办理
一比一原版田纳西大学毕业证如何办理
F
 
一比一原版奥兹学院毕业证如何办理
一比一原版奥兹学院毕业证如何办理一比一原版奥兹学院毕业证如何办理
一比一原版奥兹学院毕业证如何办理
F
 
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
ayvbos
 

Último (20)

Local Call Girls in Seoni 9332606886 HOT & SEXY Models beautiful and charmin...
Local Call Girls in Seoni  9332606886 HOT & SEXY Models beautiful and charmin...Local Call Girls in Seoni  9332606886 HOT & SEXY Models beautiful and charmin...
Local Call Girls in Seoni 9332606886 HOT & SEXY Models beautiful and charmin...
 
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
 
APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53
 
Best SEO Services Company in Dallas | Best SEO Agency Dallas
Best SEO Services Company in Dallas | Best SEO Agency DallasBest SEO Services Company in Dallas | Best SEO Agency Dallas
Best SEO Services Company in Dallas | Best SEO Agency Dallas
 
Meaning of On page SEO & its process in detail.
Meaning of On page SEO & its process in detail.Meaning of On page SEO & its process in detail.
Meaning of On page SEO & its process in detail.
 
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
 
call girls in Anand Vihar (delhi) call me [🔝9953056974🔝] escort service 24X7
call girls in Anand Vihar (delhi) call me [🔝9953056974🔝] escort service 24X7call girls in Anand Vihar (delhi) call me [🔝9953056974🔝] escort service 24X7
call girls in Anand Vihar (delhi) call me [🔝9953056974🔝] escort service 24X7
 
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac RoomVip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
 
Nagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nagercoil
Nagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime NagercoilNagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nagercoil
Nagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nagercoil
 
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girlsRussian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
 
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrStory Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
 
一比一原版田纳西大学毕业证如何办理
一比一原版田纳西大学毕业证如何办理一比一原版田纳西大学毕业证如何办理
一比一原版田纳西大学毕业证如何办理
 
Tadepalligudem Escorts Service Girl ^ 9332606886, WhatsApp Anytime Tadepallig...
Tadepalligudem Escorts Service Girl ^ 9332606886, WhatsApp Anytime Tadepallig...Tadepalligudem Escorts Service Girl ^ 9332606886, WhatsApp Anytime Tadepallig...
Tadepalligudem Escorts Service Girl ^ 9332606886, WhatsApp Anytime Tadepallig...
 
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...
 
一比一原版奥兹学院毕业证如何办理
一比一原版奥兹学院毕业证如何办理一比一原版奥兹学院毕业证如何办理
一比一原版奥兹学院毕业证如何办理
 
Ballia Escorts Service Girl ^ 9332606886, WhatsApp Anytime Ballia
Ballia Escorts Service Girl ^ 9332606886, WhatsApp Anytime BalliaBallia Escorts Service Girl ^ 9332606886, WhatsApp Anytime Ballia
Ballia Escorts Service Girl ^ 9332606886, WhatsApp Anytime Ballia
 
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
 
Mira Road Housewife Call Girls 07506202331, Nalasopara Call Girls
Mira Road Housewife Call Girls 07506202331, Nalasopara Call GirlsMira Road Housewife Call Girls 07506202331, Nalasopara Call Girls
Mira Road Housewife Call Girls 07506202331, Nalasopara Call Girls
 
Trump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts SweatshirtTrump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts Sweatshirt
 
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
 

Human Behavior Understanding: From Human-Oriented Analysis to Action Recognition I

  • 1. Human Behavior Understanding: From Human-Oriented Analysis to Action Recognition Ting Yao Principal Researcher, Vision and Multimedia Lab, JD AI Research Tutorial @ ICME, July 8th, 2019
  • 2. horse grass person “a boy is cleaning the floor” “not just beautiful”
  • 4. 4
  • 5. 5 2011 2012 2013 2014 2015 Action recognition by dense trajectories. [Wang et al. CVPR 2011] Hand-crafted feature 2016
  • 6. 2011 2012 2013 2014 2015 2016 Large-scale Video Classification with Convolutional Neural Networks. [Karpathy et al. CVPR 2014] Two-Stream Convolutional Networks for Action Recognition in Videos. [Simonyan et al. NIPS 2014] 2D convolutional network
  • 7. 2D CNN + LSTM (LRCN)2011 2012 2013 2014 2015 2016 Long-term Recurrent Convolutional Networks for Visual Recognition and Description. [Donahue et al. CVPR 2015]
  • 8. 3D convolutional network (C3D)2011 2012 2013 2014 2015 2016 Learning Spatiotemporal Features with 3D Convolutional Networks. [Tran et al. ICCV 2015]
  • 9. Temporal segment networks (TSN)2011 2012 2013 2014 2015 2016 Temporal Segment Networks: Towards Good Practices for Deep Action Recognition. [Wang et al. ECCV 2016]
  • 10. 10 Sparse Sample Dense Sample 2D CNN 3D CNN Frame Optical Flow Input Video Preprocessing Sample Strategy Backbone Network Feature Aggregation Pooling Quantization Attention RNN Stream Fusion
  • 11. 11 Backbone Network Sparse Sample Dense Sample 2D CNN 3D CNN Frame Optical Flow Input Video Preprocessing Sample Strategy Backbone Network Feature Aggregation Pooling Quantization Attention RNN Stream Fusion
  • 12. 12 State-of-the-Arts Image Domain Video Domain VGG [Simonyan et al. ICLR 2015] C3D [Tran et al. ICCV 2015] Inception [Szegedy et al. CVPR 2015] I3D [Carreira et al. CVPR 2017] ResNet [He et al. CVPR 2016] P3D [Qiu et al. ICCV 2017]
  • 13. 13 Convolution 3D Convolution 2D Convolution 3D Convolution 3D ResNet 2D ResNet ResNet-152: Time Cost: 9 x C2 x H x W Model size: 230MB 3D ResNet-152: Time Cost: 27 x C2 x T x H x W Model size: 690MB
  • 15. 15 Bottleneck Architecture: + 1x1 conv 1x1 conv 3x3 conv ReLU ReLU ReLU (a) Residual Unit + 1x1x1 conv 1x1x1 conv 1x3x3 conv ReLU ReLU 3x1x1 conv ReLU ReLU (b) P3D-A + 1x1x1 conv 1x1x1 conv ReLU ReLU 1x3x3 conv 3x1x1 conv + ReLU ReLU (c) P3D-B + 1x1x1 conv 1x1x1 conv 1x3x3 conv ReLU ReLU 3x1x1 conv ReLU + ReLU (d) P3D-C (a) (b) (c) (d) Very deep 3D CNN but still lighter weights than C3D
  • 16. 16 •R(2+1)D > MCx > rMCx > R3D > R2D
  • 17. 17 • ResNeXt-101 > Wide ResNet-50 > ResNet-200 > ResNet-152 > ResNet-101 > ResNet-50 > DenseNet-201 > DenseNet-121
  • 18. 18 • Involve large-range (global) context into representation learning • Model the diffusions between local and global features
  • 19. 19 Feature Aggregation Sparse Sample Dense Sample 2D CNN 3D CNN Frame Optical Flow Input Video Preprocessing Sample Strategy Backbone Network Feature Aggregation Pooling Quantization Attention RNN Stream Fusion
  • 20. 20 Frames (Nf) 2D CNN 2D CNN 2D CNN ... ... Convolutional Activations ... Spatial Pyramid Pooling FV-VAE Gradient Vector Visual Representation Loss Function Video Label Training Epoch Extraction Epoch Local Feature Set Frames (Nf) 2D CNN 2D CNN 2D CNN ... ... Convolutional Activations ... Spatial Pyramid Pooling FV-VAE Gradient Vector Visual Representation Loss Function Video Label Training Epoch Extraction Epoch Local Feature Set 2D CNN / 3D CNN
  • 21. 21 Frames (Nf) 2D CNN 2D CNN 2D CNN ... ... Convolutional Activations ... Spatial Pyramid Pooling FV-VAE Gradient Vector Visual Representation Loss Function Video Label Training Epoch Extraction Epoch Local Feature Set Frames (Nf) 2D CNN 2D CNN 2D CNN ... ... Convolutional Activations ... Spatial Pyramid Pooling FV-VAE Gradient Vector Visual Representation Loss Function Video Label Training Epoch Extraction Epoch Local Feature Set 2D CNN / 3D CNN • Global Average Pooling
  • 23. 23
  • 24. 24 Frames (Nf) 2D CNN 2D CNN 2D CNN ... ... Convolutional Activations ... Spatial Pyramid Pooling FV-VAE Gradient Vector Visual Representation Loss Function Video Label Training Epoch Extraction Epoch Local Feature Set Frames (Nf) 2D CNN 2D CNN 2D CNN ... ... Convolutional Activations ... Spatial Pyramid Pooling FV-VAE Gradient Vector Visual Representation Loss Function Video Label Training Epoch Extraction Epoch Local Feature Set 2D CNN / 3D CNN • Global Average Pooling • Attention • Visual Attention [Sharma et al. ICLR workshop 2015] • Recurrent Attention [Du et al. TIP 2018] • Unified Attention [Li et al. TMM 2018]
  • 25. 25 Frames (Nf) 2D CNN 2D CNN 2D CNN ... ... Convolutional Activations ... Spatial Pyramid Pooling FV-VAE Gradient Vector Visual Representation Loss Function Video Label Training Epoch Extraction Epoch Local Feature Set Frames (Nf) 2D CNN 2D CNN 2D CNN ... ... Convolutional Activations ... Spatial Pyramid Pooling FV-VAE Gradient Vector Visual Representation Loss Function Video Label Training Epoch Extraction Epoch Local Feature Set 2D CNN / 3D CNN • Global Average Pooling • Attention • Visual Attention [Sharma et al. ICLR workshop 2015] • Recurrent Attention [Du et al. TIP 2018] • Unified Attention [Li et al. TMM 2018] • RNN • LRCN [Donahue et al. CVPR 2015] • Hybrid Framework (LSTM) [Wu et al. ACM MM 2015]
  • 26. 26
  • 27. 27 • Global Activations • Fully connected layer • Global pooling layer • Fisher Vector with Variational Auto-Encoder (FV-VAE) • Fisher Vector (FV) ... ... ... ... Global Activations Convolutional Activations FV Encoding FV-VAE Encoding Convolutional Activations Normalization term Generative model GMM VAE FV FV-VAE
  • 28. 28 ... Reconstruct Loss ... Regularization Loss Classification Loss ... ... Encoder Sampling Decodertx ... Reconstruct Loss ... ... ... Encoder Identity Decodertx Back Propagation Gradient Vector Accumulator • Assumption of FV • Data is generated from Gaussian Mixture Model, which may not hold in practice • VAE • Encoder (𝑞 𝜙( Τ𝐳 𝐱)): learn new representations 𝐳 for the given input 𝐱 • Decoder (𝑝 𝜃( Τ𝐱 𝐳)): generate FV of new representations 𝐳 Training Extraction FV: ℊ 𝜃 𝑋 = 𝐹𝜃 − 1 2 𝛻𝜃 log 𝑢 𝜽(𝑋) = −𝐹𝜃 − 1 2 σ 𝑡=1 𝑇𝑥 𝛻𝜃ℒ 𝒓𝒆𝒄(𝒙𝒕; 𝜃, 𝜙)Reconstruct loss: ℒ 𝒓𝒆𝒄 = − log 𝜇 𝒙 𝑡 = − log 𝑝 𝜃( Τ𝒙 𝑡 𝒛 𝑡)
  • 29. 29 CNN FV-VAE Gradient Vector Video Representation Convolutional Feature … … Region Feature Set Loss Function Ice Dancing + Spatial Pyramid Pooling Training Epoch Extraction Epoch • FV-VAE based action recognition framework • CNN as convolutional feature extractor • Encoding SPP output using FV-VAE
  • 30. 30 Stream Fusion Sparse Sample Dense Sample 2D CNN 3D CNN Frame Optical Flow Input Video Preprocessing Sample Strategy Backbone Network Feature Aggregation Pooling Quantization Attention RNN Stream Fusion
  • 31. 31 human, guitar Playing Guitar Jumping Jack Cliff diving Basketball dunk Single Frame Consecutive Frames Clip (multiple adjacent frames) whole video Different actions may span different granularities!
  • 32. 32 • Multi-granular spatio-temporal architecture for video action recognition • Hierarchical modeling (4 granularities) • Fusion based on multi-granular score distribution
  • 33. 33
  • 34. 34
  • 35. 35 Single Frame softmax 0.4 … 0.2 … 0.7 … 0.9 Consecutive Frames softmax Clips softmax Video softmax Surfing scores Sort 0.9 … 0.7 … 0.4 … 0.2 Improved Surfing score 0.8 w=[1, 0, …, 0] Max-pooling w=[1, 1, …, 1] Ave-pooling optimized w Distribution-based classifier
  • 36. 36
  • 37. 37 2011 2012 2014 2015 2017 2018 2019
  • 38. 38 Method UCF101 HMDB51 Improved dense trajectories (IDT) [Wang et al. ICCV 2011] 85.9% 57.2% Higher dimensional IDT [Peng et al. CVIU 2016] 87.9% 61.1% 2D CNN Slow Fusion [Karpathy et al. CVPR 2014] 65.4% -- Two-stream ConvNet [Simonyan et al. NIPS 2014] 88.0% 59.4% Factorized ST-ConvNet [Sun et al. ICCV 2015] 88.1% 59.1% Two-stream + LSTM [Yue-Hei et al. CVPR 2015] 88.6% -- Two-stream Conv fusion [Feichtenhofer et al. CVPR 2016] 92.5% 67.3% Two-stream ST Residual Networks [Feichtenhofer et al. NIPS 2016] 93.4% 66.4% Temporal Segment Networks [Wang et al. ECCV 2016] 94.0% 68.5% C3D [Tran et al. ICCV 2015] 82.3% 56.8% P3D ResNet [Qiu et al. ICCV 2017] 89.8% 58.6% Two-stream P3D ResNet [Qiu et al. ICCV 2017] 94.5% 71.8% I3D [Carreira et al. CVPR 2017] 93.4% 66.4% I3D + Kinetics pre-train [Carreira et al. CVPR 2017] 97.9% 80.2% LGD-3D + Kinetics pre-train [Qiu et al. CVPR 2019] 98.2% 80.5%
  • 39. 39
  • 40. 40
  • 41. 41
  • 42. 42
  • 43. 43 Feature Extractor Pole vault 0.61 Pole vault 0.83 Pole vault 0.51
  • 44. 44 3D CNN Pole vault 0.96 Gaussian Kernel
  • 45. 45
  • 46. 46
  • 47. 47
  • 48. 48
  • 49. 49
  • 50. 50
  • 51. 51
  • 52. 52 Sparse Sample Dense Sample 2D CNN 3D CNN Frame Optical Flow Input Video Preprocessing Sample Strategy Backbone Network Feature Aggregation Pooling Quantization Attention RNN Stream Fusion
  • 53. 53
  • 54. 54
  • 55. 55