【BMVC2016】Recognition of Transitional Action for Short-Term Action Prediction using Discriminative Temporal CNN Feature

Recognition of Transitional Action for Short-Term Action
Prediction using Discriminative Temporal CNN Feature
Hirokatsu Kataoka, Ph.D.
Computer Vision Research Group (CVRG), AIST
http://www.hirokatsukataoka.net/
Yudai Miyashita (TDU), Masaki Hayashi (Liquid Inc., Keio Univ.)，
Kenji Iwata, Yutaka Satoh (AIST)

Related work: Early Action Recognition
•  [Ryoo, ICCV2011]
M. S. Ryoo, “Human Activity Prediction: Early Recognition of Ongoing Activities from Streaming Videos”, International Conference on
Computer Vision (ICCV), pp.1036-1043, 2011.

Related work: Action Prediction
•  [Kataoka+, VISAPP2016]
？？？ Daytime
(Time Zone)
Walking
(Previous Activity)
Sitting
(Current Activity)
???
(Next Activity)
xtimezone
xprevious xcurrent
θ = “Using a PC”
Given Not given
Time series
H. Kataoka, Y. Aoki, K. Iwata, Y. Satoh, “Activity Prediction using a Space-Time CNN and Bayesian Framework”, in VISAPP, 2016.

Problem of related works
•  Early action recognition
–  Action recognition in an early frame of the action
–  Enough cue is required, so almost equals to action recognition
•  Action prediction
–  Complete future prediction in an unstable situation

Proposal
•  Transitional Action (TA): Action-class while an action is transitive
–  TA contains cue of prediction: Earlier than early action recognition
–  Recognition-like future action prediction: More stable prediction
[Applications] Autonomous driving, active safety and robotics
Δt
【Proposal】
Short-term action prediction
recognize “cross” at time t5
【Previous works】
Early action recognition recognize
“cross” at time t9
Walk straight
(Action)
Cross
(Action)
Walk straight – Cross
(Transitional action)
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12

Problem settings
Framework Problem
Action Recognition
Early Action Recognition
Action Prediction
Transitional Action Recognition
f (F1...t
A
) → At
f (F1...t−L
A
) → At
f (F1...t
A
) → At+L
f (F1...t
TA
) → At+L

Difference
Framework Problem
Action Recognition
Action Prediction
f (F1...t
A
) → At
f (F1...t−L
A
) → At
f (F1...t
A
) → At+L
f (F1...t
TA
) → At+L
Walk straight
(Action)
Cross
(Action)
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12
f (F1...t−L
A
) → At
A(cross)The objective action is
-  Early action recognition is late response

Difference
Framework Problem
Action Recognition
Action Prediction
f (F1...t
A
) → At
f (F1...t−L
A
) → At
f (F1...t
A
) → At+L
f (F1...t
TA
) → At+L
Walk straight
(Action)
Cross
(Action)
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12
f (F1...t
A
) → At+L
-  Action prediction is unstable

Difference
Framework Problem
Action Recognition
Action Prediction
f (F1...t
A
) → At
f (F1...t−L
A
) → At
f (F1...t
A
) → At+L
f (F1...t
TA
) → At+L
Walk straight
(Action)
Cross
(Action)
Walk straight – Cross
(Transitional action)
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12
-  Transitional action recognition is reasonable
f (F1...t
TA
) → At+L

Details of transitional action (TA)
•  Annotation for TA
–  TA and normal action (NA) classes are partially overlapped each other
•  Difficulty of TA
–  Temporally mixed between NA and TA

Subtle Motion Descriptor (SMD)
•  A discriminative temporal CNN feature
–  To divide classes between NA and TA

•  Activation feature from VGG-16
–  Fully-connected layer (N = 4,096)
–  Based on pooled time series (PoT) [Ryoo+, CVPR2015]

•  Temporal difference ΔVt is calculated
–  (Frame t) – (Frame t-1)

•  Temporal pooling from ΔV t
–  Plus and minus
–  Zero-around values are pooled (→This is the contribution of SMD)
–  TH is experimentally fixed

Datasets
•  Temporal action datasets
–  NTSEL [Kataoka+, ITSC2015]
•  Walk (NA), cross (NA), bicycle (NA), turn (TA) with human bbox
–  UTKinect-Action [Xia+, CVPRW2012]
•  Ordered 10 NAs (e.g. walk, throw, sit)
•  8 TAs (excluding push/pull; next page)
•  Without human bbox
–  Watch-n-Patch [Wu+, CVPR2015]
•  Daily 10 NAs (e.g. read, turn on monitor, leave office)
•  Top frequent 10 TAs (next page)
•  Without human bbox

Experimental settings (list of TAs)
•  @UTKinect-Action @Watch-n-Patch

Implements
•  Action recognition appraoches
–  Temporal CNN models
•  Pooled Time-series (PoT) [Ryoo+, CVPR2015]
•  CNN accumulation
•  CNN + IDT [Jain+, ECCVW2014]
–  Improved dense trajectories (IDT) and with improved features
•  IDT [Wang+, ICCV2013]
•  IDT + cooccurrence-feature [Kataoka+, ACCV2014]
•  All Features in IDT

Exploration experiment
•  Parameters
–  Frame accumulation
–  Thresholding value TH
–  Layer fc6 vs fc7

•  Temporal accumulation [frames]
–  Faster prediction: 3 [frames] (0.1s)
–  Toward state-of-the-art: 10 [frames] (0.33s)
–  Baseline should be 3 and 10 frames accumulation

•  Thresholding value
–  Depending on data

•  Layer fc6 vs fc7
–  Layer fc6 is better

Results
•  SMD (ours) is state-of-the-art in transitional action recognition

Comparison of PoT
•  Subtle motion is effective for transitional action recognition
–  NTSEL: +2.18%, +8.63%
–  UTKinect: +7.19%, +4.31%
–  Watch-n-Patch: +4.82%, +5.12%

Conclulsion
•  Two contribusions:
1.  Definition of transitional action for short-term action prediction
2.  Subtle Motion Descriptor (SMD) to classify transitional and normal actions

【BMVC2016】Recognition of Transitional Action for Short-Term Action Prediction using Discriminative Temporal CNN Feature

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (7)

Destaque

Destaque (18)

Semelhante a 【BMVC2016】Recognition of Transitional Action for Short-Term Action Prediction using Discriminative Temporal CNN Feature

Semelhante a 【BMVC2016】Recognition of Transitional Action for Short-Term Action Prediction using Discriminative Temporal CNN Feature (13)

Último

Último (20)

【BMVC2016】Recognition of Transitional Action for Short-Term Action Prediction using Discriminative Temporal CNN Feature