1. Structural methods use human pose as a representation for action recognition by capturing information about actions while being invariant to clothing and lighting variations.
2. There are two broad classes of approaches: matching templates or exemplars, and fitting a human body model.
3. Recent work has focused on recognizing actions from still images by retrieving similar poses and analyzing image evidence, showing that pose is essential for video-based action recognition as well.
3. G. Johansson, Moving Light Displays, 1973
• Pose
seems
sufficient
for
certain
ac5on
categories
• Remove
effects
of
clothing,
ligh5ng
varia5on
from
representa5on
4.
5.
6.
7.
8. • Two
broad
classes
of
approaches
– Match
templates
(exemplar-‐based)
– Fit
a
human
body
model
9. Mori & Malik PAMI 2005
Shakhnarovich, Viola & Darrell ICCV 2003
Bourdev & Malik ICCV 2009
Database of Exemplars
11. • Detect
corners
in
images/video
• Assess
likelihood
under
ac5on-‐specific
pose
model
• Discriminate
between
walking
direc5ons,
bicycle
riding
Song, Goncalves & Perona NIPS 2001, PAMI 2003
12. test sequence
key frame
• Key
frame
matching
to
test
sequence
to
find
similar
poses
– Shape
matching
on
edge
maps
using
order
structure
Sullivan & Carlsson ECCV 2002
13. • M
is
quan5zed
3d
pose
• T
is
root
orienta5on
• Automa5c
person
detec5on-‐tracking
• Compare
quan5zed
pose
to
labeled
training
poses
– Smooth
over
5me
Ramanan & Forsyth NIPS 2003
14. • Video
shot
retrieval
from
pose
– Either
query-‐by-‐example
or
classifica5on
– Focus
on
upper
body
pose
• Pictorial
structures
model
Ac5on
Label
Ferrari, Marin & Zisserman CVPR 2009
15. query
CODE AVAILABLE ONLINE
• SVM
on
descriptors
of
absolute
&
rela5ve
part
loca5ons,
segmenta5ons
– Include
short
tracks
for
robustness
16. Golfing?
Walking?
• Focus
on
discrimina5ve
elements
of
pose
for
ac5on
classifica5on
• Use
exemplar-‐based
“poselet”
representa5on
Yang, Wang & Mori CVPR 2010
18. A:
Activity
A
Tennis Croquet Volleyball Human pose
forehand shot smash
H
O: Object
O
Tennis Croquet Volleyball Body parts
racket mallet
P1 P2 PN
H:
fO
f1 f2 fN
Intra-class variations
• More than one H for each A; Image evidence
• Unobserved during training.
P: lP: location; θP: orientation; sP: scale.
f: Shape context. [Belongie et al, 2002]
Yao & Fei-Fei CVPR 2010
20. • Build
ac5on
models
from
web
search
results
Ikizler-Cinbis, Cinbis, Sclaroff ICCV 2009
21. SLAG
• Find
repeated
poses
in
a
dataset
Wang, Jiang, Drew, Li, Mori CVPR 2006
22. • Person
loca5on
Riding horse Reading book Taking photo
given
• Classify
into
one
of
9
categories
Riding bike Play instrument Running
Phoning Use computer Walking
23. • Pose
as
representa5on
for
ac5on
recogni5on
– Captures
much
informa5on
about
ac5on
– Invariance
to
clothing
/
ligh5ng
effects
– Model
and
exemplar
based
representa5ons
• New
direc5on:
Ac5on
recogni5on
from
s5ll
images
– Image
retrieval
and
analysis
– An
important
cue
for
video-‐based
ac5on
recogni5on
– Pose
seems
essen5al
24.
25.
26.
27. • Describe
low-‐level
components
– Ac5ons
of
individual
people
– Movement
of
pixels
• Iden5fy
key
objects
or
loca5ons
in
scene
– Buildings,
roads,
etc.
• Model
interac5ons
between
people,
objects,
and
loca5ons
28. • Detect
and
track
moving
objects
• Manually
iden5fy
key
regions
in
scene
– E.g.
road,
checkpoint
• Scenarios
describe
rela5ve
arrangements
of
objects
in
scene
– E.g.
proximity
of
car
to
checkpoint
– No5ons
of
scene
context
Medioni, Cohen, Bremond, Hongeng, Nevatia PAMI 2001
29. • Detect
and
track
players,
ball
• Low-‐level
ac5on
detectors
for
individual
players
• Hand-‐constructed
Bayes
net
for
each
ac5vity
– Spa5al
and
temporal
rela5ons
between
low-‐
level
ac5ons
Intille & Bobick CVPR 1999
30. • Global,
frame-‐level
feature
– Bag-‐of-‐words
representa5on
• Detect
unusual
events
by
clustering
– Isolated,
varied
clusters
are
unusual
Zhong, Shi & Visontai CVPR 2004
31. • Chea5ng
detec5on
in
• Real-‐world
highway
dataset
simulated
card
game
– Cars
pulling
off
road,
backing
up,
U-‐turns
32. • Describe
moving
pixels
by
loca5on
and
mo5on
direc5on
– No
object
detec5on
• Use
as
visual
words
in
Latent
Dirichlet
Alloca5on
(LDA)
type
model
– Infer
low-‐level
ac5ons
from
words
Wang, Ma, Grimson PAMI 2009 Blei, Ng, Jordan JMLR 2003
33. • Higher-‐level
ac5vity
analysis
– Distribu5on
of
low-‐level
ac5ons
over
en5re
scene
• Applica5ons
– Temporal
segmenta5on
by
ac5vity
– Abnormality
detec5on
34. 0.8
A
B
0.3 0.5
C
• Hierarchical
Dirichlet
Process
model
– Learn
number
of
ac5vi5es
automa5cally
Kuettel, Breitenstein, van Gool & Ferrari CVPR 2010
35. traffic
light
controlled
scene
• con5nuous
video
• annotated
with
states
and
history
• 3x
speed
38. y
activity class
h1 h2 y
h action class
…
x1 x2 xn image
x0
Choi, Shahid, & Savarese VS 2009
Lan, Wang, Yang, & Mori SGA 2010, NIPS 2010
39. • Cap5oned
baseball
videos
in
training
• Build
AND-‐OR
graph
representa5on
of
ac5vi5es
– AND
specifies
elements
of
an
ac5vity
that
must
occur
– OR
allows
varia5on
in
how
an
element
appears
• Describe
low-‐level
tracks
using
STIPs
• Match
tracks
to
ac5ons
in
AND-‐OR
graph
Gupta, Srinivasan, Shi, Davis CVPR 2009
40. • Scene
modeling
to
look
at
the
big
picture
• Feature
representa5ons
– Holis5c:
describe
en5re
scene,
irrespec5ve
of
individuals
– Local:
describe
ac5ons
of
individuals
• Structure
of
ac5vi5es
– Model
free:
clustering-‐type
approaches
– Strong
models:
grammars,
probabilis5c
models
41.
42.
43. Objects: Actions:
cars, glasses, drinking, running,
people, etc… door exit, car enter
constraints
Scene categories:
Geometry:
indoors,
outdoors, street Street, wall,
scene, etc… field, stair, etc…
44. Riding horse Reading book
A 0.8 B
0.3 C 0.5
Riding bike Play instrument
45. • Standardiza5on
of
datasets
for
field
– Allow
comparison
of
algorithms
• E.g.
KTH
for
low-‐level
features,
atomic
ac5ons
– Fair
tuning
of
model
parameters
• New
algorithms
compare
to
baselines
– Bag-‐of-‐words
on
densely
sampled
STIPs
– Pose
es5ma5on
(Ferrari
et
al.
code)
– HOG
SVM
(Dalal
&
Triggs
code,
Ramanan
code)
46. • Standardiza5on
of
datasets
for
field
– Don’t
feel
constrained
by
the
exis5ng
problem
defini5ons
– Do
make
your
new
dataset
available
• Should
clearly
specify
separate
training
and
test
sets
• New
algorithms
compare
to
baselines
– Do
use
reasonable
variant
of
standard
baselines
for
your
new
problem
47.
48. • Even
atomic
low-‐level
ac5ons
are
very
difficult
to
detect
reliably
– Far
more
work
needed
on
representa5ons
for
the
ac5on
of
a
single
person
– Features
– Temporal
representa5on,
smoothing
– Tracking
– …
49. 1. Cameras and bandwidth are cheap
2. Lots of training data is potentially available
Training
+ =
data
Potential for the huge progress
… if we can get the data
50. Aligned with video
Describes visual Source
content
Subtitles
Yes No
DVD, Internet
Scripts for No
Yes
Internet, e.g.
TV series, movies and www.dailyscript.com
sport games
Plot summaries and No
Yes, sparsely
Internet
synopses (e.g. IMDB)
Instruction videos No
Yes
Internet, e.g.
www.videojug.com
Descriptive Video Service Yes
Yes
DVD, rare
Word tags No
Yes, sparsely
Internet
(e.g. YouTube)
Manual labelling, ??
??
Mechanical Turk,
Human Computation ESP Game,
Grad undergrad students
51. Open questions:
How to benefit from the structure of the human body in
complex situations, e.g. heavy occlusions, uniformly colored
clothing?
Will action classification generalize over different video
domains: Movies, TV, YouTube, Surveillance video?
What is the useful action vocabulary? Are we trying to solve the
right problem? How can we visualize/display the results?
Interesting novel directions:
Use actions for recognizing functional and physical object
properties, e.g. “sitable”, “eatable”, “heavy”, “solid” objects…
Action prediction, i.e. what can happen in the given situation:
e.g. is it dangerous to cross this road?
Explore more sources of strong and weak supervision: Manual
surveillance, Descriptive Video Service (DVS), YouTube tags;
Transcripts of sports games; Instruction videos.
52. • P.
Viola,
M.
Jones,
and
D.
Snow.
Detec5ng
pedestrians
using
paeerns
of
mo5on
and
appearance.
In
Proc.
9th
Int.
Conf.
Computer
Vision,
pages
734–741,
2003.
• N.
Dalal
and
B.
Triggs.
Histograms
of
oriented
gradients
for
human
detec5on.
In
Proc.
IEEE
Comput.
Soc.
Conf.
Comput.
Vision
and
Paeern
Recogn.,
2005.
• Bo
Wu
and
Ram
Neva5a.
Detec5on
of
mul5ple,
par5ally
occluded
humans
in
a
single
image
by
bayesian
combina5on
of
edgelet
part
detectors.
In
Proc.
10th
Int.
Conf.
Computer
Vision,
2005.
• Pedro
Felzenszwalb,
David
McAllester,
and
Deva
Ramanan.
A
discrimina5vely
trained,
mul5scale,
deformable
part
model.
In
IEEE
Computer
Society
Conference
on
Computer
Vision
and
Paeern
Recogni5on,
2008.
• Chris
Stauffer
and
W.E.L.
Grimson.
Adap5ve
background
mixture
models
for
real-‐
5me
tracking.
In
Proc.
IEEE
Comput.
Soc.
Conf.
Comput.
Vision
and
Paeern
Recogn.,
1999.
• Kentaro
Toyama,
John
Krumm,
Barry
Brumie,
and
Brian
Meyers.
Wallflower:
Principles
and
prac5ce
of
background
maintenance.
In
Proc.
7th
Int.
Conf.
Computer
Vision,
1999.
• J.L.
Barron,
D.J.
Fleet,
and
S.S.
Beauchemin.
Performance
of
op5cal
flow
techniques.
Int.
Journal
of
Computer
Vision,
12(1):43–77,
1994.
• T.
Brox,
C.
Bregler,
and
J.
Malik.
Large
displacement
op5cal
flow.
In
Proc.
IEEE
Comput.
Soc.
Conf.
Comput.
Vision
and
Paeern
Recogn.,
2009.
• M.
Isard
and
A.
Blake.
Condensa5on
–
condi5onal
density
propaga5on
for
visual
tracking.
Int.
Journal
of
Computer
Vision,
29(1):5–28,
1998.
• Yuan
Li,
Chang
Huang,
and
Ram
Neva5a.
Learning
to
associate:
Hybridboosted
mul5-‐target
tracker
for
crowded
scene.
In
Proc.
IEEE
Comput.
Soc.
Conf.
Comput.
Vision
and
Paeern
Recogn.,
2009.
53. • W.
T.
Freeman,
K.
Tanaka,
J.Ohta,
and
K.
Kyuma.
Computer
vision
for
computer
games.
In
IEEE
2nd
Intl.
Conf.
on
Automa5c
Face
and
Gesture
Recogni5on,
1996.
• J.
Sullivan
and
S.
Carlsson.
Recognizing
and
tracking
human
ac5on.
In
ECCV
2002
• A.A.
Efros,
A.C.
Berg,
G.
Mori,
and
J.
Malik.
Recognizing
ac5on
at
a
distance.
In
ICCV
2003
• A.
Bobick
and
J.
Davis.
The
recogni5on
of
human
movement
using
temporal
templates.
IEEE
Trans.
PAMI,
23(3):257–267,
2001.
• L.
Zelnik-‐Manor
and
M.
Irani.
Event-‐based
video
analysis.
In
CVPR
2001
• E.
Shechtman
and
M.
Irani.
Space-‐5me
behavior
based
correla5on.
In
CVPR
2005
• O.
Boiman
and
M.
Irani.
Detec5ng
irregulari5es
in
images
and
in
video.
In
Proc.
ICCV,
2005.
• M.
Blank,
L.
Gorelick,
E.
Shechtman,
M.
Irani,
and
R.
Basri.
Ac5ons
as
space-‐5me
shapes.
In
Proc.
ICCV,
2005.
• Y.
Ke,
R.
Sukthankar,
and
M.
Hebert.
Efficient
Visual
Event
Detec5on
using
Volumetric
Features
.
In
Proc.
ICCV
2005.
• Y.
Ke,
R.
Sukthankar,
and
M.
Hebert.
Event
detec5on
in
crowded
videos.
In
Proc.
ICCV,
2007.
• I.
Laptev
and
P.
Pérez.
Retrieving
ac5ons
in
movies.
In
Proc.
ICCV
2007
• D.
Weinland
and
E.
Boyer.
Ac5on
recogni5on
using
exemplar-‐based
embedding.
In
Proc.
CVPR,
2008.
• Z.
Lin,
Z.
Jiang,
and
L.
S.
Davis.
Recognizing
ac5ons
by
shape-‐mo5on
prototype
trees.
In
Proc.
ICCV,
2009.
54. • I.
Laptev
and
T.
Lindeberg.
Space-‐5me
interest
points.
In
Proc.
ICCV
2003.
• C.
Schuldt,
I.
Laptev,
and
B.
Caputo.
Recognizing
human
ac5ons:
A
local
svm
approach.
In
Proc.
ICPR,
2004.
• P.
Dollar,
V.
Rabaud,
G.
Coerell,
and
Serge
Belongie.
Behavior
recogni5on
via
sparse
spa5o-‐temporal
features.
In
VS-‐PETS,
2005.
• H.
Jhuang,
T.
Serre,
L.
Wolf
and
T.
Poggio.
A
Biologically
Inspired
System
for
Ac5on
Recogni5on.
In
Proc.
ICCV
2007
• P.
Scovanner,
S.
Ali,
and
M.
Shah,
A
3-‐Dimensional
SIFT
descriptor
and
its
applica5on
to
ac5on
recogni5on,
ACM
MM
2007.
• J.
C.
Niebles,
H.
Wang,
and
L.
Fei-‐Fei.
Unsupervised
learning
of
human
ac5on
categories
using
spa5al-‐
temporal
words.
In
IJCV
2008.
• I.
Laptev,
M.
Marszalek,
C.
Schmid,
and
B.
Rozenfeld.
Learning
realis5c
human
ac5ons
from
movies.
In
Proc.
CVPR
2008.
• A.
Klaeser,
M.
Marszałek
and
C.
Schmid.
A
spa5o-‐temporal
descriptor
based
on
3D-‐gradients.
In
Proc.
BMVC
2008
• G.
Willems,
T.
Tuytelaars
and
L.
Van
Gool.
An
Efficient
Dense
and
Scale-‐Invariant
Spa5o-‐Temporal
Interest
Point
Detector.
In
Proc.
ECCV
2008
• H.
Wang,
M.
M.
Ullah,
A.
Kläser,
I.
Laptev
and
C.
Schmid.
Evalua5on
of
local
spa5o-‐temporal
features
for
ac5on
recogni5on.
In
Proc.
BMVC
2009.
• L.
Yeffet
and
L.
Wolf.
Local
Trinary
Paeerns
for
Human
Ac5on
Recogni5on.
In
Proc.
ICCV
2009.
• A.
Gilbert,
J.
Illingworth,
R.
Bowden.
Fast
realis5c
mul5-‐ac5on
recogni5on
using
mined
dense
spa5o-‐
temporal
features,
In
Proc.
ICCV
2009.
• P.
Ma5kainen,
M.
Hebert,
R.
Sukthankar.
Trajectons:
Ac5on
recogni5on
through
the
mo5on
analysis
of
tracked
features.
ICCV
workshop
on
Video-‐oriented
Object
and
Event
Classifica5on,
2009
• M.
M.
Ullah,
S.
N.
Parizi,
I.
Laptev.
Improving
bag-‐of-‐features
ac5on
recogni5on
with
non-‐local
cues.
In
Proc.
BMVC
2010
55. • Y.
Song,
L.
Goncalves,
and
P.
Perona.
Unsupervised
learning
of
human
mo5on.
IEEE
Trans.
PAMI,
25
(7):814–827,
2003.
• D.
Ramanan
and
D.
A.
Forsyth.
Automa5c
annota5on
of
everyday
movements.
In
Advances
in
Neural
Informa5on
Processing
Systems
16,
2003.
• V.
Ferrari,
M.
Marin,
and
A.
Zisserman.
Pose
search:
retrieving
people
using
their
pose.
In
Proc.
IEEE
Comput.
Soc.
Conf.
Comput.
Vision
and
Paeern
Recogn.,
2009.
• Yang
Wang,
Hao
Jiang,
Mark
S.
Drew,
Ze-‐Nian
Li,
and
Greg
Mori.
Unsupervised
discovery
of
ac5on
classes.
In
CVPR,
2006.
• Nazli
Ikizler-‐Cinbis,
R.
Gokberk
Cinbis,
and
Stan
Sclaroff.
Learning
ac5ons
from
the
web.
In
IEEE
Interna5onal
Conference
on
Computer
Vision,
2009.
• Weilong
Yang,
Yang
Wang,
and
Greg
Mori.
Recognizing
human
ac5ons
from
s5ll
images
with
latent
poses.
In
Proc.
IEEE
Comput.
Soc.
Conf.
Comput.
Vision
and
Paeern
Recogn.,
2010.
• Bangpeng
Yao
and
Li
Fei-‐Fei.
Modeling
mutual
context
of
object
and
human
pose
in
human-‐object
interac5on
ac5vi5es.
In
Proc.
IEEE
Comput.
Soc.
Conf.
Comput.
Vision
and
Paeern
Recogn.,
2010.
56. • R.
Polana
and
R.C.
Nelson.
Detec5on
and
recogni5on
of
periodic,
nonrigid
mo5on.
In
IJCV
1997.
• S.M.
Seitz
and
C.R.
Dyer.
View
invariant
analysis
of
cyclic
mo5on.
In
IJCV
1997
• A.
Thangali
and
S.
Sclaroff.
Periodic
mo5on
detec5on
and
es5ma5on
via
space-‐5me
sampling.
In
IEEE
Workshop
on
Mo5on
and
Video
Compu5ng,
2005.
• I.
Laptev,
S.J.
Belongie,
P.
Pérez
and
J.
Wills.
Periodic
mo5on
detec5on
and
segmenta5on
via
approximate
sequence
alignment,
In
Proc.
ICCV
2005
• P.
Wang,
G.D.
Abowd
and
J.M.
Rehg.
Quasi-‐periodic
event
analysis
for
social
game
retrieval.
In
Proc
ICCV
2009
• D.
Weinland,
E.
Boyer,
and
R.
Ronfard.
Ac5on
recogni5on
from
arbitrary
views
using
3D
exemplars.
in
Proc.
ICCV
2007.
• A.
Farhadi
and
M.
Tabrizi.
Learning
to
recognize
ac5vi5es
from
the
wrong
view
point.
In
Proc.
ECCV
2008.
• I.
Junejo,
E.
Dexter,
I.
Laptev
and
Patrick
Pérez.
Cross-‐view
ac5on
recogni5on
from
temporal
self-‐
similari5es.
In
Proc.
ECCV
2008
• A.
Farhadi,
M.
Kamali,
I.
Endres,
D.
Forsyth.
A
latent
model
of
discrimina5ve
aspect.
In
Proc.
ICCV
2009.
57. • X.
Wang,
X.
Ma,
and
E.
Grimson.
Unsupervised
ac5vity
percep5on
in
crowded
and
complicated
scenes
using
hierarchical
bayesian
models.
IEEE
Trans.
PAMI,
31(3):539–
555,
2009.
• Abhinav
Gupta,
Praveen
Srinivasan,
Jianbo
Shi,
and
Larry
S.
Davis.
Understanding
videos,
construc5ng
plots
-‐
learning
a
visually
grounded
storyline
model
from
annotated
videos.
In
CVPR,
2009.
• T.
Xiang
and
S.
Gong.
Beyond
tracking:
Modelling
ac5vity
and
understanding
behaviour.
Int.
Journal
of
Computer
Vision,
67(1):21–51,
2006.
• G.
Medioni,
I.
Cohen,
F.
Bre
́mond,
S.
Hongeng,
and
R.
Neva5a.
Event
detec5on
and
analysis
from
video
streams.
IEEE
Trans.
PAMI,
23(8):873–889,
2001.
• Y.
A.
Ivanov
and
A.
F.
Bobick.
Recogni5on
of
visual
ac5vi5es
and
interac5ons
by
stochas5c
parsing.
IEEE
Trans.
PAMI,
22(8):852–872,
2000.
• D.
Moore
and
I.
Essa.
Recognizing
mul5tasked
ac5vi5es
using
stochas5c
context-‐free
grammar
using
video.
In
AAAI,
2002.
• Chen
Change
Loy,
Tao
Xiang,
and
Shaogang
Gong.
Modelling
ac5vity
global
temporal
dependencies
using
5me
delayed
probabilis5c
graphical
model.
In
ICCV,
2009.
• Xiaogang
Wang,
Keng
Teck
Ma,
Gee
Wah
Ng,
and
W.
Eric
L.
Grimson.
Trajectory
analysis
and
seman5c
region
modeling
using
a
nonparametric
bayesian
model.
In
Proc.
IEEE
Comput.
Soc.
Conf.
Comput.
Vision
and
Paeern
Recogn.,
2008.
• W.
Choi,
K.
Shahid,
and
S.
Savarese.
”what
are
they
doing?
:
Collec5ve
ac5vity
classifica5on
using
spa5o-‐temporal
rela5onship
among
people”.
In
9th
Interna5onal
Workshop
on
Visual
Surveillance,
2009.
• Ramin
Mehran,
Alexis
Oyama,
and
Mubarak
Shah.
Abnormal
crowd
behavior
detec5on
using
social
force
model.
In
CVPR,
2009.