Modeling Mutual Context of Object and Human Pose in Human-Object Interaction Activities

Modeling Mutual Context of Object
and Human Pose in Human-Object
Interaction Activities
Bangpeng Yao and Li Fei-Fei
Computer Science Department, Stanford University
{bangpeng,feifeili}@cs.stanford.edu

1

Human-Object Interaction

Robots interact Automatic sports Medical care
with objects commentary
“Kobe is dunking the ball.”

2

Holistic image based classification (Previous talk: Grouplet)
Playing
Playing
bassoon
saxophone
Detailed understanding and reasoning
Vs.
Playing
saxophone

Grouplet is a generic feature for structured objects, or interactions
of groups of objects.

HOI activity: Tennis Forehand
Berg & Malik, 2005 Grauman & Darrell, 2005 Gehler & Nowozin, 2009 OURS
Caltech101
48% 59% 77% 62% 3

Holistic image based classification

• Human pose estimation

Head

Torso

4


• Object detection

Tennis
racket

5


• Object detection

Head

Tennis Torso
racket

HOI activity: Tennis Forehand

6

Outline
• Background and Intuition
• Mutual Context of Object and Human Pose
 Model Representation
 Model Learning
 Model Inference
• Experiments
• Conclusion

7

Outline
 Model Learning
 Model Inference
• Experiments
• Conclusion

8

Human pose estimation & Object detection

Human pose Difficult part
estimation is appearance
challenging.

Self-occlusion

Image region looks
like a body part

• Felzenszwalb & Huttenlocher, 2005
• Ren et al, 2005
• Ramanan, 2006
• Ferrari et al, 2008
• Yang & Mori, 2008
• Andriluka et al, 2009
9
• Eichner & Ferrari, 2009


Human pose
estimation is
challenging.

• Felzenszwalb & Huttenlocher, 2005
• Ren et al, 2005
• Ramanan, 2006
• Ferrari et al, 2008
• Yang & Mori, 2008
• Andriluka et al, 2009
10
• Eichner & Ferrari, 2009


Facilitate

Given the
object is
detected.

11


Object
detection is
Small, low- challenging
resolution, partially
occluded

Image region similar
to detection target

• Viola & Jones, 2001
• Lampert et al, 2008
• Divvala et al, 2009
• Vedaldi et al, 2009
12


Object
detection is
challenging

• Viola & Jones, 2001
• Lampert et al, 2008
• Divvala et al, 2009
• Vedaldi et al, 2009
13


Facilitate

Given the
pose is
estimated.

14


Mutual Context

15

Context in Computer Vision
Previous work – Use context
cues to facilitate object detection:

Helpful, but only moderately
outperform better
~3-4%

with without
context context
• Hoiem et al, 2006 • Murphy et al, 2003 • Viola & Jones, 2001
• Rabinovich et al, 2007 • Shotton et al, 2006 • Lampert et al, 2008
• •
•
Oliva & Torralba, 2007
Heitz & Koller, 2008 •
Harzallah et al, 2009
Li, Socher & Fei-Fei, 2009 
• Desai et al, 2009 • Marszalek et al, 2009
• 
Divvala et al, 2009 • Bao & Savarese, 2010 16

Context in Computer Vision
Previous work – Use context Our approach – Two challenging
cues to facilitate object detection: tasks serve as mutual context of
each other:
With
mutual
context:

Helpful, but only moderately
outperform better
~3-4%

Without
context:
with without
context context
• Hoiem et al, 2006 • Murphy et al, 2003
• Rabinovich et al, 2007 • Shotton et al, 2006
• Oliva & Torralba, 2007 • Harzallah et al, 2009
• Heitz & Koller, 2008 • Li, Socher & Fei-Fei, 2009
• Desai et al, 2009 • Marszalek et al, 2009
• Divvala et al, 2009 • Bao & Savarese, 2010 17

Outline
 Model Learning
 Model Inference
• Experiments
• Conclusion

18

Mutual Context Model Representation
A:
 Activity
A

Tennis Croquet Volleyball Human pose
forehand shot smash
H
O: Object
 O
Tennis Croquet Volleyball Body parts
racket mallet
P1 P2  PN
H:
fO
f1 f2  fN
Intra-class variations
• More than one H for each A; Image evidence
• Unobserved during training.

P: lP: location; θP: orientation; sP: scale.

f: Shape context. [Belongie et al, 2002] 19


Markov Random Field
•  e ( A, O ) ,  e ( A, H ) ,  e (O, H ) : Frequency A    we e
of co-occurrence between A, O, and H. eE
 e ( A, H )
 e ( A, O ) Clique Clique
H weight potential
 e (O, H )
O

P1 P2  PN

fO
f1 f2  fN

20


Markov Random Field

•  e (O, Pn ) , e ( H , Pn ) , e ( Pm , Pn ) : Spatial Clique Clique
H weight potential
relationship among object and body parts.
  
bin lO  lPn  bin O   Pn   sO sPn    O
 e ( H , Pn )
location orientation size  e (O, Pn )
P1 P2  PN
 e ( Pm , Pn )
fO
f1 f2  fN

21


Markov Random Field

H weight potential
  
 e ( H , Pn )
location orientation size  e (O, Pn ) Obtained by
structure learning
• Learn structural connectivity among P1 P2  PN
the body parts and the object.  e ( Pm , Pn )
fO
f1 f2  fN

22


Markov Random Field

H weight potential
  
location orientation size

• Learn structural connectivity among  e (O , f O ) P1 P2  PN
the body parts and the object.
 e ( Pn , f P )
n
fO
•  e (O, f O ) and  e ( Pn , f Pn ): Discriminative
part detection scores. f1 f2  fN
Shape context + AdaBoost
[Andriluka et al, 2009]
[Belongie et al, 2002]
[Viola & Jones, 2001]

23

Outline
 Model Learning
 Model Inference
• Experiments
• Conclusion

24

Model Learning
Input:
   we e
A

eE
H

O
cricket cricket
P1 P2  PN shot bowling

fO
f1 f2  fN

Goals:
Hidden human poses

25

Model Learning
Input:
   we e
A

eE
H

O
cricket cricket

fO
f1 f2  fN

Goals:
Hidden human poses
Structural connectivity

26

Model Learning
Input:
   we e
A

eE
H

O
cricket cricket

fO
f1 f2  fN

Goals:
Hidden human poses
Potential parameters
Potential weights

27

Model Learning
Input:
   we e
A

eE
H

O
cricket cricket

fO
f1 f2  fN

Goals:
Hidden human poses Hidden variables
Structural connectivity Structure learning
Parameter estimation
Potential weights

28

Model Learning

   we e
A
Approach:
eE
H
croquet shot

O

P1 P2  PN

fO
f1 f2  fN

Goals:
Hidden human poses
Potential weights

29

Model Learning

   we e
A
Approach:
eE
  E   
2
 
max  e we e 
H
Hill-climbing 
E e
 2 2 
O  
Joint density Gaussian priori of
P1 P2  PN of the model the edge number

fO
f1 f2  fN 



Goals:
Hidden human poses
   
   
   
Potential weights
 
30

Model Learning

   we e
A
Approach:
eE
H
• Maximum likelihood
O  e ( A, O )  e ( A, H )  e (O, H )
P1 P2  PN
 e ( H , Pn )  e (O, Pn )  e ( Pm , Pn )

fO • Standard AdaBoost
f1 f2  fN  e (O, f O )  e ( Pn , f Pn )

Goals:
Hidden human poses
Potential weights

31

Model Learning

   we e
A
Approach:
eE
H
Max-margin learning
1
min  w r    i
O 2

w , 2 2
P1 P2  PN r i

s.t. i, r where y  r   y  ci  ,
fO
w ci  xi  w r  xi  1  i
f1 f2  fN
i, i  0
Goals:
Hidden human poses Notations
Structural connectivity • xi: Potential values of the i-th image.
• wr: Potential weights of the r-th pose.
Potential parameters • y(r): Activity of the r-th pose.
Potential weights • ξi: A slack variable for the i-th image.

32

Learning Results

Cricket
defensive
shot

Cricket
bowling

Croquet
shot

33

Learning Results

Tennis
forehand

Tennis
serve

Volleyball
smash

34

Outline
 Model Learning
 Model Inference
• Experiments
• Conclusion

35

Model Inference
I

The learned models

 

36

Model Inference
I

The learned models

Head detection

 

Torso detection
Compositional
Inference

[Chen et al, 2007]


 A1 , H1 , O1* , P*n 
1, n

Tennis racket detection
Layout of the object and body parts.
37

Model Inference
I

The learned models

 

Output

 


 A1 , H1 , O1* , P*n 
1, n
 
 AK , H K , OK ,PK ,n 
* *
n

38

Outline
 Model Learning
 Model Inference
• Experiments
• Conclusion

39

Dataset and Experiment Setup
Sport data set: 6 classes
180 training (supervised with object and part locations) & 120 testing images

Tasks:
• Object detection;
• Pose estimation;
• Activity classification.
Cricket Cricket Croquet
defensive shot bowling shot

Tennis Tennis Volleyball
forehand serve smash
[Gupta et al, 2009]
40

180 training (supervised with object and part locations) & 120 testing images

Tasks:

[Gupta et al, 2009]
41

Object Detection Results
Cricket bat Cricket ball
1
Valid
region
0.8



Precision
0.6

0.4

Sliding Pedestrian Our 0.2
window context Method
0
[Andriluka [Dalal & 0 0.2 0.4 0.6 0.8 1
Recall
et al, 2009] Triggs, 2006]

Croquet mallet Tennis racket Volleyball
1

0.8

Precision
0.6

0.4

0.2

0
0 0.2 0.4 0.6 0.8 1
Recall

42

1
Our Method
1
Object Detection Results
0.8 Pedestrian as context Our 1
Method
0.8 Pedestrian as context Method
Scanning window detector Our
Sliding window Pedestrian context Our method Pedestrian as context Cricket ball
Scanning window detector
Precision

0.6 0.8 1
Scanning window detector

Precision
0.6
Small object

0.8

Precision
0.4 0.6

Precision
0.4 0.6

0.2 0.4 0.4
0.2
0.2
0 0.2
0 0.2 0.4 0.6 0.8 1
0
Recall
0
0 0.2 0.4 0.6 0.8 1
0 0.2 0.4 0.6 0.8 1 Recall
0
Recall 0.2
0 0.4 0.6 0.8 1
Recall
Volleyball
Background clutter

1

0.8

Precision
0.6

0.4

0.2

0
0 0.2 0.4 0.6 0.8 1
Recall 43

43

180 training & 120 testing images

Tasks:

[Gupta et al, 2009]
44

Human Pose Estimation Results
Method Torso Upper Leg Lower Leg Upper Arm Lower Arm Head
Ramanan,
2006
.52 .22 .22 .21 .28 .24 .28 .17 .14 .42

Andriluka et
al, 2009
.50 .31 .30 .31 .27 .18 .19 .11 .11 .45

Our full
model
.66 .43 .39 .44 .34 .44 .40 .27 .29 .58

45

Ramanan,
2006
.52 .22 .22 .21 .28 .24 .28 .17 .14 .42

Andriluka et
al, 2009
.50 .31 .30 .31 .27 .18 .19 .11 .11 .45

Our full
model
.66 .43 .39 .44 .34 .44 .40 .27 .29 .58

Tennis serve Our estimation Andriluka Volleyball Our estimation Andriluka
model result et al, 2009 smash model result et al, 2009
46

Ramanan,
2006
.52 .22 .22 .21 .28 .24 .28 .17 .14 .42

Andriluka et
al, 2009
.50 .31 .30 .31 .27 .18 .19 .11 .11 .45

Our full
model
.66 .43 .39 .44 .34 .44 .40 .27 .29 .58

One pose
per class
.63 .40 .36 .41 .31 .38 .35 .21 .23 .52

Estimation Estimation Estimation Estimation
result result result result
47

180 training & 120 testing images

Tasks:

[Gupta et al, 2009]
48

Activity Classification Results
No scene
information Scene is
0.9 critical!! Cricket
83.3%
shot
Classification accuracy

0.8 78.9%

0.7
Tennis
0.6 52.5% forehand

0.5
Our Gupta et Bag-of-
Our
model Gupta et Bag-of-words
al, 2009 Words
model al, 2009 SIFT+SVM

49

Conclusion Grouplet representation

Vs.

Mutual context model

Next Steps
• Pose estimation & Object detection on PPMI images.
• Modeling multiple objects and humans.

50

Acknowledgment
• Stanford Vision Lab reviewers:
– Barry Chai (1985-2010)
– Juan Carlos Niebles
– Hao Su
• Silvio Savarese, U. Michigan
• Anonymous reviewers

51

Modeling Mutual Context of Object and Human Pose in Human-Object Interaction Activities

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Modeling Mutual Context of Object and Human Pose in Human-Object Interaction Activities

Semelhante a Modeling Mutual Context of Object and Human Pose in Human-Object Interaction Activities (8)

Mais de zukun

Mais de zukun (20)

Último

Último (20)

Modeling Mutual Context of Object and Human Pose in Human-Object Interaction Activities