Anúncio

Recognizing Human-Object Interactions in Still Images by Modeling the Mutual Context of Objects and Human Poses

Lecturer em M H Saboo Siddik College of Engineering
20 de May de 2013
Anúncio

Mais conteúdo relacionado

Similar a Recognizing Human-Object Interactions in Still Images by Modeling the Mutual Context of Objects and Human Poses(18)

Mais de أحلام انصارى(20)

Anúncio

Recognizing Human-Object Interactions in Still Images by Modeling the Mutual Context of Objects and Human Poses

  1. Recognizing Human-Object Interactions in Still Images by Modeling the Mutual Context of Objects and Human Poses Presented By Arwa Chittalwala Irfan Shaikh Heena Patel 1
  2. Robots interact with objects Automatic sports commentary “Kobe is dunking the ball.” 2 Human-Object Interaction Medical care
  3. 3 Vs. Human-Object Interaction Playing saxophone Playing bassoon Playing saxophone Grouplet is a generic feature for structured objects, or interactions of groups of objects. (Previous talk: Grouplet) Caltech101 HOI activity: Tennis Forehand Holistic image based classification Detailed understanding and reasoning Berg & Malik, 2005 Grauman & Darrell, 2005 Gehler & Nowozin, 2009 OURS 48% 59% 77% 62%
  4. 4 Human-Object Interaction Torso Head • Human pose estimation Holistic image based classification Detailed understanding and reasoning
  5. 5 Human-Object Interaction Tennis racket • Human pose estimation Holistic image based classification Detailed understanding and reasoning • Object detection
  6. 6 Human-Object Interaction • Human pose estimation Holistic image based classification Detailed understanding and reasoning • Object detection Torso Head Tennis racket HOI activity: Tennis Forehand
  7. • Background and Intuition • Mutual Context of Object and Human Pose  Model Representation  Model Learning  Model Inference • Experiments • Conclusion Outline 7
  8. • Background and Intuition • Mutual Context of Object and Human Pose  Model Representation  Model Learning  Model Inference • Experiments • Conclusion Outline 8
  9. • Felzenszwalb & Huttenlocher, 2005 • Ren et al, 2005 • Ramanan, 2006 • Ferrari et al, 2008 • Yang & Mori, 2008 • Andriluka et al, 2009 • Eichner & Ferrari, 2009 Difficult part appearance Self-occlusion Image region looks like a body part Human pose estimation & Object detection 9 Human pose estimation is challenging.
  10. Human pose estimation & Object detection 10 Human pose estimation is challenging. • Felzenszwalb & Huttenlocher, 2005 • Ren et al, 2005 • Ramanan, 2006 • Ferrari et al, 2008 • Yang & Mori, 2008 • Andriluka et al, 2009 • Eichner & Ferrari, 2009
  11. Human pose estimation & Object detection 11 Facilitate Given the object is detected.
  12. • Viola & Jones, 2001 • Lampert et al, 2008 • Divvala et al, 2009 • Vedaldi et al, 2009 Small, low-resolution, partially occluded Image region similar to detection target Human pose estimation & Object detection 12 Object detection is challenging
  13. Human pose estimation & Object detection 13 Object detection is challenging • Viola & Jones, 2001 • Lampert et al, 2008 • Divvala et al, 2009 • Vedaldi et al, 2009
  14. Human pose estimation & Object detection 14 Facilitate Given the pose is estimated.
  15. Human pose estimation & Object detection 15 Mutual Context
  16. • Hoiem et al, 2006 • Rabinovich et al, 2007 • Oliva & Torralba, 2007 • Heitz & Koller, 2008 • Desai et al, 2009 • Divvala et al, 2009 • Murphy et al, 2003 • Shotton et al, 2006 • Harzallah et al, 2009 • Li, Socher & Fei-Fei, 2009 • Marszalek et al, 2009 • Bao & Savarese, 2010 Context in Computer Vision ~3-4% with context without context Helpful, but only moderately outperform better  Previous work – Use context cues to facilitate object detection: • Viola & Jones, 2001 • Lampert et al, 2008  16
  17. Context in Computer Vision Our approach – Two challenging tasks serve as mutual context of each other: With mutual context: Without context: 17 ~3-4% with context without context Helpful, but only moderately outperform better Previous work – Use context cues to facilitate object detection: • Hoiem et al, 2006 • Rabinovich et al, 2007 • Oliva & Torralba, 2007 • Heitz & Koller, 2008 • Desai et al, 2009 • Divvala et al, 2009 • Murphy et al, 2003 • Shotton et al, 2006 • Harzallah et al, 2009 • Li, Socher & Fei-Fei, 2009 • Marszalek et al, 2009 • Bao & Savarese, 2010
  18. • Background and Intuition • Mutual Context of Object and Human Pose  Model Representation  Model Learning  Model Inference • Experiments • Conclusion Outline 18
  19. 19 H A Mutual Context Model Representation • More than one H for each A; • Unobserved during training. A:  Croquet shot Volleyball smash Tennis forehand Intra-class variations Activity Object Human pose Body parts lP: location; θP: orientation; sP: scale. Croquet mallet Volleyball  Tennis racket O: H: P: f: Shape context. [Belongie et al, 2002] P1 Image evidence  fO f1 f2 fN O P2 PN
  20. 20 Mutual Context Model Representation ( , )e O H ( , )e A O ( , )e A H e e e E w     Markov Random Field Clique potential Clique weight O P1 PN  fO H A P2 f1 f2 fN ( , )e A O ( , )e A H ( , )e O H• , , : Frequency of co-occurrence between A, O, and H.
  21. 21 A f1 f2 fN Mutual Context Model Representation ( , )e nO P ( , )e m nP P  fO P1 PNP2 O H• , , : Spatial relationship among object and body parts. ( , )e nO P ( , )e m nP P( , )e nH P      bin binn n nO P O P O Pl l s s     location orientation size ( , )e nH P e e e E w     Markov Random Field Clique potential Clique weight ( , )e A O ( , )e A H ( , )e O H• , , : Frequency of co-occurrence between A, O, and H.
  22. 22 H A f1 f2 fN Mutual Context Model Representation Obtained by structure learning  fO PNP1 P2 O • Learn structural connectivity among the body parts and the object. ( , )e A O ( , )e A H ( , )e O H• , , : Frequency of co-occurrence between A, O, and H. • , , : Spatial relationship among object and body parts. ( , )e nO P ( , )e m nP P( , )e nH P      bin binn n nO P O P O Pl l s s     location orientation size ( , )e nO P ( , )e m nP P ( , )e nH P e e e E w     Markov Random Field Clique potential Clique weight
  23. 23 H O A  fO f1 f2 fN P1 P2 PN Mutual Context Model Representation • and : Discriminative part detection scores. ( , )e OO f ( , )ne n PP f [Andriluka et al, 2009] Shape context + AdaBoost • Learn structural connectivity among the body parts and the object. [Belongie et al, 2002] [Viola & Jones, 2001] ( , )e OO f ( , )ne n PP f ( , )e A O ( , )e A H ( , )e O H• , , : Frequency of co-occurrence between A, O, and H. • , , : Spatial relationship among object and body parts. ( , )e nO P ( , )e m nP P( , )e nH P      bin binn n nO P O P O Pl l s s     location orientation size e e e E w     Markov Random Field Clique potential Clique weight
  24. • Background and Intuition • Mutual Context of Object and Human Pose  Model Representation  Model Learning  Model Inference • Experiments • Conclusion Outline 24
  25. 25 Model Learning H O A  fO f1 f2 fN P1 P2 PN e e e E w      cricket shot cricket bowling Input: Goals: Hidden human poses
  26. 26 Model Learning H O A  fO f1 f2 fN P1 P2 PN  Input: Goals: Hidden human poses Structural connectivity e e e E w     cricket shot cricket bowling
  27. e e e E w     27 Model Learning Goals: Hidden human poses Structural connectivity Potential parameters Potential weights H O A  fO f1 f2 fN P1 P2 PN  Input: cricket shot cricket bowling
  28. 28 Model Learning Goals: Parameter estimation Hidden variables Structure learning H O A  fO f1 f2 fN P1 P2 PN  Input: e e e E w     cricket shot cricket bowling Hidden human poses Structural connectivity Potential parameters Potential weights
  29. 29 Model Learning Goals: H O A  fO f1 f2 fN P1 P2 PN Approach: croquet shot e e e E w     Hidden human poses Structural connectivity Potential parameters Potential weights
  30. 30 Model Learning Goals: H O A  fO f1 f2 fN P1 P2 PN Approach:     2 2 max 2 e eeE e E w             Joint density of the model Gaussian priori of the edge number               Hill-climbing e e e E w     Hidden human poses Structural connectivity Potential parameters Potential weights
  31. 31 Model Learning Goals: H O A  fO f1 f2 fN P1 P2 PN Approach: ( , )e O H( , )e A O ( , )e A H ( , )e nO P ( , )e m nP P( , )e nH P ( , )e OO f ( , )ne n PP f • Maximum likelihood • Standard AdaBoost e e e E w     Hidden human poses Structural connectivity Potential parameters Potential weights
  32. 32 Model Learning Goals: H O A  fO f1 f2 fN P1 P2 PN Approach: Max-margin learning 2 2, 1 min 2 r i r i    w w • xi: Potential values of the i-th image. • wr: Potential weights of the r-th pose. • y(r): Activity of the r-th pose. • ξi: A slack variable for the i-th image. Notations    s.t. , where , 1 , 0 i i c i r i i i i r y r y c i            w x w x e e e E w     Hidden human poses Structural connectivity Potential parameters Potential weights
  33. 33 Learning Results Cricket defensive shot Cricket bowling Croquet shot
  34. 34 Learning Results Tennis serve Volleyball smash Tennis forehand
  35. • Background and Intuition • Mutual Context of Object and Human Pose  Model Representation  Model Learning  Model Inference • Experiments • Conclusion Outline 35
  36. I   36 Model Inference The learned models
  37. I   37 Model Inference The learned models Head detection Torso detection Tennis racket detection  Layout of the object and body parts. Compositional Inference [Chen et al, 2007]   * * 1 1 1 1,, , , n n A H O P
  38. I 38 Model Inference The learned models       * * 1 1 1 1,, , , n n A H O P   * * ,, , ,K K K K n n A H O P Output
  39. • Background and Intuition • Mutual Context of Object and Human Pose  Model Representation  Model Learning  Model Inference • Experiments • Conclusion Outline 39
  40. 40 Dataset and Experiment Setup • Object detection; • Pose estimation; • Activity classification. Tasks: [Gupta et al, 2009] Cricket defensive shot Cricket bowling Croquet shot Tennis forehand Tennis serve Volleyball smash Sport data set: 6 classes 180 training (supervised with object and part locations) & 120 testing images
  41. [Gupta et al, 2009] Cricket defensive shot Cricket bowling Croquet shot Tennis forehand Tennis serve Volleyball smash Sport data set: 6 classes 41 Dataset and Experiment Setup • Object detection; • Pose estimation; • Activity classification. Tasks: 180 training (supervised with object and part locations) & 120 testing images
  42. 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Recall Precision Object Detection Results Cricket bat 42   Valid region Croquet mallet Tennis racket Volleyball 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Recall Precision Cricket ball Our Method Sliding window Pedestrian context [Andriluka et al, 2009] [Dalal & Triggs, 2006]
  43. Object Detection Results 43 43 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Recall Precision Volleyball 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Recall Precision Cricket ball 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 RecallPrecision Our Method Pedestrian as context Scanning window detector 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Recall Precision Our Method Pedestrian as context Scanning window detector 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Recall Precision Our Method Pedestrian as context Scanning window detector Sliding window Pedestrian context Our method SmallobjectBackgroundclutter
  44. 44 Dataset and Experiment Setup • Object detection; • Pose estimation; • Activity classification. Tasks: [Gupta et al, 2009] Cricket defensive shot Cricket bowling Croquet shot Tennis forehand Tennis serve Volleyball smash Sport data set: 6 classes 180 training & 120 testing images
  45. 45 Human Pose Estimation Results Method Torso Upper Leg Lower Leg Upper Arm Lower Arm Head Ramanan, 2006 .52 .22 .22 .21 .28 .24 .28 .17 .14 .42 Andriluka et al, 2009 .50 .31 .30 .31 .27 .18 .19 .11 .11 .45 Our full model .66 .43 .39 .44 .34 .44 .40 .27 .29 .58
  46. 46 Human Pose Estimation Results Method Torso Upper Leg Lower Leg Upper Arm Lower Arm Head Ramanan, 2006 .52 .22 .22 .21 .28 .24 .28 .17 .14 .42 Andriluka et al, 2009 .50 .31 .30 .31 .27 .18 .19 .11 .11 .45 Our full model .66 .43 .39 .44 .34 .44 .40 .27 .29 .58 Andriluka et al, 2009 Our estimation result Tennis serve model Andriluka et al, 2009 Our estimation result Volleyball smash model
  47. 47 Human Pose Estimation Results Method Torso Upper Leg Lower Leg Upper Arm Lower Arm Head Ramanan, 2006 .52 .22 .22 .21 .28 .24 .28 .17 .14 .42 Andriluka et al, 2009 .50 .31 .30 .31 .27 .18 .19 .11 .11 .45 Our full model .66 .43 .39 .44 .34 .44 .40 .27 .29 .58 One pose per class .63 .40 .36 .41 .31 .38 .35 .21 .23 .52 Estimation result Estimation result Estimation result Estimation result
  48. 48 Dataset and Experiment Setup • Object detection; • Pose estimation; • Activity classification. Tasks: [Gupta et al, 2009] Cricket defensive shot Cricket bowling Croquet shot Tennis forehand Tennis serve Volleyball smash Sport data set: 6 classes 180 training & 120 testing images
  49. Activity Classification Results 49 Gupta et al, 2009 Our model Bag-of- Words 83.3% Classificationaccuracy 78.9% 52.5% 0.9 0.8 0.7 0.6 0.5 No scene information Scene is critical!! Cricket shot Tennis forehand Bag-of-words SIFT+SVM Gupta et al, 2009 Our model
  50. 50 Conclusion Human-Object Interaction Next Steps Vs. • Pose estimation & Object detection on PPMI images. • Modeling multiple objects and humans.  Grouplet representation Mutual context model
  51. 51
Anúncio