2. fw (x) =Traini
w· •
Visual representations
• Training data consists of images with labeled
N
• Need to learn the model structure, filters and d •
positives negatives
Learned model
Training
fw (x) = w · Φ(x)
• Training data consists of images with labeled bounding boxes
Training
• Need to learn the model structure, filters and deformation costs
Training
Geometric models positive negative
Statistical classifiers
(1970s-1990s) (1990s-present)weights
weights
Large-scale training
Hand-coded models
Appearance-based representations
3. Learned model
Learned visual fw (x) = w · Φ(x)
representations
Training
• Training data consists of images with labeled bounding boxes
• Need Wherethe invariance built in? deformation costs
to learn is model structure, filters and
Representation
(linear classifier, ...)
Training
Features
ViolaJones Dalal Triggs
positive nega
weights wei
4. Learned visual representations
Where is invariance built in?
4 4
4 4
4 4 4
4
Representation
4 4
(latent-variable classifier)
Features
(a) (b) (c)
(a) (a) (b)
(a)
(a)
(b) (c)
(b)
(b)
(c)
(c)
(c)
(a) (b) (c)
Felzenszwalb et al 09
(a) (a) (b)
(a) (b) (c)
(b) (c)
(c)
on model. The model is defined by a coarse root filter (a), several (a) (b) (c)
ections obtained withby single by a coarse root filter (a), The model is defined by a coarse (b) filter (a), several
on model. The defined isa defined component person model.several
on model. The model is defined byroot filter root several several
The model is a coarse (a),
on model. The model is defined by a coarse root filter (a), several
model a coarse filter (a),
(a) root (c)
btained with each with relative tocomponent personfilters specifydefined is defined byroot filter root several several
e locationobtained with a single component person model. The model is defined by a coarse root filter (a), several
tections obtained partcomponent the root model. The modelThe model by a coarse a coarse (a), filter (a),
of a single person (c). The is
tections obtained with relative to(c). The filtersThe filters specify model is defined by a coarse root filter (a), several
location of each part a a root component person model. The
a single model (c). specify
tections part relative andthe spatialthe root for the location of each part relative to the root (c). The filters specify
single model.
eof each filters (b) to relative to the root (c). The filters specify
ution part of each part relative to the root (c). The filters specify
e location
isualization of each(b) positive spatial model for the location of relative to relative to(c). The filtersThe filters specify
e location and a spatial model for thedifferent orientations. The
(b) show part
ution part filters the andaa single model for theof each part each part relative to the defined The a coarse root filte
filters obtained with a weights at location person model. The model is root (c). by
ions part ofshow (b) positivespatial component location of each part relative to theatroot (c). The filters specify
visualizationfilters the and a at weights atorientations. location of each part
ution part filters (b) positivespatial model for the The
and different different orientations. The
visualization show the positive weights at different orientations. The
histogram show the gradients features. Their visualization The
oriented
the root
the different
specify
ution the positive weights weights at different orientations. show the positive weights root (c). orientations. The
n show filters specify
ingorientedof oriented gradients features. Their visualization show the positive different different orientations. model. T
the center of a part at different1.
histogram of oriented gradients
of Fig.features. Their visualization show the positive weightscomponent person The
Detections the root.
obtained with a single at different orientations. The
visualization gradients features. Their visualization show the positive weights at weights atorientations. The
locations relative to the root.
histogrampart of a part at different locations the root.
enter the acenterat different locations relative Their visualization show the of eachweightsrelative toorientations. (c). The fi
cing the center of (b) anddifferent locations relative tothe root.
n part center of a part at different “cost” to relative to the location positive part at different the root The
of of
cing the filters a part at a the locations placing
histogram of models reflects spatial model for the center
cingthe spatialoriented gradients features. of relative to the root. of a part at different locations relative to the root.
5. person bottle
Where does learning fit in?
Training Alg Ground
images output truth
Matching 17
alg
cat
person bottle
Tune parameters ( , ) till desired output on training set
‘Graduate Student Descent’ might take a while
(phrase from Marshall Tappen)
cat
6. 5 years of PASCAL people detection
Matching results
50
37.5
average
25
precision
12.5
0
05
06
07
08
09
10
(after non-maximum suppression)
20
20
20
20
20
20
~1 second to search all scales
1% to 47% in 5 years
How do we move beyond the plateau?
7. How do we move beyond the plateau?
1. Develop more structured models with less invariant features
9. person person
person person bottle
person bottle
person
person person bottle
person bottle
bottle
Invariance vs Parametric Search
person person
person
person
bottle
person
bottle
bottle
Part-Based Models
cat cat
cat
cat 4
cat 4 4
4
4 cat cat
cat
cat cat
cat c
cat
cat
(a) (b) (c)
(a) (a) (b)
(a) (b) (c)
(b) (c)
(c)
(a) (b) (c)
10. Learned visual representations
Where is invariance built in?
Representation
(latent-variable classifier)
Features
Yi & Ramanan 11
Buffy performance: 88% vs 73%
12. How do we move beyond the plateau?
1. Develop more structured models with less invariant features
2. Score syntax as semantics
13. The forgotten challenge....
!"#$%&#
'()*+"&,)-#.*/)&,*$#012*-"&"3&)4#*&4501"-*)1*)&,"4*-5&5
678)4-*+"&,)-*-)"#*1)&*5&&"+9&*&)*-"&"3&*8""&
Head Hand ;))&
:"5- :51- Foot
<=>?=@A:$+51@5B)$& CDED FEF GEH
6I;6!JAK<J LHEC GMED MEM
14. ure 8: Top: heat equilibrium for two bones. Bottom: the result
otating the right bone with the heat-based attachment
Structured classifiers
Figure 10: A centaur pirate with a centaur skeleton embedded looks
at a cat with a quadruped skeleton embedded
the character volume as an insulated heat-conducting body and
e the temperature of bone i to be 1 while keeping the tempera-
of all of the other bones at 0. Then we can take the equilibrium
perature at each vertex on the surface as the weight of bone i at
vertex. Figure 8 illustrates this in two dimensions.
olving for heat equilibrium over a volume would require tes-
ating the volume and would be slow. Therefore, for simplic-
Pinocchio solves for equilibrium over the surface only, but at
e vertices, it adds the heat transferred from the nearest bone.
i
equilibrium over the surface for bone i is given by ∂w = ∂t
i
+ H(pi − wi ) = 0, which can be written as
−∆wi + Hwi = Hpi , (1)
re ∆ is the discrete surface Laplacian, calculated with the
ngent formula [Meyer et al. 2003], pi is a vector with pi = 1
j
e nearest bone to vertex j is i and pi = 0 otherwise, and H is
shape
Figure 11: The human scan on the left is rigged by Pinocchio and is
posed on the right by changing joint angles in the embedded skele-
ton. The well-known deficiencies of LBS can be seen in the right
Estimated
shape
j
diagonal matrix with Hjj being the heat contribution weight of knee and hip areas.
nearest bone to vertex j. Because ∆ has units of length−2 , so
t H. Letting d(j) be the distance from vertex j to the nearest
e, Pinocchio uses Hjj = c/d(j)2 if the shortest line segment 5.1 Generality
m the vertex to the bone is contained in the character volume Figure 9 shows our 16 test characters and the skeletons Pinocchio
Hjj = 0 if it is not. It uses the precomputed distance field to embedded. The skeleton was correctly embedded into 13 of these
classifier
rmine whether a line segment is entirely contained in the char- models (81% success). For Models 7, 10 and 13, a hint for a single
r volume. For c ≈ 0.22, this method gives weights with similar joint was sufficient to produce a good embedding.
sitions to those computed by finding the equilibrium over the These tests demonstrate the range of proportions that our method
me. Pinocchio uses c = 1 (corresponding to anisotropic heat can tolerate: we have a well-proportioned human (Models 1–4, 8),
usion) because the results look more natural. When k bones are large arms and tiny legs (6; in 10, this causes problems), and large
distant from vertex j, heat contributions from all of them are legs and small arms (15; in 13, the small arms cause problems). For
d: pj is 1/k for all of them, and Hjj = kc/d(j)2 . other characters we tested, skeletons were almost always correctly
quation (1) is a sparse linear system, and the left hand side embedded into well-proportioned characters whose pose matched
Estimated
rix −∆ + H does not depend on i, the bone we are interested the given skeleton. Pinocchio was even able to transfer a biped
Thus we can factor the system once and back-substitute to find walk onto a human hand, a cat on its hind legs, and a donut.
weights for each bone. Botsch et al. [2005] show how to use The most common issues we ran into on other characters were:
arse Cholesky solver to compute the factorization for this kind
ystem. Pinocchio uses the TAUCS [Toledo 2003] library for
computation. Note also that the weights wi sum to 1 for each
reflectance
• The thinnest limb into which we may hope to embed a bone
has a radius of 2τ . Characters with extremely thin limbs often reflectance
fail because the the graph we extract is disconnected. Reduc-
ex: if we sum (1) over i, we get (−∆ + H) i wi = H · 1,
P
ing τ , however, hurts performance.
ch yields i wi = 1.
P
is possible to speed up this method slightly by finding vertices • Degree 2 joints such as knees and elbows are often positioned
are unambiguously attached to a single bone and forcing their incorrectly within a limb. We do not know of a reliable way
ght to 1. An earlier variant of our algorithm did this, but the im- to identify the right locations for them: on some characters
ement was negligible, and this introduced occasional artifacts. they are thicker than the rest of the limb, and on others they
are thinner.
Results Although most of our tests were done with the biped skeleton,
evaluate Pinocchio with respect to the three criteria stated in we have also used other skeletons for other characters (Figure 10).
introduction: generality, quality, and performance. To ensure
bjective evaluation, we use inputs that were not used during 5.2 Quality
elopment. To this end, once the development was complete, we Figure 11 shows the results of manually posing a human scan us-
ed Pinocchio on 16 biped Cosmic Blobs models that we had not ing our attachment. Our video [Baran and Popovi´ 2007b] demon-
c
iously tried. strates the quality of the animation produced by Pinocchio.
6
15. Lead: Jitendra Malik (UC Berkeley)
Structured object reports
Participants: Deva Ramanan (UC Irvine), Steve Seitz (U Washington
duction/goal: Human detection and pose estimation are tasks with many applicat
ng next-generation human-computer interfaces and activity understanding. Detection
“If you’re not winning the game, change the rules”
s a classification problem (does this window contain a person or not?), while pose es
en cast as a regression problem, where given an image or sequence of frames, one m
oint angles. This project will take a more general view and cast both tasks as one of “p
e a full syntactic parse will report the number of people present (if any), their body
16. Lead: J
Caveat: we need more pixels Rama
Participants: Deva
Multiresolution models for object d
Dennis Park Deva Ramanan Charless Fowlkes
Motivation & Goal S3. Now we re
Objects in images come with various resolutions. star model
Most recognition systems are scale-invariant, eliminate bl
i.e. fixed-size template
LR global tem
More pixels mean more information!
naturally fits
We want to use the information when it is avail-
LR template
able.
HR templat
Test image trained by La
Goal : part locatio
1. We want to use more pixels.
2. We want to detect small instances as well.
3. In addition, we try to address the correlation be- Φ(x, s, z) =
tween resolution and the role of context.
Introduction/goal: Human scoring funct
We should focus on high-resolution data
Model
detect
cluding next-generation human-com=
(in contrast to most learning methods)
Building blocks
f (x, s)
HOG features [1]
SVM cast as a classification problem &(does
S4. final mod
The boundar
17. Caltech Pedestrian Benchmark
missed
10
d detections detections
Multiresolution model
, we show the result of our low-resolution rigid-template baseline.
Park et al. 2010
s to detect large instances. On the right, we show detections of
, part-based baseline, which fails to find small instances. On the
detections of our multiresolution model that is able to detect both
tances. The threshold of each model is set todecrease same rate of
Multiresolution representations yield the error by 2X compared to previous work
18. How do we move beyond the plateau?
1. Develop more structured models with less invariant features
2. Score syntax as semantics
3. Generate ground-truth datasets of structured labels
25. How do we move beyond the plateau?
1. Develop more structured models with less invariant features
2. Score “nuisance” variables as meaningful output
3. Generate ground-truth datasets of structured labels
26. Diagram for Eero
Machine Learning
Vision
Vision as applied machine learning
27. Diagram for Eero
Vision
Graphics Machine Learning
(shape & appearance)
Vision as structured pattern recognition