A biologically-motivated approach to computer vision
1. A biologically-motivated approach to computer vision
Thomas Serre
McGovern Institute for Brain Research
Department of Brain & Cognitive Sciences
Massachusetts Institute of Technology
2. The problem: invariant
recognition in natural scenes
• Object recognition is hard!
• Our visual capabilities are
computationally amazing
• Reverse-engineer the visual
system and build machines that
see and interpret the visual
world as well as we do
5. The recipe
lots of training
Lots of simple features fancy classifier
examples
Given example images where
for negative and positive examples respec-
tively.
Initialize weights for respec-
tively, where and are the number of negatives and
positives respectively.
For :
1. Normalize is very valuable, in their implementation it is necessary to
the weights,
first evaluate some feature detector at every location. These
features are then grouped to find unusual co-occurrences. In
practice, since the form of our detector and the features that
so that it uses are extremely efficient, the amortized cost of evalu-
is a probability distribution.
+ +
2. For each feature, , detector at every scale and location is much faster first and second features selected by Ad-
ating our train a classifier which Figure 3: The
is restricted to findingaand grouping edges throughoutaBoost. The two features are shown in the top row and then
than using single feature. The the image.
error is evaluated with work Fleuret and Geman have presented a face
In recent respect to ,
overlayed on a typical training face in the bottom row. The
detection. technique which relies on a “chain” of tests in or-
first feature measures the difference in intensity between the
3. Choose theder to signifywith the lowest of a face at a particular scale and and a region across the upper cheeks. The
classifier, , the presence error . region of the eyes
location [4]. The image properties measured by Fleuret and
4. Update the weights: feature capitalizes on the observation that the eye region is
Geman, disjunctions of fine scale edges, are quite different
often darker than the cheeks. The second feature compares
than rectangle features which are simple, exist at all scales,
the intensities in the eye regions to the intensity across the
and are somewhat interpretable. The two approaches also
where if example is classified cor- bridge of the nose.
differ radically in their learning philosophy. The motivation
rectly, otherwise, and .
for Fleuret and Geman’s learning process is density estima-
tion and density discrimination, while our detector nose and cheeks (see Figure 3). This feature is rel-
The final strong classifier is: of the is purely
Figure 3: The first and second features selected by Ad- discriminative. Finally the false positive rate of Fleuret andcomparison with the detection sub-window,
atively large in Figure 5: Example of frontal upright face images used for
aBoost. The two features are shown in the top row and then Geman’s approach appears to be higher than that of previ-
and should be somewhat insensitive to size and location of
otherwise training.
ous approaches like Rowley et al. and thisthe face. The second feature selected relies on the property
approach. Un-
overlayed on a typical training face in the bottom row. The fortunately the paper does not report quantitative results are darker than the bridge of the nose.
where that the eyes of
first feature measures the difference in intensity between the this kind. The included example images each have between
region of the eyes and a region across the upper cheeks. The 2 and 10 false positives.
feature capitalizes on the observation that the eye region is
Table 1: The AdaBoost algorithm for classifier learn- 4. The Attentional speed of the cascaded detector is directly related to
The Cascade
ing. Each round of boosting selects one feature from the the number of features evaluated per scanned sub-window.
often darker than the cheeks. The second feature compares 180,000 potential features. This section describes an algorithm for constructing a cas-[12], an average of 10
Evaluated on the MIT+CMU test set
the intensities in the eye regions to the intensity across the 5 Results cade of classifiers which achieves increased detectionevaluated per sub-window.
features out of a total of 6061 are per-
This is possible because a large majority of sub-windows
formance while radically reducing computation time. The
bridge of the nose.
Schneiderman & Kanade ’99
number of features are retained (perhaps a classifier was or
A 38 layer cascaded few hundred trained to detect frontalthat smaller, and by the first or second layer in the cascade. On
key insight is are rejected therefore more efficient,
Face detection
thousand). upright faces. To train the detector, a set of face and non- can be constructed which processor, the face detector can pro-
boosted classifiers a 700 Mhz Pentium III reject many of
face training images were used. The face training set con-
the negative sub-windows a 384 by 288 pixel image in about .067 seconds (us-
cess while detecting almost all posi-
of the nose and cheeks (see Figure 3). This feature is rel-
atively large in comparison with the detection sub-window, 3.2. Learning Results Viola & Jones ’01
sisted of 4916 hand labeled faces scaled and aligned to (i.e. the threshold of scale of 1.25 and a step size of 1.5 described
tive instances a ing a starting a boosted classifier can
base resolution of 24 by 24 pixels. The be adjusted so that the false negative rate is close times faster than the Rowley-
faces were ex- below). This is roughly 15 to zero).
and should be somewhat insensitive to size and location of While details on the trainingfrom performance of the final a random crawl of
tracted and images downloaded during Simpler classifiers are used to reject the majority of about 600 times faster than
Baluja-Kanade detector [12] and sub-
system are presented the world wide several simple results examples are shown more complex classifiers are called upon
in Section 5, web. Some typical face windows before the Schneiderman-Kanade detector [15].
the face. The second feature selected relies on the property
merit discussion. InitialFigure 5. The non-face subwindows used to train the
in experiments demonstrated that a to achieve low false positive rates.
that the eyes are darker than the bridge of the nose. frontal face classifier detector come from 9544 images which were manually in-
constructed from 200 features yields Image Processing
The overall form of the detection process is that of a de-
a detection rate of 95% withand found to not contain any faces. generate decision tree, what example asub-windows used for training were vari-
spected a false positive rate of 1 in There are about All we call “cascade” (see Fig-
10. • Tens of thousands of manually
annotated training examples
• ~30,000 object categories
(Biederman, 1987)
• Approach unlikely to scale up ...
What’s wrong with this
picture?
18. What are the
computational
mechanisms
underlying this
amazing feat?
1. Organization of the
visual system
2. Computational model of
the visual cortex
3. Application to computer
vision
source: cerebral cortex
19. What are the
computational
mechanisms
underlying this
amazing feat?
1. Organization of the
visual system
2. Computational model of
the visual cortex
3. Application to computer
vision
source: cerebral cortex
20. Hierarchical architecture:
Rockland & Pandya ’79;
Anatomy Maunsell & Van Essen ‘83;
Felleman & Van Essen ’91
21. Hierarchical architecture:
Rockland & Pandya ’79;
Anatomy Maunsell & Van Essen ‘83;
Felleman & Van Essen ’91
22. source: Thorpe & Fabre-Thorpe ‘01
Hierarchical architecture: Nowak & Bullier ’97
Schmolesky et al ’98
Latencies
28. simple complex
cells cells
Nobel prize 1981
Hierarchical architecture:
Hubel & Wiesel 1959, 1962, 1965, 1968
Function
29. gradual increase in complexity
of preferred stimulus
Hierarchical architecture: Kobatake & Tanaka 1994
see also Oram & Perrett 1993; Sheinberg &
Function
Logothetis 1996; Gallant et al 1996;
Riesenhuber & Poggio 1999
30. Parallel increase in invariance
properties (position and scale)
of neurons
Hierarchical architecture: Kobatake & Tanaka 1994
see also Oram & Perrett 1993; Sheinberg &
Function Logothetis 1996; Gallant et al 1996;
Riesenhuber & Poggio 1999
34. • Invariant object recognition in
IT:
• Robust invariant readout of
category information from
small population of neurons
• Single spikes after response
onset carry most of the
information
Hierarchical architecture:
Hung* Kreiman* Poggio & DiCarlo 2005
Function
39. What are the
computational
mechanisms used by
brains to achieve this
amazing feat?
1. Organization of the
visual system
2. Computational model of
the visual cortex
3. Application to computer
vision
source: cerebral cortex
40. • Qualitative neurobiological models
(Hubel & Wiesel ‘58; Perrett & Oram ‘93)
• Biologically-inspired
(Fukushima ‘80; Mel ‘97; LeCun et al ‘98;
Thorpe ‘02; Ullman et al ‘02; Wersing &
Koerner ‘03)
• Quantitative neurobiological models
(Wallis & Rolls ‘97; Riesenhuber & Poggio
‘99; Amit & Mascaro ‘03; Deco & Rolls ‘06)
Feedforward hierarchical
model of object recognition
41. Model
layers
RF sizes Num.
units
• Large-scale (108
Prefrontal 11,
Animal
vs.
units), spans several
areas of the visual
task-dependent learning
Cortex
46 8 45 12 13
non-animal classification 10 0
units
Supervised
Increase in complexity (number of subunits), RF size and invariance
PG
cortex
V2,V3,V4,MT,MST
LIP,VIP,DP,7a
V1
P P P
T
AIT,36,35
PIT, AIT
TE
o 2
S4 7 10
STP
• Combination of
Rostral STS
TG 36 35
}
o
TPO PGa IPa TEa TEm
m C3 7 10 3
PG Cortex
task-independent learning
AIT
C2b 7
o
10 3 forward and reverse
Unsupervised
S3
o
1.2 - 3.2
o
10 4 engineering
DP VIP LIP 7a PP MSTcMSTp
M TcM p FST
T PIT TF o o
S2b 0.9 - 4.4 10 7
o o
10 5
• Shown to be
C2 1.1 - 3.0
o o 7
0.6 - 2.4
consistent with many
PO V3A V4 S2 10
o o
10 4
experimental data
V2
V3
C1 0.4 - 1.6
o
0.2o- 1.1
V1
S1 10 6
across areas of visual
dorsal stream ventral stream cortex
'where' pathway 'what' pathway
Simple cells
Complex cells
Tuning Main routes
MAX Bypass routes
Feedforward hierarchical
model
42. Simple units Complex units
Selective pooling Riesenhuber & Poggio 1999 (building on
Fukushima ‘80 and Hubel & Wiesel ‘62)
mechanisms
43. Simple units Complex units
Template matching Invariance
Gaussian-like tuning max-like operation
~ “AND” ~”OR”
Selective pooling Riesenhuber & Poggio 1999 (building on
Fukushima ‘80 and Hubel & Wiesel ‘62)
mechanisms
44. Model
layers
RF sizes Num.
units
• Large-scale (108
Prefrontal 11,
Animal
vs.
units), spans several
areas of the visual
task-dependent learning
Cortex
46 8 45 12 13
non-animal classification 10 0
units
Supervised
Increase in complexity (number of subunits), RF size and invariance
PG
cortex
V2,V3,V4,MT,MST
LIP,VIP,DP,7a
V1
P P P
T
AIT,36,35
PIT, AIT
TE
o 2
S4 7 10
STP
• Combination of
Rostral STS
TG 36 35
}
o
TPO PGa IPa TEa TEm
m C3 7 10 3
PG Cortex
task-independent learning
AIT
C2b 7
o
10 3 forward and reverse
Unsupervised
S3
o
1.2 - 3.2
o
10 4 engineering
DP VIP LIP 7a PP MSTcMSTp
M TcM p FST
T PIT TF o o
S2b 0.9 - 4.4 10 7
o o
10 5
• Shown to be
C2 1.1 - 3.0
o o 7
0.6 - 2.4
consistent with many
PO V3A V4 S2 10
o o
10 4
experimental data
V2
V3
C1 0.4 - 1.6
o
0.2o- 1.1
V1
S1 10 6
across areas of visual
dorsal stream ventral stream cortex
'where' pathway 'what' pathway
Simple cells
Complex cells
Tuning Main routes
MAX Bypass routes
Feedforward hierarchical
model
45. Kouh & Poggio 2007; Knoblich Bouvrie Poggio 2007
Both operations can be
Basic circuit for the two approximated gain control
operations circuits using shunting inhibition
46. Model RF size
layers
Animal
Prefrontal
Cortex 45 12
11,
13
vs.
non-animal
PFC
classification
units
PG
V1
AIT
AIT,36,35
PIT, AIT
TE
S4 7
35
AIT
C3
PIT 7
C2b 7
o
S3 1.2
PIT
S2b V4 0.9 o
o
C2 1.1
V4
V2 0.6 o
S2
o
C1 0.4
V1
V2
V1 0.2o
S1
dorsal stream ventral stream
'where' pathway 'what' pathway
Simple cell
Complex ce
Tuning
MAX
Learning and plasticity
47. Model RF size
layers
Animal
Prefrontal
45 12
11, vs. PFC
classification
PFC, IT very likely non-animal
Cortex 13
units
PG
Evidence for adult plasticity
V1
AIT
AIT,36,35
PIT, AIT
TE
S4 7
35
AIT
C3
PIT 7
C2b 7
o
S3 1.2
V4 likely PIT
S2b V4 0.9 o
o
C2 1.1
V4
V2 0.6 o
S2
o
C1 0.4
V1
V2
V1/V2 limited evidence V1
S1
0.2o
dorsal stream ventral stream
'where' pathway 'what' pathway
Simple cell
Complex ce
Tuning
MAX
Learning and plasticity
48. Model RF size
layers
Animal
Prefrontal
Cortex 45 12
11,
13
vs.
non-animal
PFC
classification
units
PG
V1
AIT
AIT,36,35
PIT, AIT
TE
S4 7
35
AIT
C3
PIT 7
C2b 7
o
S3 1.2
PIT
S2b V4 0.9 o
Unsupervised developmental- C2 1.1
o
V2 0.6
like learning stage: V4 S2
o
Frequent image features C1 0.4
o
V1
V2
V1 0.2o
S1
dorsal stream ventral stream
'where' pathway 'what' pathway
Simple cell
Complex ce
Tuning
MAX
Learning and plasticity
49. Model RF size
layers
Animal
Prefrontal
Cortex 45 12
11,
13
vs.
non-animal
PFC
classification
units
PG
V1
AIT
AIT,36,35
PIT, AIT
TE
S4 7
35
AIT
C3
PIT 7
C2b 7
o
S3 1.2
PIT
S2b V4 0.9 o
Unsupervised developmental- C2 1.1
o
V2 0.6
like learning stage: V4 S2
o
Frequent image features C1 0.4
o
V1
V2
V1 0.2o
S1
dorsal stream ventral stream
'where' pathway 'what' pathway
Simple cell
Complex ce
Tuning
MAX
Learning and plasticity
50. Model RF size
layers
Animal
Prefrontal
Cortex 45 12
11,
13
vs.
non-animal
PFC
classification
units
PG
V1
AIT
AIT,36,35
PIT, AIT
TE
S4 7
35
AIT
C3
PIT 7
C2b 7
o
S3 1.2
PIT
S2b V4 0.9 o
Unsupervised developmental- C2 1.1
o
V2 0.6
like learning stage: V4 S2
o
Frequent image features C1 0.4
o
V1
V2
V1 0.2o
S1
dorsal stream ventral stream
'where' pathway 'what' pathway
Simple cell
Complex ce
Tuning
MAX
Learning and plasticity
51. Model RF size
layers
Learned V2/V4 units Prefrontal
45 12
11,
Animal
vs. PFC
classification
Cortex 13
non-animal
units
stronger
PG
V1
facilitation
AIT
AIT,36,35
PIT, AIT
TE
S4 7
35
AIT
C3
PIT 7
C2b 7
stronger
suppression S3 1.2
o
PIT
S2b V4 0.9 o
Unsupervised developmental- C2 1.1
o
V2 0.6
like learning stage: V4 S2
o
Frequent image features C1 0.4
o
V1
V2
V1 0.2o
S1
dorsal stream ventral stream
'where' pathway 'what' pathway
Simple cell
Complex ce
Tuning
MAX
Learning and plasticity
52. Model RF size
layers
Beyond V4 Prefrontal
45 12
11,
Animal
vs. PFC
classification
Cortex 13
non-animal
Combinations of those... PG
units
V1
AIT
AIT,36,35
PIT, AIT
TE
S4 7
35
AIT
C3
PIT 7
C2b 7
o
S3 1.2
PIT
S2b V4 0.9 o
Unsupervised developmental- C2 1.1
o
V2 0.6
like learning stage: V4 S2
o
Frequent image features C1 0.4
o
V1
V2
V1 0.2o
S1
dorsal stream ventral stream
'where' pathway 'what' pathway
Simple cell
Complex ce
Tuning
MAX
Learning and plasticity
53. Model RF size
layers
Animal
Supervised learning from a
Prefrontal
Cortex 45 12
11,
13
vs.
non-animal
PFC
classification
units
handful of training examples PG
~ linear perceptron
V1
AIT
AIT,36,35
PIT, AIT
TE
S4 7
35
AIT
C3
PIT 7
C2b 7
o
S3 1.2
PIT
S2b V4 0.9 o
Unsupervised developmental- C2 1.1
o
V2 0.6
like learning stage: V4 S2
o
Frequent image features C1 0.4
o
V1
V2
V1 0.2o
S1
dorsal stream ventral stream
'where' pathway 'what' pathway
Simple cell
Complex ce
Tuning
MAX
Learning and plasticity
55. Model RF sizes Num.
layers units
Animal
Prefrontal 11, vs.
task-dependent learning
Cortex
46 8 45 12 13
non-animal classification 10 0
units
Supervised
Increase in complexity (number of subunits), RF size and invariance
PG
V2,V3,V4,MT,MST
LIP,VIP,DP,7a
V1
P P P
T
AIT,36,35
PIT, AIT
TE
o 2
S4 7 10
STP
Rostral STS
TG 36 35
}
o
TPO PGa IPa TEa TEm
m C3 7 10 3
PG Cortex
task-independent learning
AIT
o
C2b 7 10 3
Unsupervised
o o
S3 1.2 - 3.2 10 4
DP VIP LIP 7a PP MSTcMSTp
M TcM p FST
T PIT TF o o
S2b 0.9 - 4.4 10 7
o o
C2 1.1 - 3.0 10 5
o o
PO V3A V4 S2
0.6 - 2.4 10 7
o o
V2
V3
C1 0.4 - 1.6 10 4
o
V1 0.2o- 1.1 10 6
S1
dorsal stream ventral stream
'where' pathway 'what' pathway
Simple cells
Complex cells
Tuning Main routes
MAX Bypass routes
Feedforward hierarchical
model