A biologically-motivated approach to computer vision

A biologically-motivated approach to computer vision

Thomas Serre

McGovern Institute for Brain Research
Department of Brain & Cognitive Sciences
Massachusetts Institute of Technology

The problem: invariant
recognition in natural scenes

• Object recognition is hard!

• Our visual capabilities are
computationally amazing

• Reverse-engineer the visual
system and build machines that
see and interpret the visual
world as well as we do

Computer vision
Face detection
successes

The recipe

lots of training
Lots of simple features fancy classifier
examples
Given example images where
for negative and positive examples respec-
tively.
Initialize weights for respec-
tively, where and are the number of negatives and
positives respectively.
For :

1. Normalize is very valuable, in their implementation it is necessary to
the weights,
first evaluate some feature detector at every location. These
features are then grouped to find unusual co-occurrences. In
practice, since the form of our detector and the features that
so that it uses are extremely efficient, the amortized cost of evalu-
is a probability distribution.

+ +
2. For each feature, , detector at every scale and location is much faster first and second features selected by Ad-
ating our train a classifier which Figure 3: The
is restricted to findingaand grouping edges throughoutaBoost. The two features are shown in the top row and then
than using single feature. The the image.
error is evaluated with work Fleuret and Geman have presented a face
In recent respect to ,
overlayed on a typical training face in the bottom row. The
detection. technique which relies on a “chain” of tests in or-
first feature measures the difference in intensity between the
3. Choose theder to signifywith the lowest of a face at a particular scale and and a region across the upper cheeks. The
classifier, , the presence error . region of the eyes
location [4]. The image properties measured by Fleuret and
4. Update the weights: feature capitalizes on the observation that the eye region is
Geman, disjunctions of fine scale edges, are quite different
often darker than the cheeks. The second feature compares
than rectangle features which are simple, exist at all scales,
the intensities in the eye regions to the intensity across the
and are somewhat interpretable. The two approaches also
where if example is classified cor- bridge of the nose.
differ radically in their learning philosophy. The motivation
rectly, otherwise, and .
for Fleuret and Geman’s learning process is density estima-
tion and density discrimination, while our detector nose and cheeks (see Figure 3). This feature is rel-
The final strong classifier is: of the is purely
Figure 3: The first and second features selected by Ad- discriminative. Finally the false positive rate of Fleuret andcomparison with the detection sub-window,
atively large in Figure 5: Example of frontal upright face images used for
aBoost. The two features are shown in the top row and then Geman’s approach appears to be higher than that of previ-
and should be somewhat insensitive to size and location of
otherwise training.
ous approaches like Rowley et al. and thisthe face. The second feature selected relies on the property
approach. Un-
overlayed on a typical training face in the bottom row. The fortunately the paper does not report quantitative results are darker than the bridge of the nose.
where that the eyes of
first feature measures the difference in intensity between the this kind. The included example images each have between
region of the eyes and a region across the upper cheeks. The 2 and 10 false positives.
feature capitalizes on the observation that the eye region is
Table 1: The AdaBoost algorithm for classifier learn- 4. The Attentional speed of the cascaded detector is directly related to
The Cascade
ing. Each round of boosting selects one feature from the the number of features evaluated per scanned sub-window.
often darker than the cheeks. The second feature compares 180,000 potential features. This section describes an algorithm for constructing a cas-[12], an average of 10
Evaluated on the MIT+CMU test set
the intensities in the eye regions to the intensity across the 5 Results cade of classifiers which achieves increased detectionevaluated per sub-window.
features out of a total of 6061 are per-
This is possible because a large majority of sub-windows
formance while radically reducing computation time. The
bridge of the nose.
Schneiderman & Kanade ’99
number of features are retained (perhaps a classifier was or
A 38 layer cascaded few hundred trained to detect frontalthat smaller, and by the first or second layer in the cascade. On
key insight is are rejected therefore more efficient,

Face detection
thousand). upright faces. To train the detector, a set of face and non- can be constructed which processor, the face detector can pro-
boosted classifiers a 700 Mhz Pentium III reject many of
face training images were used. The face training set con-
the negative sub-windows a 384 by 288 pixel image in about .067 seconds (us-
cess while detecting almost all posi-
of the nose and cheeks (see Figure 3). This feature is rel-
atively large in comparison with the detection sub-window, 3.2. Learning Results Viola & Jones ’01
sisted of 4916 hand labeled faces scaled and aligned to (i.e. the threshold of scale of 1.25 and a step size of 1.5 described
tive instances a ing a starting a boosted classifier can
base resolution of 24 by 24 pixels. The be adjusted so that the false negative rate is close times faster than the Rowley-
faces were ex- below). This is roughly 15 to zero).
and should be somewhat insensitive to size and location of While details on the trainingfrom performance of the final a random crawl of
tracted and images downloaded during Simpler classifiers are used to reject the majority of about 600 times faster than
Baluja-Kanade detector [12] and sub-
system are presented the world wide several simple results examples are shown more complex classifiers are called upon
in Section 5, web. Some typical face windows before the Schneiderman-Kanade detector [15].
the face. The second feature selected relies on the property
merit discussion. InitialFigure 5. The non-face subwindows used to train the
in experiments demonstrated that a to achieve low false positive rates.
that the eyes are darker than the bridge of the nose. frontal face classifier detector come from 9544 images which were manually in-
constructed from 200 features yields Image Processing
The overall form of the detection process is that of a de-
a detection rate of 95% withand found to not contain any faces. generate decision tree, what example asub-windows used for training were vari-
spected a false positive rate of 1 in There are about All we call “cascade” (see Fig-

10K-1M training examples

Schneiderman & Kanade ’99
Face detection Viola & Jones ’01

over 100K training examples

Car detection Schneiderman & Kanade ’99

over 1K training examples

Pedestrian detection Dalal & Triggs ’05

What’s wrong with this
picture?

• Tens of thousands of manually
annotated training examples

• ~30,000 object categories
(Biederman, 1987)

• Approach unlikely to scale up ...

What’s wrong with this
picture?

One-shot learning in By age 6, a child knows 10-30K
categories
humans

What are the
computational
mechanisms
underlying this
amazing feat?

source: cerebral cortex

What are the
computational
mechanisms
underlying this
amazing feat?

1. Organization of the
visual system


What are the
computational
mechanisms
underlying this
amazing feat?

visual system

2. Computational model of
the visual cortex


What are the
computational
mechanisms
underlying this
amazing feat?

visual system

the visual cortex

3. Application to computer
vision


Hierarchical architecture:
Rockland & Pandya ’79;
Anatomy Maunsell & Van Essen ‘83;
Felleman & Van Essen ’91

source: Thorpe & Fabre-Thorpe ‘01

Hierarchical architecture: Nowak & Bullier ’97
Schmolesky et al ’98
Latencies

Function

ventral visual stream

Function

Hubel & Wiesel 1959, 1962, 1965, 1968
Function

simple complex
cells cells

Nobel prize 1981
Hubel & Wiesel 1959, 1962, 1965, 1968
Function

gradual increase in complexity
of preferred stimulus

Hierarchical architecture: Kobatake & Tanaka 1994
see also Oram & Perrett 1993; Sheinberg &

Function
Logothetis 1996; Gallant et al 1996;
Riesenhuber & Poggio 1999

Parallel increase in invariance
properties (position and scale)
of neurons

Hierarchical architecture: Kobatake & Tanaka 1994
see also Oram & Perrett 1993; Sheinberg &

Function Logothetis 1996; Gallant et al 1996;
Riesenhuber & Poggio 1999

Hung* Kreiman* Poggio & DiCarlo 2005
Function

• Invariant object recognition in
IT:

• Robust invariant readout of
category information from
small population of neurons

• Single spikes after response
onset carry most of the
information

Hung* Kreiman* Poggio & DiCarlo 2005
Function

Thorpe Fize & Marlot ‘96
Feedforward processing

Feedforward processing

What are the
computational
mechanisms used by
brains to achieve this
amazing feat?

visual system

the visual cortex

3. Application to computer
vision


• Qualitative neurobiological models
(Hubel & Wiesel ‘58; Perrett & Oram ‘93)

• Biologically-inspired
(Fukushima ‘80; Mel ‘97; LeCun et al ‘98;
Thorpe ‘02; Ullman et al ‘02; Wersing &
Koerner ‘03)

• Quantitative neurobiological models
(Wallis & Rolls ‘97; Riesenhuber & Poggio
‘99; Amit & Mascaro ‘03; Deco & Rolls ‘06)

Feedforward hierarchical
model of object recognition

Model
layers
RF sizes Num.
units
• Large-scale (108
Prefrontal 11,
Animal
vs.
units), spans several
areas of the visual

task-dependent learning
Cortex
46 8 45 12 13
non-animal classification 10 0
units

Supervised

Increase in complexity (number of subunits), RF size and invariance
PG

cortex
V2,V3,V4,MT,MST
LIP,VIP,DP,7a

V1
P P P

T

AIT,36,35
PIT, AIT

TE
o 2
S4 7 10

STP

• Combination of
Rostral STS

TG 36 35
}

o
TPO PGa IPa TEa TEm
m C3 7 10 3
PG Cortex

task-independent learning
AIT

C2b 7
o
10 3 forward and reverse

Unsupervised
S3
o
1.2 - 3.2
o
10 4 engineering
DP VIP LIP 7a PP MSTcMSTp
M TcM p FST
T PIT TF o o
S2b 0.9 - 4.4 10 7

o o
10 5
• Shown to be
C2 1.1 - 3.0

o o 7
0.6 - 2.4
consistent with many
PO V3A V4 S2 10

o o
10 4
experimental data
V2
V3
C1 0.4 - 1.6

o
0.2o- 1.1
V1
S1 10 6
across areas of visual
dorsal stream ventral stream cortex
'where' pathway 'what' pathway

Simple cells
Complex cells
Tuning Main routes
MAX Bypass routes

model

Simple units Complex units

Selective pooling Riesenhuber & Poggio 1999 (building on
Fukushima ‘80 and Hubel & Wiesel ‘62)
mechanisms

Simple units Complex units
Template matching Invariance
Gaussian-like tuning max-like operation
~ “AND” ~”OR”

Selective pooling Riesenhuber & Poggio 1999 (building on
Fukushima ‘80 and Hubel & Wiesel ‘62)
mechanisms

Kouh & Poggio 2007; Knoblich Bouvrie Poggio 2007

Both operations can be
Basic circuit for the two approximated gain control
operations circuits using shunting inhibition

Model RF size
layers

Animal
Prefrontal
Cortex 45 12
11,
13
vs.
non-animal
PFC
classification
units
PG

V1

AIT

AIT,36,35
PIT, AIT
TE
S4 7

35

AIT
C3
PIT 7

C2b 7

o
S3 1.2

PIT
S2b V4 0.9 o

o
C2 1.1

V4
V2 0.6 o
S2

o
C1 0.4
V1
V2

V1 0.2o
S1

dorsal stream ventral stream

Simple cell
Complex ce
Tuning
MAX

Learning and plasticity

Model RF size
layers

Animal
Prefrontal
45 12
11, vs. PFC
classification
PFC, IT very likely non-animal
Cortex 13
units
PG

Evidence for adult plasticity
V1

AIT

AIT,36,35
PIT, AIT
TE
S4 7

35

AIT
C3
PIT 7

C2b 7

o
S3 1.2
V4 likely PIT
S2b V4 0.9 o

o
C2 1.1

V4
V2 0.6 o
S2

o
C1 0.4
V1
V2

V1/V2 limited evidence V1
S1
0.2o


Simple cell
Complex ce
Tuning
MAX


Model RF size
layers

Animal
Prefrontal
Cortex 45 12
11,
13
vs.
non-animal
PFC
classification
units
PG

V1

AIT

AIT,36,35
PIT, AIT
TE
S4 7

35

AIT
C3
PIT 7

C2b 7

o
S3 1.2

PIT
S2b V4 0.9 o

Unsupervised developmental- C2 1.1
o

V2 0.6
like learning stage: V4 S2
o

Frequent image features C1 0.4
o

V1
V2

V1 0.2o
S1


Simple cell
Complex ce
Tuning
MAX


Model RF size
layers

Learned V2/V4 units Prefrontal
45 12
11,
Animal
vs. PFC
classification
Cortex 13
non-animal
units

stronger
PG

V1
facilitation
AIT

AIT,36,35
PIT, AIT
TE
S4 7

35

AIT
C3
PIT 7

C2b 7
stronger
suppression S3 1.2
o

PIT
S2b V4 0.9 o

o

V2 0.6
o

o

V1
V2

V1 0.2o
S1


Simple cell
Complex ce
Tuning
MAX


Model RF size
layers

Beyond V4 Prefrontal
45 12
11,
Animal
vs. PFC
classification
Cortex 13
non-animal
Combinations of those... PG
units

V1

AIT

AIT,36,35
PIT, AIT
TE
S4 7

35

AIT
C3
PIT 7

C2b 7

o
S3 1.2

PIT
S2b V4 0.9 o

o

V2 0.6
o

o

V1
V2

V1 0.2o
S1


Simple cell
Complex ce
Tuning
MAX


Model RF size
layers

Animal

Supervised learning from a
Prefrontal
Cortex 45 12
11,
13
vs.
non-animal
PFC
classification
units
handful of training examples PG

~ linear perceptron
V1

AIT

AIT,36,35
PIT, AIT
TE
S4 7

35

AIT
C3
PIT 7

C2b 7

o
S3 1.2

PIT
S2b V4 0.9 o

o

V2 0.6
o

o

V1
V2

V1 0.2o
S1


Simple cell
Complex ce
Tuning
MAX


Learning and sample
complexity

Model RF sizes Num.
layers units
Animal
Prefrontal 11, vs.

task-dependent learning
Cortex
46 8 45 12 13
non-animal classification 10 0
units

Supervised

Increase in complexity (number of subunits), RF size and invariance
PG
V2,V3,V4,MT,MST
LIP,VIP,DP,7a

V1
P P P

T

AIT,36,35
PIT, AIT

TE
o 2
S4 7 10

STP
Rostral STS

TG 36 35
}

o
TPO PGa IPa TEa TEm
m C3 7 10 3
PG Cortex

task-independent learning
AIT
o
C2b 7 10 3

Unsupervised
o o
S3 1.2 - 3.2 10 4

DP VIP LIP 7a PP MSTcMSTp
M TcM p FST
T PIT TF o o
S2b 0.9 - 4.4 10 7

o o
C2 1.1 - 3.0 10 5

o o
PO V3A V4 S2
0.6 - 2.4 10 7

o o
V2
V3
C1 0.4 - 1.6 10 4

o
V1 0.2o- 1.1 10 6
S1


Simple cells
Complex cells
Tuning Main routes
MAX Bypass routes

model

A biologically-motivated approach to computer vision

A biologically-motivated approach to computer vision

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (14)

Destaque

Destaque (6)

Último

Último (20)

A biologically-motivated approach to computer vision