Mechanisms of bottom-up and top-down processing in visual perception

Mechanisms of
bottom-up and top-down
processing in visual
perception
Thomas Serre

McGovern Institute for Brain Research
Department of Brain & Cognitive Sciences
Massachusetts Institute of Technology

The problem:
recognition in natural scenes

Rapid recognition:
human behavior

Potter 1971, 1975 see also Biederman 1972; Thorpe 1996 movie courtesy of Jim DiCarlo

Rapid recognition:
human behavior
Gist of the scene at 7 images/s from
unpredictable random sequence of
images
No time for eye movements
No top-down / expectations


Rapid recognition:
human behavior
Gist of the scene at 7 images/s from
unpredictable random sequence of
images
No time for eye movements
No top-down / expectations
Feedforward processing:
Coarse / base image representation


Outline
1.Rapid recognition and feedforward processing:
Loose hierarchy of image fragments
“Clutter problem”

Outline

2.Beyond feedforward processing:
X
X
Top-down cortical feedback and attention to solve the “clutter problem”
XX
Predicting human eye movements

Outline

2.Beyond feedforward processing:
Top-down cortical feedback and attention to solve the “clutter problem”
Predicting human eye movements

Object recognition in the
visual cortex

source: Jim DiCarlo

visual cortex

Ventral visual stream

source: Jim DiCarlo

visual cortex

Hierarchical architecture:


source: Jim DiCarlo

visual cortex

Latencies

source: Jim DiCarlo

visual cortex

Latencies
Anatomy

source: Jim DiCarlo

visual cortex

Latencies
Anatomy
Function

source: Jim DiCarlo

visual cortex

Nobel prize 1981
Hubel & Wiesel 1959, 1962, 1965, 1968

visual cortex

gradual increase in complexity
of preferred stimulus

Kobatake & Tanaka 1994

see also Oram & Perrett 1993; Sheinberg & Logothetis 1996; Gallant et al 1996; Riesenhuber & Poggio 1999

visual cortex

Parallel increase in invariance
properties (position and scale)
of neurons
Kobatake & Tanaka 1994

see also Oram & Perrett 1993; Sheinberg & Logothetis 1996; Gallant et al 1996; Riesenhuber & Poggio 1999

Model RF sizes Num.
layers units
Animal
vs.
Prefrontal 11,

task-dependent learning
classification
8
46 45 12
10 0
non-animal
Cortex 13
units

Supervised

Increase in complexity (number of subunits), RF size and invariance
PG

V2,V3,V4,MT,MST
LIP,VIP,DP,7a V1

AIT,36,35
PIT, AIT
TE
2
o
S4 7 10

STP
Rostral STS

}

36 35
TG
o
10 3
C3 7
TPO PGa IPa TEa TEm
PG Cortex

task-independent learning
AIT
o
10 3
7
C2b

Unsupervised
o o
10 4
1.2 - 3.2
S3

PIT
VIP LIP 7a PP MSTcMSTp
DP FST o
o
TF
10 7
0.9 - 4.4
S2b

o
o
10 5
1.1 - 3.0
C2

o
o
10 7
0.6 - 2.4
V4
PO V3A MT
S2

o
o
10 4
0.4 - 1.6
V3
C1
V2

o
0.2o- 1.1 10 6
V1
S1

dorsal stream ventral stream
'where' pathway 'what' pathway

Simple cells
Complex cells
Main routes
Tuning
Serre Kouh Cadieu Knoblich Kreiman & Poggio 2005 MAX Bypass routes

Model RF sizes Num.
layers units
Animal
vs.
Prefrontal 11,

classification
8
46 45 12
10 0
non-animal
Cortex 13

Large-scale (108
units

Supervised

PG

V2,V3,V4,MT,MST
units), spans
LIP,VIP,DP,7a V1

AIT,36,35
PIT, AIT several areas of
TE
2
o
S4 7 10

the visual cortex
STP
Rostral STS

}

36 35
TG
o
10 3
C3 7
TPO PGa IPa TEa TEm
PG Cortex

AIT
o
10 3
7
C2b

Unsupervised
o o
10 4
1.2 - 3.2
S3

PIT
DP FST o
o
TF
10 7
0.9 - 4.4
S2b

o
o
10 5
1.1 - 3.0
C2

o
o
10 7
0.6 - 2.4
V4
PO V3A MT
S2

o
o
10 4
0.4 - 1.6
V3
C1
V2

o
0.2o- 1.1 10 6
V1
S1


Simple cells
Complex cells
Main routes
Tuning

Model RF sizes Num.
layers units
Animal
vs.
Prefrontal 11,

classification
8
46 45 12
10 0
non-animal
Cortex 13

Large-scale (108
units

Supervised

PG

V2,V3,V4,MT,MST
units), spans
LIP,VIP,DP,7a V1

AIT,36,35
TE
2
o
S4 7 10

the visual cortex
STP
Rostral STS

}

36 35
TG
o
10 3
C3 7
TPO PGa IPa TEa TEm
PG Cortex

Combination of
AIT
o 3
7 10
C2b

Unsupervised
forward 10 and
o o 4
1.2 - 3.2
S3
reverse
PIT
DP FST o
o
TF 7
0.9 - 4.4 10
engineering
S2b

o
o
10 5
1.1 - 3.0
C2

o
o
10 7
0.6 - 2.4
V4
PO V3A MT
S2

o
o
10 4
0.4 - 1.6
V3
C1
V2

o
0.2o- 1.1 10 6
V1
S1


Simple cells
Complex cells
Main routes
Tuning

Model RF sizes Num.
layers units
Animal
vs.
Prefrontal 11,

classification
8
46 45 12
10 0
non-animal
Cortex 13

Large-scale (108
units

Supervised

PG

V2,V3,V4,MT,MST
units), spans
LIP,VIP,DP,7a V1

AIT,36,35
TE
2
o
S4 7 10

the visual cortex
STP
Rostral STS

}

36 35
TG
o
10 3
C3 7
TPO PGa IPa TEa TEm
PG Cortex

Combination of
AIT
o 3
7 10
C2b

Unsupervised
forward 10 and
o o 4
1.2 - 3.2
S3
reverse
PIT
DP FST o
o
TF 7
0.9 - 4.4 10
engineering
S2b

o
o
10 5
1.1 - 3.0
C2

Shown to be o
o 7
0.6 - 2.4 10
V4
PO V3A MT
S2

consistent with o
o 4
0.4 - 1.6 10
V3
C1
V2
many1.1 10 experimental
o
o 6
0.2 -
V1
data across areas
S1

of visual cortex
(V1, V2, V4, MT
and IT)
Simple cells
Complex cells
Main routes
Tuning

Two functional classes of
cells
Simple cells Complex cells

Invariance
Template matching
max-like operation
Gaussian-like tuning
~”OR”
~ “AND”

Riesenhuber & Poggio 1999 (building on Fukushima 1980 and Hubel & Wiesel 1962)

Hierarchy of image
fragments

see also Ullman et al 2002

Hierarchy of image
fragments

Unsupervised learning of
frequent image fragments
during development


Hierarchy of image
fragments

during development
Reusable fragments shared
across categories


Hierarchy of image
fragments

during development
across categories
Large redundant vocabulary
for implicit geometry


Hierarchy of image
fragments

frequent image fragments IT
during development
across categories
V1

Hierarchy of image
fragments category
selective
units

linear perceptron
frequent image fragments IT
during development
across categories
V1

Model vs. IT

1 IT Model

0.8
Classification performance

0.6

0.4

0.2

0
Size: 3.4o 3.4o 1.7o 6.8o 3.4o 3.4o
center 2ohorz. 4ohorz.
Position: center center center

TRAIN

Model data: Serre Kouh Cadieu Knoblich Kreiman & Poggio 2005
Experimental data: Hung* Kreiman* Poggio & DiCarlo 2005

Is this model sufﬁcient to
explain performance in rapid
categorization tasks?
Image

Interval
Image-Mask

Mask
1/f noise
20 ms

30 ms ISI

80 ms Animal present
or not ?
Thorpe et al 1996; Van Rullen & Koch 2003; Bacon-Mace et al 2005

Rapid categorization

Serre Oliva & Poggio 2007

Head Close-body Medium-body Far-body

Animals

Natural
distractors

Artificial
distractors




Animals

Natural
distractors

2.6

2.4
Performance (d')

1.8

1.4
Model (82% correct)
Human observers (80% correct)
1.0

Head Close- Medium- Far-
body body body
Animals

Natural
distractors

“Clutter effect”

Limitation of feedforward
model compatible with
reduced selectivity in V4
(Reynolds et al 1999) and IT in
the presence of clutter
(Zoccolan et al 2005, 2007; Rolls et al
2003)

Meyers Freiwald Embark Kreiman Serre Poggio in prep

“Clutter effect”
Recording site in monkey’s IT

Limitation of feedforward
model compatible with
reduced selectivity in V4 Model
(Reynolds et al 1999) and IT in
the presence of clutter IT neurons
(Zoccolan et al 2005, 2007; Rolls et al
2003)

fMRI

Meyers Freiwald Embark Kreiman Serre Poggio in prep

Summary I

Rapid categorization seems compatible with model
based on feedforward hierarchy of image fragments
Consistent with psychophysics, key limitation of
architecture is recognition in clutter
How does the visual system overcome such limitation?

Spatial attention solves
the “clutter problem”
see also Broadbent 1952 1954; Treisman 1960; Treisman & Gelade 1980;
Duncan & Desimone 1995; Wolfe, 1997;
and many others

and many others

foreground

and many others
background

foreground

and many others
background

foreground

X
X
XX

and many others
background

foreground

X
X
XX
Problem: How to know where to attend?

Spatial attention solves X
X
XX
and many others

Science 22 April 2005:
Vol. 308. no. 5721, pp. 529 - 534
Parallel and Serial Neural Mechanisms for
Visual Search in Macaque Area V4
Narcisse P. Bichot, Andrew F. Rossi, Robert Desimone

Spatial attention solves X
X
XX
and many others

Science 22 April 2005:
Vol. 308. no. 5721, pp. 529 - 534
Parallel and Serial Neural Mechanisms for
Visual Search in Macaque Area V4
Narcisse P. Bichot, Andrew F. Rossi, Robert Desimone

Answer: Parallel feature-based attention

Parallel feature-based X
X
XX
attention modulation
normalized spike activity

2

1

0
0 100 200 0 100 200
time from ﬁxation (ms)

Serial spatial attention X
X
XX
modulation
Test for serial (spatial) selection 2
attend within RF

normalized spike activity
1
FIX

attend away from RF
RF

0

0 100 200
RF stimulus is
SACCADE:
target of saccade
ruary 18, 2009

time from ﬁxation (ms)
vs.
RF stimulus is not
SACCADE:
target of saccade

Fig. 4. Illustration of the saccade enhancement
analysis. We compared neuronal measures when
the monkey made a saccade to an RF stimulus
versus a saccade away from the RF. In this dis-

Attention as Bayesian
inference
PFC

IT

V4/PIT

V2

Chikkerur Serre & Poggio in prep
see also Rao 2005; Lee & Mumford 2003

inference
PFC

feature-based
attention

IT

V4/PIT

V2


inference
PFC

feature-based
attention

IT
FEF/LIP

V4/PIT
spatial attention

V2


inference
O
PFC

feature-based
object priors
attention

Fi
IT
L
FEF/LIP

Fli
V4/PIT
location priors
spatial attention
N

I
V2


inference
PFC
O

LIP
IT
Fi
L

V4
Fli
N

V2
I


Model performance
improves with attention

performance (d’)
one shift of
no attention
attention

Model Humans


Model performance
3

performance (d’)
2

1

0
one shift of
no attention
attention

Model Humans


Model performance
mask no mask

3

performance (d’)
2

1

0
one shift of
no attention
attention

Model Humans


Agreement with
neurophysiology data
Feature-based attention:
Differential modulation for preferred vs. non-preferred
stimulus (Bichot et al’ 05)
Spatial attention:
Gain modulation on neuron’s tuning curves (McAdams &
Maunsell’99)

Competitive mechanisms in V2 and V4 (Reynolds et al’ 99)
Improved readout in clutter (being tested in
collaboration with the Desimone lab)

IT readout improves with
attention

train readout classiﬁer on
+
isolated object

Zhang Meyers Serre Bichot Desimone Poggio in prep

attention

+


attention
cue transient change
7
attention on object

Average rank
attention away
8
+ from object

object not shown
9
0 500 1000 1500 2000
Time (ms)

n=34

Could these attentional
mechanisms also explain
search strategies in
complex natural images?

Matching human eye
movements
Dataset:
100 street-scenes images with cars &
pedestrians and 20 without

Experiment
8 participants asked to count the number of
cars/pedestrians
Blocks/randomized presentations
Each image presented twice

Eye movements recorded using
an infra-red eye tracker
Eye movements as proxy for
attention
Chikkerur Tan Serre & Poggio in sub

Matching human eye
movements

Car search
Pedestrian search


inference
PFC
O

FEF/LIP
IT
Fi
L

V4
Fli
N

V2
I


Matching human eye 100%

movements

fraction ﬁxations
75%

50%

25%

10% 20% 30%
% image covered by saliency maps

Matching human eye 100%

area
movements

fraction ﬁxations
75%
under
50%

ROC
25%

curve
10% 20% 30%
% image covered by saliency maps

Results
ROC area

Humans Bottom-up Top-down (feature-based)


Results
1
ROC area

0.75

0.5

0.25

0
car pedestrian

Humans Bottom-up Top-down (feature-based)


Mechanisms of bottom-up and top-down processing in visual perception

Mechanisms of bottom-up and top-down processing in visual perception

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Recently uploaded

Recently uploaded (20)

Mechanisms of bottom-up and top-down processing in visual perception

Editor's Notes