Mechanisms of bottom-up and top-down processing in visual perception
1. Mechanisms of
bottom-up and top-down
processing in visual
perception
Thomas Serre
McGovern Institute for Brain Research
Department of Brain & Cognitive Sciences
Massachusetts Institute of Technology
3. Rapid recognition:
human behavior
Potter 1971, 1975 see also Biederman 1972; Thorpe 1996 movie courtesy of Jim DiCarlo
4. Rapid recognition:
human behavior
Potter 1971, 1975 see also Biederman 1972; Thorpe 1996 movie courtesy of Jim DiCarlo
5. Rapid recognition:
human behavior
Gist of the scene at 7 images/s from
unpredictable random sequence of
images
No time for eye movements
No top-down / expectations
Potter 1971, 1975 see also Biederman 1972; Thorpe 1996 movie courtesy of Jim DiCarlo
6. Rapid recognition:
human behavior
Gist of the scene at 7 images/s from
unpredictable random sequence of
images
No time for eye movements
No top-down / expectations
Feedforward processing:
Coarse / base image representation
Potter 1971, 1975 see also Biederman 1972; Thorpe 1996 movie courtesy of Jim DiCarlo
10. Outline
1.Rapid recognition and feedforward processing:
Loose hierarchy of image fragments
“Clutter problem”
2.Beyond feedforward processing:
X
X
Top-down cortical feedback and attention to solve the “clutter problem”
XX
Predicting human eye movements
11. Outline
1.Rapid recognition and feedforward processing:
Loose hierarchy of image fragments
“Clutter problem”
2.Beyond feedforward processing:
Top-down cortical feedback and attention to solve the “clutter problem”
Predicting human eye movements
14. Object recognition in the
visual cortex
Hierarchical architecture:
Ventral visual stream
source: Jim DiCarlo
15. Object recognition in the
visual cortex
Hierarchical architecture:
Latencies
Ventral visual stream
source: Jim DiCarlo
16. Object recognition in the
visual cortex
Hierarchical architecture:
Latencies
Ventral visual stream
Anatomy
source: Jim DiCarlo
17. Object recognition in the
visual cortex
Hierarchical architecture:
Latencies
Ventral visual stream
Anatomy
Function
source: Jim DiCarlo
18. Object recognition in the
visual cortex
Nobel prize 1981
Hubel & Wiesel 1959, 1962, 1965, 1968
19. Object recognition in the
visual cortex
gradual increase in complexity
of preferred stimulus
Kobatake & Tanaka 1994
see also Oram & Perrett 1993; Sheinberg & Logothetis 1996; Gallant et al 1996; Riesenhuber & Poggio 1999
20. Object recognition in the
visual cortex
Parallel increase in invariance
properties (position and scale)
of neurons
Kobatake & Tanaka 1994
see also Oram & Perrett 1993; Sheinberg & Logothetis 1996; Gallant et al 1996; Riesenhuber & Poggio 1999
21. Model RF sizes Num.
layers units
Animal
vs.
Prefrontal 11,
task-dependent learning
classification
8
46 45 12
10 0
non-animal
Cortex 13
units
Supervised
Increase in complexity (number of subunits), RF size and invariance
PG
V2,V3,V4,MT,MST
LIP,VIP,DP,7a V1
AIT,36,35
PIT, AIT
TE
2
o
S4 7 10
STP
Rostral STS
}
36 35
TG
o
10 3
C3 7
TPO PGa IPa TEa TEm
PG Cortex
task-independent learning
AIT
o
10 3
7
C2b
Unsupervised
o o
10 4
1.2 - 3.2
S3
PIT
VIP LIP 7a PP MSTcMSTp
DP FST o
o
TF
10 7
0.9 - 4.4
S2b
o
o
10 5
1.1 - 3.0
C2
o
o
10 7
0.6 - 2.4
V4
PO V3A MT
S2
o
o
10 4
0.4 - 1.6
V3
C1
V2
o
0.2o- 1.1 10 6
V1
S1
dorsal stream ventral stream
'where' pathway 'what' pathway
Simple cells
Complex cells
Main routes
Tuning
Serre Kouh Cadieu Knoblich Kreiman & Poggio 2005 MAX Bypass routes
22. Model RF sizes Num.
layers units
Animal
vs.
Prefrontal 11,
task-dependent learning
classification
8
46 45 12
10 0
non-animal
Cortex 13
Large-scale (108
units
Supervised
Increase in complexity (number of subunits), RF size and invariance
PG
V2,V3,V4,MT,MST
units), spans
LIP,VIP,DP,7a V1
AIT,36,35
PIT, AIT several areas of
TE
2
o
S4 7 10
the visual cortex
STP
Rostral STS
}
36 35
TG
o
10 3
C3 7
TPO PGa IPa TEa TEm
PG Cortex
task-independent learning
AIT
o
10 3
7
C2b
Unsupervised
o o
10 4
1.2 - 3.2
S3
PIT
VIP LIP 7a PP MSTcMSTp
DP FST o
o
TF
10 7
0.9 - 4.4
S2b
o
o
10 5
1.1 - 3.0
C2
o
o
10 7
0.6 - 2.4
V4
PO V3A MT
S2
o
o
10 4
0.4 - 1.6
V3
C1
V2
o
0.2o- 1.1 10 6
V1
S1
dorsal stream ventral stream
'where' pathway 'what' pathway
Simple cells
Complex cells
Main routes
Tuning
Serre Kouh Cadieu Knoblich Kreiman & Poggio 2005 MAX Bypass routes
23. Model RF sizes Num.
layers units
Animal
vs.
Prefrontal 11,
task-dependent learning
classification
8
46 45 12
10 0
non-animal
Cortex 13
Large-scale (108
units
Supervised
Increase in complexity (number of subunits), RF size and invariance
PG
V2,V3,V4,MT,MST
units), spans
LIP,VIP,DP,7a V1
AIT,36,35
PIT, AIT several areas of
TE
2
o
S4 7 10
the visual cortex
STP
Rostral STS
}
36 35
TG
o
10 3
C3 7
TPO PGa IPa TEa TEm
PG Cortex
task-independent learning
Combination of
AIT
o 3
7 10
C2b
Unsupervised
forward 10 and
o o 4
1.2 - 3.2
S3
reverse
PIT
VIP LIP 7a PP MSTcMSTp
DP FST o
o
TF 7
0.9 - 4.4 10
engineering
S2b
o
o
10 5
1.1 - 3.0
C2
o
o
10 7
0.6 - 2.4
V4
PO V3A MT
S2
o
o
10 4
0.4 - 1.6
V3
C1
V2
o
0.2o- 1.1 10 6
V1
S1
dorsal stream ventral stream
'where' pathway 'what' pathway
Simple cells
Complex cells
Main routes
Tuning
Serre Kouh Cadieu Knoblich Kreiman & Poggio 2005 MAX Bypass routes
24. Model RF sizes Num.
layers units
Animal
vs.
Prefrontal 11,
task-dependent learning
classification
8
46 45 12
10 0
non-animal
Cortex 13
Large-scale (108
units
Supervised
Increase in complexity (number of subunits), RF size and invariance
PG
V2,V3,V4,MT,MST
units), spans
LIP,VIP,DP,7a V1
AIT,36,35
PIT, AIT several areas of
TE
2
o
S4 7 10
the visual cortex
STP
Rostral STS
}
36 35
TG
o
10 3
C3 7
TPO PGa IPa TEa TEm
PG Cortex
task-independent learning
Combination of
AIT
o 3
7 10
C2b
Unsupervised
forward 10 and
o o 4
1.2 - 3.2
S3
reverse
PIT
VIP LIP 7a PP MSTcMSTp
DP FST o
o
TF 7
0.9 - 4.4 10
engineering
S2b
o
o
10 5
1.1 - 3.0
C2
Shown to be o
o 7
0.6 - 2.4 10
V4
PO V3A MT
S2
consistent with o
o 4
0.4 - 1.6 10
V3
C1
V2
many1.1 10 experimental
o
o 6
0.2 -
V1
data across areas
S1
of visual cortex
dorsal stream ventral stream
'where' pathway 'what' pathway
(V1, V2, V4, MT
and IT)
Simple cells
Complex cells
Main routes
Tuning
Serre Kouh Cadieu Knoblich Kreiman & Poggio 2005 MAX Bypass routes
25. Two functional classes of
cells
Simple cells Complex cells
Invariance
Template matching
max-like operation
Gaussian-like tuning
~”OR”
~ “AND”
Riesenhuber & Poggio 1999 (building on Fukushima 1980 and Hubel & Wiesel 1962)
26. Model RF sizes Num.
layers units
Animal
vs.
Prefrontal 11,
task-dependent learning
classification
8
46 45 12
10 0
non-animal
Cortex 13
units
Supervised
Increase in complexity (number of subunits), RF size and invariance
PG
V2,V3,V4,MT,MST
LIP,VIP,DP,7a V1
AIT,36,35
PIT, AIT
TE
2
o
S4 7 10
STP
Rostral STS
}
36 35
TG
o
10 3
C3 7
TPO PGa IPa TEa TEm
PG Cortex
task-independent learning
AIT
o
10 3
7
C2b
Unsupervised
o o
10 4
1.2 - 3.2
S3
PIT
VIP LIP 7a PP MSTcMSTp
DP FST o
o
TF
10 7
0.9 - 4.4
S2b
o
o
10 5
1.1 - 3.0
C2
o
o
10 7
0.6 - 2.4
V4
PO V3A MT
S2
o
o
10 4
0.4 - 1.6
V3
C1
V2
o
0.2o- 1.1 10 6
V1
S1
dorsal stream ventral stream
'where' pathway 'what' pathway
Simple cells
Complex cells
Main routes
Tuning
Serre Kouh Cadieu Knoblich Kreiman & Poggio 2005 MAX Bypass routes
28. Hierarchy of image
fragments
Unsupervised learning of
frequent image fragments
during development
see also Ullman et al 2002
29. Hierarchy of image
fragments
Unsupervised learning of
frequent image fragments
during development
Reusable fragments shared
across categories
see also Ullman et al 2002
30. Hierarchy of image
fragments
Unsupervised learning of
frequent image fragments
during development
Reusable fragments shared
across categories
Large redundant vocabulary
for implicit geometry
see also Ullman et al 2002
31. Hierarchy of image
fragments
Unsupervised learning of
frequent image fragments IT
during development
Reusable fragments shared
across categories
Large redundant vocabulary
for implicit geometry
V1
see also Ullman et al 2002
32. Hierarchy of image
fragments
Unsupervised learning of
frequent image fragments IT
during development
Reusable fragments shared
across categories
Large redundant vocabulary
for implicit geometry
V1
see also Ullman et al 2002
33. Hierarchy of image
fragments
Unsupervised learning of
frequent image fragments IT
during development
Reusable fragments shared
across categories
Large redundant vocabulary
for implicit geometry
V1
see also Ullman et al 2002
34. Hierarchy of image
fragments category
selective
units
linear perceptron
Unsupervised learning of
frequent image fragments IT
during development
Reusable fragments shared
across categories
Large redundant vocabulary
for implicit geometry
V1
see also Ullman et al 2002
35. Model vs. IT
1 IT Model
0.8
Classification performance
0.6
0.4
0.2
0
Size: 3.4o 3.4o 1.7o 6.8o 3.4o 3.4o
center 2ohorz. 4ohorz.
Position: center center center
TRAIN
Model data: Serre Kouh Cadieu Knoblich Kreiman & Poggio 2005
Experimental data: Hung* Kreiman* Poggio & DiCarlo 2005
36. Is this model sufficient to
explain performance in rapid
categorization tasks?
Image
Interval
Image-Mask
Mask
1/f noise
20 ms
30 ms ISI
80 ms Animal present
or not ?
Thorpe et al 1996; Van Rullen & Koch 2003; Bacon-Mace et al 2005
41. Rapid categorization
2.6
2.4
Performance (d')
1.8
1.4
Model (82% correct)
Human observers (80% correct)
1.0
Head Close-body Medium-body Far-body
Head Close- Medium- Far-
body body body
Animals
Natural
distractors
Serre Oliva & Poggio 2007
42. “Clutter effect”
Limitation of feedforward
model compatible with
reduced selectivity in V4
(Reynolds et al 1999) and IT in
the presence of clutter
(Zoccolan et al 2005, 2007; Rolls et al
2003)
Meyers Freiwald Embark Kreiman Serre Poggio in prep
43. “Clutter effect”
Recording site in monkey’s IT
Limitation of feedforward
model compatible with
reduced selectivity in V4 Model
(Reynolds et al 1999) and IT in
the presence of clutter IT neurons
(Zoccolan et al 2005, 2007; Rolls et al
2003)
fMRI
Meyers Freiwald Embark Kreiman Serre Poggio in prep
44. Summary I
Rapid categorization seems compatible with model
based on feedforward hierarchy of image fragments
Consistent with psychophysics, key limitation of
architecture is recognition in clutter
How does the visual system overcome such limitation?
45. Outline
1.Rapid recognition and feedforward processing:
Loose hierarchy of image fragments
“Clutter problem”
2.Beyond feedforward processing:
X
X
Top-down cortical feedback and attention to solve the “clutter problem”
XX
Predicting human eye movements
46. Spatial attention solves
the “clutter problem”
see also Broadbent 1952 1954; Treisman 1960; Treisman & Gelade 1980;
Duncan & Desimone 1995; Wolfe, 1997;
and many others
47. Spatial attention solves
the “clutter problem”
see also Broadbent 1952 1954; Treisman 1960; Treisman & Gelade 1980;
Duncan & Desimone 1995; Wolfe, 1997;
and many others
foreground
48. Spatial attention solves
the “clutter problem”
see also Broadbent 1952 1954; Treisman 1960; Treisman & Gelade 1980;
Duncan & Desimone 1995; Wolfe, 1997;
and many others
background
foreground
49. Spatial attention solves
the “clutter problem”
see also Broadbent 1952 1954; Treisman 1960; Treisman & Gelade 1980;
Duncan & Desimone 1995; Wolfe, 1997;
and many others
background
foreground
X
X
XX
50. Spatial attention solves
the “clutter problem”
see also Broadbent 1952 1954; Treisman 1960; Treisman & Gelade 1980;
Duncan & Desimone 1995; Wolfe, 1997;
and many others
background
foreground
X
X
XX
Problem: How to know where to attend?
51. Spatial attention solves X
X
XX
the “clutter problem”
see also Broadbent 1952 1954; Treisman 1960; Treisman & Gelade 1980;
Duncan & Desimone 1995; Wolfe, 1997;
and many others
Science 22 April 2005:
Vol. 308. no. 5721, pp. 529 - 534
Parallel and Serial Neural Mechanisms for
Visual Search in Macaque Area V4
Narcisse P. Bichot, Andrew F. Rossi, Robert Desimone
52. Spatial attention solves X
X
XX
the “clutter problem”
see also Broadbent 1952 1954; Treisman 1960; Treisman & Gelade 1980;
Duncan & Desimone 1995; Wolfe, 1997;
and many others
Science 22 April 2005:
Vol. 308. no. 5721, pp. 529 - 534
Parallel and Serial Neural Mechanisms for
Visual Search in Macaque Area V4
Narcisse P. Bichot, Andrew F. Rossi, Robert Desimone
Answer: Parallel feature-based attention
53. Parallel feature-based X
X
XX
attention modulation
normalized spike activity
2
1
0
0 100 200 0 100 200
time from fixation (ms)
54. Serial spatial attention X
X
XX
modulation
Test for serial (spatial) selection 2
attend within RF
normalized spike activity
1
FIX
attend away from RF
RF
0
0 100 200
RF stimulus is
SACCADE:
target of saccade
ruary 18, 2009
time from fixation (ms)
vs.
RF stimulus is not
SACCADE:
target of saccade
Fig. 4. Illustration of the saccade enhancement
analysis. We compared neuronal measures when
the monkey made a saccade to an RF stimulus
versus a saccade away from the RF. In this dis-
55. Attention as Bayesian
inference
PFC
IT
V4/PIT
V2
Chikkerur Serre & Poggio in prep
see also Rao 2005; Lee & Mumford 2003
56. Attention as Bayesian
inference
PFC
feature-based
attention
IT
V4/PIT
V2
Chikkerur Serre & Poggio in prep
see also Rao 2005; Lee & Mumford 2003
57. Attention as Bayesian
inference
PFC
feature-based
attention
IT
FEF/LIP
V4/PIT
spatial attention
V2
Chikkerur Serre & Poggio in prep
see also Rao 2005; Lee & Mumford 2003
58. Attention as Bayesian
inference
O
PFC
feature-based
object priors
attention
Fi
IT
L
FEF/LIP
Fli
V4/PIT
location priors
spatial attention
N
I
V2
Chikkerur Serre & Poggio in prep
see also Rao 2005; Lee & Mumford 2003
60. Attention as Bayesian
inference
feature-based
PFC
O
attention
belief propagation:
FEF/LIP
= P (L)
mLIP →V 4
IT
Fi
= P (F i |O)
mIT →V 4
= P (Fli |F, L)P (L)P (I|Fli )
mV 4→IT L
L Fli
= P (Fli |F, L)P (F i |O)P (I|Fli )
mV 4→LIP
V4
Fli
Fi Fli
N
Where is at object O? V2
I
Chikkerur Serre & Poggio in prep
see also Rao 2005; Lee & Mumford 2003
61. Attention as Bayesian
inference
spatial attention
PFC
O
belief propagation:
FEF/LIP
= P (L)
mLIP →V 4
IT
Fi
= P (F i |O)
mIT →V 4
= P (Fli |F, L)P (L)P (I|Fli )
mV 4→IT L
L Fli
= P (Fli |F, L)P (F i |O)P (I|Fli )
mV 4→LIP
V4
Fli
Fi Fli
N
What is at location L?
V2
I
Chikkerur Serre & Poggio in prep
see also Rao 2005; Lee & Mumford 2003
62. Model performance
improves with attention
performance (d’)
one shift of
no attention
attention
Model Humans
Chikkerur Serre & Poggio in prep
63. Model performance
improves with attention
3
performance (d’)
2
1
0
one shift of
no attention
attention
Model Humans
Chikkerur Serre & Poggio in prep
64. Model performance
improves with attention
3
performance (d’)
2
1
0
one shift of
no attention
attention
Model Humans
Chikkerur Serre & Poggio in prep
65. Model performance
improves with attention
3
performance (d’)
2
1
0
one shift of
no attention
attention
Model Humans
Chikkerur Serre & Poggio in prep
66. Model performance
improves with attention
mask no mask
3
performance (d’)
2
1
0
one shift of
no attention
attention
Model Humans
Chikkerur Serre & Poggio in prep
67. Agreement with
neurophysiology data
Feature-based attention:
Differential modulation for preferred vs. non-preferred
stimulus (Bichot et al’ 05)
Spatial attention:
Gain modulation on neuron’s tuning curves (McAdams &
Maunsell’99)
Competitive mechanisms in V2 and V4 (Reynolds et al’ 99)
Improved readout in clutter (being tested in
collaboration with the Desimone lab)
68. IT readout improves with
attention
train readout classifier on
+
isolated object
Zhang Meyers Serre Bichot Desimone Poggio in prep
69. IT readout improves with
attention
+
Zhang Meyers Serre Bichot Desimone Poggio in prep
70. IT readout improves with
attention
+
Zhang Meyers Serre Bichot Desimone Poggio in prep
71. IT readout improves with
attention
+
Zhang Meyers Serre Bichot Desimone Poggio in prep
72. IT readout improves with
attention
cue transient change
7
attention on object
Average rank
attention away
8
+ from object
object not shown
9
0 500 1000 1500 2000
Time (ms)
n=34
Zhang Meyers Serre Bichot Desimone Poggio in prep
73. IT readout improves with
attention
cue transient change
7
attention on object
Average rank
attention away
8
+ from object
object not shown
9
0 500 1000 1500 2000
Time (ms)
n=34
Zhang Meyers Serre Bichot Desimone Poggio in prep
75. Matching human eye
movements
Dataset:
100 street-scenes images with cars &
pedestrians and 20 without
Experiment
8 participants asked to count the number of
cars/pedestrians
Blocks/randomized presentations
Each image presented twice
Eye movements recorded using
an infra-red eye tracker
Eye movements as proxy for
attention
Chikkerur Tan Serre & Poggio in sub
80. Matching human eye 100%
movements
fraction fixations
75%
50%
25%
10% 20% 30%
% image covered by saliency maps
81. Matching human eye 100%
area
movements
fraction fixations
75%
under
50%
ROC
25%
curve
10% 20% 30%
% image covered by saliency maps
82. Results
ROC area
Humans Bottom-up Top-down (feature-based)
Chikkerur Tan Serre & Poggio in sub
83. Results
1
ROC area
0.75
0.5
0.25
0
car pedestrian
Humans Bottom-up Top-down (feature-based)
Chikkerur Tan Serre & Poggio in sub
84. Results
1
ROC area
0.75
0.5
0.25
0
car pedestrian
Humans Bottom-up Top-down (feature-based)
Chikkerur Tan Serre & Poggio in sub
85. Results
1
ROC area
0.75
0.5
0.25
0
car pedestrian
Humans Bottom-up Top-down (feature-based)
Chikkerur Tan Serre & Poggio in sub
86. Results
1
ROC area
0.75
0.5
0.25
0
car pedestrian
Humans Bottom-up Top-down (feature-based)
Chikkerur Tan Serre & Poggio in sub
Editor's Notes
Thank you very much Charles for inviting me. I am delighted to be here and enjoying a weather that we could never hope for in the Spring in Boston...
Here is the problem I am trying to solve: You give me an image and I tell you for instance whether or not it contains an animal.
Object recognition is a very hard computational problem. The reason for that is that despite the fact that all of these are images of a giraffe, they look quite different at the pixel level. Objects in the real-world and these animal images in particular can vary drastically in terms of their appearance, shape, texture.
In particular, changes in position and scale can create very large changes in the pattern of activity that they elicit on the retina... Think about that: even just a small shift in position of 2 deg of visual angle corresponds to shifting of the image on the retina of more than 120 photoreceptors!
This is an extremely difficult task and today, no artificial computer vision system can do this task as robustly and accurately as the primate visual system.
However as primates we are extremely good at solving this task despite all these variations...
A classical paradigm that has been extensively used to study object recognition and visual perception is what I would call the rapid recognition paradigms.
Here I am flashing images in rapid succession. This paradigm is called RSVP and was introduced by Molly Potter in the 70’s. Images are being presented at a rate of 7/s. At this speed you probably don’t get every details in the image but at the very least you are able to build a coarse description of the scene. For instance most of you should be able to recognize and perhaps memorize objects in these images...
While these types of task do not necessarily reflect natural everyday vision when the visual world moves continuously and you are free to move your eyes and shift your attention. However they are able to isolate the first 100-150 ms of visual processing during which a base representation for images is being formed before more complex visual routines can come into play...
A classical paradigm that has been extensively used to study object recognition and visual perception is what I would call the rapid recognition paradigms.
Here I am flashing images in rapid succession. This paradigm is called RSVP and was introduced by Molly Potter in the 70’s. Images are being presented at a rate of 7/s. At this speed you probably don’t get every details in the image but at the very least you are able to build a coarse description of the scene. For instance most of you should be able to recognize and perhaps memorize objects in these images...
While these types of task do not necessarily reflect natural everyday vision when the visual world moves continuously and you are free to move your eyes and shift your attention. However they are able to isolate the first 100-150 ms of visual processing during which a base representation for images is being formed before more complex visual routines can come into play...
A classical paradigm that has been extensively used to study object recognition and visual perception is what I would call the rapid recognition paradigms.
Here I am flashing images in rapid succession. This paradigm is called RSVP and was introduced by Molly Potter in the 70’s. Images are being presented at a rate of 7/s. At this speed you probably don’t get every details in the image but at the very least you are able to build a coarse description of the scene. For instance most of you should be able to recognize and perhaps memorize objects in these images...
While these types of task do not necessarily reflect natural everyday vision when the visual world moves continuously and you are free to move your eyes and shift your attention. However they are able to isolate the first 100-150 ms of visual processing during which a base representation for images is being formed before more complex visual routines can come into play...
In this talk I will argue that this base representation corresponds to the activation of a hierarchy of image fragments following a single feedforward sweep through the visual system. This bottom-up feedforward sweep rapidly activates specific sub-population of neurons in the ventral stream of the visual cortex that are tuned to image fragments with different levels of selectivity and invariance.
I will show you that consistent with human psychophysics, a key limitation of this architecture is that it is susceptible to clutter. While it does relatively well on images that contains a single object and limited clutter (like the ones I just showed you), we found that the performance decreases significantly with increased amount of clutter.
In this talk I will argue that this base representation corresponds to the activation of a hierarchy of image fragments following a single feedforward sweep through the visual system. This bottom-up feedforward sweep rapidly activates specific sub-population of neurons in the ventral stream of the visual cortex that are tuned to image fragments with different levels of selectivity and invariance.
I will show you that consistent with human psychophysics, a key limitation of this architecture is that it is susceptible to clutter. While it does relatively well on images that contains a single object and limited clutter (like the ones I just showed you), we found that the performance decreases significantly with increased amount of clutter.
In the second part of my talk I will argue that the way the visual system solves this clutter problem is via cortical feedback and shifts of attention. I will outline an integrated model of object recognition and attention. I will show that the object recognition performance of the model increases significantly when used in conjunction with attentional mechanisms. Using eye movements as a proxy for attention, I will show that the resulting model can account for a significant fraction of human eye movements during search tasks in complex natural images.
We have implemented a computational model (shown on the right) that implement these sets of principles.
Van Essen on the left. We do not try to account for the whole visual cortex, only the ventral stream of the visual cortex...
The model is hierarchical with only feedforward connections.
We have implemented a computational model (shown on the right) that implement these sets of principles.
Van Essen on the left. We do not try to account for the whole visual cortex, only the ventral stream of the visual cortex...
The model is hierarchical with only feedforward connections.
We have implemented a computational model (shown on the right) that implement these sets of principles.
Van Essen on the left. We do not try to account for the whole visual cortex, only the ventral stream of the visual cortex...
The model is hierarchical with only feedforward connections.
Computational considerations suggest that you need two types of operations and therefore functional classes of cells to explain those data.
By analogy with H&B hierarchical model of processing in the visual cortex, we have called these two classes of cells simple and complex. The scheme that I am going to describe essentially extend their proposal from striate to extra-striate visual areas.
We have assumed that these two types of functional units implement two types of computations or mathematical operations: Gaussian-like or bell-shape tuning and a max-like operation.
The gaussian-bell tuning was motivated by a learning algorithm called Radial Basis Function while the max operation was motivated by the standard scanning approach in computer vision and theoretical arguments from signal processing.
The goal of the simple units is to increase the complexity of the representation. Here on this example by pooling together the activity of afferent units with different orientations via this Gaussian-like tuning. This Gaussian tuning is ubiquitous in the visual cortex from orientation tuning in V1 to tuning for complex objects around certain poses in IT.
The complex units pool together afferent units with the same preferred stimuli eg vertical bar but slightly different positions and scales. At the complex unit level we thus build some tolerance with respect to the exact position and scale of the stimulus within the receptive field of the unit.
We have implemented a computational model (shown on the right) that implement these sets of principles.
Van Essen on the left. We do not try to account for the whole visual cortex, only the ventral stream of the visual cortex...
The model is hierarchical with only feedforward connections.
EMPHASIZE AFTER TRAINING: NO DATA FITTING
MENTION CHARLES
It builds a simple-to-complex cells hierarchies.
Mimic as closely as possible the tuning properties of neurons in various areas of the ventral stream
Builds on earlier work in the lab by Max Riesenhuber
-- I would argue that a key aspect of this model is the learning of a large dictionary of reusable features (I would call them shape components) from V1 to IT. These features represent a basic vocabulary of shape components that can be used to represent any visual input. These features correspond to patches of images which appear with high probability in the natural world. We argue that learning of this dictionary is done UNSUPERVISED during a developmental period.
-- In this model, the goal of the ventral stream of the visual cortex from V1 to IT is to build a good representation for images, i.e. a representation which is compact and invariant with respect to 2D transformations such as translation and scale.
-- With a good image representation, learning a new image category is relatively easy. We speculate that this can be done from a handful of labeling examples by training task-specific circuits running from IT to the PFC.
We showed that it worked well on multiple object categories on standard computer vision databases.
-- I would argue that a key aspect of this model is the learning of a large dictionary of reusable features (I would call them shape components) from V1 to IT. These features represent a basic vocabulary of shape components that can be used to represent any visual input. These features correspond to patches of images which appear with high probability in the natural world. We argue that learning of this dictionary is done UNSUPERVISED during a developmental period.
-- In this model, the goal of the ventral stream of the visual cortex from V1 to IT is to build a good representation for images, i.e. a representation which is compact and invariant with respect to 2D transformations such as translation and scale.
-- With a good image representation, learning a new image category is relatively easy. We speculate that this can be done from a handful of labeling examples by training task-specific circuits running from IT to the PFC.
We showed that it worked well on multiple object categories on standard computer vision databases.
-- I would argue that a key aspect of this model is the learning of a large dictionary of reusable features (I would call them shape components) from V1 to IT. These features represent a basic vocabulary of shape components that can be used to represent any visual input. These features correspond to patches of images which appear with high probability in the natural world. We argue that learning of this dictionary is done UNSUPERVISED during a developmental period.
-- In this model, the goal of the ventral stream of the visual cortex from V1 to IT is to build a good representation for images, i.e. a representation which is compact and invariant with respect to 2D transformations such as translation and scale.
-- With a good image representation, learning a new image category is relatively easy. We speculate that this can be done from a handful of labeling examples by training task-specific circuits running from IT to the PFC.
We showed that it worked well on multiple object categories on standard computer vision databases.
-- I would argue that a key aspect of this model is the learning of a large dictionary of reusable features (I would call them shape components) from V1 to IT. These features represent a basic vocabulary of shape components that can be used to represent any visual input. These features correspond to patches of images which appear with high probability in the natural world. We argue that learning of this dictionary is done UNSUPERVISED during a developmental period.
-- In this model, the goal of the ventral stream of the visual cortex from V1 to IT is to build a good representation for images, i.e. a representation which is compact and invariant with respect to 2D transformations such as translation and scale.
-- With a good image representation, learning a new image category is relatively easy. We speculate that this can be done from a handful of labeling examples by training task-specific circuits running from IT to the PFC.
We showed that it worked well on multiple object categories on standard computer vision databases.
-- I would argue that a key aspect of this model is the learning of a large dictionary of reusable features (I would call them shape components) from V1 to IT. These features represent a basic vocabulary of shape components that can be used to represent any visual input. These features correspond to patches of images which appear with high probability in the natural world. We argue that learning of this dictionary is done UNSUPERVISED during a developmental period.
-- In this model, the goal of the ventral stream of the visual cortex from V1 to IT is to build a good representation for images, i.e. a representation which is compact and invariant with respect to 2D transformations such as translation and scale.
-- With a good image representation, learning a new image category is relatively easy. We speculate that this can be done from a handful of labeling examples by training task-specific circuits running from IT to the PFC.
We showed that it worked well on multiple object categories on standard computer vision databases.
-- I would argue that a key aspect of this model is the learning of a large dictionary of reusable features (I would call them shape components) from V1 to IT. These features represent a basic vocabulary of shape components that can be used to represent any visual input. These features correspond to patches of images which appear with high probability in the natural world. We argue that learning of this dictionary is done UNSUPERVISED during a developmental period.
-- In this model, the goal of the ventral stream of the visual cortex from V1 to IT is to build a good representation for images, i.e. a representation which is compact and invariant with respect to 2D transformations such as translation and scale.
-- With a good image representation, learning a new image category is relatively easy. We speculate that this can be done from a handful of labeling examples by training task-specific circuits running from IT to the PFC.
We showed that it worked well on multiple object categories on standard computer vision databases.
-- I would argue that a key aspect of this model is the learning of a large dictionary of reusable features (I would call them shape components) from V1 to IT. These features represent a basic vocabulary of shape components that can be used to represent any visual input. These features correspond to patches of images which appear with high probability in the natural world. We argue that learning of this dictionary is done UNSUPERVISED during a developmental period.
-- In this model, the goal of the ventral stream of the visual cortex from V1 to IT is to build a good representation for images, i.e. a representation which is compact and invariant with respect to 2D transformations such as translation and scale.
-- With a good image representation, learning a new image category is relatively easy. We speculate that this can be done from a handful of labeling examples by training task-specific circuits running from IT to the PFC.
We showed that it worked well on multiple object categories on standard computer vision databases.
for the sake of time I am only going to show you that you can simulate a neurophysiology experiment with this model. You can record from population of random neurons and perform the same exact analysis as in a real experiment. On the bar plot shown here we performed the same exact readout experiment as in the study by Hung et al. What is shown here the classification performance when training in a specific position and scale and evaluating the generalization capability of the classifier to positions and scales not presented during training. This measures the built-in invariance inherited from the response properties of population of neurons and you can see that the fit is quite good.
In parallel we have used this model in real-world computer vision applications. For instance we have developed a computer vision system for the automatic parsing of street scene images. Here are examples of automatic parsing by the system overlaid over the original images. The colors and bounding boxes indicate predictions from the model (eg green for trees etc).
The computer vision system shown here is based exclusively on the response properties
More recently we have extended the approach for the recognition of human actions such as running, walking, jogging, jumping, waving etc...
In all cases we have shown that the resulting biologically motivated computer vision systems were performing on par or better than state-of-the-art computer vision systems.
The goal of the model was not to explain natural every day vision when you are free to move your eyes and shift your attention but rather was is often called rapid recognition or immediate recognition which corresponds to the first 100-150 ms of visual processing (when an image is briefly presented) ie when the visual system is forced to operate in a feedforward mode before eye movements and shifts of attention.
Here is an example on the left. Here I flash an image for a couple of ms, you probably don’t have time to get every fine details of this image but most people are able to say whether they contain an animal or not.
Here we had divided our dataset in 4 subcategories: head... overall both the model and human do about 80% on this very difficult task and you can see that they agree quite well in turns of how they perform for these 4 subcategories...
The goal of the model was not to explain natural every day vision when you are free to move your eyes and shift your attention but rather was is often called rapid recognition or immediate recognition which corresponds to the first 100-150 ms of visual processing (when an image is briefly presented) ie when the visual system is forced to operate in a feedforward mode before eye movements and shifts of attention.
Here is an example on the left. Here I flash an image for a couple of ms, you probably don’t have time to get every fine details of this image but most people are able to say whether they contain an animal or not.
Here we had divided our dataset in 4 subcategories: head... overall both the model and human do about 80% on this very difficult task and you can see that they agree quite well in turns of how they perform for these 4 subcategories...
We have seen that in the model and in the visual cortex, when two stimuli fall within the receptive field of a neuron, the two stimuli “compete”, that is they reduce the selectivity of the neurons. I just showed you that at the psychophysical level, the amount of clutter in an image largely determines the performance of the model and of human observers during rapid categorization tasks.
We have seen that in the model and in the visual cortex, when two stimuli fall within the receptive field of a neuron, the two stimuli “compete”, that is they reduce the selectivity of the neurons. I just showed you that at the psychophysical level, the amount of clutter in an image largely determines the performance of the model and of human observers during rapid categorization tasks.
We have seen that in the model and in the visual cortex, when two stimuli fall within the receptive field of a neuron, the two stimuli “compete”, that is they reduce the selectivity of the neurons. I just showed you that at the psychophysical level, the amount of clutter in an image largely determines the performance of the model and of human observers during rapid categorization tasks.
We have seen that in the model and in the visual cortex, when two stimuli fall within the receptive field of a neuron, the two stimuli “compete”, that is they reduce the selectivity of the neurons. I just showed you that at the psychophysical level, the amount of clutter in an image largely determines the performance of the model and of human observers during rapid categorization tasks.
We have seen that in the model and in the visual cortex, when two stimuli fall within the receptive field of a neuron, the two stimuli “compete”, that is they reduce the selectivity of the neurons. I just showed you that at the psychophysical level, the amount of clutter in an image largely determines the performance of the model and of human observers during rapid categorization tasks.
Using eye movements as correlate of attention. Assumption is that attention gets to an item just before eye moves so if eyes move we an assume that just before that attention was there
Here is the original model: we added back-projections to account for these attentional modulations
we assume that feature-based attention acts through a cascade of top-down connections though the ventral stream originating in the PFC where a template of the target object is held in memory all the way down to V4 and possibly lower areas. We also assume a spatial attention modulation originating from the parietal cortex (here I am assuming LIP based on limited experimental evidence).
This attentional mechanisms can be casted in a probabilistic Bayesian framework whereby the parietal cortex represents Location variables, the ventral stream represents feature variables. These are our image fragments. Variables for the target object are encoded in higher areas such as PFC...
This framework is inspired by an earlier model by Rao to explain spatial attention and is a special case of the computational model of the visual cortex described by David Mumford and that probably most of you know...
Here is the original model: we added back-projections to account for these attentional modulations
we assume that feature-based attention acts through a cascade of top-down connections though the ventral stream originating in the PFC where a template of the target object is held in memory all the way down to V4 and possibly lower areas. We also assume a spatial attention modulation originating from the parietal cortex (here I am assuming LIP based on limited experimental evidence).
This attentional mechanisms can be casted in a probabilistic Bayesian framework whereby the parietal cortex represents Location variables, the ventral stream represents feature variables. These are our image fragments. Variables for the target object are encoded in higher areas such as PFC...
This framework is inspired by an earlier model by Rao to explain spatial attention and is a special case of the computational model of the visual cortex described by David Mumford and that probably most of you know...
Here is the original model: we added back-projections to account for these attentional modulations
we assume that feature-based attention acts through a cascade of top-down connections though the ventral stream originating in the PFC where a template of the target object is held in memory all the way down to V4 and possibly lower areas. We also assume a spatial attention modulation originating from the parietal cortex (here I am assuming LIP based on limited experimental evidence).
This attentional mechanisms can be casted in a probabilistic Bayesian framework whereby the parietal cortex represents Location variables, the ventral stream represents feature variables. These are our image fragments. Variables for the target object are encoded in higher areas such as PFC...
This framework is inspired by an earlier model by Rao to explain spatial attention and is a special case of the computational model of the visual cortex described by David Mumford and that probably most of you know...
here the way we implemented that is via belief propagation in polytrees (here the messages are shown for the simplified case of a single feature for clarity).
Within framework, spatial attention can be described as a series of msgs from L to Fil to Fi to O while feature-based attention goes the opposite way.
Thus the model makes specific predictions about the timing of visual areas in the ventral stream and the parietal cortex depending on the task at end.
Obviously I am leaving a lot of details open unfortunately...
We have implemented the approach in the context of our animal search
model mostly improves on medium and far conditions
We have implemented the approach in the context of our animal search
model mostly improves on medium and far conditions
We have implemented the approach in the context of our animal search
model mostly improves on medium and far conditions
We have implemented the approach in the context of our animal search
model mostly improves on medium and far conditions
here the way we implemented that is via belief propagation in polytrees (here the messages are shown for the simplified case of a single feature for clarity).
Within framework, spatial attention can be described as a series of msgs from L to Fil to Fi to O while feature-based attention goes the opposite way.
Thus the model makes specific predictions about the timing of visual areas in the ventral stream and the parietal cortex depending on the task at end.
Obviously I am leaving a lot of details open unfortunately...
Unlike artificial search arrays were arbitrary objects are simply randomly placed on a display, natural scenes are highly structured.
This is a point that has been made by Antonio Torralba and Aude Oliva and the fact that global features could provide a good representation of the gist of the scene which is sufficient to associate contextual information from the visual scene to actual object locations like here for instance where you would expect people to be most in these darker regions...