IAP09 CUDA@MIT 6.963 - Guest Lecture: Unlocking Biologically-Inspired Computer Vision: a High-Throughput Approach (David Cox, Harvard | Jim DiCarlo, Nicolas Pinto, MIT)

A High-Throughput Approach to
Discovering Good Forms of Visual
Representation

David Cox
The Rowland Institute at
Harvard

Nicolas Pinto
Jim DiCarlo
MIT BCS

The Rowland Institute at Harvard
HARVARD UNIVERSITY

Goals

1) “Building Brains”
Concrete example of real-world experiments
fundamental enabled by stream processing
hardware

Goals

1) “Building Brains”
Concrete example of real-world experiments
fundamental enabled by stream processing
hardware

2) Tricks of the trade
Some high-level highlights for how we
leverage CUDA to achieve our goals

The Problem
Why is vision hard?

The Problem
Why is vision hard?

World is 3D, but retina is 2D

The Problem
Why is vision hard?

World is 3D, but retina is 2D

The same object can cast an
inﬁnite number of dierent
images onto the retina

Transformation Identity

Bigger diﬀerence
in pixel space

The Approach:
Reverse Engineering the Brain

REVERSE

Study
Natural System

The Approach:

REVERSE FORWARD

Study Build
Natural System Artiﬁcial System

IT Neurons: Complex Stimuli

Desimone et al.

IT Cortex can do object recognition

Hung et al., 2005


REVERSE FORWARD

Study Build
Natural System Artiﬁcial System

Object Recognition

Training Set

Object Recognition

Training Set Representation

Object Recognition


Classiﬁer

Object Recognition


Classiﬁer

Test Example Representation

Object Recognition


Classiﬁer

Test Example Representation

Guess

Object Recognition
Training Set

Object Recognition
Testing Set

Object Recognition
Testing Set

“Siri”

Object Recognition
Testing Set

“Not Siri”
“Siri”

How are things done normally?

Usual Formula:


Usual Formula:

1) One grad student


Usual Formula:

1) One grad student
2) One Model (size limited by runtime)


Usual Formula:

1) One grad student
3) Performance numbers on a few
standard test sets


Usual Formula:

1) One grad student
standard test sets
4) yay. we. rock.


Usual Formula:

1) One grad student
standard test sets
4) yay. we. rock.
5) One Ph.D.

Why is this not optimal?

• Lots of parameters – can’t explore easily


• Big models are paralyzingly slow to run


• Advice from my friend:


• Advice from my friend:
“Don’t run anything that takes longer than a
week to complete, because it will just crash
halfway through anyways (or you’ll discover
a bug) and you’ll never ﬁnish your Ph.D.”

Doing things a little bit dierently


1) One grad student


1) One grad student
2) One Hundreds of Thousands of
BIG Models


1) One grad student
BIG Models
standard test sets*


1) One grad student
BIG Models
standard test sets*
4) yay. we. rock.


1) One grad student
BIG Models
standard test sets*
4) yay. we. rock.
5) One Ph.D.?

Pipeline: Biology

“Plate” a Diversity
of Organisms

Pipeline: Biology

“Plate” a Diversity Allow them
of Organisms to grow

Pipeline: Biology

Apply Challenge

Pipeline: Biology

Apply Challenge

Collect Surviving
Colonies

Pipeline: Biology

Apply Challenge

Collect Surviving
Study / Repeat
Colonies

Pipeline: Biology-Inspired Vision


Generate
Random Models


Generate Unsupervised
Random Models Learning (Video)


Generate Unsupervised Test with
Random Models Learning (Video) “screening” task



Skim o best
models



Validate on other Skim o best
tasks models

Need to break all of the implicit rules


1. Test tens to hundreds of thousands
of instantiations of biologically-inspired
hierarchical models


hierarchical models
2. Use more realistic inputs (e.g. large
quantities of video)


hierarchical models
2. Use more realistic inputs (e.g. large
quantities of video)
3. Test models which begin to approach
the scale of natural systems

Massive Computation

If you work for it, a multi-TeraFLOPS
cluster is doable for a modest price

Long, winding road of stream processing

A Match Made in Heaven
Brains are parallel, GPUs are parallel

≈


≈
Multiple scales of parallelism:


≈
“Embarrasingly” parallel: video
frames, regions


≈
“Embarrasingly” parallel: video
frames, regions
Fine-grained: independent “neurons,”
operating on overlapping inputs

Images In, Images Out

≈


≈
Image processing particularly well-suited


≈
Excellent Arithmetic Intensity: very
natural to load image patches into
shared memory


≈
Excellent Arithmetic Intensity: very
natural to load image patches into
shared memory
Data: 2D / 3D locality

These things are REALLY fast
Performance (g ops) Development Time (hours)


Matlab

C/SSE

PS3

GT200


0.3
Matlab

C/SSE

PS3

GT200


0.3
Matlab

9.0
C/SSE

PS3

GT200


0.3
Matlab

9.0
C/SSE

110.0
PS3

GT200


0.3
Matlab

9.0
C/SSE

110.0
PS3

330.0
GT200


0.3
Matlab
0.5

9.0
C/SSE

110.0
PS3

330.0
GT200


0.3
Matlab
0.5

9.0
C/SSE
10.0

110.0
PS3

330.0
GT200


0.3
Matlab
0.5

9.0
C/SSE
10.0

110.0
PS3
30.0

330.0
GT200


0.3
Matlab
0.5

9.0
C/SSE
10.0

110.0
PS3
30.0

330.0
GT200
10.0

Pipeline


tasks models

Read-out

L3
thresh/sat norm strength

Learning
normalization
neighborhood Rate
Trace
“Temp. Adv.”
“Auto-reset”
...
number of lters

L2

Learning
normalization
Rate
neighborhood
Trace
kernel
“Temp. Adv.”
size
“Auto-reset”
...
n. of lters

L1
Learning
Rate
normalization
Trace
neighborhood
“Temp. Adv.”
“Auto-reset”
...
kernel
size

number of lters

input
kernel
size

neighborhood Rate
Trace
“Temp. Adv.”
“Auto-reset”
...
number of lters

L2

Learning
normalization
Rate
neighborhood
Trace
kernel
“Temp. Adv.”
size
“Auto-reset”
...
n. of lters

L1
Learning
Rate
normalization
Trace
neighborhood
“Temp. Adv.”
“Auto-reset”
...
kernel
size

A Broad Parametric Model

Normalize
Ni = Inputi / norm(Inputneighborhd)

Compute Filter Responses
Ri = Fi ⊗ N
Ri thresh: Ri = thresh
Ri sat: Ri = sat

Determine a “Winning Filter”
Ri’ = (∑ Tk * Hk) * Ri
winner: max(Ri’)

Update Filter
Fwinning = Fwinning + learning rate * N


Normalize • Optimize “Coverage”
Ni = Inputi / norm(Inputneighborhd)

Ri = Fi ⊗ N
Ri sat: Ri = sat

Ri’ = (∑ Tk * Hk) * Ri
winner: max(Ri’)

Update Filter


Ni = Inputi / norm(Inputneighborhd) ( lters span the range of
observed inputs)
Ri = Fi ⊗ N
Ri sat: Ri = sat

Ri’ = (∑ Tk * Hk) * Ri
winner: max(Ri’)

Update Filter


observed inputs)
Ri = Fi ⊗ N
• Privilege movement of
lters in certain
Ri sat: Ri = sat
directions using
Determine a “Winning Filter” temporal information
Ri’ = (∑ Tk * Hk) * Ri
winner: max(Ri’)

Update Filter


observed inputs)
Ri = Fi ⊗ N
• Privilege movement of
lters in certain
Ri sat: Ri = sat
directions using
Determine a “Winning Filter” temporal information
Ri’ = (∑ Tk * Hk) * Ri
winner: max(Ri’)
• Expand dimensionality
greatly and then scale
Update Filter
Fwinning = Fwinning + learning rate * N back as layers progress

Dealing with Parametric Complexity


Throwing a wide net comes with its own
challenges:


challenges:
Complexity
•


challenges:
Complexity
•

Kernels must perform under widely
•
varying conditions


challenges:
Complexity
•

Kernels must perform under widely
•
varying conditions
Best kernel for a 3x3 conv may not be the
same as the best kernel for a 17x17 one;
more complex operations are even hairier

Meta-programming

Leave the grunt-programming to the
computer

Meta-programming

computer
Dynamically compile specialized versions of
•
the same kernel for dierent conditions

Meta-programming

computer
•
Smooth syntactic ugliness: unroll loops,
•
index un-indexable registers

Meta-programming

computer
•
Smooth syntactic ugliness: unroll loops,
•
index un-indexable registers
Dynamic, empirical run-time tuning
•

texturefloat4, 1, cudaReadModeElementType tex_float4;
__constant__ float constant[$FILTER_D][$FILTER_W][$N_FILTERS];

#define IMUL(a, b) __mul24(a, b)
extern quot;Cquot; {

#for j in xrange($FILTER_H)

__global__ void convolve_beta_j${j}(float4 *input, float4 *output)
{

#set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1
__shared__ float shared_in[$INPUT_BLOCK_W][4+1];

// -- input/output offsets
const uint in_idx = (blockIdx.y+$j)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x;
const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x;
float4 input_v4;

// -- load input to shared memory
#for i in xrange($LOAD_ITERATIONS)
#if $i==($LOAD_ITERATIONS-1)
if((threadIdx.x+$BLOCK_W*$i)$INPUT_BLOCK_W)
#end if
{
input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W*$i);
shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x;
shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y;
shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z;

conv_kernel_4x4x4.cu
conv_kernel_template.cu #include stdio.h

texturefloat4, 1, cudaReadModeElementType tex_float4;
__constant__ float constant[4][4][4];

#define IMUL(a, b) __mul24(a, b)
__constant__ float constant[$FILTER_D][$FILTER_W]
[$N_FILTERS]; __global__ void convolve_beta_j0(float4 *input, float4 *output)
{
__shared__ float shared_in[131][4+1];
#for j in xrange($FILTER_H) const uint in_idx = (blockIdx.y+0)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x;
const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x;
float4 input_v4;
__global__ void convolve_beta_j${j}(float4 *input, float4
*output)
{
{

input_v4 = tex1Dfetch(tex_float4, in_idx+128*0);
shared_in[threadIdx.x+128*0][0] = input_v4.x;

shared_in[threadIdx.x+128*0][1] = input_v4.y;

shared_in[threadIdx.x+128*0][2] = input_v4.z;

shared_in[threadIdx.x+128*0][3] = input_v4.w;
}
const uint in_idx = (blockIdx.y+$j)*INPUT_W + if((threadIdx.x+128*1)131)
blockIdx.x*blockDim.x + threadIdx.x; {

input_v4 = tex1Dfetch(tex_float4, in_idx+128*1);
const uint out_idx = blockIdx.y*OUTPUT_W +

shared_in[threadIdx.x+128*1][0] = input_v4.x;
blockIdx.x*blockDim.x + threadIdx.x;

shared_in[threadIdx.x+128*1][1] = input_v4.y;
float4 input_v4;

shared_in[threadIdx.x+128*1][2] = input_v4.z;

shared_in[threadIdx.x+128*1][3] = input_v4.w;
// -- load input to shared memory }
__syncthreads();
// -- compute dot products
float v, w;
#end if
{ float sum0 = 0;
float sum1 = 0;
input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W*
float sum2 = 0;
$i);
float sum3 = 0;
shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y; v = shared_in[threadIdx.x+0][0];
shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z; w = constant[0][0][0];
sum0 += v*w;
shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w;
w = constant[0][0][1];
}
sum1 += v*w;
#end for
sum2 += v*w;
sum3 += v*w;
v = shared_in[threadIdx.x+1][0];
sum0 += v*w;
sum1 += v*w;
sum2 += v*w;
sum3 += v*w;
v = shared_in[threadIdx.x+2][0];
sum0 += v*w;
sum1 += v*w;

conv_kernel_template.cu

[$N_FILTERS];




20 kB
*output)
{


const uint in_idx = (blockIdx.y+$j)*INPUT_W +
float4 input_v4;

#end if

{
$i);
}

64 kB
#end for

conv_kernel_beta_template.cu
[$N_FILTERS];



*output)
{


float4 input_v4;

#end if
{
$i);
}
#end for

conv_kernel_beta_template.cu
[$N_FILTERS];



*output)
{



seemingly

innocuous “ﬂow”
float4 input_v4;

parameter change
#end if
{
$i);
}
#end for

version A
...
conv_kernel_beta_template.cu mad.rn.f32 $r4, s[$ofs3+0x0000], $r4, $r1
mov.b32 $r1, c0[$ofs2+0x0008]
mad.rn.f32 $r4, s[$ofs3+0x0008], $r1, $r4

mov.b32 $r1, c0[$ofs2+0x000c]
[$N_FILTERS];

mad.rn.f32 $r4, s[$ofs3+0x000c], $r1, $r4

mov.b32 $r1, c0[$ofs2+0x0010]

mad.rn.f32 $r4, s[$ofs3+0x0010], $r1, $r4
*output)

...
{



seemingly

float4 input_v4;

parameter change
#end if
{
$i);
}
#end for

version A
...
conv_kernel_beta_template.cu mad.rn.f32 $r4, s[$ofs3+0x0000], $r4, $r1
mov.b32 $r1, c0[$ofs2+0x0008]
mad.rn.f32 $r4, s[$ofs3+0x0008], $r1, $r4

mov.b32 $r1, c0[$ofs2+0x000c]
[$N_FILTERS];

mad.rn.f32 $r4, s[$ofs3+0x000c], $r1, $r4

mov.b32 $r1, c0[$ofs2+0x0010]

mad.rn.f32 $r4, s[$ofs3+0x0010], $r1, $r4
*output)

...
{



seemingly

float4 input_v4;

parameter change

version B
#end if
{

...
$i);

mad.rn.f32 $r1, s[$ofs1+0x007c], c0[$ofs1+0x0078], $r1
}
mad.rn.f32 $r1, s[$ofs2+0x0000], c0[$ofs2+0x007c], $r1
#end for

mad.rn.f32 $r1, s[$ofs2+0x0008], c0[$ofs2+0x0080], $r1
mad.rn.f32 $r1, s[$ofs2+0x000c], c0[$ofs2+0x0084], $r1
mad.rn.f32 $r1, s[$ofs2+0x0010], c0[$ofs2+0x0088], $r1

...

Unsupervised Experience
Learn Filter Kernels from Temporal Input Statistics

Unsupervised Experience
Learn Filter Kernels from Temporal Input Statistics
law order

Screening
A quick object rec. test to ﬁnd promising models


tasks models

Screening

“cars” “planes”

Screening

Parametric Variation

no variation

Screening


no variation more variation

Screening


no variation more variation lots of variation

Screening

250

200
N=2500
150

100

50

0
50 60 70 80 90 100

Screening

100
250
90

80
200
N=2500
70
150
60
100 50
V1S 1 2 3 4 5
50

0
50 60 70 80 90 100

Validation
See how we do on other test sets


tasks models

Validation

cars vs. planes (validate) boats vs. animals

100 100

90 90

80 80

70 70

60 60

50 50
V1S 1 2 3 4 5 V1S 1 2 3 4 5

Validation

synthetic face
webcam faces
discrimination
100 100

90 90

80 80

70 70

60 60

50 50
5 V1S 1 2 3 4 5 V1S 1 2 3 4 5

Other “Petri Dishes”

boats


cars and planes
boats

Unsupervised
Screen Distribution Validation
Training “World”
synthetic face
cars vs. planes (validate) boats vs. animals webcam faces
discrimination
cars and planes 250

100 100 100 100
200
N=2500 90 90 90 90
150 80 80 80 80

70 70 70 70
100
60 60 60 60
50 50 50 50 50
V1S 1 2 3 4 5 V1S 1 2 3 4 5 V1S 1 2 3 4 5 V1S 1 2 3 4 5

50 60 70 80 90 100

boats 250

200 100 100 100 100
N=2500
90 90 90 90
150
80 80 80 80

70 70 70 70
100
60 60 60 60
50 50 50 50 50
V1S 1 2 3 4 5 V1S 1 2 3 4 5 V1S 1 2 3 4 5 V1S 1 2 3 4 5

50 60 70 80 90 100

law order
250
100 100 100 100
200
N=2500 90 90 90 90

80 80 80 80
150
70 70 70 70
100 60 60 60 60

50 50 50 50
50 V1S 1 2 3 4 5 V1S 1 2 3 4 5 V1S 1 2 3 4 5 V1S 1 2 3 4 5

0
50 60 70 80 90 100

Screening Basically Works

3

2

1
cars vs. planes
(validate)

0

-1

r=0.84
-2

-2 -1 0 1 2 3
cars vs. planes (screen)

Screening Basically Works

3
3

2
2

1
1

boats vs. animals
cars vs. planes

0
(validate)

0

-1
-1

r=0.60
r=0.84 -2
-2

-2 -1 0 1 2 3
-2 -1 0 1 2 3

3
4
2
3
1
2
0
discrimination

1
synthetic face

webcam faces

-1
0

-2
-1
r=0.23
r=0.49 -3
-2

-2 -1 0 1 2 3
-2 -1 0 1 2 3

Screening Works on Average

3

2
noramalized average

1

0

-1

r=0.69
-2 -1 0 1 2 3

What does it all mean?

Preprocessing / Normalization / Zero-mean Layer 1 / Linear Filtering / RF size
0.9 0.9
Performance (% correct)

0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5

False True 3 5 7 9
Parameter value Parameter value

What does it all mean?

Layer 1 / Normalization / Threshold Layer 3 / Learning / Neighborhood size
0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5

01 10 1 3 5 7 9
Parameter value Parameter value

Summary

• GPUs allow us to engage a qualitatively
dierent pace of experimentation

Summary

• CUDA provides unprecedented
performance / eort ratio

Summary

• CUDA provides unprecedented
performance / eort ratio
• New laboratory for studying visual
representation

Shameless Recruitment

http://www.rowland.harvard.edu/rjf/cox

Shameless Recruitment

Now
with
E
NVID xtra
Good IA
ness!

http://www.rowland.harvard.edu/rjf/cox

Acknowledgments

Cox Lab @
• Davide Zoccolan
• Nadja Oertelt

Acknowledgments

Cox Lab @
• Davide Zoccolan
• Nadja Oertelt

DiCarlo Lab @ MIT
• Jim DiCarlo
• Nicolas Pinto

Acknowledgments

Cox Lab @
• Davide Zoccolan
• Nadja Oertelt

DiCarlo Lab @ MIT
• Jim DiCarlo
• Nicolas Pinto

HARVARD UNIVERSITY

The Problem with Evaluation

What set to use?

faces_easy faces car_side airplanes accordion
Caltech 101

Caltech 101

from A. Torralba

The ﬁeld does well on the ‘101

Performance
100
100%

8080
1] Wang et al. 2006
60 60
[2] Grauman and Darrell 2005
[3] Mutch and Lowe 2006
[4] Lazebnik et al. 2006
40 40 [5] Zhang et al. 2006

20 20

0 0
1 2 3 4 5

state of the art systems

The “start-simple” model

• Input divisive normalization


• Threshold, saturation


• Output normalization



• Downsampling
• Dimensionality Reduction
• Classi cation



• Downsampling
• Dimensionality Reduction
• Classi cation

• Really Big (~2.1 million dim)

V1-like model does shockingly well

Performance
100
100%

8080

60 60

40 40

20 20

0 0
1 2 3 4 5



Performance
100
100%

8080

60 60

40 40

20 20

0 0
1 2 3 4 5
V1-like


Performance
100
100%

8080

60 60

40 40

20 20

0 0
1 2 3 4 5
V1-like V1+

But the model is stupid

It’s just V1.

It has no speci c
mechanisms for achieving
invariance.

But the model is stupid

It’s just V1.

It has no speci c
mechanisms for achieving
invariance.

The Fight is
Rigged?

Refocusing on the real problem


Just two classes: should be easy, no?



no variation



no variation more variation



no variation more variation lots of variation

Pulling performance back to chance
100

90
Performance (%)

80

70

60

50

0% 10% 20% 30% 40% 50% 60%
position (x-axis)
0% 20% 40% 60% 80% 100% 120%
position (y-axis)
0% 10% 20% 30% 40% 50% 60%
scale
0° 15° 30° 45° 60° 75° 90°
rotation
0° 15° 30° 45° 60° 75° 90°
view

Intra-class variation

Just one bad database?

Caltech 256


Caltech 256
Face Databases: ORL, Yale, AR, CVL


Caltech 256
Face Databases: ORL, Yale, AR, CVL

See us @ ECCV 08
Faces in the Wild
Workshop

Face Databases

ORL 4 training examples 8 training examples

100

80

60

40

20

0
V1-like V1-like
1 2 3 4 1 2 3 4

1. pixel space 3. and 4. Noushath et al. 2007
2. Savvides et al. 2007

Face Databases

AR 5 training 8 training
examples examples
100

80

60

40

20
1. pixel space
2. Liang et al. 2007 0
3 V1-like 3 V1-like
1 2 1 2
3. Zhang et al. 2007

Face Databases

YALE 4 training 8 training
examples examples
100

90

80

70

60
1. pixel space
2.,3.,4. and 5. Noushath et al. 2006 50
V1-like V1-like
1 2 34567 1 2 3456
6. Ben et al. 2006
7. Wang et al. 2007 chance (1/15=6.67%)

Face Databases

CVL 2 training 3 training
examples examples
(frontal only) (all)
100

80

60

40

chance
20 (1/114=0.88%)
1. pixel space
2. Goel et al. 2005 0
3. Gokberk et al. 2002 V1-like V1-like
12 1 3

Face Databases

CAS-PEAL
4 training examples
facing facing facing
100
down forward up

80

60

40

20
1. pixel space
2.,3.,4 and 5. Cao et al. 2004 0
1 2 3 4 5 V1 1 2 3 4 5 V1 1 2 3 4 5 V1

Synthetic Variation
some variation
100

90

80 scene background
white noise background
70 phase scrambled background

60

chance
50
more variation
40

30

position (x-axis) 0% 20% 40% 60% 80% 100% 120%
position (y-axis) 0% 10% 20% 30% 40% 50% 60%
scale 0% 10% 20% 30% 40% 50% 60%
in-plane rotation 0° 15° 30° 45° 60° 75° 90°
depth-rotation 0° 15° 30° 45° 60° 75° 90°

Increasing Variation

IAP09 CUDA@MIT 6.963 - Guest Lecture: Unlocking Biologically-Inspired Computer Vision: a High-Throughput Approach (David Cox, Harvard | Jim DiCarlo, Nicolas Pinto, MIT)

Recomendados

Recomendados

Mais conteúdo relacionado

Mais de npinto

Mais de npinto (20)

Último

Último (20)

IAP09 CUDA@MIT 6.963 - Guest Lecture: Unlocking Biologically-Inspired Computer Vision: a High-Throughput Approach (David Cox, Harvard | Jim DiCarlo, Nicolas Pinto, MIT)