SlideShare uma empresa Scribd logo
1 de 212
A High-Throughput Approach to
Discovering Good Forms of Visual
Representation

David Cox
The Rowland Institute at
Harvard

Nicolas Pinto
Jim DiCarlo
MIT BCS



   The Rowland Institute at Harvard
   HARVARD UNIVERSITY
Goals
Goals


1) “Building Brains”
Concrete example of real-world experiments
fundamental enabled by stream processing
hardware
Goals


1) “Building Brains”
Concrete example of real-world experiments
fundamental enabled by stream processing
hardware

2) Tricks of the trade
Some high-level highlights for how we
leverage CUDA to achieve our goals
Goals


1) “Building Brains”
Concrete example of real-world experiments
fundamental enabled by stream processing
hardware

2) Tricks of the trade
Some high-level highlights for how we
leverage CUDA to achieve our goals
The Problem
Why is vision hard?
The Problem
Why is vision hard?



              World is 3D, but retina is 2D
The Problem
Why is vision hard?



              World is 3D, but retina is 2D


              The same object can cast an
              infinite number of dierent
              images onto the retina
The Problem
Object Variation
The Problem
Object Variation
The Problem
Object Variation
The Problem
Object Variation
Transformation  Identity
Transformation  Identity
Transformation  Identity
Transformation  Identity
Transformation  Identity
Transformation  Identity




 Bigger difference
   in pixel space
The Approach:
Reverse Engineering the Brain



    REVERSE

       Study
   Natural System
The Approach:
Reverse Engineering the Brain



    REVERSE                FORWARD

       Study                   Build
   Natural System         Artificial System
The Approach:
Reverse Engineering the Brain



    REVERSE                FORWARD

       Study                   Build
   Natural System         Artificial System
Reverse Engineering the Brain
Monkeys
Monkeys
Rats
Rats
Visual System
Visual System
IT Neurons: Complex Stimuli




                              Desimone et al.
IT Cortex can do object recognition




                                Hung et al., 2005
Reverse Engineering the Brain



    REVERSE                FORWARD

       Study                   Build
   Natural System         Artificial System
Reverse Engineering the Brain



    REVERSE                FORWARD

       Study                   Build
   Natural System         Artificial System
Object Recognition


  Training Set
Object Recognition


  Training Set   Representation
Object Recognition


  Training Set   Representation




                                  Classifier
Object Recognition


  Training Set   Representation




                                  Classifier
Object Recognition


  Training Set   Representation




                                  Classifier


  Test Example   Representation
Object Recognition


  Training Set   Representation




                                  Classifier


  Test Example   Representation



                                    Guess
Object Recognition
Object Recognition
Training Set
Object Recognition
Training Set
Object Recognition
Training Set
Object Recognition
Training Set
Object Recognition
Training Set
Object Recognition
Testing Set
Object Recognition
Testing Set
Object Recognition
Testing Set




       “Siri”
Object Recognition
Testing Set




       “Siri”
Object Recognition
Testing Set




                     “Not Siri”
       “Siri”
How are things done normally?
How are things done normally?

  Usual Formula:
How are things done normally?

  Usual Formula:

  1) One grad student
How are things done normally?

  Usual Formula:

  1) One grad student
  2) One Model (size limited by runtime)
How are things done normally?

  Usual Formula:

  1) One grad student
  2) One Model (size limited by runtime)
  3) Performance numbers on a few
  standard test sets
How are things done normally?

  Usual Formula:

  1) One grad student
  2) One Model (size limited by runtime)
  3) Performance numbers on a few
  standard test sets
  4) yay. we. rock.
How are things done normally?

  Usual Formula:

  1) One grad student
  2) One Model (size limited by runtime)
  3) Performance numbers on a few
  standard test sets
  4) yay. we. rock.
  5) One Ph.D.
Why is this not optimal?
Why is this not optimal?


• Lots of parameters – can’t explore easily
Why is this not optimal?


• Lots of parameters – can’t explore easily
• Big models are paralyzingly slow to run
Why is this not optimal?


• Lots of parameters – can’t explore easily
• Big models are paralyzingly slow to run
• Advice from my friend:
Why is this not optimal?


• Lots of parameters – can’t explore easily
• Big models are paralyzingly slow to run
• Advice from my friend:
  “Don’t run anything that takes longer than a
  week to complete, because it will just crash
  halfway through anyways (or you’ll discover
  a bug) and you’ll never finish your Ph.D.”
Doing things a little bit dierently
Doing things a little bit dierently


  1) One grad student
Doing things a little bit dierently


  1) One grad student
  2) One Hundreds of Thousands of
  BIG Models
Doing things a little bit dierently


  1) One grad student
  2) One Hundreds of Thousands of
  BIG Models
  3) Performance numbers on a few
  standard test sets*
Doing things a little bit dierently


  1) One grad student
  2) One Hundreds of Thousands of
  BIG Models
  3) Performance numbers on a few
  standard test sets*
Doing things a little bit dierently


  1) One grad student
  2) One Hundreds of Thousands of
  BIG Models
  3) Performance numbers on a few
  standard test sets*
  4) yay. we. rock.
Doing things a little bit dierently


  1) One grad student
  2) One Hundreds of Thousands of
  BIG Models
  3) Performance numbers on a few
  standard test sets*
  4) yay. we. rock.
  5) One Ph.D.?
High-Throughput Screening
High-Throughput Screening
Pipeline: Biology
Pipeline: Biology



“Plate” a Diversity
  of Organisms
Pipeline: Biology



“Plate” a Diversity   Allow them
  of Organisms          to grow
Pipeline: Biology



“Plate” a Diversity   Allow them
                                   Apply Challenge
  of Organisms          to grow
Pipeline: Biology



“Plate” a Diversity   Allow them
                                        Apply Challenge
  of Organisms          to grow




                               Collect Surviving
                                   Colonies
Pipeline: Biology



“Plate” a Diversity     Allow them
                                          Apply Challenge
  of Organisms            to grow




                                 Collect Surviving
            Study / Repeat
                                     Colonies
Pipeline: Biology-Inspired Vision
Pipeline: Biology-Inspired Vision



   Generate
Random Models
Pipeline: Biology-Inspired Vision



   Generate      Unsupervised
Random Models   Learning (Video)
Pipeline: Biology-Inspired Vision



   Generate      Unsupervised          Test with
Random Models   Learning (Video)   “screening” task
Pipeline: Biology-Inspired Vision



   Generate      Unsupervised            Test with
Random Models   Learning (Video)     “screening” task




                              Skim o best
                                 models
Pipeline: Biology-Inspired Vision



   Generate         Unsupervised            Test with
Random Models      Learning (Video)     “screening” task




        Validate on other        Skim o best
              tasks                 models
Need to break all of the implicit rules
Need to break all of the implicit rules



   1. Test tens to hundreds of thousands
      of instantiations of biologically-inspired
      hierarchical models
Need to break all of the implicit rules



   1. Test tens to hundreds of thousands
      of instantiations of biologically-inspired
      hierarchical models
   2. Use more realistic inputs (e.g. large
      quantities of video)
Need to break all of the implicit rules



   1. Test tens to hundreds of thousands
      of instantiations of biologically-inspired
      hierarchical models
   2. Use more realistic inputs (e.g. large
      quantities of video)
   3. Test models which begin to approach
      the scale of natural systems
Massive Computation
Massive Computation




If you work for it, a multi-TeraFLOPS
cluster is doable for a modest price
Long, winding road of stream processing
GPU pr0n
GPU pr0n
A Match Made in Heaven
Brains are parallel, GPUs are parallel



                  ≈
A Match Made in Heaven
Brains are parallel, GPUs are parallel



                     ≈
   Multiple scales of parallelism:
A Match Made in Heaven
Brains are parallel, GPUs are parallel



                     ≈
   Multiple scales of parallelism:
     “Embarrasingly” parallel: video
     frames, regions
A Match Made in Heaven
Brains are parallel, GPUs are parallel



                     ≈
   Multiple scales of parallelism:
     “Embarrasingly” parallel: video
     frames, regions
     Fine-grained: independent “neurons,”
     operating on overlapping inputs
A Match Made in Heaven
Images In, Images Out



                 ≈
A Match Made in Heaven
Images In, Images Out



                    ≈
   Image processing particularly well-suited
A Match Made in Heaven
Images In, Images Out



                    ≈
   Image processing particularly well-suited
    Excellent Arithmetic Intensity: very
    natural to load image patches into
    shared memory
A Match Made in Heaven
Images In, Images Out



                    ≈
   Image processing particularly well-suited
    Excellent Arithmetic Intensity: very
    natural to load image patches into
    shared memory
    Data: 2D / 3D locality
These things are REALLY fast
       Performance (g ops)   Development Time (hours)
These things are REALLY fast
         Performance (g ops)   Development Time (hours)



Matlab



 C/SSE



  PS3



GT200
These things are REALLY fast
               Performance (g ops)   Development Time (hours)


         0.3
Matlab



 C/SSE



  PS3



GT200
These things are REALLY fast
                 Performance (g ops)   Development Time (hours)


         0.3
Matlab


           9.0
 C/SSE



  PS3



GT200
These things are REALLY fast
                 Performance (g ops)   Development Time (hours)


         0.3
Matlab


           9.0
 C/SSE


                             110.0
  PS3



GT200
These things are REALLY fast
                 Performance (g ops)   Development Time (hours)


         0.3
Matlab


           9.0
 C/SSE


                             110.0
  PS3


                                                                  330.0
GT200
These things are REALLY fast
                 Performance (g ops)   Development Time (hours)


         0.3
Matlab
         0.5


           9.0
 C/SSE


                             110.0
  PS3


                                                                  330.0
GT200
These things are REALLY fast
                  Performance (g ops)   Development Time (hours)


         0.3
Matlab
         0.5


           9.0
 C/SSE
           10.0


                              110.0
  PS3


                                                                   330.0
GT200
These things are REALLY fast
                   Performance (g ops)   Development Time (hours)


         0.3
Matlab
         0.5


           9.0
 C/SSE
           10.0


                               110.0
  PS3
                  30.0


                                                                    330.0
GT200
These things are REALLY fast
                   Performance (g ops)   Development Time (hours)


         0.3
Matlab
         0.5


           9.0
 C/SSE
           10.0


                               110.0
  PS3
                  30.0


                                                                    330.0
GT200
           10.0
Pipeline



   Generate         Unsupervised            Test with
Random Models      Learning (Video)     “screening” task




        Validate on other        Skim o best
              tasks                 models
Pipeline



   Generate         Unsupervised            Test with
Random Models      Learning (Video)     “screening” task




        Validate on other        Skim o best
              tasks                 models
Read-out


L3
                  thresh/sat            norm strength

                                                                        Learning
                                            normalization
                                            neighborhood                         Rate
                                                                                 Trace
                                                                                 “Temp. Adv.”
                                                                                 “Auto-reset”
                                                                                    ...
                                               number of lters




L2
                                        thresh/sat            norm strength

                                                                         Learning
                                                 normalization
                                                                                  Rate
                                                 neighborhood
                                                                                  Trace
         kernel
                                                                                  “Temp. Adv.”
         size
                                                                                  “Auto-reset”
                                                                                     ...
                                                      n. of lters




L1
                                                                              Learning
                               thresh/sat            norm strength
                                                                                   Rate
                                                        normalization
                                                                                   Trace
                                                        neighborhood
                                                                                   “Temp. Adv.”
                                                                                   “Auto-reset”
                                                                                      ...
kernel
size

                                                                 number of lters




 input
    kernel
    size
neighborhood                         Rate
                                                                    Trace
                                                                    “Temp. Adv.”
                                                                    “Auto-reset”
                                                                       ...
                                  number of lters




L2
                           thresh/sat            norm strength

                                                            Learning
                                    normalization
                                                                     Rate
                                    neighborhood
                                                                     Trace
         kernel
                                                                     “Temp. Adv.”
         size
                                                                     “Auto-reset”
                                                                        ...
                                         n. of lters




L1
                                                                 Learning
                  thresh/sat            norm strength
                                                                      Rate
                                           normalization
                                                                      Trace
                                           neighborhood
                                                                      “Temp. Adv.”
                                                                      “Auto-reset”
                                                                         ...
kernel
size
A Broad Parametric Model

Normalize
Ni = Inputi / norm(Inputneighborhd)

Compute Filter Responses
Ri = Fi ⊗ N
Ri  thresh: Ri = thresh
Ri  sat: Ri = sat

Determine a “Winning Filter”
Ri’ = (∑ Tk * Hk) * Ri
winner: max(Ri’)

Update Filter
Fwinning = Fwinning + learning rate * N
A Broad Parametric Model

Normalize                                 • Optimize “Coverage”
Ni = Inputi / norm(Inputneighborhd)

Compute Filter Responses
Ri = Fi ⊗ N
Ri  thresh: Ri = thresh
Ri  sat: Ri = sat

Determine a “Winning Filter”
Ri’ = (∑ Tk * Hk) * Ri
winner: max(Ri’)

Update Filter
Fwinning = Fwinning + learning rate * N
A Broad Parametric Model

Normalize                                 • Optimize “Coverage”
Ni = Inputi / norm(Inputneighborhd)       ( lters span the range of
                                          observed inputs)
Compute Filter Responses
Ri = Fi ⊗ N
Ri  thresh: Ri = thresh
Ri  sat: Ri = sat

Determine a “Winning Filter”
Ri’ = (∑ Tk * Hk) * Ri
winner: max(Ri’)

Update Filter
Fwinning = Fwinning + learning rate * N
A Broad Parametric Model

Normalize                                 • Optimize “Coverage”
Ni = Inputi / norm(Inputneighborhd)       ( lters span the range of
                                          observed inputs)
Compute Filter Responses
Ri = Fi ⊗ N
                                          • Privilege movement of
Ri  thresh: Ri = thresh
                                            lters in certain
Ri  sat: Ri = sat
                                          directions using
Determine a “Winning Filter”              temporal information
Ri’ = (∑ Tk * Hk) * Ri
winner: max(Ri’)

Update Filter
Fwinning = Fwinning + learning rate * N
A Broad Parametric Model

Normalize                                 • Optimize “Coverage”
Ni = Inputi / norm(Inputneighborhd)       ( lters span the range of
                                          observed inputs)
Compute Filter Responses
Ri = Fi ⊗ N
                                          • Privilege movement of
Ri  thresh: Ri = thresh
                                            lters in certain
Ri  sat: Ri = sat
                                          directions using
Determine a “Winning Filter”              temporal information
Ri’ = (∑ Tk * Hk) * Ri
winner: max(Ri’)
                                          • Expand dimensionality
                                          greatly and then scale
Update Filter
Fwinning = Fwinning + learning rate * N   back as layers progress
Dealing with Parametric Complexity
Dealing with Parametric Complexity

Throwing a wide net comes with its own
challenges:
Dealing with Parametric Complexity

Throwing a wide net comes with its own
challenges:
     Complexity
 •
Dealing with Parametric Complexity

Throwing a wide net comes with its own
challenges:
     Complexity
 •

  Kernels must perform under widely
 •
 varying conditions
Dealing with Parametric Complexity

Throwing a wide net comes with its own
challenges:
     Complexity
 •

  Kernels must perform under widely
 •
 varying conditions
 Best kernel for a 3x3 conv may not be the
 same as the best kernel for a 17x17 one;
 more complex operations are even hairier
Meta-programming
Meta-programming

Leave the grunt-programming to the
computer
Meta-programming

Leave the grunt-programming to the
computer
    Dynamically compile specialized versions of
•
    the same kernel for dierent conditions
Meta-programming

Leave the grunt-programming to the
computer
    Dynamically compile specialized versions of
•
    the same kernel for dierent conditions
    Smooth syntactic ugliness: unroll loops,
•
    index un-indexable registers
Meta-programming

Leave the grunt-programming to the
computer
    Dynamically compile specialized versions of
•
    the same kernel for dierent conditions
    Smooth syntactic ugliness: unroll loops,
•
    index un-indexable registers
    Dynamic, empirical run-time tuning
•
texturefloat4, 1, cudaReadModeElementType tex_float4;
__constant__ float constant[$FILTER_D][$FILTER_W][$N_FILTERS];

#define IMUL(a, b) __mul24(a, b)
extern quot;Cquot; {

#for j in xrange($FILTER_H)

  __global__ void convolve_beta_j${j}(float4 *input, float4 *output)
  {

#set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1
    __shared__ float shared_in[$INPUT_BLOCK_W][4+1];

    // -- input/output offsets
    const uint in_idx = (blockIdx.y+$j)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x;
    const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x;
    float4 input_v4;

    // -- load input to shared memory
#for i in xrange($LOAD_ITERATIONS)
#if $i==($LOAD_ITERATIONS-1)
    if((threadIdx.x+$BLOCK_W*$i)$INPUT_BLOCK_W)
#end if
      {
	   input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W*$i);
	   shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x;
	   shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y;
	   shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z;
texturefloat4, 1, cudaReadModeElementType tex_float4;
__constant__ float constant[$FILTER_D][$FILTER_W][$N_FILTERS];

#define IMUL(a, b) __mul24(a, b)
extern quot;Cquot; {

#for j in xrange($FILTER_H)

  __global__ void convolve_beta_j${j}(float4 *input, float4 *output)
  {

#set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1
    __shared__ float shared_in[$INPUT_BLOCK_W][4+1];

    // -- input/output offsets
    const uint in_idx = (blockIdx.y+$j)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x;
    const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x;
    float4 input_v4;

    // -- load input to shared memory
#for i in xrange($LOAD_ITERATIONS)
#if $i==($LOAD_ITERATIONS-1)
    if((threadIdx.x+$BLOCK_W*$i)$INPUT_BLOCK_W)
#end if
      {
	   input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W*$i);
	   shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x;
	   shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y;
	   shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z;
conv_kernel_4x4x4.cu
conv_kernel_template.cu                                         #include stdio.h

                                                                texturefloat4, 1, cudaReadModeElementType tex_float4;
                                                                __constant__ float constant[4][4][4];

                                                                #define IMUL(a, b) __mul24(a, b)
                                                                extern quot;Cquot; {
 texturefloat4, 1, cudaReadModeElementType tex_float4;
 __constant__ float constant[$FILTER_D][$FILTER_W]
 [$N_FILTERS];                                                      __global__ void convolve_beta_j0(float4 *input, float4 *output)
                                                                    {
 #define IMUL(a, b) __mul24(a, b)
                                                                     __shared__ float shared_in[131][4+1];
 extern quot;Cquot; {
                                                                     // -- input/output offsets
 #for j in xrange($FILTER_H)                                         const uint in_idx = (blockIdx.y+0)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x;
                                                                     const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x;
                                                                     float4 input_v4;
   __global__ void convolve_beta_j${j}(float4 *input, float4
 *output)
                                                                     // -- load input to shared memory
   {
                                                                       {
                                                                
                input_v4 = tex1Dfetch(tex_float4, in_idx+128*0);
 #set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1                      
                shared_in[threadIdx.x+128*0][0] = input_v4.x;
                                                                
                shared_in[threadIdx.x+128*0][1] = input_v4.y;
     __shared__ float shared_in[$INPUT_BLOCK_W][4+1];
                                                                
                shared_in[threadIdx.x+128*0][2] = input_v4.z;
                                                                
                shared_in[threadIdx.x+128*0][3] = input_v4.w;
     // -- input/output offsets
                                                                       }
     const uint in_idx = (blockIdx.y+$j)*INPUT_W +                   if((threadIdx.x+128*1)131)
 blockIdx.x*blockDim.x + threadIdx.x;                                  {
                                                                
                input_v4 = tex1Dfetch(tex_float4, in_idx+128*1);
     const uint out_idx = blockIdx.y*OUTPUT_W +
                                                                
                shared_in[threadIdx.x+128*1][0] = input_v4.x;
 blockIdx.x*blockDim.x + threadIdx.x;
                                                                
                shared_in[threadIdx.x+128*1][1] = input_v4.y;
     float4 input_v4;
                                                                
                shared_in[threadIdx.x+128*1][2] = input_v4.z;
                                                                
                shared_in[threadIdx.x+128*1][3] = input_v4.w;
     // -- load input to shared memory                                 }
                                                                     __syncthreads();
 #for i in xrange($LOAD_ITERATIONS)
 #if $i==($LOAD_ITERATIONS-1)
                                                                     // -- compute dot products
     if((threadIdx.x+$BLOCK_W*$i)$INPUT_BLOCK_W)
                                                                     float v, w;
 #end if
       {                                                             float sum0 = 0;
                                                                     float sum1 = 0;
 	        input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W*
                                                                     float sum2 = 0;
 $i);
                                                                     float sum3 = 0;
 	        shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x;
 	        shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y;        v = shared_in[threadIdx.x+0][0];
 	        shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z;        w = constant[0][0][0];
                                                                     sum0 += v*w;
 	        shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w;
                                                                     w = constant[0][0][1];
       }
                                                                     sum1 += v*w;
 #end for
                                                                     w = constant[0][0][2];
                                                                     sum2 += v*w;
                                                                     w = constant[0][0][3];
                                                                     sum3 += v*w;
                                                                     v = shared_in[threadIdx.x+1][0];
                                                                     w = constant[0][1][0];
                                                                     sum0 += v*w;
                                                                     w = constant[0][1][1];
                                                                     sum1 += v*w;
                                                                     w = constant[0][1][2];
                                                                     sum2 += v*w;
                                                                     w = constant[0][1][3];
                                                                     sum3 += v*w;
                                                                     v = shared_in[threadIdx.x+2][0];
                                                                     w = constant[0][2][0];
                                                                     sum0 += v*w;
                                                                     w = constant[0][2][1];
                                                                     sum1 += v*w;
conv_kernel_template.cu
 texturefloat4, 1, cudaReadModeElementType tex_float4;
 __constant__ float constant[$FILTER_D][$FILTER_W]

                                                                conv_kernel_4x4x4.cu
 [$N_FILTERS];

 #define IMUL(a, b) __mul24(a, b)
 extern quot;Cquot; {

 #for j in xrange($FILTER_H)

   __global__ void convolve_beta_j${j}(float4 *input, float4



                                                                            20 kB
 *output)
   {

 #set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1
     __shared__ float shared_in[$INPUT_BLOCK_W][4+1];

     // -- input/output offsets
     const uint in_idx = (blockIdx.y+$j)*INPUT_W +
 blockIdx.x*blockDim.x + threadIdx.x;
     const uint out_idx = blockIdx.y*OUTPUT_W +
 blockIdx.x*blockDim.x + threadIdx.x;
     float4 input_v4;

     // -- load input to shared memory
 #for i in xrange($LOAD_ITERATIONS)
 #if $i==($LOAD_ITERATIONS-1)
     if((threadIdx.x+$BLOCK_W*$i)$INPUT_BLOCK_W)
 #end if

                                                                conv_kernel_8x8x4.cu
       {
 	        input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W*
 $i);
 	        shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x;
 	        shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y;
 	        shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z;
 	        shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w;
       }



                                                                            64 kB
 #end for
conv_kernel_beta_template.cu
 texturefloat4, 1, cudaReadModeElementType tex_float4;
 __constant__ float constant[$FILTER_D][$FILTER_W]
 [$N_FILTERS];

 #define IMUL(a, b) __mul24(a, b)
 extern quot;Cquot; {

 #for j in xrange($FILTER_H)

   __global__ void convolve_beta_j${j}(float4 *input, float4
 *output)
   {

 #set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1
     __shared__ float shared_in[$INPUT_BLOCK_W][4+1];

     // -- input/output offsets
     const uint in_idx = (blockIdx.y+$j)*INPUT_W +
 blockIdx.x*blockDim.x + threadIdx.x;
     const uint out_idx = blockIdx.y*OUTPUT_W +
 blockIdx.x*blockDim.x + threadIdx.x;
     float4 input_v4;

     // -- load input to shared memory
 #for i in xrange($LOAD_ITERATIONS)
 #if $i==($LOAD_ITERATIONS-1)
     if((threadIdx.x+$BLOCK_W*$i)$INPUT_BLOCK_W)
 #end if
       {
 	        input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W*
 $i);
 	        shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x;
 	        shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y;
 	        shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z;
 	        shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w;
       }
 #end for
conv_kernel_beta_template.cu
 texturefloat4, 1, cudaReadModeElementType tex_float4;
 __constant__ float constant[$FILTER_D][$FILTER_W]
 [$N_FILTERS];

 #define IMUL(a, b) __mul24(a, b)
 extern quot;Cquot; {

 #for j in xrange($FILTER_H)

   __global__ void convolve_beta_j${j}(float4 *input, float4
 *output)
   {

 #set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1
     __shared__ float shared_in[$INPUT_BLOCK_W][4+1];

     // -- input/output offsets

                                                                    seemingly
     const uint in_idx = (blockIdx.y+$j)*INPUT_W +
 blockIdx.x*blockDim.x + threadIdx.x;
     const uint out_idx = blockIdx.y*OUTPUT_W +


                                                                innocuous “flow”
 blockIdx.x*blockDim.x + threadIdx.x;
     float4 input_v4;



                                                                parameter change
     // -- load input to shared memory
 #for i in xrange($LOAD_ITERATIONS)
 #if $i==($LOAD_ITERATIONS-1)
     if((threadIdx.x+$BLOCK_W*$i)$INPUT_BLOCK_W)
 #end if
       {
 	        input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W*
 $i);
 	        shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x;
 	        shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y;
 	        shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z;
 	        shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w;
       }
 #end for
version A
                                                                                            ...
conv_kernel_beta_template.cu                                             mad.rn.f32 $r4, s[$ofs3+0x0000], $r4, $r1
                                                                         mov.b32 $r1, c0[$ofs2+0x0008]
                                                                         mad.rn.f32 $r4, s[$ofs3+0x0008], $r1, $r4
 texturefloat4, 1, cudaReadModeElementType tex_float4;
 __constant__ float constant[$FILTER_D][$FILTER_W]

                                                                         mov.b32 $r1, c0[$ofs2+0x000c]
 [$N_FILTERS];


                                                                         mad.rn.f32 $r4, s[$ofs3+0x000c], $r1, $r4
 #define IMUL(a, b) __mul24(a, b)
 extern quot;Cquot; {

                                                                         mov.b32 $r1, c0[$ofs2+0x0010]
 #for j in xrange($FILTER_H)

                                                                         mad.rn.f32 $r4, s[$ofs3+0x0010], $r1, $r4
   __global__ void convolve_beta_j${j}(float4 *input, float4
 *output)


                                                                                            ...
   {

 #set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1
     __shared__ float shared_in[$INPUT_BLOCK_W][4+1];

     // -- input/output offsets

                                                                    seemingly
     const uint in_idx = (blockIdx.y+$j)*INPUT_W +
 blockIdx.x*blockDim.x + threadIdx.x;
     const uint out_idx = blockIdx.y*OUTPUT_W +


                                                                innocuous “flow”
 blockIdx.x*blockDim.x + threadIdx.x;
     float4 input_v4;



                                                                parameter change
     // -- load input to shared memory
 #for i in xrange($LOAD_ITERATIONS)
 #if $i==($LOAD_ITERATIONS-1)
     if((threadIdx.x+$BLOCK_W*$i)$INPUT_BLOCK_W)
 #end if
       {
 	        input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W*
 $i);
 	        shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x;
 	        shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y;
 	        shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z;
 	        shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w;
       }
 #end for
version A
                                                                                                  ...
conv_kernel_beta_template.cu                                                 mad.rn.f32 $r4, s[$ofs3+0x0000], $r4, $r1
                                                                             mov.b32 $r1, c0[$ofs2+0x0008]
                                                                             mad.rn.f32 $r4, s[$ofs3+0x0008], $r1, $r4
 texturefloat4, 1, cudaReadModeElementType tex_float4;
 __constant__ float constant[$FILTER_D][$FILTER_W]

                                                                             mov.b32 $r1, c0[$ofs2+0x000c]
 [$N_FILTERS];


                                                                             mad.rn.f32 $r4, s[$ofs3+0x000c], $r1, $r4
 #define IMUL(a, b) __mul24(a, b)
 extern quot;Cquot; {

                                                                             mov.b32 $r1, c0[$ofs2+0x0010]
 #for j in xrange($FILTER_H)

                                                                             mad.rn.f32 $r4, s[$ofs3+0x0010], $r1, $r4
   __global__ void convolve_beta_j${j}(float4 *input, float4
 *output)


                                                                                                  ...
   {

 #set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1
     __shared__ float shared_in[$INPUT_BLOCK_W][4+1];

     // -- input/output offsets

                                                                    seemingly
     const uint in_idx = (blockIdx.y+$j)*INPUT_W +
 blockIdx.x*blockDim.x + threadIdx.x;
     const uint out_idx = blockIdx.y*OUTPUT_W +


                                                                innocuous “flow”
 blockIdx.x*blockDim.x + threadIdx.x;
     float4 input_v4;



                                                                parameter change
     // -- load input to shared memory
 #for i in xrange($LOAD_ITERATIONS)
 #if $i==($LOAD_ITERATIONS-1)

                                                                                                        version B
     if((threadIdx.x+$BLOCK_W*$i)$INPUT_BLOCK_W)
 #end if
       {
 	        input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W*



                                                                                               ...
 $i);
 	        shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x;
 	        shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y;
 	        shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z;

                                                                     mad.rn.f32 $r1, s[$ofs1+0x007c], c0[$ofs1+0x0078], $r1
 	        shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w;
       }
                                                                     mad.rn.f32 $r1, s[$ofs2+0x0000], c0[$ofs2+0x007c], $r1
 #end for


                                                                     mad.rn.f32 $r1, s[$ofs2+0x0008], c0[$ofs2+0x0080], $r1
                                                                     mad.rn.f32 $r1, s[$ofs2+0x000c], c0[$ofs2+0x0084], $r1
                                                                     mad.rn.f32 $r1, s[$ofs2+0x0010], c0[$ofs2+0x0088], $r1

                                                                                               ...
Pipeline



   Generate         Unsupervised            Test with
Random Models      Learning (Video)     “screening” task




        Validate on other        Skim o best
              tasks                 models
Pipeline



   Generate         Unsupervised            Test with
Random Models      Learning (Video)     “screening” task




        Validate on other        Skim o best
              tasks                 models
Unsupervised Experience
Learn Filter Kernels from Temporal Input Statistics
Unsupervised Experience
Learn Filter Kernels from Temporal Input Statistics
                law  order
Screening
A quick object rec. test to find promising models




   Generate         Unsupervised            Test with
Random Models      Learning (Video)     “screening” task




        Validate on other        Skim o best
              tasks                 models
Screening
A quick object rec. test to find promising models




   Generate         Unsupervised            Test with
Random Models      Learning (Video)     “screening” task




        Validate on other        Skim o best
              tasks                 models
Screening
A quick object rec. test to find promising models


            “cars”          “planes”
Screening
A quick object rec. test to find promising models


Parametric Variation




no variation
Screening
A quick object rec. test to find promising models


Parametric Variation




no variation    more variation
Screening
A quick object rec. test to find promising models


Parametric Variation




no variation    more variation     lots of variation
Screening


250


200
                     N=2500
150


100


50


 0
      50   60   70   80   90   100
Screening


250


200
                     N=2500
150


100


50


 0
      50   60   70   80   90   100
Screening

                                     100
250
                                     90

                                     80
200
                     N=2500
                                     70
150
                                     60
100                                  50
                                           V1S   1   2   3   4   5
50


 0
      50   60   70   80   90   100
Validation
See how we do on other test sets




   Generate         Unsupervised            Test with
Random Models      Learning (Video)     “screening” task




        Validate on other        Skim o best
              tasks                 models
Validation
See how we do on other test sets




   Generate         Unsupervised            Test with
Random Models      Learning (Video)     “screening” task




        Validate on other        Skim o best
              tasks                 models
Validation


  cars vs. planes (validate)             boats vs. animals

  100                             100

   90                             90

   80                             80

   70                             70

   60                             60

   50                             50
        V1S   1   2   3   4   5         V1S   1   2   3   4   5
Validation

             synthetic face
                                         webcam faces
             discrimination
       100                             100

        90                             90

        80                             80

        70                             70

        60                             60

        50                             50
5            V1S   1   2   3   4   5         V1S   1   2   3   4   5
Other “Petri Dishes”


       boats
Other “Petri Dishes”


                       cars and planes
       boats
Other “Petri Dishes”
Unsupervised
                   Screen Distribution                   Validation
Training “World”
                                                                                                                                synthetic face
                                                          cars vs. planes (validate)             boats vs. animals                                          webcam faces
                                                                                                                                discrimination
 cars and planes   250

                                                          100                             100                             100                             100
                   200
                                        N=2500            90                              90                              90                              90
                   150                                    80                              80                              80                              80

                                                          70                              70                              70                              70
                   100
                                                          60                              60                              60                              60
                   50                                     50                              50                              50                              50
                                                                V1S   1   2   3   4   5         V1S   1   2   3   4   5         V1S   1   2   3   4   5         V1S   1   2   3   4   5


                         50   60   70    80   90   100



     boats         250


                   200                                    100                             100                             100                             100
                                         N=2500
                                                          90                              90                              90                              90
                   150
                                                          80                              80                              80                              80

                                                          70                              70                              70                              70
                   100
                                                          60                              60                              60                              60
                    50                                    50                              50                              50                              50
                                                                V1S   1   2   3   4   5         V1S   1   2   3   4   5         V1S   1   2   3   4   5         V1S   1   2   3   4   5


                         50   60   70    80   90   100


  law  order
                   250
                                                          100                             100                             100                             100
                   200
                                         N=2500           90                              90                              90                              90

                                                          80                              80                              80                              80
                   150
                                                          70                              70                              70                              70
                   100                                    60                              60                              60                              60

                                                          50                              50                              50                              50
                    50                                          V1S   1   2   3   4   5         V1S   1   2   3   4   5         V1S   1   2   3   4   5         V1S   1   2   3   4   5


                     0
                         50   60   70    80   90   100
Screening Basically Works


                       3

                       2

                       1
     cars vs. planes
       (validate)




                       0

                       -1

                                                       r=0.84
                       -2

                             -2    -1     0     1      2   3
                            cars vs. planes (screen)
Screening Basically Works


                                                                                                    3
                          3

                                                                                                    2
                          2

                                                                                                       1
                          1




                                                                               boats vs. animals
       cars vs. planes




                                                                                                    0
         (validate)




                          0

                                                                                                    -1
                          -1

                                                                                                                                               r=0.60
                                                                r=0.84                              -2
                         -2

                                                                                                            -2        -1     0      1      2       3
                                     -2        -1       0   1   2   3
                                                                                                                cars vs. planes (screen)
                                    cars vs. planes (screen)

                                                                                                   3
                         4
                                                                                                   2
                         3
                                                                                                   1
                         2
                                                                                                   0
     discrimination




                         1
     synthetic face




                                                                            webcam faces


                                                                                                   -1
                         0

                                                                                                   -2
                         -1
                                                                                                                                           r=0.23
                                                                r=0.49                             -3
                         -2

                                                                                                           -2        -1     0      1       2      3
                               -2         -1        0       1   2       3
                                                                                                            cars vs. planes (screen)
                                cars vs. planes (screen)
Screening Works on Average

                           3




                           2
     noramalized average




                           1




                           0




                           -1



                                                               r=0.69
                                -2       -1       0        1   2    3
                                cars vs. planes (screen)
What does it all mean?
What does it all mean?



                            Preprocessing / Normalization / Zero-mean                               Layer 1 / Linear Filtering / RF size
                          0.9                                                                 0.9
Performance (% correct)




                                                                    Performance (% correct)
                          0.8                                                                 0.8

                          0.7                                                                 0.7

                          0.6                                                                 0.6

                          0.5                                                                 0.5


                                  False               True                                          3       5         7        9
                                      Parameter value                                                      Parameter value
What does it all mean?



                                Layer 1 / Normalization / Threshold                             Layer 3 / Learning / Neighborhood size
                          0.9                                                                   0.9
Performance (% correct)




                                                                      Performance (% correct)
                          0.8                                                                   0.8

                          0.7                                                                   0.7

                          0.6                                                                   0.6

                          0.5                                                                   0.5


                                01                       10                                           1     3     5        7   9
                                       Parameter value                                                       Parameter value
Summary
Summary



  • GPUs allow us to engage a qualitatively
    dierent pace of experimentation
Summary



  • GPUs allow us to engage a qualitatively
    dierent pace of experimentation
  • CUDA provides unprecedented
    performance / eort ratio
Summary



  • GPUs allow us to engage a qualitatively
    dierent pace of experimentation
  • CUDA provides unprecedented
    performance / eort ratio
  • New laboratory for studying visual
    representation
Shameless Recruitment




 http://www.rowland.harvard.edu/rjf/cox
Shameless Recruitment




 http://www.rowland.harvard.edu/rjf/cox
Shameless Recruitment

                                Now
                                   with
                                        E
                                  NVID xtra
                                 Good IA
                                     ness!




 http://www.rowland.harvard.edu/rjf/cox
Acknowledgments

 Cox Lab @
 The Rowland Institute at Harvard
• Davide Zoccolan
• Nadja Oertelt
Acknowledgments

 Cox Lab @
 The Rowland Institute at Harvard
• Davide Zoccolan
• Nadja Oertelt

DiCarlo Lab @ MIT
• Jim DiCarlo
• Nicolas Pinto
Acknowledgments

 Cox Lab @
 The Rowland Institute at Harvard
• Davide Zoccolan
• Nadja Oertelt

DiCarlo Lab @ MIT
• Jim DiCarlo
• Nicolas Pinto
Acknowledgments

 Cox Lab @
 The Rowland Institute at Harvard
• Davide Zoccolan
• Nadja Oertelt

DiCarlo Lab @ MIT
• Jim DiCarlo
• Nicolas Pinto



                               The Rowland Institute at Harvard
                               HARVARD UNIVERSITY
The Problem with Evaluation

What set to use?
The Problem with Evaluation

What set to use?
The Problem with Evaluation

What set to use?
Caltech 101
faces_easy   faces   car_side   airplanes   accordion
                                                        Caltech 101
Caltech 101




              from A. Torralba
The field does well on the ‘101

          Performance
   100
100%

 8080
                                               1] Wang et al. 2006
 60 60
                                               [2] Grauman and Darrell 2005
                                               [3] Mutch and Lowe 2006
                                               [4] Lazebnik et al. 2006
 40 40                                         [5] Zhang et al. 2006

 20 20

  0   0
            1      2      3       4        5

                state of the art systems
The “start-simple” model
The “start-simple” model


                     • Input divisive normalization
The “start-simple” model


                     • Input divisive normalization
                     • Threshold, saturation
The “start-simple” model


                     • Input divisive normalization
                     • Threshold, saturation
                     • Output normalization
The “start-simple” model


                     • Input divisive normalization
                     • Threshold, saturation
                     • Output normalization

                     • Downsampling
                     • Dimensionality Reduction
                     • Classi cation
The “start-simple” model


                     • Input divisive normalization
                     • Threshold, saturation
                     • Output normalization

                     • Downsampling
                     • Dimensionality Reduction
                     • Classi cation

                     • Really Big (~2.1 million dim)
V1-like model does shockingly well

          Performance
   100
100%

 8080

 60 60

 40 40

 20 20

  0   0
            1      2      3       4        5

                state of the art systems
V1-like model does shockingly well

          Performance
   100
100%

 8080

 60 60

 40 40

 20 20

  0   0
            1      2      3       4        5
                                               V1-like
                state of the art systems
V1-like model does shockingly well

          Performance
   100
100%

 8080

 60 60

 40 40

 20 20

  0   0
            1      2      3       4        5
                                               V1-like   V1+
                state of the art systems
But the model is stupid


It’s just V1.

It has no speci c
mechanisms for achieving
invariance.
But the model is stupid


It’s just V1.

It has no speci c
mechanisms for achieving
invariance.
But the model is stupid


It’s just V1.

It has no speci c
mechanisms for achieving
invariance.


 The Fight is
 Rigged?
Refocusing on the real problem
Refocusing on the real problem




  Just two classes: should be easy, no?
Refocusing on the real problem


Parametric Variation




no variation
Refocusing on the real problem


Parametric Variation




no variation   more variation
Refocusing on the real problem


Parametric Variation




no variation   more variation   lots of variation
Pulling performance back to chance
                          100

                          90
        Performance (%)




                          80

                          70

                          60

                          50



                                0%   10%   20%    30%    40%       50%    60%
  position (x-axis)
                                0%   20%   40%    60%    80%       100%   120%
  position (y-axis)
                                0%   10%   20%    30%    40%       50%    60%
              scale
                                0°   15°   30°    45°    60°       75°    90°
          rotation
                                0°   15°   30°    45°    60°       75°    90°
              view


                                           Intra-class variation
Pulling performance back to chance
                          100

                          90
        Performance (%)




                          80

                          70

                          60

                          50



                                0%   10%   20%    30%    40%       50%    60%
  position (x-axis)
                                0%   20%   40%    60%    80%       100%   120%
  position (y-axis)
                                0%   10%   20%    30%    40%       50%    60%
              scale
                                0°   15°   30°    45°    60°       75°    90°
          rotation
                                0°   15°   30°    45°    60°       75°    90°
              view


                                           Intra-class variation
Just one bad database?
Just one bad database?

Caltech 256
Just one bad database?

Caltech 256
Face Databases: ORL, Yale, AR, CVL
Just one bad database?

Caltech 256
Face Databases: ORL, Yale, AR, CVL




                                See us @ ECCV 08
                                Faces in the Wild
                                Workshop
Face Databases


ORL                                               4 training examples                 8 training examples

                                           100




                 Performance (% correct)
                                           80


                                           60


                                           40


                                           20


                                            0
                                                                   V1-like                             V1-like
                                                  1   2    3   4                      1   2   3   4

                                                 1. pixel space              3. and 4. Noushath et al. 2007
                                                 2. Savvides et al. 2007
Face Databases


     AR                                                5 training          8 training
                                                       examples            examples
                                                 100




                       Performance (% correct)
                                                 80


                                                 60


                                                 40


                                                 20
1. pixel space
2. Liang et al. 2007                              0
                                                               3 V1-like           3 V1-like
                                                       1   2               1   2
3. Zhang et al. 2007
Face Databases


     YALE                                                               4 training             8 training
                                                                        examples               examples
                                                                 100




                                       Performance (% correct)
                                                                 90


                                                                 80


                                                                 70


                                                                 60
1. pixel space
2.,3.,4. and 5. Noushath et al. 2006                             50
                                                                                     V1-like                  V1-like
                                                                       1 2 34567               1 2 3456
6. Ben et al. 2006
7. Wang et al. 2007                                                                                 chance (1/15=6.67%)
Face Databases


     CVL                                                   2 training     3 training
                                                           examples       examples
                                                         (frontal only)       (all)
                                                   100




                         Performance (% correct)
                                                   80


                                                   60


                                                   40

                                                                                           chance
                                                   20                                  (1/114=0.88%)
1. pixel space
2. Goel et al. 2005                                 0
3. Gokberk et al. 2002                                          V1-like           V1-like
                                                           12             1   3
Face Databases


     CAS-PEAL
                                                                         4 training examples
                                                                   facing        facing          facing
                                                           100
                                                                   down         forward            up



                                 Performance (% correct)
                                                           80


                                                           60


                                                           40


                                                           20
1. pixel space
2.,3.,4 and 5. Cao et al. 2004                              0
                                                                 1 2 3 4 5 V1   1 2 3 4 5 V1   1 2 3 4 5 V1
Simulation




             vs.
Synthetic Variation
                                                                                                some variation
                                  100


                                  90
        Performance (% correct)




                                  80                               scene background
                                                                   white noise background
                                  70                               phase scrambled background


                                  60

                                        chance
                                  50
                                                                                                more variation
                                  40


                                  30



 position (x-axis)                       0%      20%   40%   60%   80%     100%       120%
 position (y-axis)                       0%      10%   20%   30%   40%     50%        60%
            scale                        0%      10%   20%   30%   40%     50%        60%
in-plane rotation                        0°      15°   30°   45°   60°     75°        90°
   depth-rotation                        0°      15°   30°   45°   60°     75°        90°


                                           Increasing Variation

Mais conteúdo relacionado

Mais de npinto

[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...npinto
 
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...npinto
 
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)npinto
 
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...npinto
 
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...npinto
 
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...npinto
 
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...npinto
 
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...npinto
 
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...npinto
 
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...npinto
 
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...npinto
 
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...npinto
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)npinto
 
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)npinto
 
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...npinto
 
[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programmingnpinto
 
[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programming[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programmingnpinto
 
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basicsnpinto
 
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patternsnpinto
 
[Harvard CS264] 01 - Introduction
[Harvard CS264] 01 - Introduction[Harvard CS264] 01 - Introduction
[Harvard CS264] 01 - Introductionnpinto
 

Mais de npinto (20)

[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
 
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
 
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
 
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
 
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
 
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
 
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
 
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
 
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
 
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
 
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
 
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
 
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
 
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
 
[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming
 
[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programming[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programming
 
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
 
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
 
[Harvard CS264] 01 - Introduction
[Harvard CS264] 01 - Introduction[Harvard CS264] 01 - Introduction
[Harvard CS264] 01 - Introduction
 

Último

ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
Q4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptxQ4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptxnelietumpap1
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfSpandanaRallapalli
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomnelietumpap1
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptxSherlyMaeNeri
 

Último (20)

ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
Q4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptxQ4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptx
 
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptxLEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdf
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choom
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptx
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 

IAP09 CUDA@MIT 6.963 - Guest Lecture: Unlocking Biologically-Inspired Computer Vision: a High-Throughput Approach (David Cox, Harvard | Jim DiCarlo, Nicolas Pinto, MIT)

  • 1. A High-Throughput Approach to Discovering Good Forms of Visual Representation David Cox The Rowland Institute at Harvard Nicolas Pinto Jim DiCarlo MIT BCS The Rowland Institute at Harvard HARVARD UNIVERSITY
  • 3. Goals 1) “Building Brains” Concrete example of real-world experiments fundamental enabled by stream processing hardware
  • 4. Goals 1) “Building Brains” Concrete example of real-world experiments fundamental enabled by stream processing hardware 2) Tricks of the trade Some high-level highlights for how we leverage CUDA to achieve our goals
  • 5. Goals 1) “Building Brains” Concrete example of real-world experiments fundamental enabled by stream processing hardware 2) Tricks of the trade Some high-level highlights for how we leverage CUDA to achieve our goals
  • 6. The Problem Why is vision hard?
  • 7. The Problem Why is vision hard? World is 3D, but retina is 2D
  • 8. The Problem Why is vision hard? World is 3D, but retina is 2D The same object can cast an infinite number of dierent images onto the retina
  • 18. Transformation Identity Bigger difference in pixel space
  • 19. The Approach: Reverse Engineering the Brain REVERSE Study Natural System
  • 20. The Approach: Reverse Engineering the Brain REVERSE FORWARD Study Build Natural System Artificial System
  • 21. The Approach: Reverse Engineering the Brain REVERSE FORWARD Study Build Natural System Artificial System
  • 25. Rats
  • 26. Rats
  • 29. IT Neurons: Complex Stimuli Desimone et al.
  • 30. IT Cortex can do object recognition Hung et al., 2005
  • 31. Reverse Engineering the Brain REVERSE FORWARD Study Build Natural System Artificial System
  • 32. Reverse Engineering the Brain REVERSE FORWARD Study Build Natural System Artificial System
  • 33. Object Recognition Training Set
  • 34. Object Recognition Training Set Representation
  • 35. Object Recognition Training Set Representation Classifier
  • 36. Object Recognition Training Set Representation Classifier
  • 37. Object Recognition Training Set Representation Classifier Test Example Representation
  • 38. Object Recognition Training Set Representation Classifier Test Example Representation Guess
  • 49. Object Recognition Testing Set “Not Siri” “Siri”
  • 50. How are things done normally?
  • 51. How are things done normally? Usual Formula:
  • 52. How are things done normally? Usual Formula: 1) One grad student
  • 53. How are things done normally? Usual Formula: 1) One grad student 2) One Model (size limited by runtime)
  • 54. How are things done normally? Usual Formula: 1) One grad student 2) One Model (size limited by runtime) 3) Performance numbers on a few standard test sets
  • 55. How are things done normally? Usual Formula: 1) One grad student 2) One Model (size limited by runtime) 3) Performance numbers on a few standard test sets 4) yay. we. rock.
  • 56. How are things done normally? Usual Formula: 1) One grad student 2) One Model (size limited by runtime) 3) Performance numbers on a few standard test sets 4) yay. we. rock. 5) One Ph.D.
  • 57. Why is this not optimal?
  • 58. Why is this not optimal? • Lots of parameters – can’t explore easily
  • 59. Why is this not optimal? • Lots of parameters – can’t explore easily • Big models are paralyzingly slow to run
  • 60. Why is this not optimal? • Lots of parameters – can’t explore easily • Big models are paralyzingly slow to run • Advice from my friend:
  • 61. Why is this not optimal? • Lots of parameters – can’t explore easily • Big models are paralyzingly slow to run • Advice from my friend: “Don’t run anything that takes longer than a week to complete, because it will just crash halfway through anyways (or you’ll discover a bug) and you’ll never finish your Ph.D.”
  • 62. Doing things a little bit dierently
  • 63. Doing things a little bit dierently 1) One grad student
  • 64. Doing things a little bit dierently 1) One grad student 2) One Hundreds of Thousands of BIG Models
  • 65. Doing things a little bit dierently 1) One grad student 2) One Hundreds of Thousands of BIG Models 3) Performance numbers on a few standard test sets*
  • 66. Doing things a little bit dierently 1) One grad student 2) One Hundreds of Thousands of BIG Models 3) Performance numbers on a few standard test sets*
  • 67. Doing things a little bit dierently 1) One grad student 2) One Hundreds of Thousands of BIG Models 3) Performance numbers on a few standard test sets* 4) yay. we. rock.
  • 68. Doing things a little bit dierently 1) One grad student 2) One Hundreds of Thousands of BIG Models 3) Performance numbers on a few standard test sets* 4) yay. we. rock. 5) One Ph.D.?
  • 72. Pipeline: Biology “Plate” a Diversity of Organisms
  • 73. Pipeline: Biology “Plate” a Diversity Allow them of Organisms to grow
  • 74. Pipeline: Biology “Plate” a Diversity Allow them Apply Challenge of Organisms to grow
  • 75. Pipeline: Biology “Plate” a Diversity Allow them Apply Challenge of Organisms to grow Collect Surviving Colonies
  • 76. Pipeline: Biology “Plate” a Diversity Allow them Apply Challenge of Organisms to grow Collect Surviving Study / Repeat Colonies
  • 78. Pipeline: Biology-Inspired Vision Generate Random Models
  • 79. Pipeline: Biology-Inspired Vision Generate Unsupervised Random Models Learning (Video)
  • 80. Pipeline: Biology-Inspired Vision Generate Unsupervised Test with Random Models Learning (Video) “screening” task
  • 81. Pipeline: Biology-Inspired Vision Generate Unsupervised Test with Random Models Learning (Video) “screening” task Skim o best models
  • 82. Pipeline: Biology-Inspired Vision Generate Unsupervised Test with Random Models Learning (Video) “screening” task Validate on other Skim o best tasks models
  • 83. Need to break all of the implicit rules
  • 84. Need to break all of the implicit rules 1. Test tens to hundreds of thousands of instantiations of biologically-inspired hierarchical models
  • 85. Need to break all of the implicit rules 1. Test tens to hundreds of thousands of instantiations of biologically-inspired hierarchical models 2. Use more realistic inputs (e.g. large quantities of video)
  • 86. Need to break all of the implicit rules 1. Test tens to hundreds of thousands of instantiations of biologically-inspired hierarchical models 2. Use more realistic inputs (e.g. large quantities of video) 3. Test models which begin to approach the scale of natural systems
  • 88. Massive Computation If you work for it, a multi-TeraFLOPS cluster is doable for a modest price
  • 89. Long, winding road of stream processing
  • 92. A Match Made in Heaven Brains are parallel, GPUs are parallel ≈
  • 93. A Match Made in Heaven Brains are parallel, GPUs are parallel ≈ Multiple scales of parallelism:
  • 94. A Match Made in Heaven Brains are parallel, GPUs are parallel ≈ Multiple scales of parallelism: “Embarrasingly” parallel: video frames, regions
  • 95. A Match Made in Heaven Brains are parallel, GPUs are parallel ≈ Multiple scales of parallelism: “Embarrasingly” parallel: video frames, regions Fine-grained: independent “neurons,” operating on overlapping inputs
  • 96. A Match Made in Heaven Images In, Images Out ≈
  • 97. A Match Made in Heaven Images In, Images Out ≈ Image processing particularly well-suited
  • 98. A Match Made in Heaven Images In, Images Out ≈ Image processing particularly well-suited Excellent Arithmetic Intensity: very natural to load image patches into shared memory
  • 99. A Match Made in Heaven Images In, Images Out ≈ Image processing particularly well-suited Excellent Arithmetic Intensity: very natural to load image patches into shared memory Data: 2D / 3D locality
  • 100. These things are REALLY fast Performance (g ops) Development Time (hours)
  • 101. These things are REALLY fast Performance (g ops) Development Time (hours) Matlab C/SSE PS3 GT200
  • 102. These things are REALLY fast Performance (g ops) Development Time (hours) 0.3 Matlab C/SSE PS3 GT200
  • 103. These things are REALLY fast Performance (g ops) Development Time (hours) 0.3 Matlab 9.0 C/SSE PS3 GT200
  • 104. These things are REALLY fast Performance (g ops) Development Time (hours) 0.3 Matlab 9.0 C/SSE 110.0 PS3 GT200
  • 105. These things are REALLY fast Performance (g ops) Development Time (hours) 0.3 Matlab 9.0 C/SSE 110.0 PS3 330.0 GT200
  • 106. These things are REALLY fast Performance (g ops) Development Time (hours) 0.3 Matlab 0.5 9.0 C/SSE 110.0 PS3 330.0 GT200
  • 107. These things are REALLY fast Performance (g ops) Development Time (hours) 0.3 Matlab 0.5 9.0 C/SSE 10.0 110.0 PS3 330.0 GT200
  • 108. These things are REALLY fast Performance (g ops) Development Time (hours) 0.3 Matlab 0.5 9.0 C/SSE 10.0 110.0 PS3 30.0 330.0 GT200
  • 109. These things are REALLY fast Performance (g ops) Development Time (hours) 0.3 Matlab 0.5 9.0 C/SSE 10.0 110.0 PS3 30.0 330.0 GT200 10.0
  • 110. Pipeline Generate Unsupervised Test with Random Models Learning (Video) “screening” task Validate on other Skim o best tasks models
  • 111. Pipeline Generate Unsupervised Test with Random Models Learning (Video) “screening” task Validate on other Skim o best tasks models
  • 112. Read-out L3 thresh/sat norm strength Learning normalization neighborhood Rate Trace “Temp. Adv.” “Auto-reset” ... number of lters L2 thresh/sat norm strength Learning normalization Rate neighborhood Trace kernel “Temp. Adv.” size “Auto-reset” ... n. of lters L1 Learning thresh/sat norm strength Rate normalization Trace neighborhood “Temp. Adv.” “Auto-reset” ... kernel size number of lters input kernel size
  • 113. neighborhood Rate Trace “Temp. Adv.” “Auto-reset” ... number of lters L2 thresh/sat norm strength Learning normalization Rate neighborhood Trace kernel “Temp. Adv.” size “Auto-reset” ... n. of lters L1 Learning thresh/sat norm strength Rate normalization Trace neighborhood “Temp. Adv.” “Auto-reset” ... kernel size
  • 114. A Broad Parametric Model Normalize Ni = Inputi / norm(Inputneighborhd) Compute Filter Responses Ri = Fi ⊗ N Ri thresh: Ri = thresh Ri sat: Ri = sat Determine a “Winning Filter” Ri’ = (∑ Tk * Hk) * Ri winner: max(Ri’) Update Filter Fwinning = Fwinning + learning rate * N
  • 115. A Broad Parametric Model Normalize • Optimize “Coverage” Ni = Inputi / norm(Inputneighborhd) Compute Filter Responses Ri = Fi ⊗ N Ri thresh: Ri = thresh Ri sat: Ri = sat Determine a “Winning Filter” Ri’ = (∑ Tk * Hk) * Ri winner: max(Ri’) Update Filter Fwinning = Fwinning + learning rate * N
  • 116. A Broad Parametric Model Normalize • Optimize “Coverage” Ni = Inputi / norm(Inputneighborhd) ( lters span the range of observed inputs) Compute Filter Responses Ri = Fi ⊗ N Ri thresh: Ri = thresh Ri sat: Ri = sat Determine a “Winning Filter” Ri’ = (∑ Tk * Hk) * Ri winner: max(Ri’) Update Filter Fwinning = Fwinning + learning rate * N
  • 117. A Broad Parametric Model Normalize • Optimize “Coverage” Ni = Inputi / norm(Inputneighborhd) ( lters span the range of observed inputs) Compute Filter Responses Ri = Fi ⊗ N • Privilege movement of Ri thresh: Ri = thresh lters in certain Ri sat: Ri = sat directions using Determine a “Winning Filter” temporal information Ri’ = (∑ Tk * Hk) * Ri winner: max(Ri’) Update Filter Fwinning = Fwinning + learning rate * N
  • 118. A Broad Parametric Model Normalize • Optimize “Coverage” Ni = Inputi / norm(Inputneighborhd) ( lters span the range of observed inputs) Compute Filter Responses Ri = Fi ⊗ N • Privilege movement of Ri thresh: Ri = thresh lters in certain Ri sat: Ri = sat directions using Determine a “Winning Filter” temporal information Ri’ = (∑ Tk * Hk) * Ri winner: max(Ri’) • Expand dimensionality greatly and then scale Update Filter Fwinning = Fwinning + learning rate * N back as layers progress
  • 119. Dealing with Parametric Complexity
  • 120. Dealing with Parametric Complexity Throwing a wide net comes with its own challenges:
  • 121. Dealing with Parametric Complexity Throwing a wide net comes with its own challenges: Complexity •
  • 122. Dealing with Parametric Complexity Throwing a wide net comes with its own challenges: Complexity • Kernels must perform under widely • varying conditions
  • 123. Dealing with Parametric Complexity Throwing a wide net comes with its own challenges: Complexity • Kernels must perform under widely • varying conditions Best kernel for a 3x3 conv may not be the same as the best kernel for a 17x17 one; more complex operations are even hairier
  • 124.
  • 127. Meta-programming Leave the grunt-programming to the computer Dynamically compile specialized versions of • the same kernel for dierent conditions
  • 128. Meta-programming Leave the grunt-programming to the computer Dynamically compile specialized versions of • the same kernel for dierent conditions Smooth syntactic ugliness: unroll loops, • index un-indexable registers
  • 129. Meta-programming Leave the grunt-programming to the computer Dynamically compile specialized versions of • the same kernel for dierent conditions Smooth syntactic ugliness: unroll loops, • index un-indexable registers Dynamic, empirical run-time tuning •
  • 130. texturefloat4, 1, cudaReadModeElementType tex_float4; __constant__ float constant[$FILTER_D][$FILTER_W][$N_FILTERS]; #define IMUL(a, b) __mul24(a, b) extern quot;Cquot; { #for j in xrange($FILTER_H) __global__ void convolve_beta_j${j}(float4 *input, float4 *output) { #set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1 __shared__ float shared_in[$INPUT_BLOCK_W][4+1]; // -- input/output offsets const uint in_idx = (blockIdx.y+$j)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4; // -- load input to shared memory #for i in xrange($LOAD_ITERATIONS) #if $i==($LOAD_ITERATIONS-1) if((threadIdx.x+$BLOCK_W*$i)$INPUT_BLOCK_W) #end if { input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W*$i); shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x; shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y; shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z;
  • 131. texturefloat4, 1, cudaReadModeElementType tex_float4; __constant__ float constant[$FILTER_D][$FILTER_W][$N_FILTERS]; #define IMUL(a, b) __mul24(a, b) extern quot;Cquot; { #for j in xrange($FILTER_H) __global__ void convolve_beta_j${j}(float4 *input, float4 *output) { #set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1 __shared__ float shared_in[$INPUT_BLOCK_W][4+1]; // -- input/output offsets const uint in_idx = (blockIdx.y+$j)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4; // -- load input to shared memory #for i in xrange($LOAD_ITERATIONS) #if $i==($LOAD_ITERATIONS-1) if((threadIdx.x+$BLOCK_W*$i)$INPUT_BLOCK_W) #end if { input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W*$i); shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x; shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y; shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z;
  • 132. conv_kernel_4x4x4.cu conv_kernel_template.cu #include stdio.h texturefloat4, 1, cudaReadModeElementType tex_float4; __constant__ float constant[4][4][4]; #define IMUL(a, b) __mul24(a, b) extern quot;Cquot; { texturefloat4, 1, cudaReadModeElementType tex_float4; __constant__ float constant[$FILTER_D][$FILTER_W] [$N_FILTERS]; __global__ void convolve_beta_j0(float4 *input, float4 *output) { #define IMUL(a, b) __mul24(a, b) __shared__ float shared_in[131][4+1]; extern quot;Cquot; { // -- input/output offsets #for j in xrange($FILTER_H) const uint in_idx = (blockIdx.y+0)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4; __global__ void convolve_beta_j${j}(float4 *input, float4 *output) // -- load input to shared memory { { input_v4 = tex1Dfetch(tex_float4, in_idx+128*0); #set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1 shared_in[threadIdx.x+128*0][0] = input_v4.x; shared_in[threadIdx.x+128*0][1] = input_v4.y; __shared__ float shared_in[$INPUT_BLOCK_W][4+1]; shared_in[threadIdx.x+128*0][2] = input_v4.z; shared_in[threadIdx.x+128*0][3] = input_v4.w; // -- input/output offsets } const uint in_idx = (blockIdx.y+$j)*INPUT_W + if((threadIdx.x+128*1)131) blockIdx.x*blockDim.x + threadIdx.x; { input_v4 = tex1Dfetch(tex_float4, in_idx+128*1); const uint out_idx = blockIdx.y*OUTPUT_W + shared_in[threadIdx.x+128*1][0] = input_v4.x; blockIdx.x*blockDim.x + threadIdx.x; shared_in[threadIdx.x+128*1][1] = input_v4.y; float4 input_v4; shared_in[threadIdx.x+128*1][2] = input_v4.z; shared_in[threadIdx.x+128*1][3] = input_v4.w; // -- load input to shared memory } __syncthreads(); #for i in xrange($LOAD_ITERATIONS) #if $i==($LOAD_ITERATIONS-1) // -- compute dot products if((threadIdx.x+$BLOCK_W*$i)$INPUT_BLOCK_W) float v, w; #end if { float sum0 = 0; float sum1 = 0; input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W* float sum2 = 0; $i); float sum3 = 0; shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x; shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y; v = shared_in[threadIdx.x+0][0]; shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z; w = constant[0][0][0]; sum0 += v*w; shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w; w = constant[0][0][1]; } sum1 += v*w; #end for w = constant[0][0][2]; sum2 += v*w; w = constant[0][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][0]; w = constant[0][1][0]; sum0 += v*w; w = constant[0][1][1]; sum1 += v*w; w = constant[0][1][2]; sum2 += v*w; w = constant[0][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][0]; w = constant[0][2][0]; sum0 += v*w; w = constant[0][2][1]; sum1 += v*w;
  • 133. conv_kernel_template.cu texturefloat4, 1, cudaReadModeElementType tex_float4; __constant__ float constant[$FILTER_D][$FILTER_W] conv_kernel_4x4x4.cu [$N_FILTERS]; #define IMUL(a, b) __mul24(a, b) extern quot;Cquot; { #for j in xrange($FILTER_H) __global__ void convolve_beta_j${j}(float4 *input, float4 20 kB *output) { #set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1 __shared__ float shared_in[$INPUT_BLOCK_W][4+1]; // -- input/output offsets const uint in_idx = (blockIdx.y+$j)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4; // -- load input to shared memory #for i in xrange($LOAD_ITERATIONS) #if $i==($LOAD_ITERATIONS-1) if((threadIdx.x+$BLOCK_W*$i)$INPUT_BLOCK_W) #end if conv_kernel_8x8x4.cu { input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W* $i); shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x; shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y; shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z; shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w; } 64 kB #end for
  • 134. conv_kernel_beta_template.cu texturefloat4, 1, cudaReadModeElementType tex_float4; __constant__ float constant[$FILTER_D][$FILTER_W] [$N_FILTERS]; #define IMUL(a, b) __mul24(a, b) extern quot;Cquot; { #for j in xrange($FILTER_H) __global__ void convolve_beta_j${j}(float4 *input, float4 *output) { #set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1 __shared__ float shared_in[$INPUT_BLOCK_W][4+1]; // -- input/output offsets const uint in_idx = (blockIdx.y+$j)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4; // -- load input to shared memory #for i in xrange($LOAD_ITERATIONS) #if $i==($LOAD_ITERATIONS-1) if((threadIdx.x+$BLOCK_W*$i)$INPUT_BLOCK_W) #end if { input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W* $i); shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x; shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y; shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z; shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w; } #end for
  • 135. conv_kernel_beta_template.cu texturefloat4, 1, cudaReadModeElementType tex_float4; __constant__ float constant[$FILTER_D][$FILTER_W] [$N_FILTERS]; #define IMUL(a, b) __mul24(a, b) extern quot;Cquot; { #for j in xrange($FILTER_H) __global__ void convolve_beta_j${j}(float4 *input, float4 *output) { #set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1 __shared__ float shared_in[$INPUT_BLOCK_W][4+1]; // -- input/output offsets seemingly const uint in_idx = (blockIdx.y+$j)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + innocuous “flow” blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4; parameter change // -- load input to shared memory #for i in xrange($LOAD_ITERATIONS) #if $i==($LOAD_ITERATIONS-1) if((threadIdx.x+$BLOCK_W*$i)$INPUT_BLOCK_W) #end if { input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W* $i); shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x; shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y; shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z; shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w; } #end for
  • 136. version A ... conv_kernel_beta_template.cu mad.rn.f32 $r4, s[$ofs3+0x0000], $r4, $r1 mov.b32 $r1, c0[$ofs2+0x0008] mad.rn.f32 $r4, s[$ofs3+0x0008], $r1, $r4 texturefloat4, 1, cudaReadModeElementType tex_float4; __constant__ float constant[$FILTER_D][$FILTER_W] mov.b32 $r1, c0[$ofs2+0x000c] [$N_FILTERS]; mad.rn.f32 $r4, s[$ofs3+0x000c], $r1, $r4 #define IMUL(a, b) __mul24(a, b) extern quot;Cquot; { mov.b32 $r1, c0[$ofs2+0x0010] #for j in xrange($FILTER_H) mad.rn.f32 $r4, s[$ofs3+0x0010], $r1, $r4 __global__ void convolve_beta_j${j}(float4 *input, float4 *output) ... { #set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1 __shared__ float shared_in[$INPUT_BLOCK_W][4+1]; // -- input/output offsets seemingly const uint in_idx = (blockIdx.y+$j)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + innocuous “flow” blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4; parameter change // -- load input to shared memory #for i in xrange($LOAD_ITERATIONS) #if $i==($LOAD_ITERATIONS-1) if((threadIdx.x+$BLOCK_W*$i)$INPUT_BLOCK_W) #end if { input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W* $i); shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x; shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y; shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z; shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w; } #end for
  • 137. version A ... conv_kernel_beta_template.cu mad.rn.f32 $r4, s[$ofs3+0x0000], $r4, $r1 mov.b32 $r1, c0[$ofs2+0x0008] mad.rn.f32 $r4, s[$ofs3+0x0008], $r1, $r4 texturefloat4, 1, cudaReadModeElementType tex_float4; __constant__ float constant[$FILTER_D][$FILTER_W] mov.b32 $r1, c0[$ofs2+0x000c] [$N_FILTERS]; mad.rn.f32 $r4, s[$ofs3+0x000c], $r1, $r4 #define IMUL(a, b) __mul24(a, b) extern quot;Cquot; { mov.b32 $r1, c0[$ofs2+0x0010] #for j in xrange($FILTER_H) mad.rn.f32 $r4, s[$ofs3+0x0010], $r1, $r4 __global__ void convolve_beta_j${j}(float4 *input, float4 *output) ... { #set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1 __shared__ float shared_in[$INPUT_BLOCK_W][4+1]; // -- input/output offsets seemingly const uint in_idx = (blockIdx.y+$j)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + innocuous “flow” blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4; parameter change // -- load input to shared memory #for i in xrange($LOAD_ITERATIONS) #if $i==($LOAD_ITERATIONS-1) version B if((threadIdx.x+$BLOCK_W*$i)$INPUT_BLOCK_W) #end if { input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W* ... $i); shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x; shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y; shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z; mad.rn.f32 $r1, s[$ofs1+0x007c], c0[$ofs1+0x0078], $r1 shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w; } mad.rn.f32 $r1, s[$ofs2+0x0000], c0[$ofs2+0x007c], $r1 #end for mad.rn.f32 $r1, s[$ofs2+0x0008], c0[$ofs2+0x0080], $r1 mad.rn.f32 $r1, s[$ofs2+0x000c], c0[$ofs2+0x0084], $r1 mad.rn.f32 $r1, s[$ofs2+0x0010], c0[$ofs2+0x0088], $r1 ...
  • 138. Pipeline Generate Unsupervised Test with Random Models Learning (Video) “screening” task Validate on other Skim o best tasks models
  • 139. Pipeline Generate Unsupervised Test with Random Models Learning (Video) “screening” task Validate on other Skim o best tasks models
  • 140. Unsupervised Experience Learn Filter Kernels from Temporal Input Statistics
  • 141. Unsupervised Experience Learn Filter Kernels from Temporal Input Statistics law order
  • 142. Screening A quick object rec. test to find promising models Generate Unsupervised Test with Random Models Learning (Video) “screening” task Validate on other Skim o best tasks models
  • 143. Screening A quick object rec. test to find promising models Generate Unsupervised Test with Random Models Learning (Video) “screening” task Validate on other Skim o best tasks models
  • 144. Screening A quick object rec. test to find promising models “cars” “planes”
  • 145. Screening A quick object rec. test to find promising models Parametric Variation no variation
  • 146. Screening A quick object rec. test to find promising models Parametric Variation no variation more variation
  • 147. Screening A quick object rec. test to find promising models Parametric Variation no variation more variation lots of variation
  • 148. Screening 250 200 N=2500 150 100 50 0 50 60 70 80 90 100
  • 149. Screening 250 200 N=2500 150 100 50 0 50 60 70 80 90 100
  • 150. Screening 100 250 90 80 200 N=2500 70 150 60 100 50 V1S 1 2 3 4 5 50 0 50 60 70 80 90 100
  • 151. Validation See how we do on other test sets Generate Unsupervised Test with Random Models Learning (Video) “screening” task Validate on other Skim o best tasks models
  • 152. Validation See how we do on other test sets Generate Unsupervised Test with Random Models Learning (Video) “screening” task Validate on other Skim o best tasks models
  • 153. Validation cars vs. planes (validate) boats vs. animals 100 100 90 90 80 80 70 70 60 60 50 50 V1S 1 2 3 4 5 V1S 1 2 3 4 5
  • 154. Validation synthetic face webcam faces discrimination 100 100 90 90 80 80 70 70 60 60 50 50 5 V1S 1 2 3 4 5 V1S 1 2 3 4 5
  • 156. Other “Petri Dishes” cars and planes boats
  • 157. Other “Petri Dishes” Unsupervised Screen Distribution Validation Training “World” synthetic face cars vs. planes (validate) boats vs. animals webcam faces discrimination cars and planes 250 100 100 100 100 200 N=2500 90 90 90 90 150 80 80 80 80 70 70 70 70 100 60 60 60 60 50 50 50 50 50 V1S 1 2 3 4 5 V1S 1 2 3 4 5 V1S 1 2 3 4 5 V1S 1 2 3 4 5 50 60 70 80 90 100 boats 250 200 100 100 100 100 N=2500 90 90 90 90 150 80 80 80 80 70 70 70 70 100 60 60 60 60 50 50 50 50 50 V1S 1 2 3 4 5 V1S 1 2 3 4 5 V1S 1 2 3 4 5 V1S 1 2 3 4 5 50 60 70 80 90 100 law order 250 100 100 100 100 200 N=2500 90 90 90 90 80 80 80 80 150 70 70 70 70 100 60 60 60 60 50 50 50 50 50 V1S 1 2 3 4 5 V1S 1 2 3 4 5 V1S 1 2 3 4 5 V1S 1 2 3 4 5 0 50 60 70 80 90 100
  • 158. Screening Basically Works 3 2 1 cars vs. planes (validate) 0 -1 r=0.84 -2 -2 -1 0 1 2 3 cars vs. planes (screen)
  • 159. Screening Basically Works 3 3 2 2 1 1 boats vs. animals cars vs. planes 0 (validate) 0 -1 -1 r=0.60 r=0.84 -2 -2 -2 -1 0 1 2 3 -2 -1 0 1 2 3 cars vs. planes (screen) cars vs. planes (screen) 3 4 2 3 1 2 0 discrimination 1 synthetic face webcam faces -1 0 -2 -1 r=0.23 r=0.49 -3 -2 -2 -1 0 1 2 3 -2 -1 0 1 2 3 cars vs. planes (screen) cars vs. planes (screen)
  • 160. Screening Works on Average 3 2 noramalized average 1 0 -1 r=0.69 -2 -1 0 1 2 3 cars vs. planes (screen)
  • 161. What does it all mean?
  • 162. What does it all mean? Preprocessing / Normalization / Zero-mean Layer 1 / Linear Filtering / RF size 0.9 0.9 Performance (% correct) Performance (% correct) 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 False True 3 5 7 9 Parameter value Parameter value
  • 163. What does it all mean? Layer 1 / Normalization / Threshold Layer 3 / Learning / Neighborhood size 0.9 0.9 Performance (% correct) Performance (% correct) 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 01 10 1 3 5 7 9 Parameter value Parameter value
  • 165. Summary • GPUs allow us to engage a qualitatively dierent pace of experimentation
  • 166. Summary • GPUs allow us to engage a qualitatively dierent pace of experimentation • CUDA provides unprecedented performance / eort ratio
  • 167. Summary • GPUs allow us to engage a qualitatively dierent pace of experimentation • CUDA provides unprecedented performance / eort ratio • New laboratory for studying visual representation
  • 170. Shameless Recruitment Now with E NVID xtra Good IA ness! http://www.rowland.harvard.edu/rjf/cox
  • 171. Acknowledgments Cox Lab @ The Rowland Institute at Harvard • Davide Zoccolan • Nadja Oertelt
  • 172. Acknowledgments Cox Lab @ The Rowland Institute at Harvard • Davide Zoccolan • Nadja Oertelt DiCarlo Lab @ MIT • Jim DiCarlo • Nicolas Pinto
  • 173. Acknowledgments Cox Lab @ The Rowland Institute at Harvard • Davide Zoccolan • Nadja Oertelt DiCarlo Lab @ MIT • Jim DiCarlo • Nicolas Pinto
  • 174. Acknowledgments Cox Lab @ The Rowland Institute at Harvard • Davide Zoccolan • Nadja Oertelt DiCarlo Lab @ MIT • Jim DiCarlo • Nicolas Pinto The Rowland Institute at Harvard HARVARD UNIVERSITY
  • 175. The Problem with Evaluation What set to use?
  • 176. The Problem with Evaluation What set to use?
  • 177. The Problem with Evaluation What set to use?
  • 179. faces_easy faces car_side airplanes accordion Caltech 101
  • 180. Caltech 101 from A. Torralba
  • 181. The field does well on the ‘101 Performance 100 100% 8080 1] Wang et al. 2006 60 60 [2] Grauman and Darrell 2005 [3] Mutch and Lowe 2006 [4] Lazebnik et al. 2006 40 40 [5] Zhang et al. 2006 20 20 0 0 1 2 3 4 5 state of the art systems
  • 183. The “start-simple” model • Input divisive normalization
  • 184. The “start-simple” model • Input divisive normalization • Threshold, saturation
  • 185. The “start-simple” model • Input divisive normalization • Threshold, saturation • Output normalization
  • 186. The “start-simple” model • Input divisive normalization • Threshold, saturation • Output normalization • Downsampling • Dimensionality Reduction • Classi cation
  • 187. The “start-simple” model • Input divisive normalization • Threshold, saturation • Output normalization • Downsampling • Dimensionality Reduction • Classi cation • Really Big (~2.1 million dim)
  • 188. V1-like model does shockingly well Performance 100 100% 8080 60 60 40 40 20 20 0 0 1 2 3 4 5 state of the art systems
  • 189. V1-like model does shockingly well Performance 100 100% 8080 60 60 40 40 20 20 0 0 1 2 3 4 5 V1-like state of the art systems
  • 190. V1-like model does shockingly well Performance 100 100% 8080 60 60 40 40 20 20 0 0 1 2 3 4 5 V1-like V1+ state of the art systems
  • 191. But the model is stupid It’s just V1. It has no speci c mechanisms for achieving invariance.
  • 192.
  • 193. But the model is stupid It’s just V1. It has no speci c mechanisms for achieving invariance.
  • 194. But the model is stupid It’s just V1. It has no speci c mechanisms for achieving invariance. The Fight is Rigged?
  • 195. Refocusing on the real problem
  • 196. Refocusing on the real problem Just two classes: should be easy, no?
  • 197. Refocusing on the real problem Parametric Variation no variation
  • 198. Refocusing on the real problem Parametric Variation no variation more variation
  • 199. Refocusing on the real problem Parametric Variation no variation more variation lots of variation
  • 200. Pulling performance back to chance 100 90 Performance (%) 80 70 60 50 0% 10% 20% 30% 40% 50% 60% position (x-axis) 0% 20% 40% 60% 80% 100% 120% position (y-axis) 0% 10% 20% 30% 40% 50% 60% scale 0° 15° 30° 45° 60° 75° 90° rotation 0° 15° 30° 45° 60° 75° 90° view Intra-class variation
  • 201. Pulling performance back to chance 100 90 Performance (%) 80 70 60 50 0% 10% 20% 30% 40% 50% 60% position (x-axis) 0% 20% 40% 60% 80% 100% 120% position (y-axis) 0% 10% 20% 30% 40% 50% 60% scale 0° 15° 30° 45° 60° 75° 90° rotation 0° 15° 30° 45° 60° 75° 90° view Intra-class variation
  • 202. Just one bad database?
  • 203. Just one bad database? Caltech 256
  • 204. Just one bad database? Caltech 256 Face Databases: ORL, Yale, AR, CVL
  • 205. Just one bad database? Caltech 256 Face Databases: ORL, Yale, AR, CVL See us @ ECCV 08 Faces in the Wild Workshop
  • 206. Face Databases ORL 4 training examples 8 training examples 100 Performance (% correct) 80 60 40 20 0 V1-like V1-like 1 2 3 4 1 2 3 4 1. pixel space 3. and 4. Noushath et al. 2007 2. Savvides et al. 2007
  • 207. Face Databases AR 5 training 8 training examples examples 100 Performance (% correct) 80 60 40 20 1. pixel space 2. Liang et al. 2007 0 3 V1-like 3 V1-like 1 2 1 2 3. Zhang et al. 2007
  • 208. Face Databases YALE 4 training 8 training examples examples 100 Performance (% correct) 90 80 70 60 1. pixel space 2.,3.,4. and 5. Noushath et al. 2006 50 V1-like V1-like 1 2 34567 1 2 3456 6. Ben et al. 2006 7. Wang et al. 2007 chance (1/15=6.67%)
  • 209. Face Databases CVL 2 training 3 training examples examples (frontal only) (all) 100 Performance (% correct) 80 60 40 chance 20 (1/114=0.88%) 1. pixel space 2. Goel et al. 2005 0 3. Gokberk et al. 2002 V1-like V1-like 12 1 3
  • 210. Face Databases CAS-PEAL 4 training examples facing facing facing 100 down forward up Performance (% correct) 80 60 40 20 1. pixel space 2.,3.,4 and 5. Cao et al. 2004 0 1 2 3 4 5 V1 1 2 3 4 5 V1 1 2 3 4 5 V1
  • 211. Simulation vs.
  • 212. Synthetic Variation some variation 100 90 Performance (% correct) 80 scene background white noise background 70 phase scrambled background 60 chance 50 more variation 40 30 position (x-axis) 0% 20% 40% 60% 80% 100% 120% position (y-axis) 0% 10% 20% 30% 40% 50% 60% scale 0% 10% 20% 30% 40% 50% 60% in-plane rotation 0° 15° 30° 45° 60° 75° 90° depth-rotation 0° 15° 30° 45° 60° 75° 90° Increasing Variation