SlideShare uma empresa Scribd logo
1 de 58
Baixar para ler offline
The R-Stream High-Level Program Transformation Tool
      N. Vasilache, B. Meister, M. Baskaran, A.Hartono, R. Lethin




Reservoir Labs   Harvard 04/12/2011                                 1
Outline


        •• R-Stream Overview

        •• Compilation Walk-through

        •• Performance Results

        •• Getting R-Stream




Reservoir Labs   Harvard 04/12/2011   2
Power efficiency driving architectures




           Heterogeneous                         SIMD   SIMD          SIMD   SIMD        NUMA
                                          FPGA                 FPGA
             Processing


                                          DMA      Memory      DMA      Memory



            Distributed
                                          GPP                  GPP
          Local Memories                         SIMD   SIMD          SIMD   SIMD      Hierarchical
                                                                                    (including board,
                                                                                    chassis, cabinet)

                                                 SIMD   SIMD          SIMD   SIMD
             Explicitly                   FPGA                 FPGA
             Managed
            Architecture
                                          DMA      Memory      DMA
                                                                                        Multiple
                                                                        Memory
                                                                                        Execution
                                                                                         Models
             Bandwidth
                                          GPP                  GPP
              Starved                            SIMD   SIMD          SIMD   SIMD




              Multiple                                                                   Mixed
               Spatial                                                                 Parallelism
             Dimensions                                                                  Types




                                                                                                        3
Reservoir Labs       Harvard 04/12/2011
Computation choreography


        •• Expressing it
        • Annotations and pragma dialects for C
        • Explicitly (e.g., new languages like CUDA and OpenCL)

        •• But before expressing it, how can programmers find it?
        • Manual constructive procedures, art, sweat, time
          – Artisans get complete control over every detail
        • Automatically
          –   Operations research problem
          –   Like scheduling trucks to save fuel             Our focus
          –   Model, solve              , implement
          –   Faster, sometimes better, than a human



Reservoir Labs   Harvard 04/12/2011                                       4
How to do automatic scheduling?


        •• Naïve approach
        • Model
          – Tasks, dependences
          – Resource use, latencies
          – Machine model with connectivity, resource capacities
        • Solve with ILP
          – Minimize overall task length
          – Subject to dependences, resource use
        • Problems
          – Complexity: task graph is huge!
          – Dynamics: loop lengths unknown.

        •• So we do something much more cool.

Reservoir Labs   Harvard 04/12/2011                                5
Program Transformations Specification


iteration space of a statement S(i,j)
                                                          t2
      j
                                              2       2
                                         :Z       Z
                        i
                                                                 t1
          ••   Schedule maps iterations to multi-dimensional time:
          •    A feasible schedule preserves dependences
          ••   Placement maps iterations to multi-dimensional space:
          •    UHPC in progress, partially done
          ••   Layout maps data elements to multi-dimensional space:
          •    UHPC in progress
          ••   Hierarchical by design, tiling serves separation of concerns

                                                                              6
 Reservoir Labs     Harvard 04/12/2011
Loop transformations

 for(i=0; i<N; i++)
   for(j=0; j<N; j++)
     S(i,j);
                                                                unimodular
                  for(j=0; j<N; j++)                    0 1 i
permutation         for(i=0; i<N; i++)        (i, j )
                      S(i,j);                           1 0 j
                  for(i=N-1; i>=0; i--)                 1 0 i
  reversal          for(j=0; j<N; j++)        (i, j )
                      S(j,i);                           0   1         j

                 for(i=0; i<N; i++)                     1   0 i
  skewing          for(j= *i; j<N+ *i; j++)   (i, j )
                     S(i,j- *i);                            1     j

                  for(i=0; i< *N; i+= )                     0 i
   scaling          for(j=0; j<N; j++)        (i, j )
                      S(i/ ,j);                         0   1     j


                                                                          7
Reservoir Labs   Harvard 04/12/2011
Loop fusion and distribution

 for(i=0; i<N; i++)
   for(j=0; j<N; j++)
                                             fusion            for(i=0; i<N; i++)
                                                                 for(j=0; j<N; j++)
     S1(i,j);
                                                                   S1(i,j);
   for(j=0; j<N; j++)
                                                                   S2(i,j)
     S2(i,j)                               distribution
                                                                               0 0 0
                   0 0 0
                                                                               1 0 0 i
                   1 0 0 i                        fusion             (i, j )   0 0 0 j
    1 (i, j )      0 0 0 j                                       1

                                                                               0 1 0 1
                   0 1 0 1
                                                                               0 0 0
                   0 0 0
                   0 0 0                                                       0 0 0
                   1 0 0 i                                                     1 0 0 i
    2   (i, j )    0 0 1 j                                       2   (i, j )   0 0 0 j
                                                distribution                   0 1 0 1
                   0 1 0 1
                   0 0 0                                                       0 0 1


                                                                                         8
Reservoir Labs        Harvard 04/12/2011
Enabling technology is new compiler math

Uniform Recurrence Equations [Karp et al. 1970]
                                                                   Many: Lamport, Allen/Kennedy,
                                                            Banerjee, Irigoin, Wolfe/Lam, Pugh, Pingali, e
Loop Transformations and Parallelization [1970-]
                                                              Vectorization, SMP, locality optimizations
                                                    Dependence summary: direction/distance vectors

                                                               Unimodular transformations
   Systolic Array Mapping                              Mostly linear-algebraic

                                        Many: Feautrier, Darte, Vivien, Wilde, Rajopadhye, etc,....

                                        Exact dependence analysis
   Polyhedral Model [1980-]                                          General affine transformations
                                          Loop synthesis via polyhedral scanning

                                          New computational techniques based
                                             on polyhedral representations

 Reservoir Labs    Harvard 04/12/2011
                                                                                                9
R-Stream model: polyhedra

n = f();
for (i=5; i<= n; i+=2) {
  A[i][i] = A[i][i]/B[i];
  for (j=0; j<=i; j++) {
    if (j<=10) {
      …… A[i+2j+n][i+3]……
  }
}

{i, j   Z2 | k   Z ,5 i      n;0       j i; j i ; i   2k 1}
                i
  A0    1 2 1 0
                j
  A1    1 0 0 3                                   Affine and non-affine transformations
                n                                 Order and place of operations and data
  1     0 0 0 1
                1

Loop code represented (exactly or conservatively) with polyhedrons
   High-level, mathematical view of a mapping
   But targets concrete properties: parallelism, locality, memory footprint



Reservoir Labs    Harvard 04/12/2011                                                       10
Polyhedral slogans


        •• Parametric imperfect loop nests

        •• Subsumes classical transformations

        •• Compacts the transformation search space

        •• Parallelization, locality optimization (communication avoiding)

        •• Preserves semantics

        •• Analytic joint formulations of optimizations

        •• Not just for affine static control programs

Reservoir Labs   Harvard 04/12/2011                                          11
Polyhedral model – challenges in building a compiler


        •• Killer math

        •• Scalability of optimizations/code generation

        •• Mostly confined to dependence preserving transformations

        •• Code can be radically transformed – outputs can look
          wildly different

        •• Modeling indirections, pointers, non-affine code.

        •• Many of these challenges are solved


Reservoir Labs   Harvard 04/12/2011                                   12
R-Stream blueprint




                 Machine
                                                  Polyhedral Mapper
                  Model




                                        Raising                       Lowering




                  EDG C                                                          Pretty
                                              Scalar Representation
                 Front End                                                       Printer




                                                                                           13
Reservoir Labs     Harvard 04/12/2011
Inside the polyhedral mapper




                                             GDG representation



                                                 Tactics Module


                 Parallelization
                                                                         Comm.
                    Locality            Tiling            Placement
                                                                        Generation
                 Optimization
                                                                                     ……
                    Memory              Sync               Layout       Polyhedral
                   Promotion          Generation         Optimization   Scanning




                                                   Jolylib, ……




                                                                                          14
Reservoir Labs   Harvard 04/12/2011
Inside the polyhedral mapper
Optimization modules engineered to expose ““knobs”” that could be used by auto-tuner



                                              GDG representation



                                                  Tactics Module


                  Parallelization
                                                                          Comm.
                     Locality            Tiling            Placement
                                                                         Generation
                  Optimization
                                                                                      ……
                     Memory              Sync               Layout       Polyhedral
                    Promotion          Generation         Optimization   Scanning




                                                    Jolylib, ……




                                                                                           15
 Reservoir Labs   Harvard 04/12/2011
Driving the mapping: the machine model


•• Target machine characteristics that have an
     influence on how the mapping should be done
•    Local memory / cache sizes
•    Communication facilities: DMA, cache(s)
•    Synchronization capabilities
•    Symmetrical or not
•    SIMD width
•    Bandwidths

•• Currently: two-level model (Host and Accelerators)
•• XML schema and graphical rendering



    Reservoir Labs   Harvard 04/12/2011
Machine model example: multi-Tesla


       Host



1 thread per GPU




              OpenMP morph              XML file
                                                   CUDA morph
Reservoir Labs     Harvard 04/12/2011                           17
Mapping process

             Dependencies
                                                               2- Task formation:
                                                               - Coarse-grain atomic tasks
                                                               - Master/slave side operations




    1- Scheduling:
    Parallelism, locality, tilability


                        3- Placement:
                        Assign tasks to blocks/threads



                                                         - Local / global data layout optimization
                                                         - Multi-buffering (explicitly managed)
                                                         - Synchronization (barriers)
                                                         - Bulk communications
                                                         - Thread generation -> master/slave
                                                         - CUDA-specific optimizations

Reservoir Labs          Harvard 04/12/2011                                                           18
Program Transformations Specification


iteration space of a statement S(i,j)
                                                          t2
      j
                                              2       2
                                         :Z       Z
                        i
                                                                 t1
          ••   Schedule maps iterations to multi-dimensional time:
          •    A feasible schedule preserves dependences
          ••   Placement maps iterations to multi-dimensional space:
          •    UHPC in progress, partially done
          ••   Layout maps data elements to multi-dimensional space:
          •    UHPC in progress
          ••   Hierarchical by design, tiling serves separation of concerns

                                                                              19
 Reservoir Labs     Harvard 04/12/2011
Model for scheduling trades 3 objectives jointly


                                      Loop Fission
  Fewer
  Global                  More                          More        Sufficient
 Memory                  Locality                    Parallelism    Occupancy
 Accesses
                                      Loop Fusion

                  + successive                       + successive
                     thread                             thread
                   contiguity                         contiguity
                                       Memory
                                      Coalescing
                                                       Better
                                                      Effective
                                                     Bandwidth

                                                                    Patent pending

Reservoir Labs   Harvard 04/12/2011                                              20
Optimization with BLAS vs. global optimization


                         Numerous cache misses
                                                         /* Global Optimization*/
/* Optimization with BLAS */                             doall loop {      Can parallelize
for loop {             Outer loop(s)                       ……               outer loop(s)
  ……                                                       for loop {
               Retrieve data Z from disk                     ……
  BLAS call 1
  ……           Store data Z back to disk                     [read from Z]
               Retrieve data Z from disk !!!                                   Loop fusion
  BLAS call 2                                    VS.         ……
  ……                                                         [write to Z]           can
  ……                                                         ……                  improve
  BLAS call n                                                [read from Z]        locality
  ……                                                       }
}                                                          ……
                                                         }



          Global optimization can expose better parallelism and locality


 Reservoir Labs    Harvard 04/12/2011
Tradeoffs between parallelism and locality

        • Significant parallelism is needed to fully utilize all resources
        • Locality is also critical to minimize communication
        • Parallelism can come at the expense of locality
                                                                         Limited bandwidth
                                                                           at chip border
                                                                             High on-chip
                                                                              parallelism

        •• Our approach: R-Stream compiler exposes parallelism via affine scheduling
           that simultaneously augments locality using loop fusion
                                                                              Reuse data once
                                                                             loaded on chip =
                                                                                  locality




Reservoir Labs   Harvard 04/12/2011
Parallelism/locality tradeoff example
                           Array z gets expanded, to
                                                                Maximum distribution destroys locality
                           introduce another level of
                           parallelism
/*                                                               doall (i=0; i<400; i++)
  * Original code:                                                doall (j=0; j<3997; j++)
  * Simplified CSLC LMS                                            z_e[j][i]=0
  */                                                             doall (i=0; i<400; i++)
for (k=0; k<400; k++) {                                           doall (j=0; j<3997; j++)
                                            Max. parallelism
  for (i=0; i<3997; i++) {                                         for (k=0; k<4000; k++)
                                              (no fusion)
    z[i]=0;                                                          z_e[j][i]=z_e[j][i]+B[j][k]*x[i][k];
    for (j=0; j<4000; j++)                                       doall (i=0; i<3997; i++)
     z[i]= z[i]+B[i][j]*x[k][j];                                  for (j=0; j<400; j++)
  }                                                                w[i]=w[i]+z_e[i][j];
  for (i=0; i<3997; i++)                                         doall (i=0; i<3997; i++)
    w[i]=w[i]+z[i];                                  Data
                                                                  z[i] = z_e[i][399];
}                                                accumulation



               2 levels of parallelism, but poor data reuse (on array z_e)

  Reservoir Labs       Harvard 04/12/2011
Parallelism/locality tradeoff example (cont.)


                                                                Aggressive loop fusion destroys
                                                                parallelism (i.e., only 1 degree
/*
                                                                of parallelism)
  * Original code:
  * Simplified CSLC LMS
  */                                                      doall (i=0; i<3997; i++)
for (k=0; k<400; k++) {                                    for (j=0; j<400; j++) {
  for (i=0; i<3997; i++) {
                                            Max. fusion      z[i]=0;
    z[i]=0;                                                  for (k=0; k<4000; k++)
    for (j=0; j<4000; j++)                                    z[i]=z[i]+B[i][k]*x[j][k];
     z[i]= z[i]+B[i][j]*x[k][j];                             w[i]=w[i]+z[i];
  }                                                        }
  for (i=0; i<3997; i++)
    w[i]=w[i]+z[i];
}


               Very good data reuse (on array z), but only 1 level of parallelism

  Reservoir Labs       Harvard 04/12/2011
Parallelism/locality tradeoff example (cont.)

                                                                                  Partial fusion doesn’t
                                      Expansion of array z                        decrease parallelism
/*
  * Original code:                                              doall (i=0; i<3997; i++) {
  * Simplified CSLC LMS                                           doall (j=0; j<400; j++) {
  */                                                                z_e[i][j]=0;
for (k=0; k<400; k++) {                                             for (k=0; k<4000; k++)
                                                                     z_e[i][j]=z_e[i][j]+B[i][k]*x[j][k];
  for (i=0; i<3997; i++) {                  Parallelism with
                                                                  }
    z[i]=0;                                  partial fusion
    for (j=0; j<4000; j++)                                        for (j=0; j<400; j++)
     z[i]= z[i]+B[i][j]*x[k][j];                                    w[i]=w[i]+z_e[i][j];
  }                                                             }
                                                                doall (i=0; i<3997; i++)
  for (i=0; i<3997; i++)                             Data
                                                                  z[i]=z_e[i][399];
    w[i]=w[i]+z[i];                              accumulation
}


               2 levels of parallelism with good data reuse (on array z_e)

  Reservoir Labs       Harvard 04/12/2011
Parallelism/locality tradeoffs: performance
           numbers




 Code with a good balance between parallelism and fusion performs best
 In explicitly managed memory/scratchpad architectures this is even more true

Reservoir Labs   Harvard 04/12/2011
R-Stream: affine scheduling and fusion

    •• R-Stream uses a heuristic based on an objective function with several cost
      coefficients:
    • slowdown in execution if a loop p is executed sequentially rather than in parallel
    • cost in performance if two loops p and q remain unfused rather than fused



                  minimize                  wl pl              ue f e
                                      l loops        e loop edges


                                      slowdown in sequential        cost of unfusing
                                            execution                  two loops

    •• These two cost coefficients address parallelism and locality in a unified and
       unbiased manner (as opposed to traditional compilers)
    •• Fine-grained parallelism, such as SIMD, can also be modeled using similar
       formulation

                                                                                       Patent Pending
Reservoir Labs   Harvard 04/12/2011
Parallelism + locality + spatial locality


                                      Hypothesis that auto-tuning should adjust these
                                                       parameters




                               wl pl              ue f e
                       l loops            e loop edges

                                                           benefits of improved locality
                    benefits of parallel execution



             New algorithm (unpublished) balances contiguity to
               enhance coalescing for GPU and SIMDization
                    modulo data-layout transformations


                                                                                           28
Reservoir Labs   Harvard 04/12/2011
Outline


        •• R-Stream Overview

        •• Compilation Walk-through

        •• Performance Results

        •• Getting R-Stream




Reservoir Labs   Harvard 04/12/2011   29
What R-Stream does for you – in a nutshell

     •• Input
     • Sequential
        – Short and simple
          textbook C code
        – Just add a “#pragma
          map” and R-Stream
          figures out the rest




Reservoir Labs   Harvard 04/12/2011                     30
What R-Stream does for you – in a nutshell

     •• Input                                    •• Output
     • Sequential                                • OpenMP + CUDA code
        – Short and simple            R-Stream     – Hundreds of lines of tightly
          textbook C code                            optimized GPU-side CUDA
        – Just add a “#pragma                        code
          map” and R-Stream                        – Few lines of host-side
          figures out the rest                       OpenMP C code

   • Gauss-Seidel 9 points stencil                              • Very difficult to
      – Used in iterative PDE solvers                             hand-optimize
          – scientific modeling (heat, fluid flow, waves,       • Not available in
             etc.)                                                any standard
      – Building block for faster iterative solvers like          library
        Multigrid or AMR



Reservoir Labs   Harvard 04/12/2011                                                   31
What R-Stream does for you – in a nutshell

     •• Input                                    •• Output
     • Sequential                                • OpenMP + CUDA code
        – Short and simple            R-Stream     – Hundreds of lines of tightly
          textbook C code                            optimized GPU-side CUDA
        – Just add a “#pragma                        code
          map” and R-Stream                        – Few lines of host-side
          figures out the rest                       OpenMP C code



                                                               • Achieving up to
                                           Will be                – 20 GFLOPS in
                                           illustrated in           GTX 285
                                           the next few           – 25 GFLOPS in
                                           slides                   GTX 480



Reservoir Labs   Harvard 04/12/2011                                                 32
Finding and utilizing available parallelism

           Excerpt of automatically generated code
                                                           GPU
                                                               SM N


                                                             SM 2

                                                            SM 1

                                                                        Shared Memory

                                                            Registers   Registers       Registers
                                                                                                    Instruction


R-Stream AUTOMATICALLY finds
                                                               SP 1        SP 2     …     SP M
                                                                                                        Unit



and forms parallelism                                                                               Constant
                                                                                                      Cache


                                                                                                     Texture
                                                                                                      Cache


                                                             Off-chip Device memory (Global,
                                                             constant, texture)



                                                     Extracting and mapping parallel loops

 Reservoir Labs   Harvard 04/12/2011                                                                        33
Memory compaction on GPU scratchpad


              Excerpt of automatically generated code   GPU
                                                           SM N


                                                         SM 2

                                                        SM 1

                                                                    Shared Memory

                                                        Registers   Registers       Registers
                                                                                                Instruction

                                                           SP 1        SP 2     …     SP M
                                                                                                    Unit


R-Stream AUTOMATICALLY
manages local scratchpad                                                                        Constant
                                                                                                  Cache


                                                                                                 Texture
                                                                                                  Cache


                                                         Off-chip Device memory (Global,
                                                         constant, texture)




 Reservoir Labs   Harvard 04/12/2011                                                                    34
GPU DRAM to scratchpad coalesced communication

             Excerpt of automatically generated code
                                                                      GPU
                                                                         SM N


                                                                       SM 2

                                                                       SM 1

                                                                                  Shared Memory

                                                                      Registers   Registers       Registers
                                                                                                              Instruction

                                                                         SP 1        SP 2     …     SP M
                                                                                                                  Unit


R-Stream AUTOMATICALLY
chooses parallelism to favor                                                                                  Constant
                                                                                                                Cache
coalescing
                                                                                                               Texture
                                                                                                                Cache


                                                                       Off-chip Device memory (Global,
                                                                       constant, texture)


                                                 Coalesced GPU DRAM accesses


 Reservoir Labs     Harvard 04/12/2011                                                                                35
Host-to-GPU communication

          Excerpt of automatically generated code

                 R-Stream AUTOMATICALLY
                                                              GPU
                 chooses partition and sets up host
                                                                 SM N
                 to GPU communication
                                                               SM 2

                                                              SM 1

                                                                          Shared Memory

                                                              Registers   Registers       Registers
                                                                                                      Instruction

                                                                 SP 1        SP 2     …     SP M
                                                                                                          Unit




                           CPU                                                                        Constant
                                                                                                        Cache


                                                                                                       Texture
                                                      PCI                                               Cache
                                                    Express
                 Host memory                                  Off-chip Device memory (Global,
                                                              constant, texture)




Reservoir Labs    Harvard 04/12/2011                                                                          36
Multi-GPU mapping
          Excerpt of automatically generated code



                                      Mapping              Host               Host
                                      across all GPUs     memory             memory



                                                              CPU             CPU




   R-Stream AUTOMATICALLY finds                         GPU                     GPU
                                                                     GPU
   another level of parallelism,
   across GPUs
                                                         GPU         GPU       GPU
                                                        memory      memory    memory

Reservoir Labs   Harvard 04/12/2011                                                   37
Multi-GPU mapping
          Excerpt of automatically generated code



                                                           Host               Host
                                      Multi-streaming     memory             memory
                                      of host-GPU
                                      communication
                                                              CPU             CPU




   R-Stream AUTOMATICALLY                               GPU                     GPU
                                                                     GPU
   creates n-way software pipelines
   for communications
                                                         GPU         GPU       GPU
                                                        memory      memory    memory

Reservoir Labs   Harvard 04/12/2011                                                   38
Future capabilities – mapping to CPU-GPU clusters

                                                            High Speed Interconnect (e.g.
                                                                    InfiniBand)




Program
                                                  CPU + GPU                      CPU + GPU
 MPI Process                      MPI Process

     OpenMP                           OpenMP
     process                          process
                                                          DRAM
     launching                        launching                                     DRAM


     CUDA                             CUDA
                                                          CPU                      CPU


                                                  GPU               GPU                 GPU
                                                   DRAM              DRAM               DRAM




Reservoir Labs   Harvard 04/12/2011                                                            39
Outline


        •• R-Stream Overview

        •• Compilation Walk-through

        •• Performance Results

        •• Getting R-Stream




Reservoir Labs   Harvard 04/12/2011   40
Experimental evaluation

     Configuration 1: MKL                           Configuration 2: Low-level compilers

       Radar                                         Radar                GCC
                              MKL calls
       code                                          code                 ICC



                                      Configuration 3: R-Stream

                  Radar                                      Optimized          GCC
                                         R-Stream
                  code                                       radar code         ICC

  •• Main comparisons:
  • R-Stream High-Level C Transformation Tool 3.1.2
  • Intel MKL 10.2.1


Reservoir Labs   Harvard 04/12/2011
Experimental evaluation (cont.)

  ••   Intel Xeon workstation:
  •    Dual quad-core E5405 Xeon processors (8 cores total)
  •    9GB memory
  ••   8 OpenMP threads
  ••   Single precision floating point data
  ••   Low-level compilers and the used flags:
  •    GCC: -O6 -fno-trapping-math -ftree-vectorize -msse3 -fopenmp
  •    ICC: -fast -openmp




Reservoir Labs   Harvard 04/12/2011
Radar benchmarks


     •• Beamforming algorithms:
     • MVDR-SER: Minimum Variance Distortionless Response using
          Sequential Regression
     •    CSLC-LMS: Coherent Sidelobe Cancellation using Least Mean Square
     •    CSLC-RLS: Coherent Sidelobe Cancellation using Robust Least
          Square
     ••   Expressed in sequential ANSI C
     ••   400 radar iterations
     ••   Compute 3 radar sidelobes (for CSLC-LMS and CSLC-RLS)




Reservoir Labs   Harvard 04/12/2011
MVDR-SER




Reservoir Labs   Harvard 04/12/2011
CSLC-LMS




Reservoir Labs   Harvard 04/12/2011
CSLC-RLS




Reservoir Labs   Harvard 04/12/2011
3D Discretized wave equation input code (RTM)
        ••   #pragma rstream map
        ••   void RTM_3D(double (*U1)[Y][X], double (*U2)[Y][X], double (*V)[Y][X],
        ••               int pX, int pY, int pZ) {
        ••     double temp;
        ••     int i, j, k;

        ••    for (k=4; k<pZ-4; k++) {
        ••      for (j=4; j<pY-4; j++) {
        ••        for (i=4; i<pX-4; i++) {
        ••         temp = C0 * U2[k][j][i] +
        ••            C1 * (U2[k-1][j][i] + U2[k+1][j][i] +
        ••                  U2[k][j-1][i] + U2[k][j+1][i] +
        ••                 U2[k][j][i-1] + U2[k][j][i+1]) +
        ••            C2 * (U2[k-2][j][i] + U2[k+2][j][i] +
        ••                  U2[k][j-2][i] + U2[k][j+2][i] +        25-point 8th
        ••                  U2[k][j][i-2] + U2[k][j][i+2]) +       order (in space)
        ••            C3 * (U2[k-3][j][i] + U2[k+3][j][i] +
        ••                  U2[k][j-3][i] + U2[k][j+3][i] +
                                                                   stencil
        ••                  U2[k][j][i-3] + U2[k][j][i+3]) +
        ••            C4 * (U2[k-4][j][i] + U2[k+4][j][i] +
        ••                  U2[k][j-4][i] + U2[k][j+4][i] +
        ••                  U2[k][j][i-4] + U2[k][j][i+4]);

        ••             U1[k][j][i] =
        ••               2.0f * U2[k][j][i] - U1[k][j][i] +
        ••               V[k][j][i] * temp;
        ••   } } } }

Reservoir Labs     Harvard 04/12/2011                                                 47
3D Discretized wave equation input code (RTM)




Not so naïve …


Communication
autotuning
knobs




                                               ThreadIdx.x divergence
                                               is expensive

 Reservoir Labs   Harvard 04/12/2011                                    48
Outline


        •• R-Stream Overview

        •• Compilation Walk-through

        •• Performance Results

        •• Getting R-Stream




Reservoir Labs   Harvard 04/12/2011   49
Current status


        •• Ongoing development also supported by DOE, Reservoir
        • Improvements in scope, stability, performance

        •• Installations/evaluations at US government laboratories

        •• Forward collaboration with Georgia Tech on Keeneland
        • HP SL390 - 3 FERMI GPU, 2 Westmere/node

        •• Basis of compiler for DARPA UHPC Intel Corporation Team




Reservoir Labs   Harvard 04/12/2011                                  50
Availability



        •• Per developer seat licensing model

        •• Support (releases, bug fixes, services)

        •• Available with commercial grade external solvers

        •• Government has limited rights / SBIR data rights

        •• Academic source licenses with collaborators

        •• Professional team, continuity, software engineering


Reservoir Labs   Harvard 04/12/2011                              51
R-Stream Gives More Insight About Programs


        •• Teaches you about parallelism and the polyhedral model:
        • Generates correct code for any transformation
             – The transformation may be incorrect if specified by the user
        • Imperative code generation has been the bottleneck till 2005
             – 3 thesis written on the topic and it’s still not completely covered
             – It is good to have an intuition of how these things work
        ••   R-Stream has meaningful metrics to represent your program
        •    Maximal amount of parallelism given minimal expansion
        •    Tradeoffs between coarse-grained and fine-grained parallelism
        •    Loop types (doall, red, perm, seq)
        ••   Help with algorithm selection
        ••   Tiling of imperfectly nested loop nests
        ••   Generates code for explicitly managed memory

Reservoir Labs    Harvard 04/12/2011                                                 52
How to Use R-Stream Successfully


        ••   R-Stream is a great transformation tool, it is also a great learning tool:
        •    Takes “simple C” input code
        •    Applies multiple transformations and lists them explicitly
        •    Can dump code at any step in the process
        •    It’s the tool I wish I had during my PhD:
             – To be fair, I already had a great set of tools
        ••   R-Stream can be used in multiple modes:
        •    Fully automatic + compile flag options
        •    Autotuning mode (more than just tile size and unrolling … )
        •    Scripting / Programmatic mode (Beanshell + interfaces)
        •    Mix of these modes + manual post-processing




Reservoir Labs    Harvard 04/12/2011                                               53
Use Case: Fully Automatic Mode + Compile Flag Options


        •• Akin to traditional gcc / icc compilation with flags:
        • Predefined transformations can be parameterized
          – Except less/no phase ordering issues
        • Except you can see what each transformation does at each step
        • Except you can generate compilable and executable code at (almost)
          any step in the process
        • Except you can control code generation for compactness or
          performance




Reservoir Labs   Harvard 04/12/2011                                        54
Use Case: Autotuning Mode


        •• More advanced than traditional approaches:
        •  Knobs go far beyond loop unrolling + unroll-and-jam + tiling
        •  Knobs are based on well-understood models
        •  Knobs target high-level properties of the program
           – Amount of parallelism, amount of memory expansion, depth of
             pipelining of communication and computations …
        • Knobs are dependent on target machine, program and state of the
           mapping process:
           – Our tool has introspection
        •• We are really building a hierarchical autotuning transformation tool




Reservoir Labs   Harvard 04/12/2011                                           55
Use Case: “Power user” interactive interface


        •• Beanshell access to optimizations

        •• Can direct and review the process of compilation
        • automatic tactics (affine scheduling)

        •• Can direct code generation

        •• Access to “tuning parameters”

        •• All options / commands available on command line
          interface




Reservoir Labs   Harvard 04/12/2011                           56
Conclusion

       •• R-Stream simplifies software development and maintenance

       •• Porting: reduces expense and delivery delays

       •• Does this by automatically parallelizing loop code
       • While optimizing for data locality, coalescing, etc.

       •• Addresses
       • Dense loop-intensive computations

       ••   Extensions
       •    Data-parallel programming idioms
       •    Sparse representations
       •    Dynamic runtime execution
Reservoir Labs   Harvard 04/12/2011                                  57
Contact us


        •• Per-developer seat, floating, cloud-based licensing

        •• Discounts for academic users

        •• Research collaborations with academic partners


        ••   For more information:
        •    Call us at 212-780-0527, or
        •    See Rich, Ann
        •    E-mail us at
             – {sales,lethin,johnson}@reservoir.com



Reservoir Labs    Harvard 04/12/2011                             58

Mais conteúdo relacionado

Mais procurados

study Domain Transform for Edge-Aware Image and Video Processing
study Domain Transform for Edge-Aware Image and Video Processingstudy Domain Transform for Edge-Aware Image and Video Processing
study Domain Transform for Edge-Aware Image and Video ProcessingChiamin Hsu
 
Consulting design presentation
Consulting design presentationConsulting design presentation
Consulting design presentationKevin Lomeli
 
200081003 Friday Food@IBBT
200081003 Friday Food@IBBT200081003 Friday Food@IBBT
200081003 Friday Food@IBBTimec.archive
 
A fast implementation of matrix-matrix product in double-double precision on ...
A fast implementation of matrix-matrix product in double-double precision on ...A fast implementation of matrix-matrix product in double-double precision on ...
A fast implementation of matrix-matrix product in double-double precision on ...Maho Nakata
 

Mais procurados (6)

study Domain Transform for Edge-Aware Image and Video Processing
study Domain Transform for Edge-Aware Image and Video Processingstudy Domain Transform for Edge-Aware Image and Video Processing
study Domain Transform for Edge-Aware Image and Video Processing
 
GQSAR presentation
GQSAR presentationGQSAR presentation
GQSAR presentation
 
Cascading[1]
Cascading[1]Cascading[1]
Cascading[1]
 
Consulting design presentation
Consulting design presentationConsulting design presentation
Consulting design presentation
 
200081003 Friday Food@IBBT
200081003 Friday Food@IBBT200081003 Friday Food@IBBT
200081003 Friday Food@IBBT
 
A fast implementation of matrix-matrix product in double-double precision on ...
A fast implementation of matrix-matrix product in double-double precision on ...A fast implementation of matrix-matrix product in double-double precision on ...
A fast implementation of matrix-matrix product in double-double precision on ...
 

Destaque

[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)npinto
 
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...npinto
 
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...npinto
 
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...npinto
 
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...npinto
 
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...npinto
 
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...npinto
 
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...npinto
 
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...npinto
 
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...npinto
 
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...npinto
 
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)npinto
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)npinto
 
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...npinto
 

Destaque (14)

[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
 
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
 
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
 
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
 
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
 
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
 
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
 
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
 
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
 
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
 
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
 
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
 
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
 

Semelhante a [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

[05][cuda 및 fermi 최적화 기술] hryu optimization
[05][cuda 및 fermi 최적화 기술] hryu optimization[05][cuda 및 fermi 최적화 기술] hryu optimization
[05][cuda 및 fermi 최적화 기술] hryu optimizationlaparuma
 
Cacheconcurrencyconsistency cassandra svcc
Cacheconcurrencyconsistency cassandra svccCacheconcurrencyconsistency cassandra svcc
Cacheconcurrencyconsistency cassandra svccsrisatish ambati
 
Big Science, Big Data: Simon Metson at Eduserv Symposium 2012
Big Science, Big Data: Simon Metson at Eduserv Symposium 2012Big Science, Big Data: Simon Metson at Eduserv Symposium 2012
Big Science, Big Data: Simon Metson at Eduserv Symposium 2012Eduserv
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraSomnath Mazumdar
 
Rise of the scientific database
Rise of the scientific databaseRise of the scientific database
Rise of the scientific databaseJohn De Goes
 
Hummingbird - Open Source for Small Satellites - GSAW 2012
Hummingbird - Open Source for Small Satellites - GSAW 2012Hummingbird - Open Source for Small Satellites - GSAW 2012
Hummingbird - Open Source for Small Satellites - GSAW 2012Logica_hummingbird
 
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAsScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAsShinya Takamaeda-Y
 
3D-IC Designs require 3D tools
3D-IC Designs require 3D tools3D-IC Designs require 3D tools
3D-IC Designs require 3D toolschiportal
 
MDE based FPGA physical Design Fast prototyping with Smalltalk
MDE based FPGA physical Design Fast prototyping with SmalltalkMDE based FPGA physical Design Fast prototyping with Smalltalk
MDE based FPGA physical Design Fast prototyping with SmalltalkESUG
 
Intro to threp
Intro to threpIntro to threp
Intro to threpHong Wu
 
Postgres Plus Advanced Server 9.2新機能ご紹介
Postgres Plus Advanced Server 9.2新機能ご紹介Postgres Plus Advanced Server 9.2新機能ご紹介
Postgres Plus Advanced Server 9.2新機能ご紹介Yuji Fujita
 
Big Data for Mobile
Big Data for MobileBig Data for Mobile
Big Data for MobileBugSense
 
MathWorks Interview Lecture
MathWorks Interview LectureMathWorks Interview Lecture
MathWorks Interview LectureJohn Yates
 
NIAR_VRC_2010
NIAR_VRC_2010NIAR_VRC_2010
NIAR_VRC_2010fftoledo
 
Extend starfish to Support the Growing Hadoop Ecosystem
Extend starfish to Support the Growing Hadoop EcosystemExtend starfish to Support the Growing Hadoop Ecosystem
Extend starfish to Support the Growing Hadoop EcosystemFei Dong
 

Semelhante a [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs) (20)

[05][cuda 및 fermi 최적화 기술] hryu optimization
[05][cuda 및 fermi 최적화 기술] hryu optimization[05][cuda 및 fermi 최적화 기술] hryu optimization
[05][cuda 및 fermi 최적화 기술] hryu optimization
 
Cacheconcurrencyconsistency cassandra svcc
Cacheconcurrencyconsistency cassandra svccCacheconcurrencyconsistency cassandra svcc
Cacheconcurrencyconsistency cassandra svcc
 
Work items
Work itemsWork items
Work items
 
Work items
Work itemsWork items
Work items
 
Big Science, Big Data: Simon Metson at Eduserv Symposium 2012
Big Science, Big Data: Simon Metson at Eduserv Symposium 2012Big Science, Big Data: Simon Metson at Eduserv Symposium 2012
Big Science, Big Data: Simon Metson at Eduserv Symposium 2012
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
 
Rise of the scientific database
Rise of the scientific databaseRise of the scientific database
Rise of the scientific database
 
8085 MICROPROCESSOR
8085 MICROPROCESSOR 8085 MICROPROCESSOR
8085 MICROPROCESSOR
 
Hummingbird - Open Source for Small Satellites - GSAW 2012
Hummingbird - Open Source for Small Satellites - GSAW 2012Hummingbird - Open Source for Small Satellites - GSAW 2012
Hummingbird - Open Source for Small Satellites - GSAW 2012
 
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAsScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
 
3D-IC Designs require 3D tools
3D-IC Designs require 3D tools3D-IC Designs require 3D tools
3D-IC Designs require 3D tools
 
MDE based FPGA physical Design Fast prototyping with Smalltalk
MDE based FPGA physical Design Fast prototyping with SmalltalkMDE based FPGA physical Design Fast prototyping with Smalltalk
MDE based FPGA physical Design Fast prototyping with Smalltalk
 
Intro to threp
Intro to threpIntro to threp
Intro to threp
 
Postgres Plus Advanced Server 9.2新機能ご紹介
Postgres Plus Advanced Server 9.2新機能ご紹介Postgres Plus Advanced Server 9.2新機能ご紹介
Postgres Plus Advanced Server 9.2新機能ご紹介
 
Big Data for Mobile
Big Data for MobileBig Data for Mobile
Big Data for Mobile
 
MathWorks Interview Lecture
MathWorks Interview LectureMathWorks Interview Lecture
MathWorks Interview Lecture
 
Progress_190213
Progress_190213Progress_190213
Progress_190213
 
Java 8 Lambda
Java 8 LambdaJava 8 Lambda
Java 8 Lambda
 
NIAR_VRC_2010
NIAR_VRC_2010NIAR_VRC_2010
NIAR_VRC_2010
 
Extend starfish to Support the Growing Hadoop Ecosystem
Extend starfish to Support the Growing Hadoop EcosystemExtend starfish to Support the Growing Hadoop Ecosystem
Extend starfish to Support the Growing Hadoop Ecosystem
 

Mais de npinto

"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)npinto
 
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...npinto
 
[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programmingnpinto
 
[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programming[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programmingnpinto
 
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basicsnpinto
 
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patternsnpinto
 
[Harvard CS264] 01 - Introduction
[Harvard CS264] 01 - Introduction[Harvard CS264] 01 - Introduction
[Harvard CS264] 01 - Introductionnpinto
 
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...npinto
 
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...npinto
 
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)npinto
 
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)npinto
 
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)npinto
 
IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)npinto
 
IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)npinto
 
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...npinto
 
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...npinto
 

Mais de npinto (16)

"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)
 
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
 
[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming
 
[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programming[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programming
 
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
 
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
 
[Harvard CS264] 01 - Introduction
[Harvard CS264] 01 - Introduction[Harvard CS264] 01 - Introduction
[Harvard CS264] 01 - Introduction
 
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
 
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
 
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)
 
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
 
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
 
IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)
 
IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)
 
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
 
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...
 

Último

Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfPatidar M
 
Oppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmOppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmStan Meyer
 
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptxBIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptxSayali Powar
 
Q-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITWQ-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITWQuiz Club NITW
 
Scientific Writing :Research Discourse
Scientific  Writing :Research  DiscourseScientific  Writing :Research  Discourse
Scientific Writing :Research DiscourseAnita GoswamiGiri
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4JOYLYNSAMANIEGO
 
Grade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptxGrade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptxkarenfajardo43
 
Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1GloryAnnCastre1
 
MS4 level being good citizen -imperative- (1) (1).pdf
MS4 level   being good citizen -imperative- (1) (1).pdfMS4 level   being good citizen -imperative- (1) (1).pdf
MS4 level being good citizen -imperative- (1) (1).pdfMr Bounab Samir
 
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptx
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptxDIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptx
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptxMichelleTuguinay1
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
Using Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea DevelopmentUsing Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea Developmentchesterberbo7
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxVanesaIglesias10
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Projectjordimapav
 
Mental Health Awareness - a toolkit for supporting young minds
Mental Health Awareness - a toolkit for supporting young mindsMental Health Awareness - a toolkit for supporting young minds
Mental Health Awareness - a toolkit for supporting young mindsPooky Knightsmith
 
Congestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationCongestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationdeepaannamalai16
 
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...DhatriParmar
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 

Último (20)

Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdf
 
Oppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmOppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and Film
 
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptxBIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
 
Q-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITWQ-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITW
 
Scientific Writing :Research Discourse
Scientific  Writing :Research  DiscourseScientific  Writing :Research  Discourse
Scientific Writing :Research Discourse
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4
 
Grade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptxGrade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptx
 
Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1
 
MS4 level being good citizen -imperative- (1) (1).pdf
MS4 level   being good citizen -imperative- (1) (1).pdfMS4 level   being good citizen -imperative- (1) (1).pdf
MS4 level being good citizen -imperative- (1) (1).pdf
 
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptx
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptxDIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptx
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptx
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
Using Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea DevelopmentUsing Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea Development
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptx
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Project
 
Mental Health Awareness - a toolkit for supporting young minds
Mental Health Awareness - a toolkit for supporting young mindsMental Health Awareness - a toolkit for supporting young minds
Mental Health Awareness - a toolkit for supporting young minds
 
prashanth updated resume 2024 for Teaching Profession
prashanth updated resume 2024 for Teaching Professionprashanth updated resume 2024 for Teaching Profession
prashanth updated resume 2024 for Teaching Profession
 
Congestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationCongestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentation
 
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 

[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

  • 1. The R-Stream High-Level Program Transformation Tool N. Vasilache, B. Meister, M. Baskaran, A.Hartono, R. Lethin Reservoir Labs Harvard 04/12/2011 1
  • 2. Outline •• R-Stream Overview •• Compilation Walk-through •• Performance Results •• Getting R-Stream Reservoir Labs Harvard 04/12/2011 2
  • 3. Power efficiency driving architectures Heterogeneous SIMD SIMD SIMD SIMD NUMA FPGA FPGA Processing DMA Memory DMA Memory Distributed GPP GPP Local Memories SIMD SIMD SIMD SIMD Hierarchical (including board, chassis, cabinet) SIMD SIMD SIMD SIMD Explicitly FPGA FPGA Managed Architecture DMA Memory DMA Multiple Memory Execution Models Bandwidth GPP GPP Starved SIMD SIMD SIMD SIMD Multiple Mixed Spatial Parallelism Dimensions Types 3 Reservoir Labs Harvard 04/12/2011
  • 4. Computation choreography •• Expressing it • Annotations and pragma dialects for C • Explicitly (e.g., new languages like CUDA and OpenCL) •• But before expressing it, how can programmers find it? • Manual constructive procedures, art, sweat, time – Artisans get complete control over every detail • Automatically – Operations research problem – Like scheduling trucks to save fuel Our focus – Model, solve , implement – Faster, sometimes better, than a human Reservoir Labs Harvard 04/12/2011 4
  • 5. How to do automatic scheduling? •• Naïve approach • Model – Tasks, dependences – Resource use, latencies – Machine model with connectivity, resource capacities • Solve with ILP – Minimize overall task length – Subject to dependences, resource use • Problems – Complexity: task graph is huge! – Dynamics: loop lengths unknown. •• So we do something much more cool. Reservoir Labs Harvard 04/12/2011 5
  • 6. Program Transformations Specification iteration space of a statement S(i,j) t2 j 2 2 :Z Z i t1 •• Schedule maps iterations to multi-dimensional time: • A feasible schedule preserves dependences •• Placement maps iterations to multi-dimensional space: • UHPC in progress, partially done •• Layout maps data elements to multi-dimensional space: • UHPC in progress •• Hierarchical by design, tiling serves separation of concerns 6 Reservoir Labs Harvard 04/12/2011
  • 7. Loop transformations for(i=0; i<N; i++) for(j=0; j<N; j++) S(i,j); unimodular for(j=0; j<N; j++) 0 1 i permutation for(i=0; i<N; i++) (i, j ) S(i,j); 1 0 j for(i=N-1; i>=0; i--) 1 0 i reversal for(j=0; j<N; j++) (i, j ) S(j,i); 0 1 j for(i=0; i<N; i++) 1 0 i skewing for(j= *i; j<N+ *i; j++) (i, j ) S(i,j- *i); 1 j for(i=0; i< *N; i+= ) 0 i scaling for(j=0; j<N; j++) (i, j ) S(i/ ,j); 0 1 j 7 Reservoir Labs Harvard 04/12/2011
  • 8. Loop fusion and distribution for(i=0; i<N; i++) for(j=0; j<N; j++) fusion for(i=0; i<N; i++) for(j=0; j<N; j++) S1(i,j); S1(i,j); for(j=0; j<N; j++) S2(i,j) S2(i,j) distribution 0 0 0 0 0 0 1 0 0 i 1 0 0 i fusion (i, j ) 0 0 0 j 1 (i, j ) 0 0 0 j 1 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 i 1 0 0 i 2 (i, j ) 0 0 1 j 2 (i, j ) 0 0 0 j distribution 0 1 0 1 0 1 0 1 0 0 0 0 0 1 8 Reservoir Labs Harvard 04/12/2011
  • 9. Enabling technology is new compiler math Uniform Recurrence Equations [Karp et al. 1970] Many: Lamport, Allen/Kennedy, Banerjee, Irigoin, Wolfe/Lam, Pugh, Pingali, e Loop Transformations and Parallelization [1970-] Vectorization, SMP, locality optimizations Dependence summary: direction/distance vectors Unimodular transformations Systolic Array Mapping Mostly linear-algebraic Many: Feautrier, Darte, Vivien, Wilde, Rajopadhye, etc,.... Exact dependence analysis Polyhedral Model [1980-] General affine transformations Loop synthesis via polyhedral scanning New computational techniques based on polyhedral representations Reservoir Labs Harvard 04/12/2011 9
  • 10. R-Stream model: polyhedra n = f(); for (i=5; i<= n; i+=2) { A[i][i] = A[i][i]/B[i]; for (j=0; j<=i; j++) { if (j<=10) { …… A[i+2j+n][i+3]…… } } {i, j Z2 | k Z ,5 i n;0 j i; j i ; i 2k 1} i A0 1 2 1 0 j A1 1 0 0 3 Affine and non-affine transformations n Order and place of operations and data 1 0 0 0 1 1 Loop code represented (exactly or conservatively) with polyhedrons High-level, mathematical view of a mapping But targets concrete properties: parallelism, locality, memory footprint Reservoir Labs Harvard 04/12/2011 10
  • 11. Polyhedral slogans •• Parametric imperfect loop nests •• Subsumes classical transformations •• Compacts the transformation search space •• Parallelization, locality optimization (communication avoiding) •• Preserves semantics •• Analytic joint formulations of optimizations •• Not just for affine static control programs Reservoir Labs Harvard 04/12/2011 11
  • 12. Polyhedral model – challenges in building a compiler •• Killer math •• Scalability of optimizations/code generation •• Mostly confined to dependence preserving transformations •• Code can be radically transformed – outputs can look wildly different •• Modeling indirections, pointers, non-affine code. •• Many of these challenges are solved Reservoir Labs Harvard 04/12/2011 12
  • 13. R-Stream blueprint Machine Polyhedral Mapper Model Raising Lowering EDG C Pretty Scalar Representation Front End Printer 13 Reservoir Labs Harvard 04/12/2011
  • 14. Inside the polyhedral mapper GDG representation Tactics Module Parallelization Comm. Locality Tiling Placement Generation Optimization …… Memory Sync Layout Polyhedral Promotion Generation Optimization Scanning Jolylib, …… 14 Reservoir Labs Harvard 04/12/2011
  • 15. Inside the polyhedral mapper Optimization modules engineered to expose ““knobs”” that could be used by auto-tuner GDG representation Tactics Module Parallelization Comm. Locality Tiling Placement Generation Optimization …… Memory Sync Layout Polyhedral Promotion Generation Optimization Scanning Jolylib, …… 15 Reservoir Labs Harvard 04/12/2011
  • 16. Driving the mapping: the machine model •• Target machine characteristics that have an influence on how the mapping should be done • Local memory / cache sizes • Communication facilities: DMA, cache(s) • Synchronization capabilities • Symmetrical or not • SIMD width • Bandwidths •• Currently: two-level model (Host and Accelerators) •• XML schema and graphical rendering Reservoir Labs Harvard 04/12/2011
  • 17. Machine model example: multi-Tesla Host 1 thread per GPU OpenMP morph XML file CUDA morph Reservoir Labs Harvard 04/12/2011 17
  • 18. Mapping process Dependencies 2- Task formation: - Coarse-grain atomic tasks - Master/slave side operations 1- Scheduling: Parallelism, locality, tilability 3- Placement: Assign tasks to blocks/threads - Local / global data layout optimization - Multi-buffering (explicitly managed) - Synchronization (barriers) - Bulk communications - Thread generation -> master/slave - CUDA-specific optimizations Reservoir Labs Harvard 04/12/2011 18
  • 19. Program Transformations Specification iteration space of a statement S(i,j) t2 j 2 2 :Z Z i t1 •• Schedule maps iterations to multi-dimensional time: • A feasible schedule preserves dependences •• Placement maps iterations to multi-dimensional space: • UHPC in progress, partially done •• Layout maps data elements to multi-dimensional space: • UHPC in progress •• Hierarchical by design, tiling serves separation of concerns 19 Reservoir Labs Harvard 04/12/2011
  • 20. Model for scheduling trades 3 objectives jointly Loop Fission Fewer Global More More Sufficient Memory Locality Parallelism Occupancy Accesses Loop Fusion + successive + successive thread thread contiguity contiguity Memory Coalescing Better Effective Bandwidth Patent pending Reservoir Labs Harvard 04/12/2011 20
  • 21. Optimization with BLAS vs. global optimization Numerous cache misses /* Global Optimization*/ /* Optimization with BLAS */ doall loop { Can parallelize for loop { Outer loop(s) …… outer loop(s) …… for loop { Retrieve data Z from disk …… BLAS call 1 …… Store data Z back to disk [read from Z] Retrieve data Z from disk !!! Loop fusion BLAS call 2 VS. …… …… [write to Z] can …… …… improve BLAS call n [read from Z] locality …… } } …… } Global optimization can expose better parallelism and locality Reservoir Labs Harvard 04/12/2011
  • 22. Tradeoffs between parallelism and locality • Significant parallelism is needed to fully utilize all resources • Locality is also critical to minimize communication • Parallelism can come at the expense of locality Limited bandwidth at chip border High on-chip parallelism •• Our approach: R-Stream compiler exposes parallelism via affine scheduling that simultaneously augments locality using loop fusion Reuse data once loaded on chip = locality Reservoir Labs Harvard 04/12/2011
  • 23. Parallelism/locality tradeoff example Array z gets expanded, to Maximum distribution destroys locality introduce another level of parallelism /* doall (i=0; i<400; i++) * Original code: doall (j=0; j<3997; j++) * Simplified CSLC LMS z_e[j][i]=0 */ doall (i=0; i<400; i++) for (k=0; k<400; k++) { doall (j=0; j<3997; j++) Max. parallelism for (i=0; i<3997; i++) { for (k=0; k<4000; k++) (no fusion) z[i]=0; z_e[j][i]=z_e[j][i]+B[j][k]*x[i][k]; for (j=0; j<4000; j++) doall (i=0; i<3997; i++) z[i]= z[i]+B[i][j]*x[k][j]; for (j=0; j<400; j++) } w[i]=w[i]+z_e[i][j]; for (i=0; i<3997; i++) doall (i=0; i<3997; i++) w[i]=w[i]+z[i]; Data z[i] = z_e[i][399]; } accumulation 2 levels of parallelism, but poor data reuse (on array z_e) Reservoir Labs Harvard 04/12/2011
  • 24. Parallelism/locality tradeoff example (cont.) Aggressive loop fusion destroys parallelism (i.e., only 1 degree /* of parallelism) * Original code: * Simplified CSLC LMS */ doall (i=0; i<3997; i++) for (k=0; k<400; k++) { for (j=0; j<400; j++) { for (i=0; i<3997; i++) { Max. fusion z[i]=0; z[i]=0; for (k=0; k<4000; k++) for (j=0; j<4000; j++) z[i]=z[i]+B[i][k]*x[j][k]; z[i]= z[i]+B[i][j]*x[k][j]; w[i]=w[i]+z[i]; } } for (i=0; i<3997; i++) w[i]=w[i]+z[i]; } Very good data reuse (on array z), but only 1 level of parallelism Reservoir Labs Harvard 04/12/2011
  • 25. Parallelism/locality tradeoff example (cont.) Partial fusion doesn’t Expansion of array z decrease parallelism /* * Original code: doall (i=0; i<3997; i++) { * Simplified CSLC LMS doall (j=0; j<400; j++) { */ z_e[i][j]=0; for (k=0; k<400; k++) { for (k=0; k<4000; k++) z_e[i][j]=z_e[i][j]+B[i][k]*x[j][k]; for (i=0; i<3997; i++) { Parallelism with } z[i]=0; partial fusion for (j=0; j<4000; j++) for (j=0; j<400; j++) z[i]= z[i]+B[i][j]*x[k][j]; w[i]=w[i]+z_e[i][j]; } } doall (i=0; i<3997; i++) for (i=0; i<3997; i++) Data z[i]=z_e[i][399]; w[i]=w[i]+z[i]; accumulation } 2 levels of parallelism with good data reuse (on array z_e) Reservoir Labs Harvard 04/12/2011
  • 26. Parallelism/locality tradeoffs: performance numbers Code with a good balance between parallelism and fusion performs best In explicitly managed memory/scratchpad architectures this is even more true Reservoir Labs Harvard 04/12/2011
  • 27. R-Stream: affine scheduling and fusion •• R-Stream uses a heuristic based on an objective function with several cost coefficients: • slowdown in execution if a loop p is executed sequentially rather than in parallel • cost in performance if two loops p and q remain unfused rather than fused minimize wl pl ue f e l loops e loop edges slowdown in sequential cost of unfusing execution two loops •• These two cost coefficients address parallelism and locality in a unified and unbiased manner (as opposed to traditional compilers) •• Fine-grained parallelism, such as SIMD, can also be modeled using similar formulation Patent Pending Reservoir Labs Harvard 04/12/2011
  • 28. Parallelism + locality + spatial locality Hypothesis that auto-tuning should adjust these parameters wl pl ue f e l loops e loop edges benefits of improved locality benefits of parallel execution New algorithm (unpublished) balances contiguity to enhance coalescing for GPU and SIMDization modulo data-layout transformations 28 Reservoir Labs Harvard 04/12/2011
  • 29. Outline •• R-Stream Overview •• Compilation Walk-through •• Performance Results •• Getting R-Stream Reservoir Labs Harvard 04/12/2011 29
  • 30. What R-Stream does for you – in a nutshell •• Input • Sequential – Short and simple textbook C code – Just add a “#pragma map” and R-Stream figures out the rest Reservoir Labs Harvard 04/12/2011 30
  • 31. What R-Stream does for you – in a nutshell •• Input •• Output • Sequential • OpenMP + CUDA code – Short and simple R-Stream – Hundreds of lines of tightly textbook C code optimized GPU-side CUDA – Just add a “#pragma code map” and R-Stream – Few lines of host-side figures out the rest OpenMP C code • Gauss-Seidel 9 points stencil • Very difficult to – Used in iterative PDE solvers hand-optimize – scientific modeling (heat, fluid flow, waves, • Not available in etc.) any standard – Building block for faster iterative solvers like library Multigrid or AMR Reservoir Labs Harvard 04/12/2011 31
  • 32. What R-Stream does for you – in a nutshell •• Input •• Output • Sequential • OpenMP + CUDA code – Short and simple R-Stream – Hundreds of lines of tightly textbook C code optimized GPU-side CUDA – Just add a “#pragma code map” and R-Stream – Few lines of host-side figures out the rest OpenMP C code • Achieving up to Will be – 20 GFLOPS in illustrated in GTX 285 the next few – 25 GFLOPS in slides GTX 480 Reservoir Labs Harvard 04/12/2011 32
  • 33. Finding and utilizing available parallelism Excerpt of automatically generated code GPU SM N SM 2 SM 1 Shared Memory Registers Registers Registers Instruction R-Stream AUTOMATICALLY finds SP 1 SP 2 … SP M Unit and forms parallelism Constant Cache Texture Cache Off-chip Device memory (Global, constant, texture) Extracting and mapping parallel loops Reservoir Labs Harvard 04/12/2011 33
  • 34. Memory compaction on GPU scratchpad Excerpt of automatically generated code GPU SM N SM 2 SM 1 Shared Memory Registers Registers Registers Instruction SP 1 SP 2 … SP M Unit R-Stream AUTOMATICALLY manages local scratchpad Constant Cache Texture Cache Off-chip Device memory (Global, constant, texture) Reservoir Labs Harvard 04/12/2011 34
  • 35. GPU DRAM to scratchpad coalesced communication Excerpt of automatically generated code GPU SM N SM 2 SM 1 Shared Memory Registers Registers Registers Instruction SP 1 SP 2 … SP M Unit R-Stream AUTOMATICALLY chooses parallelism to favor Constant Cache coalescing Texture Cache Off-chip Device memory (Global, constant, texture) Coalesced GPU DRAM accesses Reservoir Labs Harvard 04/12/2011 35
  • 36. Host-to-GPU communication Excerpt of automatically generated code R-Stream AUTOMATICALLY GPU chooses partition and sets up host SM N to GPU communication SM 2 SM 1 Shared Memory Registers Registers Registers Instruction SP 1 SP 2 … SP M Unit CPU Constant Cache Texture PCI Cache Express Host memory Off-chip Device memory (Global, constant, texture) Reservoir Labs Harvard 04/12/2011 36
  • 37. Multi-GPU mapping Excerpt of automatically generated code Mapping Host Host across all GPUs memory memory CPU CPU R-Stream AUTOMATICALLY finds GPU GPU GPU another level of parallelism, across GPUs GPU GPU GPU memory memory memory Reservoir Labs Harvard 04/12/2011 37
  • 38. Multi-GPU mapping Excerpt of automatically generated code Host Host Multi-streaming memory memory of host-GPU communication CPU CPU R-Stream AUTOMATICALLY GPU GPU GPU creates n-way software pipelines for communications GPU GPU GPU memory memory memory Reservoir Labs Harvard 04/12/2011 38
  • 39. Future capabilities – mapping to CPU-GPU clusters High Speed Interconnect (e.g. InfiniBand) Program CPU + GPU CPU + GPU MPI Process MPI Process OpenMP OpenMP process process DRAM launching launching DRAM CUDA CUDA CPU CPU GPU GPU GPU DRAM DRAM DRAM Reservoir Labs Harvard 04/12/2011 39
  • 40. Outline •• R-Stream Overview •• Compilation Walk-through •• Performance Results •• Getting R-Stream Reservoir Labs Harvard 04/12/2011 40
  • 41. Experimental evaluation Configuration 1: MKL Configuration 2: Low-level compilers Radar Radar GCC MKL calls code code ICC Configuration 3: R-Stream Radar Optimized GCC R-Stream code radar code ICC •• Main comparisons: • R-Stream High-Level C Transformation Tool 3.1.2 • Intel MKL 10.2.1 Reservoir Labs Harvard 04/12/2011
  • 42. Experimental evaluation (cont.) •• Intel Xeon workstation: • Dual quad-core E5405 Xeon processors (8 cores total) • 9GB memory •• 8 OpenMP threads •• Single precision floating point data •• Low-level compilers and the used flags: • GCC: -O6 -fno-trapping-math -ftree-vectorize -msse3 -fopenmp • ICC: -fast -openmp Reservoir Labs Harvard 04/12/2011
  • 43. Radar benchmarks •• Beamforming algorithms: • MVDR-SER: Minimum Variance Distortionless Response using Sequential Regression • CSLC-LMS: Coherent Sidelobe Cancellation using Least Mean Square • CSLC-RLS: Coherent Sidelobe Cancellation using Robust Least Square •• Expressed in sequential ANSI C •• 400 radar iterations •• Compute 3 radar sidelobes (for CSLC-LMS and CSLC-RLS) Reservoir Labs Harvard 04/12/2011
  • 44. MVDR-SER Reservoir Labs Harvard 04/12/2011
  • 45. CSLC-LMS Reservoir Labs Harvard 04/12/2011
  • 46. CSLC-RLS Reservoir Labs Harvard 04/12/2011
  • 47. 3D Discretized wave equation input code (RTM) •• #pragma rstream map •• void RTM_3D(double (*U1)[Y][X], double (*U2)[Y][X], double (*V)[Y][X], •• int pX, int pY, int pZ) { •• double temp; •• int i, j, k; •• for (k=4; k<pZ-4; k++) { •• for (j=4; j<pY-4; j++) { •• for (i=4; i<pX-4; i++) { •• temp = C0 * U2[k][j][i] + •• C1 * (U2[k-1][j][i] + U2[k+1][j][i] + •• U2[k][j-1][i] + U2[k][j+1][i] + •• U2[k][j][i-1] + U2[k][j][i+1]) + •• C2 * (U2[k-2][j][i] + U2[k+2][j][i] + •• U2[k][j-2][i] + U2[k][j+2][i] + 25-point 8th •• U2[k][j][i-2] + U2[k][j][i+2]) + order (in space) •• C3 * (U2[k-3][j][i] + U2[k+3][j][i] + •• U2[k][j-3][i] + U2[k][j+3][i] + stencil •• U2[k][j][i-3] + U2[k][j][i+3]) + •• C4 * (U2[k-4][j][i] + U2[k+4][j][i] + •• U2[k][j-4][i] + U2[k][j+4][i] + •• U2[k][j][i-4] + U2[k][j][i+4]); •• U1[k][j][i] = •• 2.0f * U2[k][j][i] - U1[k][j][i] + •• V[k][j][i] * temp; •• } } } } Reservoir Labs Harvard 04/12/2011 47
  • 48. 3D Discretized wave equation input code (RTM) Not so naïve … Communication autotuning knobs ThreadIdx.x divergence is expensive Reservoir Labs Harvard 04/12/2011 48
  • 49. Outline •• R-Stream Overview •• Compilation Walk-through •• Performance Results •• Getting R-Stream Reservoir Labs Harvard 04/12/2011 49
  • 50. Current status •• Ongoing development also supported by DOE, Reservoir • Improvements in scope, stability, performance •• Installations/evaluations at US government laboratories •• Forward collaboration with Georgia Tech on Keeneland • HP SL390 - 3 FERMI GPU, 2 Westmere/node •• Basis of compiler for DARPA UHPC Intel Corporation Team Reservoir Labs Harvard 04/12/2011 50
  • 51. Availability •• Per developer seat licensing model •• Support (releases, bug fixes, services) •• Available with commercial grade external solvers •• Government has limited rights / SBIR data rights •• Academic source licenses with collaborators •• Professional team, continuity, software engineering Reservoir Labs Harvard 04/12/2011 51
  • 52. R-Stream Gives More Insight About Programs •• Teaches you about parallelism and the polyhedral model: • Generates correct code for any transformation – The transformation may be incorrect if specified by the user • Imperative code generation has been the bottleneck till 2005 – 3 thesis written on the topic and it’s still not completely covered – It is good to have an intuition of how these things work •• R-Stream has meaningful metrics to represent your program • Maximal amount of parallelism given minimal expansion • Tradeoffs between coarse-grained and fine-grained parallelism • Loop types (doall, red, perm, seq) •• Help with algorithm selection •• Tiling of imperfectly nested loop nests •• Generates code for explicitly managed memory Reservoir Labs Harvard 04/12/2011 52
  • 53. How to Use R-Stream Successfully •• R-Stream is a great transformation tool, it is also a great learning tool: • Takes “simple C” input code • Applies multiple transformations and lists them explicitly • Can dump code at any step in the process • It’s the tool I wish I had during my PhD: – To be fair, I already had a great set of tools •• R-Stream can be used in multiple modes: • Fully automatic + compile flag options • Autotuning mode (more than just tile size and unrolling … ) • Scripting / Programmatic mode (Beanshell + interfaces) • Mix of these modes + manual post-processing Reservoir Labs Harvard 04/12/2011 53
  • 54. Use Case: Fully Automatic Mode + Compile Flag Options •• Akin to traditional gcc / icc compilation with flags: • Predefined transformations can be parameterized – Except less/no phase ordering issues • Except you can see what each transformation does at each step • Except you can generate compilable and executable code at (almost) any step in the process • Except you can control code generation for compactness or performance Reservoir Labs Harvard 04/12/2011 54
  • 55. Use Case: Autotuning Mode •• More advanced than traditional approaches: • Knobs go far beyond loop unrolling + unroll-and-jam + tiling • Knobs are based on well-understood models • Knobs target high-level properties of the program – Amount of parallelism, amount of memory expansion, depth of pipelining of communication and computations … • Knobs are dependent on target machine, program and state of the mapping process: – Our tool has introspection •• We are really building a hierarchical autotuning transformation tool Reservoir Labs Harvard 04/12/2011 55
  • 56. Use Case: “Power user” interactive interface •• Beanshell access to optimizations •• Can direct and review the process of compilation • automatic tactics (affine scheduling) •• Can direct code generation •• Access to “tuning parameters” •• All options / commands available on command line interface Reservoir Labs Harvard 04/12/2011 56
  • 57. Conclusion •• R-Stream simplifies software development and maintenance •• Porting: reduces expense and delivery delays •• Does this by automatically parallelizing loop code • While optimizing for data locality, coalescing, etc. •• Addresses • Dense loop-intensive computations •• Extensions • Data-parallel programming idioms • Sparse representations • Dynamic runtime execution Reservoir Labs Harvard 04/12/2011 57
  • 58. Contact us •• Per-developer seat, floating, cloud-based licensing •• Discounts for academic users •• Research collaborations with academic partners •• For more information: • Call us at 212-780-0527, or • See Rich, Ann • E-mail us at – {sales,lethin,johnson}@reservoir.com Reservoir Labs Harvard 04/12/2011 58