SlideShare uma empresa Scribd logo
1 de 76
Baixar para ler offline
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)
–  Data-independent tasks

–  Tasks with statically-known data dependences



–  SIMD divergence

–  Lacking fine-grained synchronization

–  Lacking writeable, coherent caches
–  Data-independent tasks

–  Tasks with statically-known data dependences



–  SIMD divergence

–  Lacking fine-grained synchronization

–  Lacking writeable, coherent caches
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)
32-­‐bit	
  Key-­‐value	
  Sor7ng	
                 Keys-­‐only	
  Sor7ng	
  
                  DEVICE           (106	
  keys	
  /	
  sec)	
  	
               (106	
  pairs/	
  sec)	
  




NVIDIA	
  GTX	
  280         449	
  	
  	
  (3.8x	
  speedup*)               534	
  	
  	
  (2.9x	
  speedup*)




                                                     * Satish et al.,"Designing efficient sorting algorithms
                                                       for manycore GPUs," in IPDPS '09
32-­‐bit	
  Key-­‐value	
  Sor7ng	
         Keys-­‐only	
  Sor7ng	
  
                  DEVICE           (106	
  keys	
  /	
  sec)	
  	
       (106	
  pairs/	
  sec)	
  


NVIDIA	
  GTX	
  480                         775                                1005

NVIDIA	
  GTX	
  280                         449                                 534

NVIDIA	
  8800	
  GT                         129                                 171
32-­‐bit	
  Key-­‐value	
  Sor7ng	
            Keys-­‐only	
  Sor7ng	
  
                         DEVICE                               (106	
  keys	
  /	
  sec)	
  	
          (106	
  pairs/	
  sec)	
  


NVIDIA	
  GTX	
  480                                                      775                                 1005

NVIDIA	
  GTX	
  280                                                      449                                  534

NVIDIA	
  8800	
  GT                                                      129                                   171


Intel	
  	
  Knight's	
  Ferry	
  MIC	
  32-­‐core*                                                            560

Intel	
  	
  Core	
  i7	
  quad-­‐core	
  *                                                                    240

Intel	
  	
  Core-­‐2	
  quad-­‐core*                                                                          138

                                                                          *Satish et al., "Fast Sort on CPUs, GPUs and Intel MIC
                                                                           Architectures,“ Intel Tech Report 2010.
 
Input	
  




                                 Thread	
     Thread	
     Thread	
     Thread	
  


                  Output	
  



–  Each output is dependent upon a finite subset of the input
    •  Threads are decomposed by output element
    •  The output (and at least one input) index is a static function of thread-id
Input	
  



                                        ?	
  

   Output	
  




–  Each output element has dependences upon any / all input elements
–  E.g., sorting, reduction, compaction, duplicate removal, histogram generation,
   map-reduce, etc.
–  Threads are decomposed by output
   element
                                               Thread	
          Thread	
          Thread	
          Thread	
  
–  Repeatedly iterate over recycled
   input streams
–  Output stream size is statically
   known before each pass             Thread	
          Thread	
          Thread	
          Thread	
  
+         +      +   +




–  O(n) global work from passes of pairwise-neighbor-reduction

–  Static dependences, uniform output
allocation

–  Repeated pairwise swapping
     • Bubble sort is O(n2)
                                             –  Repeatedly check each vertex or edge
     • Bitonic sort is O(nlog2n)
                                                  • Breadth-first search becomes O(V2)
–  Need partitioning: dynamic, cooperative        • O(V+E) is work-optimal

                                             –  Need queue: dynamic, cooperative
                                                allocation
allocation

–  Repeated pairwise swapping
     • Bubble sort is O(n2)
                                             –  Repeatedly check each vertex or edge
     • Bitonic sort is O(nlog2n)
                                                  • Breadth-first search becomes O(V2)
–  Need partitioning: dynamic, cooperative        • O(V+E) is work-optimal

                                             –  Need queue: dynamic, cooperative
                                                allocation
	
  
–  Variable output per thread
–  Need dynamic, cooperative allocation
Input	
  
                      Thread	
     Thread	
     Thread	
     Thread	
     Thread	
     Thread	
        Thread	
     Thread	
     Thread	
     Thread	
     Thread	
     Thread	
  




                                                                                               ?	
  

         Output	
  



•  Where do I put something in a list?                                                     Where do I enqueue something?
     –  Duplicate removal                                                                               –  Search space exploration

     –  Sorting                                                                                         –  Graph traversal

     –  Histogram compilation                                                                           –  General work queues
• For 30,000 producers and consumers?




–  Locks serialize everything
Input	
            2	
     1	
     0	
     3	
     2	
     –    O(n) work

                                                           –    For allocation: use scan results as
Prefix	
  Sum	
     0	
     2	
     3	
     3	
     6	
          a scattering vector

                                                           –    Popularized by Blelloch et al. in the
                                                                ‘90s




                                                           –    Merrill et al. Parallel Scan for
                                                                Stream Architectures. Technical
                                                                Report CS2009-14, University of
                                                                Virginia. 2009
Thread	
     Thread	
     Thread	
     Thread	
     Thread	
  


Input	
  	
  (	
  &	
  allocaOon	
  	
  
requirement)	
                               2	
          1	
          0	
          3	
          2	
        –    O(n) work

                                                                                                            –    For allocation: use scan results as
Result	
  of	
  	
                                                                                               a scattering vector
prefix	
  scan	
  (sum)	
  
                                              0	
          2	
          3	
         3	
          6	
  
                                                                                                            –    Popularized by Blelloch et al. in the
                                                                                                                 ‘90s




                                                                                                            –    Merrill et al. Parallel Scan for
                                                                                                                 Stream Architectures. Technical
                                                                                                                 Report CS2009-14, University of
                                                                                                                 Virginia. 2009
Input	
  	
  (	
  &	
  allocaOon	
  	
  
             requirement)	
                                       2	
          1	
          0	
          3	
          2	
        –    O(n) work

                                                                                                                                 –    For allocation: use scan results as
             Result	
  of	
  	
                                                                                                       a scattering vector
             prefix	
  scan	
  (sum)	
  
                                                                   0	
          2	
          3	
         3	
          6	
  
                                                                Thread	
     Thread	
     Thread	
     Thread	
     Thread	
  
                                                                                                                                 –    Popularized by Blelloch et al. in the
                                                                                                                                      ‘90s



Output	
  
                           0	
            1	
           2	
         3	
          4	
          5	
          6	
          7	
  
                                                                                                                                 –    Merrill et al. Parallel Scan for
                                                                                                                                      Stream Architectures. Technical
                                                                                                                                      Report CS2009-14, University of
                                                                                                                                      Virginia. 2009
Key sequence    1110   0011        1010   0111   1100   1000        0101   0001



                                    0s                               1s



Output key sequence   1110   1010        1100   1000   0011   0111        0101   0001
Key sequence      1110            0011              1010         0111     1100        1000             0101        0001
                               0            1                 2            3           4           5                6           7




                                                    0s                                                     1s




Allocation requirements    1       0   1        0        1        1   0        0   0       1   0       1        0       0   1       1
                           0       1   2        3        4        5   6        7   0       1   2       3        4       5   6       7

   Scanned allocations
                           0       1   1        2        2        3   4        4   0       0   1       1        2       2   2       3
    (relocation offsets)
                           0       1   2        3        4        5   6        7   0       1   2       3        4       5   6       7
0s                                                                1s




 Allocation requirements      1       0   1        0        1        1   0           0       0       1       0       1        0       0   1       1
                              0       1   2        3        4        5   6           7       0       1       2       3        4       5   6       7

     Scanned allocations
                              0       1   1        2        2        3   4           4       0       0       1       1        2       2   2       3
   (bin relocation offsets)
                              0       1   2        3        4        5   6           7       0       1       2       3        4       5   6       7
     Adjusted allocations
                              0       1   1        2        2        3   4           4       4       4       5       5        6       6   6       7
(global relocation offsets)
                              0       1   2        3        4        5   6           7       0       1       2       3        4       5   6       7




                              0           4                 1                5           2               3                    6           7

           Key sequence       1110            0011              1010         0111        1100            1000                 0101        0001




    Output key sequence       1110            1010              1100         1000        0011            0111                 0101        0001
                                  0            1                 2               3               4               5                6           7
 
Determine	
  allocaCon	
  size	
  




                                                              Global	
  Device	
  Memory	
  
Host	
  Program	
                CUDPP	
  scan	
  


                                 CUDPP	
  Scan	
  


                                 CUDPP	
  scan	
  


                              Distribute	
  output	
  

        Host	
                           GPU	
  


                      Un-fused
Determine	
  allocaCon	
  size	
  
                                                                                                                     Determine	
  allocaCon	
  




                                                             Global	
  Device	
  Memory	
  




                                                                                                                                                  Global	
  Device	
  Memory	
  
                                                                                                                          Scan	
  
Host	
  Program	
  




                                                                                              Host	
  Program	
  
                                CUDPP	
  scan	
  

                                CUDPP	
  Scan	
                                                                              Scan	
  

                                CUDPP	
  scan	
  
                                                                                                                            Scan	
  
                                                                                                                       Distribute	
  output	
  
                             Distribute	
  output	
  
  Host	
                                GPU	
                                                 Host	
                            GPU	
  

                      Un-fused                                                                                      Fused
Determine	
  allocaCon	
                                      1.  Heavy SMT (over-threading) yields




                                                    Global	
  Device	
  Memory	
  
                             Scan	
                                                      usable “bubbles” of free
Host	
  Program	
  




                                                                                         computation
                               Scan	
                                                2.  Propagate live data between steps
                                                                                         in fast registers / smem

                              Scan	
                                                 3.  Use scan (or variant) as a “runtime”
                         Distribute	
  output	
                                          for everything

Host	
                            GPU	
  

                      Fused
Determine	
  allocaCon	
                                      1.  Heavy SMT (over-threading) yields




                                                    Global	
  Device	
  Memory	
  
                             Scan	
                                                      usable “bubbles” of free
Host	
  Program	
  




                                                                                         computation
                               Scan	
                                                2.  Propagate live data between steps
                                                                                         in fast registers / smem

                              Scan	
                                                 3.  Use scan (or variant) as a “runtime”
                         Distribute	
  output	
                                          for everything

Host	
                            GPU	
  

                      Fused
Device	
              Memory	
  Bandwidth	
       Compute	
  Throughput	
            Memory	
  wall	
      Memory	
  wall	
  
           	
                   (109	
  bytes/s)	
       (109	
  thread-­‐cycles/s)	
       (bytes/cycle)	
      (instrs/word)	
  

     GTX	
  480	
                    169.0	
                       672.0	
                      0.251	
               15.9	
  

     GTX	
  285	
                    159.0	
                       354.2	
                      0.449	
                8.9	
  

     GTX	
  280	
                    141.7	
                       311.0	
                      0.456	
                8.8	
  

   Tesla	
  C1060	
                  102.0	
                       312.0	
                      0.327	
               12.2	
  

   9800	
  GTX+	
                     70.4	
                       235.0	
                      0.300	
               13.4	
  

     8800	
  GT	
                     57.6	
                       168.0	
                      0.343	
               11.7	
  

     9800	
  GT	
                     57.6	
                       168.0	
                      0.343	
               11.7	
  

    8800	
  GTX	
                     86.4	
                       172.8	
                      0.500	
                8.0	
  

Quadro	
  FX	
  5600	
                76.8	
                       152.3	
                      0.504	
                7.9	
  
Device	
              Memory	
  Bandwidth	
       Compute	
  Throughput	
            Memory	
  wall	
      Memory	
  wall	
  
           	
                   (109	
  bytes/s)	
       (109	
  thread-­‐cycles/s)	
       (bytes/cycle)	
      (instrs/word)	
  

     GTX	
  480	
                    169.0	
                       672.0	
                      0.251	
               15.9	
  

     GTX	
  285	
                    159.0	
                       354.2	
                      0.449	
                8.9	
  

     GTX	
  280	
                    141.7	
                       311.0	
                      0.456	
                8.8	
  

   Tesla	
  C1060	
                  102.0	
                       312.0	
                      0.327	
               12.2	
  

   9800	
  GTX+	
                     70.4	
                       235.0	
                      0.300	
               13.4	
  

     8800	
  GT	
                     57.6	
                       168.0	
                      0.343	
               11.7	
  

     9800	
  GT	
                     57.6	
                       168.0	
                      0.343	
               11.7	
  

    8800	
  GTX	
                     86.4	
                       172.8	
                      0.500	
                8.0	
  

Quadro	
  FX	
  5600	
                76.8	
                       152.3	
                      0.504	
                7.9	
  
25	
  
                                                                                                                                                GTX285	
  r+w	
  memory	
  wall	
  	
  
Thread-­‐InstrucOons	
  /	
  32-­‐bit	
  scan	
  element	
  




                                                                                                                                                  (17.8	
  instrucOons	
  per	
  	
  
                                                               20	
                                                                                    input	
  word)	
  



                                                               15	
  



                                                               10	
  
                                                                                                   Insert	
  work	
  here	
  

                                                                 5	
  



                                                                 0	
  
                                                                         0	
     16	
     32	
         48	
                64	
        80	
                    96	
                   112	
  
                                                                                                   Problem	
  Size	
  (millions)	
  
25	
  
Thread-­‐InstrucOons	
  /	
  32-­‐bit	
  scan	
  element	
  




                                                                                                                                                   GTX285	
  r+w	
  memory	
  
                                                               20	
                                                                                    wall	
  (17.8)	
  


                                                               15	
  



                                                               10	
                                Insert	
  work	
  here	
  


                                                                 5	
  
                                                                                                                                                Data	
  Movement	
  
                                                                                                                                                   Skeleton	
  
                                                                 0	
  
                                                                         0	
     16	
     32	
         48	
                64	
        80	
                96	
            112	
  
                                                                                                   Problem	
  Size	
  (millions)	
  
25	
  
Thread-­‐InstrucOons	
  /	
  32-­‐bit	
  scan	
  element	
  




                                                                                                                                                        GTX285	
  r+w	
  memory	
  
                                                               20	
                                                                                         wall	
  (17.8)	
  


                                                               15	
  
                                                                                                   Insert	
  work	
  here	
  

                                                               10	
                                                                             Our	
  Scan	
  Kernel	
  



                                                                 5	
  
                                                                                                                                                Data	
  Movement	
  
                                                                                                                                                   Skeleton	
  
                                                                 0	
  
                                                                         0	
     16	
     32	
        48	
                  64	
       80	
                    96	
            112	
  
                                                                                                   Problem	
  Size	
  (millions)	
  
25	
  




                                Thread-­‐InstrucOons	
  /	
  32-­‐bit	
  scan	
  element	
  
                                                                                                                                                                                      GTX285	
  r+w	
  
                                                                                               20	
                                                                                   memory	
  wall	
  
                                                                                                                                                                                        (17.8)	
  


                                                                                               15	
  
–    Increase granularity /
                                                                                                                                   Insert	
  work	
  here	
  
     redundant computation
      • ghost cells                                                                            10	
                                                                             Our	
  Scan	
  Kernel	
  
      • radix bits


–    Orthogonal kernel fusion
                                                                                                 5	
  
                                                                                                                                                                                Data	
  Movement	
  
                                                                                                                                                                                   Skeleton	
  
                                                                                                 0	
  
                                                                                                         0	
     16	
     32	
            48	
           64	
          80	
              96	
           112	
  
                                                                                                                                   Problem	
  Size	
  (millions)	
  
25	
  
Thread-­‐InstrucOons	
  /	
  32-­‐bit	
  scan	
  element	
  




                                                                                               CUDPP	
  Scan	
  Kernel	
  
                                                               20	
  



                                                               15	
  



                                                               10	
                                Our	
  Scan	
  Kernel	
  



                                                                 5	
  



                                                                 0	
  
                                                                         0	
     20	
     40	
                          60	
                80	
     100	
     120	
  
                                                                                                        Problem	
  Size	
  (millions)	
  
35	
  


                                                                                                 30	
                                                                                 GTX285	
  Radix	
  




                                  Thread-­‐InstrucOons	
  /	
  32-­‐bit	
  scan	
  element	
  
–    Partially-coalesced writes                                                                                                                                                     Scader	
  Kernel	
  Wall	
  

–    2x write overhead                                                                           25	
  

                                                                                                                                                                                         GTX285	
  Scan	
  
                                                                                                 20	
                                Insert	
  work	
  here	
                             Kernel	
  Wall	
  


                                                                                                 15	
  
–    4 total concurrent scan
     operations (radix 16)                                                                       10	
  
                                                                                                                                                                                    Our	
  Scan	
  Kernel	
  

                                                                                                   5	
  


                                                                                                   0	
  
                                                                                                           0	
     16	
     32	
             48	
            64	
          80	
              96	
               112	
  
                                                                                                                                       Problem	
  Size	
  (millions)	
  
50	
  

                                                                                      45	
  
                                                                                                                                                                          480	
  Radix	
  
                                                                                      40	
                                                                              Scader	
  Kernel	
  




                                 Thread-­‐instructoins	
  /	
  32-­‐bit	
  word	
  
                                                                                                                                                                            Wall	
  
                                                                                      35	
  

                                                                                      30	
  
–    Need kernels with tunable
                                                                                      25	
  
     local (or redundant) work                                                                                                                                           285	
  Radix	
  
      • ghost cells                                                                   20	
                                                                             Scader	
  Kernel	
  
                                                                                                                                                                           Wall	
  
      • radix bits
                                                                                      15	
  

                                                                                      10	
  

                                                                                        5	
  

                                                                                        0	
  
                                                                                                0	
     10	
     20	
     30	
        40	
        50	
        60	
           70	
        80	
     90	
  
                                                                                                                              Problem	
  Size	
  (millions)	
  
 
–  Virtual processors abstract a diversity of hardware configurations



–  Leads to a host of inefficiencies




–  E.g., only several hundred CTAs
–  Virtual processors abstract a diversity of hardware configurations



–  Leads to a host of inefficiencies




–  E.g., only several hundred CTAs
…	
  
Grid A




                                                             threadblock	
  

         grid-size = (N / tilesize) CTAs



                                                          …	
  
Grid B




                                                                       threadblock	
  



         grid-size = 150 CTAs (or other small constant)
…	
  




                                                                threadblock	
  




–  Thread-dependent predicates

–  Setup and initialization code (notably for
   smem)

–  Offset calculations (notably for smem)




                                                –  Common values are hoisted and kept live
…	
  




                                                                threadblock	
  




–  Thread-dependent predicates

–  Setup and initialization code (notably for
   smem)

–  Offset calculations (notably for smem)




                                                –  Common values are hoisted and kept live
…	
  




                                                                  threadblock	
  




–  Thread-dependent predicates

–  Setup and initialization code (notably for
   smem)

–  Offset calculations (notably for smem)




                                                –  Common values are hoisted and kept live
                                                –  Spills are really bad
log tilesize (N) -level tree

                                                                Two-level tree
                                                       load, store)




–    O( N / tilesize) gmem accesses               –    GPU is least efficient here: get it over with
                                                       as quick as possible
–    2-4 instructions per access (offset calcs,
log tilesize (N) -level tree

                                                                Two-level tree
                                                       load, store)




–    O( N / tilesize) gmem accesses               –    GPU is least efficient here: get it over with
                                                       as quick as possible
–    2-4 instructions per access (offset calcs,
20	
  
Thread-­‐instrucOons	
  /	
  Element	
  




                                           16	
  


                                           12	
  

                                                                                                                                                 Compute	
  Load	
  
                                             8	
  

                                                                                                                                                 285	
  Scan	
  Kernel	
  Wall	
  
                                             4	
  


                                             0	
  
                                                     0	
     1000	
     2000	
     3000	
          4000	
            5000	
           6000	
        7000	
           8000	
          9000	
  
                                                                                          Grid	
  Size	
  (#	
  of	
  threadblocks)	
  
C = number of CTAs
                                                        N = problem size
–    16.1M / 150 CTAs / 1024 =   109.91 tiles per CTA
                                                        T = tile size

                                                        B = tiles per CTA

–  conditional evaluation

–  singleton loads
C = number of CTAs
                                                        N = problem size
–    16.1M / 150 CTAs / 1024 =   109.91 tiles per CTA
                                                        T = tile size

                                                        B = tiles per CTA

–  conditional evaluation

–  singleton loads
C = number of CTAs
                                                          N = problem size
–    floor(16.1M / (1024 * 150) )   = 109 tiles per CTA
                                                          T = tile size

                                                          B = tiles per CTA

–    16.1M % (1024 * 150)           = 136.4 extra tiles
C = number of CTAs
                                                                     N = problem size
–    floor(16.1M / (1024 * 150) )   = 109 tiles per CTA (14 CTAs)
                                                                     T = tile size

                                                                     B = tiles per CTA

–    109 + 1                        = 110 tiles per CTA (136 CTAs)




–    16.1M % (1024 * 150)           = 0.4 extra tiles
 
–  If you breathe on your code, run it through the VP
    •  Kernel runtimes
    •  Instruction counts




–  Indispensible for tuning
    •  Host-side timing requires too many iterations
    •  Only 1-2 cudaprof iterations for consistent counter-based perf data

–  Write tools to parse the output
    •  “Dummy” kernels useful for demarcation
1100	
  
                                          1000	
                                                                                                                                      GTX	
  480	
  

                                           900	
                                                                                                                                      C2050	
  (no	
  ECC)	
  
SorOng	
  Rate	
  (106	
  keys/sec)	
  




                                           800	
                                                                                                                                      GTX	
  285	
  
                                           700	
                                                                                                                                      C2050	
  (ECC)	
  
                                           600	
                                                                                                                                      GTX	
  280	
  
                                           500	
                                                                                                                                      C1060	
  
                                           400	
                                                                                                                                      9800	
  GTX+	
  
                                           300	
  
                                           200	
  
                                           100	
  
                                               0	
  
                                                       0	
     16	
     32	
     48	
     64	
     80	
     96	
   112	
   128	
   144	
   160	
   176	
   192	
   208	
   224	
   240	
   256	
   272	
  
                                                                                                                Problem	
  size	
  (millions)	
  
800	
  
                                                                                                                                                                                                      GTX	
  480	
  
                                                      700	
                                                                                                                                           C2050	
  (no	
  ECC)	
  
                                                                                                                                                                                                      GTX	
  285	
  
SorOng	
  Rate	
  (millions	
  of	
  pairs/sec)	
  




                                                      600	
                                                                                                                                           GTX	
  280	
  
                                                                                                                                                                                                      C2050	
  (ECC)	
  
                                                                                                                                                                                                      C1060	
  
                                                      500	
  
                                                                                                                                                                                                      9800	
  GTX+	
  

                                                      400	
  

                                                      300	
  

                                                      200	
  

                                                      100	
  

                                                          0	
  
                                                                  0	
     16	
     32	
     48	
     64	
     80	
     96	
       112	
       128	
       144	
     160	
     176	
     192	
     208	
       224	
       240	
  
                                                                                                                           Problem	
  size	
  (millions)	
  
180	
  

                                                   160	
  
Kernel	
  Bandwidth	
  (GiBytes	
  /	
  sec)	
  




                                                   140	
  

                                                   120	
  

                                                   100	
  

                                                     80	
  

                                                     60	
  

                                                     40	
                                                                             merrill_tree	
  Reduce	
  

                                                     20	
                                                                             merrill_rts	
  Scan	
  

                                                       0	
  
                                                               0	
     20	
     40	
                   60	
                  80	
            100	
                 120	
  
                                                                                         Problem	
  Size	
  (millions)	
  
180	
  

                                                         160	
  
Kernel	
  Bandwidth	
  (Bytes	
  x109	
  /	
  sec)	
  




                                                         140	
  

                                                         120	
  

                                                         100	
  

                                                           80	
  

                                                           60	
  

                                                           40	
                                                                             merrill_linear	
  Reduce	
  

                                                           20	
                                                                             merrill_linear	
  Scan	
  

                                                             0	
  
                                                                     0	
     20	
     40	
                   60	
                  80	
             100	
                  120	
  
                                                                                               Problem	
  Size	
  (millions)	
  
–  Implement device “memcpy” for tile-processing
    •  Optimize for “full tiles”

–  Specialize for different SM versions, input types, etc.
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)
–  Use templated code to
   generate various
   instances


–  Run with cudaprof env
   vars to collect data
160	
                                        128-­‐Thread	
  CTA	
  (64B	
  ld)	
  
                                 150	
  
Bandwidth	
  (GiBytes/sec)	
  




                                 140	
  
                                                                                                                                                                                                                         One-­‐way	
  
                                 130	
  
                                 120	
                                                                                                                     Single	
  
                                 110	
                                                                                                                     Single	
  (no-­‐overlap)	
  
                                                                                                                                                           Double	
  
                                 100	
  
                                                                                                                                                           Double	
  (no-­‐overlap)	
  
                                   90	
  
                                                                                                                                                           Quad	
  
                                   80	
                                                                                                                    Quad	
  (no-­‐overlap)	
  
                                            0	
             20	
     40	
                       60	
                                       80	
      100	
                    120	
  
                                                                                    Words	
  Copied	
  (millions)	
  
                                                                                                                                                                                                                us	
  
                                                                                                                                                                                                                                        cudaMemcpy()	
  
                                                                                                                                                      128-­‐Thread	
  CTA	
  (128B	
  ld/st)	
  
                                                                                                                        140	
  
                                                                                       Bandwidth	
  (GiBytes/sec)	
  




                                                                                                                        130	
  

                                                    Two-­‐way	
                                                         120	
  
                                                                                                                                                                                                                                Single	
  
                                                                                                                        110	
  
                                                                                                                                                                                                                                Single	
  (no-­‐overlap)	
  
                                                                                                                        100	
  
                                                                                                                                                                                                                                Double	
  
                                                                                                                          90	
                                                                                                  Double	
  (no-­‐overlap)	
  
                                                                                                                                                                                                                                Quad	
  
                                                                                                                          80	
                                                                                                  Quad	
  (no-­‐overlap)	
  
                                                                                                                          70	
                                                                                                  Intrinsic	
  Copy	
  
                                                                                                                                   0	
      20	
      40	
                     60	
                    80	
               100	
                     120	
  
                                                                                                                                                                   Words	
  Copied	
  (millions)	
  
 
m0              m1           m2               m3           m4              m5           m6              m7
                                                                                                                                    m0   m1   m2   m3    m4            m5           m6           m7           m8         m9       m10       m11

t0   x0            x1             x2             x3             x4            x5             x6            x7              t0	
     i    i    i    i     x0            x1           x2           x3           x4         x5        x6        x7


                                                                                                                                                         ⊕0	
       ⊕1	
          ⊕2	
        ⊕3	
          ⊕4	
     ⊕5	
       ⊕6	
      ⊕7	
  
                 ⊕0	
                         ⊕1	
                          ⊕2	
                         ⊕3	
  
t1   x0        ⊕(x0..x1)          x2        ⊕(x2..x3)           x4        ⊕(x4..x5)          x6        ⊕(x6..x7)
                                                                                                                           t1	
     i    i    i    i     x0       ⊕(x0..x1) ⊕(x1..x2) ⊕(x2..x3) ⊕(x3..x4) ⊕(x4..x5) ⊕(x5..x6) ⊕(x6..x7)


                                                                                                                                                        ⊕0	
         ⊕1	
         ⊕2	
         ⊕3	
         ⊕4	
      ⊕5	
       ⊕6	
      ⊕7	
  


                                              ⊕0	
                                                       ⊕1	
              t2	
     i    i    i    i     x0       ⊕(x0..x1) ⊕(x0..x2) ⊕(x0..x3) ⊕(x1..x4) ⊕(x2..x5) ⊕(x3..x6) ⊕(x4..x7)
t2   x0        ⊕(x0..x1)          x2        ⊕(x0..x3)           x4        ⊕(x4..x5)          x6        ⊕(x4..x7)       i


                                                                                                                                                         ⊕0	
            ⊕1	
        ⊕2	
          ⊕3	
     ⊕4	
       ⊕5	
      ⊕6	
      ⊕7	
  


                                                                                                                           t3	
     i    i    i    i     x0       ⊕(x0..x1) ⊕(x0..x2) ⊕(x0..x3) ⊕(x0..x4) ⊕(x0..x5) ⊕(x0..x6) ⊕(x0..x7)
                                               =0	
                                                      ⊕0	
  
t3   x0        ⊕(x0..x1)          x2               i            x4        ⊕(x4..x5)          x6        ⊕(x0..x3)




                  =0                          ⊕0	
                           =1                          ⊕1	
  
t4   x0              i            x2        ⊕(x0..x1)           x4        ⊕(x0..x3)          x6        ⊕(x0..x5)




     =0          ⊕0	
             =1          ⊕1	
              =2          ⊕2	
             =3          ⊕3	
  
t5   i             x0          ⊕(x0..x1)    ⊕(x0..x2)        ⊕(x0..x3)    ⊕(x0..x4)       ⊕(x0..x5)    ⊕(x0..x6)




               –  SIMD lanes wasted on O(n)-work Brent Kung (left), but less work when n > warp size
               –  Kogge-Stone (right) is O(n log n)-work, but faster when n ≤ warp size
m0              m1           m2               m3           m4              m5           m6              m7
                                                                                                                                    m0   m1   m2   m3    m4            m5           m6           m7           m8         m9       m10       m11

t0   x0            x1             x2             x3             x4            x5             x6            x7              t0	
     i    i    i    i     x0            x1           x2           x3           x4         x5        x6        x7


                                                                                                                                                         ⊕0	
       ⊕1	
          ⊕2	
        ⊕3	
          ⊕4	
     ⊕5	
       ⊕6	
      ⊕7	
  
                 ⊕0	
                         ⊕1	
                          ⊕2	
                         ⊕3	
  
t1   x0        ⊕(x0..x1)          x2        ⊕(x2..x3)           x4        ⊕(x4..x5)          x6        ⊕(x6..x7)
                                                                                                                           t1	
     i    i    i    i     x0       ⊕(x0..x1) ⊕(x1..x2) ⊕(x2..x3) ⊕(x3..x4) ⊕(x4..x5) ⊕(x5..x6) ⊕(x6..x7)


                                                                                                                                                        ⊕0	
         ⊕1	
         ⊕2	
         ⊕3	
         ⊕4	
      ⊕5	
       ⊕6	
      ⊕7	
  


                                              ⊕0	
                                                       ⊕1	
              t2	
     i    i    i    i     x0       ⊕(x0..x1) ⊕(x0..x2) ⊕(x0..x3) ⊕(x1..x4) ⊕(x2..x5) ⊕(x3..x6) ⊕(x4..x7)
t2   x0        ⊕(x0..x1)          x2        ⊕(x0..x3)           x4        ⊕(x4..x5)          x6        ⊕(x4..x7)       i


                                                                                                                                                         ⊕0	
            ⊕1	
        ⊕2	
          ⊕3	
     ⊕4	
       ⊕5	
      ⊕6	
      ⊕7	
  


                                                                                                                           t3	
     i    i    i    i     x0       ⊕(x0..x1) ⊕(x0..x2) ⊕(x0..x3) ⊕(x0..x4) ⊕(x0..x5) ⊕(x0..x6) ⊕(x0..x7)
                                               =0	
                                                      ⊕0	
  
t3   x0        ⊕(x0..x1)          x2               i            x4        ⊕(x4..x5)          x6        ⊕(x0..x3)




                  =0                          ⊕0	
                           =1                          ⊕1	
  
t4   x0              i            x2        ⊕(x0..x1)           x4        ⊕(x0..x3)          x6        ⊕(x0..x5)




     =0          ⊕0	
             =1          ⊕1	
              =2          ⊕2	
             =3          ⊕3	
  
t5   i             x0          ⊕(x0..x1)    ⊕(x0..x2)        ⊕(x0..x3)    ⊕(x0..x4)       ⊕(x0..x5)    ⊕(x0..x6)




               –  SIMD lanes wasted on O(n)-work Brent Kung (left), but less work when n > warp size
               –  Kogge-Stone (right) is O(n log n)-work, but faster when n ≤ warp size
barrier	
  
Tree-­‐based:	
              barrier	
  
                             barrier	
  
                                                                                                                t0   	
                         t1	
                           t2   	
                             	
  
                                                                                                                                                                                                                  t3

                                                                                                                t0   	
                         t1	
                           t2   	
                             	
  
                                                                                                                                                                                                                  t3

                                                                                                                t0   	
                         t1	
                           t2   	
                             	
  
                                                                                                                                                                                                                  t3

                                                                                                                t0   	
                         t1	
                           t2   	
                             	
  
                                                                                                                                                                                                                  t3




                                           t0	
        t1	
     t2	
     …	
   t
                                                                               T/4	
  -­‐1	
     tT/4	
  	
     tT/4	
  +	
  
                                                                                                                     1	
  
                                                                                                                                tT/4	
  +	
  
                                                                                                                                    2	
  
                                                                                                                                                …	
   t                 	
  
                                                                                                                                                          T/2	
  -­‐	
  1       tT/2	
  	
        tT/2	
  +	
  
                                                                                                                                                                                                    1	
  
                                                                                                                                                                                                                       tT/2	
  +	
  
                                                                                                                                                                                                                         2	
  
                                                                                                                                                                                                                                         …	
   t
                                                                                                                                                                                                                                               3T/4	
  -­‐1	
          t3T/4	
  	
     t3T/4+1	
     t3T/4+2	
     …	
   t           	
  
                                                                                                                                                                                                                                                                                                                         T	
  -­‐	
  1




vs.	
  raking-­‐based:	
     barrier	
     t0   	
                                                t1     	
                                                                      t2     	
                                                                                t3   	
  
                                           t0   	
                                                t1     	
                                                                      t2     	
                                                                                t3   	
  
                                           t0   	
                                                t1     	
                                                                      t2     	
                                                                                t3   	
  
                                                                                                                                                         t0   	
                           t1  	
                             t2       	
                          	
  
                                                                                                                                                                                                                                                                  t3

                                                                                                                                                         t0   	
                           t1  	
                             t2       	
                          	
  
                                                                                                                                                                                                                                                                  t3

                                                                                                                                                         t0   	
                           t1  	
                             t2       	
                          	
  
                                                                                                                                                                                                                                                                  t3

                                                                                                                                                         t0   	
                           t1  	
                             t2       	
                          	
  
                                                                                                                                                                                                                                                                  t3
barrier	
  
Tree-­‐based:	
              barrier	
  
                             barrier	
  
                                                                                                                t0   	
                         t1	
                           t2   	
                             	
  
                                                                                                                                                                                                                  t3

                                                                                                                t0   	
                         t1	
                           t2   	
                             	
  
                                                                                                                                                                                                                  t3

                                                                                                                t0   	
                         t1	
                           t2   	
                             	
  
                                                                                                                                                                                                                  t3

                                                                                                                t0   	
                         t1	
                           t2   	
                             	
  
                                                                                                                                                                                                                  t3




                                           t0	
        t1	
     t2	
     …	
   t
                                                                               T/4	
  -­‐1	
     tT/4	
  	
     tT/4	
  +	
  
                                                                                                                     1	
  
                                                                                                                                tT/4	
  +	
  
                                                                                                                                    2	
  
                                                                                                                                                …	
   t                 	
  
                                                                                                                                                          T/2	
  -­‐	
  1       tT/2	
  	
        tT/2	
  +	
  
                                                                                                                                                                                                    1	
  
                                                                                                                                                                                                                       tT/2	
  +	
  
                                                                                                                                                                                                                         2	
  
                                                                                                                                                                                                                                         …	
   t
                                                                                                                                                                                                                                               3T/4	
  -­‐1	
          t3T/4	
  	
     t3T/4+1	
     t3T/4+2	
     …	
   t           	
  
                                                                                                                                                                                                                                                                                                                         T	
  -­‐	
  1




vs.	
  raking-­‐based:	
     barrier	
     t0   	
                                                t1     	
                                                                      t2     	
                                                                                t3   	
  
                                           t0   	
                                                t1     	
                                                                      t2     	
                                                                                t3   	
  
                                           t0   	
                                                t1     	
                                                                      t2     	
                                                                                t3   	
  
                                                                                                                                                         t0   	
                           t1  	
                             t2       	
                          	
  
                                                                                                                                                                                                                                                                  t3

                                                                                                                                                         t0   	
                           t1  	
                             t2       	
                          	
  
                                                                                                                                                                                                                                                                  t3

                                                                                                                                                         t0   	
                           t1  	
                             t2       	
                          	
  
                                                                                                                                                                                                                                                                  t3

                                                                                                                                                         t0   	
                           t1  	
                             t2       	
                          	
  
                                                                                                                                                                                                                                                                  t3
DMA	
  
                                                         t0	
        t1	
     t2	
     …	
   t 	
  
                                                                                                T/4	
     tT/4	
  	
     tT/4	
  +	
  
                                                                                                                             1	
  
                                                                                                                                         tT/4	
  +	
  
                                                                                                                                             2	
  
                                                                                                                                                         …	
   t 	
  
                                                                                                                                                                  T/2	
  -­‐	
     tT/2	
  	
       tT/2	
  +	
  
                                                                                                                                                                                                      1	
  
                                                                                                                                                                                                                    tT/2	
  +	
  
                                                                                                                                                                                                                      2	
  
                                                                                                                                                                                                                                     …	
   t
                                                                                                                                                                                                                                           3T/4	
  
                                                                                                                                                                                                                                           -­‐1	
  
                                                                                                                                                                                                                                                           t3T/4	
  	
     t3T/
                                                                                                                                                                                                                                                                           4+1	
  
                                                                                                                                                                                                                                                                                     t3T/
                                                                                                                                                                                                                                                                                     4+2	
  
                                                                                                                                                                                                                                                                                               …	
   t           	
  
                                                                                                                                                                                                                                                                                                     T	
  -­‐	
  1


–  Barriers make O(n) code O(n log n)
                                                                                                -­‐1                                                               1




                                                         t0   	
                                           t1     	
                                                                t2     	
                                                                 t3   	
  
                                                         t0   	
                                           t1     	
                                                                t2     	
                                                                 t3   	
  
–  The rest are “DMA engine” threads                     t0   	
                                           t1     	
                                                                t2     	
                                                                 t3   	
  




                                        Worker	
  	
  
–  Use threadblocks to cover pipeline
                                                                                                                                                               t0   	
                        t1  	
                       t2       	
                 	
  
                                                                                                                                                                                                                                                      t3



   latencies, e.g., for Fermi:                                                                                                                                 t0   	
                        t1  	
                       t2       	
                 	
  
                                                                                                                                                                                                                                                      t3


                                                                                                                                                               t0   	
                        t1  	
                       t2       	
                 	
  
                                                                                                                                                                                                                                                      t3
    •  2 worker warps per CTA
                                                                                                                                                               t0   	
                        t1  	
                       t2       	
                 	
  
                                                                                                                                                                                                                                                      t3

    •  6-7 CTAs
 
–  Different SMs (varied local storage: registers/smem)
–  Different input types (e.g., sorting chars vs. ulongs)



–  # of steps for each algorithm phase is configuration-driven

–  Template expansion + Constant-propagation + Static loop unrolling +
   Preprocessor Macros
–  Compiler produces a target assembly that is well-tuned for the specifically
   targeted hardware and problem
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

Mais conteúdo relacionado

Destaque

[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)npinto
 
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...npinto
 
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...npinto
 
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...npinto
 
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...npinto
 
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...npinto
 
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...npinto
 
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)npinto
 
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...npinto
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)npinto
 
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...npinto
 

Destaque (11)

[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
 
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
 
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
 
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
 
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
 
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
 
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
 
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
 
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
 
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
 

Semelhante a [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

Modern Graphics Pipeline Overview
Modern Graphics Pipeline OverviewModern Graphics Pipeline Overview
Modern Graphics Pipeline Overviewslantsixgames
 
Cacheconcurrencyconsistency cassandra svcc
Cacheconcurrencyconsistency cassandra svccCacheconcurrencyconsistency cassandra svcc
Cacheconcurrencyconsistency cassandra svccsrisatish ambati
 
09_Practical Multicore programming
09_Practical Multicore programming09_Practical Multicore programming
09_Practical Multicore programmingnoerror
 
H 264 in cuda presentation
H 264 in cuda presentationH 264 in cuda presentation
H 264 in cuda presentationashoknaik120
 
Big Science, Big Data: Simon Metson at Eduserv Symposium 2012
Big Science, Big Data: Simon Metson at Eduserv Symposium 2012Big Science, Big Data: Simon Metson at Eduserv Symposium 2012
Big Science, Big Data: Simon Metson at Eduserv Symposium 2012Eduserv
 
[Osxdev]metal
[Osxdev]metal[Osxdev]metal
[Osxdev]metalNAVER D2
 
Efficient Parallel Set-Similarity Joins Using MapReduce - Poster
Efficient Parallel Set-Similarity Joins Using MapReduce - PosterEfficient Parallel Set-Similarity Joins Using MapReduce - Poster
Efficient Parallel Set-Similarity Joins Using MapReduce - Posterrvernica
 
SRAM redundancy insertion
SRAM redundancy insertionSRAM redundancy insertion
SRAM redundancy insertionchiportal
 
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAsScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAsShinya Takamaeda-Y
 
BlackHat 2009 - Hacking Zigbee Chips (slides)
BlackHat 2009 - Hacking Zigbee Chips (slides)BlackHat 2009 - Hacking Zigbee Chips (slides)
BlackHat 2009 - Hacking Zigbee Chips (slides)Michael Smith
 
CBM Variable Speed Machinery
CBM Variable Speed MachineryCBM Variable Speed Machinery
CBM Variable Speed MachineryJordan McBain
 
PG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrPG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrKohei KaiGai
 
Notes on a High-Performance JSON Protocol
Notes on a High-Performance JSON ProtocolNotes on a High-Performance JSON Protocol
Notes on a High-Performance JSON ProtocolDaniel Austin
 
MARC ONERA Toulouse2012 Altreonic
MARC ONERA Toulouse2012 AltreonicMARC ONERA Toulouse2012 Altreonic
MARC ONERA Toulouse2012 AltreonicEric Verhulst
 

Semelhante a [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia) (20)

Vlsi giet
Vlsi gietVlsi giet
Vlsi giet
 
GPU - how can we use it?
GPU - how can we use it?GPU - how can we use it?
GPU - how can we use it?
 
Modern Graphics Pipeline Overview
Modern Graphics Pipeline OverviewModern Graphics Pipeline Overview
Modern Graphics Pipeline Overview
 
Cacheconcurrencyconsistency cassandra svcc
Cacheconcurrencyconsistency cassandra svccCacheconcurrencyconsistency cassandra svcc
Cacheconcurrencyconsistency cassandra svcc
 
Cuda Architecture
Cuda ArchitectureCuda Architecture
Cuda Architecture
 
09_Practical Multicore programming
09_Practical Multicore programming09_Practical Multicore programming
09_Practical Multicore programming
 
H 264 in cuda presentation
H 264 in cuda presentationH 264 in cuda presentation
H 264 in cuda presentation
 
Big Science, Big Data: Simon Metson at Eduserv Symposium 2012
Big Science, Big Data: Simon Metson at Eduserv Symposium 2012Big Science, Big Data: Simon Metson at Eduserv Symposium 2012
Big Science, Big Data: Simon Metson at Eduserv Symposium 2012
 
PG-Strom
PG-StromPG-Strom
PG-Strom
 
[Osxdev]metal
[Osxdev]metal[Osxdev]metal
[Osxdev]metal
 
Efficient Parallel Set-Similarity Joins Using MapReduce - Poster
Efficient Parallel Set-Similarity Joins Using MapReduce - PosterEfficient Parallel Set-Similarity Joins Using MapReduce - Poster
Efficient Parallel Set-Similarity Joins Using MapReduce - Poster
 
SRAM redundancy insertion
SRAM redundancy insertionSRAM redundancy insertion
SRAM redundancy insertion
 
Fpga technology
Fpga technologyFpga technology
Fpga technology
 
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAsScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
 
BlackHat 2009 - Hacking Zigbee Chips (slides)
BlackHat 2009 - Hacking Zigbee Chips (slides)BlackHat 2009 - Hacking Zigbee Chips (slides)
BlackHat 2009 - Hacking Zigbee Chips (slides)
 
Moving object detection on FPGA
Moving object detection on FPGAMoving object detection on FPGA
Moving object detection on FPGA
 
CBM Variable Speed Machinery
CBM Variable Speed MachineryCBM Variable Speed Machinery
CBM Variable Speed Machinery
 
PG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrPG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated Asyncr
 
Notes on a High-Performance JSON Protocol
Notes on a High-Performance JSON ProtocolNotes on a High-Performance JSON Protocol
Notes on a High-Performance JSON Protocol
 
MARC ONERA Toulouse2012 Altreonic
MARC ONERA Toulouse2012 AltreonicMARC ONERA Toulouse2012 Altreonic
MARC ONERA Toulouse2012 Altreonic
 

Mais de npinto

"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)npinto
 
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...npinto
 
[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programmingnpinto
 
[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programming[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programmingnpinto
 
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basicsnpinto
 
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patternsnpinto
 
[Harvard CS264] 01 - Introduction
[Harvard CS264] 01 - Introduction[Harvard CS264] 01 - Introduction
[Harvard CS264] 01 - Introductionnpinto
 
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...npinto
 
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)npinto
 
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)npinto
 
IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)npinto
 
IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)npinto
 
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...npinto
 

Mais de npinto (13)

"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)
 
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
 
[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming
 
[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programming[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programming
 
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
 
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
 
[Harvard CS264] 01 - Introduction
[Harvard CS264] 01 - Introduction[Harvard CS264] 01 - Introduction
[Harvard CS264] 01 - Introduction
 
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
 
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)
 
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
 
IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)
 
IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)
 
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...
 

Último

Research Methodology and Tips on Better Research
Research Methodology and Tips on Better ResearchResearch Methodology and Tips on Better Research
Research Methodology and Tips on Better ResearchRushdi Shams
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...Nguyen Thanh Tu Collection
 
Dhavni Theory by Anandvardhana Indian Poetics
Dhavni Theory by Anandvardhana Indian PoeticsDhavni Theory by Anandvardhana Indian Poetics
Dhavni Theory by Anandvardhana Indian PoeticsDhatriParmar
 
Metabolism , Metabolic Fate& disorders of cholesterol.pptx
Metabolism , Metabolic Fate& disorders of cholesterol.pptxMetabolism , Metabolic Fate& disorders of cholesterol.pptx
Metabolism , Metabolic Fate& disorders of cholesterol.pptxDr. Santhosh Kumar. N
 
EDD8524 The Future of Educational Leader
EDD8524 The Future of Educational LeaderEDD8524 The Future of Educational Leader
EDD8524 The Future of Educational LeaderDr. Bruce A. Johnson
 
2024 March 11, Telehealth Billing- Current Telehealth CPT Codes & Telehealth ...
2024 March 11, Telehealth Billing- Current Telehealth CPT Codes & Telehealth ...2024 March 11, Telehealth Billing- Current Telehealth CPT Codes & Telehealth ...
2024 March 11, Telehealth Billing- Current Telehealth CPT Codes & Telehealth ...Marlene Maheu
 
25 CHUYÊN ĐỀ ÔN THI TỐT NGHIỆP THPT 2023 – BÀI TẬP PHÁT TRIỂN TỪ ĐỀ MINH HỌA...
25 CHUYÊN ĐỀ ÔN THI TỐT NGHIỆP THPT 2023 – BÀI TẬP PHÁT TRIỂN TỪ ĐỀ MINH HỌA...25 CHUYÊN ĐỀ ÔN THI TỐT NGHIỆP THPT 2023 – BÀI TẬP PHÁT TRIỂN TỪ ĐỀ MINH HỌA...
25 CHUYÊN ĐỀ ÔN THI TỐT NGHIỆP THPT 2023 – BÀI TẬP PHÁT TRIỂN TỪ ĐỀ MINH HỌA...Nguyen Thanh Tu Collection
 
3.12.24 Freedom Summer in Mississippi.pptx
3.12.24 Freedom Summer in Mississippi.pptx3.12.24 Freedom Summer in Mississippi.pptx
3.12.24 Freedom Summer in Mississippi.pptxmary850239
 
The First National K12 TUG March 6 2024.pdf
The First National K12 TUG March 6 2024.pdfThe First National K12 TUG March 6 2024.pdf
The First National K12 TUG March 6 2024.pdfdogden2
 
THYROID HORMONE.pptx by Subham Panja,Asst. Professor, Department of B.Sc MLT,...
THYROID HORMONE.pptx by Subham Panja,Asst. Professor, Department of B.Sc MLT,...THYROID HORMONE.pptx by Subham Panja,Asst. Professor, Department of B.Sc MLT,...
THYROID HORMONE.pptx by Subham Panja,Asst. Professor, Department of B.Sc MLT,...Subham Panja
 
Plant Tissue culture., Plasticity, Totipotency, pptx
Plant Tissue culture., Plasticity, Totipotency, pptxPlant Tissue culture., Plasticity, Totipotency, pptx
Plant Tissue culture., Plasticity, Totipotency, pptxHimansu10
 
AUDIENCE THEORY - PARTICIPATORY - JENKINS.pptx
AUDIENCE THEORY - PARTICIPATORY - JENKINS.pptxAUDIENCE THEORY - PARTICIPATORY - JENKINS.pptx
AUDIENCE THEORY - PARTICIPATORY - JENKINS.pptxiammrhaywood
 
ASTRINGENTS.pdf Pharmacognosy chapter 5 diploma in Pharmacy
ASTRINGENTS.pdf Pharmacognosy chapter 5 diploma in PharmacyASTRINGENTS.pdf Pharmacognosy chapter 5 diploma in Pharmacy
ASTRINGENTS.pdf Pharmacognosy chapter 5 diploma in PharmacySumit Tiwari
 
Alamkara theory by Bhamaha Indian Poetics (1).pptx
Alamkara theory by Bhamaha Indian Poetics (1).pptxAlamkara theory by Bhamaha Indian Poetics (1).pptx
Alamkara theory by Bhamaha Indian Poetics (1).pptxDhatriParmar
 
Material Remains as Source of Ancient Indian History & Culture.ppt
Material Remains as Source of Ancient Indian History & Culture.pptMaterial Remains as Source of Ancient Indian History & Culture.ppt
Material Remains as Source of Ancient Indian History & Culture.pptBanaras Hindu University
 
Riti theory by Vamana Indian poetics.pptx
Riti theory by Vamana Indian poetics.pptxRiti theory by Vamana Indian poetics.pptx
Riti theory by Vamana Indian poetics.pptxDhatriParmar
 
3.14.24 The Selma March and the Voting Rights Act.pptx
3.14.24 The Selma March and the Voting Rights Act.pptx3.14.24 The Selma March and the Voting Rights Act.pptx
3.14.24 The Selma March and the Voting Rights Act.pptxmary850239
 

Último (20)

Research Methodology and Tips on Better Research
Research Methodology and Tips on Better ResearchResearch Methodology and Tips on Better Research
Research Methodology and Tips on Better Research
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...
BÀI TẬP BỔ TRỢ TIẾNG ANH 11 THEO ĐƠN VỊ BÀI HỌC - CẢ NĂM - CÓ FILE NGHE (GLOB...
 
Dhavni Theory by Anandvardhana Indian Poetics
Dhavni Theory by Anandvardhana Indian PoeticsDhavni Theory by Anandvardhana Indian Poetics
Dhavni Theory by Anandvardhana Indian Poetics
 
Metabolism , Metabolic Fate& disorders of cholesterol.pptx
Metabolism , Metabolic Fate& disorders of cholesterol.pptxMetabolism , Metabolic Fate& disorders of cholesterol.pptx
Metabolism , Metabolic Fate& disorders of cholesterol.pptx
 
EDD8524 The Future of Educational Leader
EDD8524 The Future of Educational LeaderEDD8524 The Future of Educational Leader
EDD8524 The Future of Educational Leader
 
2024 March 11, Telehealth Billing- Current Telehealth CPT Codes & Telehealth ...
2024 March 11, Telehealth Billing- Current Telehealth CPT Codes & Telehealth ...2024 March 11, Telehealth Billing- Current Telehealth CPT Codes & Telehealth ...
2024 March 11, Telehealth Billing- Current Telehealth CPT Codes & Telehealth ...
 
25 CHUYÊN ĐỀ ÔN THI TỐT NGHIỆP THPT 2023 – BÀI TẬP PHÁT TRIỂN TỪ ĐỀ MINH HỌA...
25 CHUYÊN ĐỀ ÔN THI TỐT NGHIỆP THPT 2023 – BÀI TẬP PHÁT TRIỂN TỪ ĐỀ MINH HỌA...25 CHUYÊN ĐỀ ÔN THI TỐT NGHIỆP THPT 2023 – BÀI TẬP PHÁT TRIỂN TỪ ĐỀ MINH HỌA...
25 CHUYÊN ĐỀ ÔN THI TỐT NGHIỆP THPT 2023 – BÀI TẬP PHÁT TRIỂN TỪ ĐỀ MINH HỌA...
 
3.12.24 Freedom Summer in Mississippi.pptx
3.12.24 Freedom Summer in Mississippi.pptx3.12.24 Freedom Summer in Mississippi.pptx
3.12.24 Freedom Summer in Mississippi.pptx
 
t-test Parametric test Biostatics and Research Methodology
t-test Parametric test Biostatics and Research Methodologyt-test Parametric test Biostatics and Research Methodology
t-test Parametric test Biostatics and Research Methodology
 
The First National K12 TUG March 6 2024.pdf
The First National K12 TUG March 6 2024.pdfThe First National K12 TUG March 6 2024.pdf
The First National K12 TUG March 6 2024.pdf
 
THYROID HORMONE.pptx by Subham Panja,Asst. Professor, Department of B.Sc MLT,...
THYROID HORMONE.pptx by Subham Panja,Asst. Professor, Department of B.Sc MLT,...THYROID HORMONE.pptx by Subham Panja,Asst. Professor, Department of B.Sc MLT,...
THYROID HORMONE.pptx by Subham Panja,Asst. Professor, Department of B.Sc MLT,...
 
Plant Tissue culture., Plasticity, Totipotency, pptx
Plant Tissue culture., Plasticity, Totipotency, pptxPlant Tissue culture., Plasticity, Totipotency, pptx
Plant Tissue culture., Plasticity, Totipotency, pptx
 
AUDIENCE THEORY - PARTICIPATORY - JENKINS.pptx
AUDIENCE THEORY - PARTICIPATORY - JENKINS.pptxAUDIENCE THEORY - PARTICIPATORY - JENKINS.pptx
AUDIENCE THEORY - PARTICIPATORY - JENKINS.pptx
 
ASTRINGENTS.pdf Pharmacognosy chapter 5 diploma in Pharmacy
ASTRINGENTS.pdf Pharmacognosy chapter 5 diploma in PharmacyASTRINGENTS.pdf Pharmacognosy chapter 5 diploma in Pharmacy
ASTRINGENTS.pdf Pharmacognosy chapter 5 diploma in Pharmacy
 
Problems on Mean,Mode,Median Standard Deviation
Problems on Mean,Mode,Median Standard DeviationProblems on Mean,Mode,Median Standard Deviation
Problems on Mean,Mode,Median Standard Deviation
 
Alamkara theory by Bhamaha Indian Poetics (1).pptx
Alamkara theory by Bhamaha Indian Poetics (1).pptxAlamkara theory by Bhamaha Indian Poetics (1).pptx
Alamkara theory by Bhamaha Indian Poetics (1).pptx
 
ANOVA Parametric test: Biostatics and Research Methodology
ANOVA Parametric test: Biostatics and Research MethodologyANOVA Parametric test: Biostatics and Research Methodology
ANOVA Parametric test: Biostatics and Research Methodology
 
Material Remains as Source of Ancient Indian History & Culture.ppt
Material Remains as Source of Ancient Indian History & Culture.pptMaterial Remains as Source of Ancient Indian History & Culture.ppt
Material Remains as Source of Ancient Indian History & Culture.ppt
 
Riti theory by Vamana Indian poetics.pptx
Riti theory by Vamana Indian poetics.pptxRiti theory by Vamana Indian poetics.pptx
Riti theory by Vamana Indian poetics.pptx
 
3.14.24 The Selma March and the Voting Rights Act.pptx
3.14.24 The Selma March and the Voting Rights Act.pptx3.14.24 The Selma March and the Voting Rights Act.pptx
3.14.24 The Selma March and the Voting Rights Act.pptx
 

[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

  • 2. –  Data-independent tasks –  Tasks with statically-known data dependences –  SIMD divergence –  Lacking fine-grained synchronization –  Lacking writeable, coherent caches
  • 3. –  Data-independent tasks –  Tasks with statically-known data dependences –  SIMD divergence –  Lacking fine-grained synchronization –  Lacking writeable, coherent caches
  • 5. 32-­‐bit  Key-­‐value  Sor7ng   Keys-­‐only  Sor7ng   DEVICE (106  keys  /  sec)     (106  pairs/  sec)   NVIDIA  GTX  280 449      (3.8x  speedup*) 534      (2.9x  speedup*) * Satish et al.,"Designing efficient sorting algorithms for manycore GPUs," in IPDPS '09
  • 6. 32-­‐bit  Key-­‐value  Sor7ng   Keys-­‐only  Sor7ng   DEVICE (106  keys  /  sec)     (106  pairs/  sec)   NVIDIA  GTX  480 775 1005 NVIDIA  GTX  280 449 534 NVIDIA  8800  GT 129 171
  • 7. 32-­‐bit  Key-­‐value  Sor7ng   Keys-­‐only  Sor7ng   DEVICE (106  keys  /  sec)     (106  pairs/  sec)   NVIDIA  GTX  480 775 1005 NVIDIA  GTX  280 449 534 NVIDIA  8800  GT 129 171 Intel    Knight's  Ferry  MIC  32-­‐core* 560 Intel    Core  i7  quad-­‐core  * 240 Intel    Core-­‐2  quad-­‐core* 138 *Satish et al., "Fast Sort on CPUs, GPUs and Intel MIC Architectures,“ Intel Tech Report 2010.
  • 8.  
  • 9. Input   Thread   Thread   Thread   Thread   Output   –  Each output is dependent upon a finite subset of the input •  Threads are decomposed by output element •  The output (and at least one input) index is a static function of thread-id
  • 10. Input   ?   Output   –  Each output element has dependences upon any / all input elements –  E.g., sorting, reduction, compaction, duplicate removal, histogram generation, map-reduce, etc.
  • 11. –  Threads are decomposed by output element Thread   Thread   Thread   Thread   –  Repeatedly iterate over recycled input streams –  Output stream size is statically known before each pass Thread   Thread   Thread   Thread  
  • 12. + + + + –  O(n) global work from passes of pairwise-neighbor-reduction –  Static dependences, uniform output
  • 13. allocation –  Repeated pairwise swapping • Bubble sort is O(n2) –  Repeatedly check each vertex or edge • Bitonic sort is O(nlog2n) • Breadth-first search becomes O(V2) –  Need partitioning: dynamic, cooperative • O(V+E) is work-optimal –  Need queue: dynamic, cooperative allocation
  • 14. allocation –  Repeated pairwise swapping • Bubble sort is O(n2) –  Repeatedly check each vertex or edge • Bitonic sort is O(nlog2n) • Breadth-first search becomes O(V2) –  Need partitioning: dynamic, cooperative • O(V+E) is work-optimal –  Need queue: dynamic, cooperative allocation
  • 15.    –  Variable output per thread –  Need dynamic, cooperative allocation
  • 16. Input   Thread   Thread   Thread   Thread   Thread   Thread   Thread   Thread   Thread   Thread   Thread   Thread   ?   Output   •  Where do I put something in a list?   Where do I enqueue something? –  Duplicate removal –  Search space exploration –  Sorting –  Graph traversal –  Histogram compilation –  General work queues
  • 17. • For 30,000 producers and consumers? –  Locks serialize everything
  • 18. Input   2   1   0   3   2   –  O(n) work –  For allocation: use scan results as Prefix  Sum   0   2   3   3   6   a scattering vector –  Popularized by Blelloch et al. in the ‘90s –  Merrill et al. Parallel Scan for Stream Architectures. Technical Report CS2009-14, University of Virginia. 2009
  • 19. Thread   Thread   Thread   Thread   Thread   Input    (  &  allocaOon     requirement)   2   1   0   3   2   –  O(n) work –  For allocation: use scan results as Result  of     a scattering vector prefix  scan  (sum)   0   2   3   3   6   –  Popularized by Blelloch et al. in the ‘90s –  Merrill et al. Parallel Scan for Stream Architectures. Technical Report CS2009-14, University of Virginia. 2009
  • 20. Input    (  &  allocaOon     requirement)   2   1   0   3   2   –  O(n) work –  For allocation: use scan results as Result  of     a scattering vector prefix  scan  (sum)   0   2   3   3   6   Thread   Thread   Thread   Thread   Thread   –  Popularized by Blelloch et al. in the ‘90s Output   0   1   2   3   4   5   6   7   –  Merrill et al. Parallel Scan for Stream Architectures. Technical Report CS2009-14, University of Virginia. 2009
  • 21. Key sequence 1110 0011 1010 0111 1100 1000 0101 0001 0s 1s Output key sequence 1110 1010 1100 1000 0011 0111 0101 0001
  • 22. Key sequence 1110 0011 1010 0111 1100 1000 0101 0001 0 1 2 3 4 5 6 7 0s 1s Allocation requirements 1 0 1 0 1 1 0 0 0 1 0 1 0 0 1 1 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Scanned allocations 0 1 1 2 2 3 4 4 0 0 1 1 2 2 2 3 (relocation offsets) 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
  • 23. 0s 1s Allocation requirements 1 0 1 0 1 1 0 0 0 1 0 1 0 0 1 1 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Scanned allocations 0 1 1 2 2 3 4 4 0 0 1 1 2 2 2 3 (bin relocation offsets) 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Adjusted allocations 0 1 1 2 2 3 4 4 4 4 5 5 6 6 6 7 (global relocation offsets) 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 4 1 5 2 3 6 7 Key sequence 1110 0011 1010 0111 1100 1000 0101 0001 Output key sequence 1110 1010 1100 1000 0011 0111 0101 0001 0 1 2 3 4 5 6 7
  • 24.  
  • 25. Determine  allocaCon  size   Global  Device  Memory   Host  Program   CUDPP  scan   CUDPP  Scan   CUDPP  scan   Distribute  output   Host   GPU   Un-fused
  • 26. Determine  allocaCon  size   Determine  allocaCon   Global  Device  Memory   Global  Device  Memory   Scan   Host  Program   Host  Program   CUDPP  scan   CUDPP  Scan   Scan   CUDPP  scan   Scan   Distribute  output   Distribute  output   Host   GPU   Host   GPU   Un-fused Fused
  • 27. Determine  allocaCon   1.  Heavy SMT (over-threading) yields Global  Device  Memory   Scan   usable “bubbles” of free Host  Program   computation Scan   2.  Propagate live data between steps in fast registers / smem Scan   3.  Use scan (or variant) as a “runtime” Distribute  output   for everything Host   GPU   Fused
  • 28. Determine  allocaCon   1.  Heavy SMT (over-threading) yields Global  Device  Memory   Scan   usable “bubbles” of free Host  Program   computation Scan   2.  Propagate live data between steps in fast registers / smem Scan   3.  Use scan (or variant) as a “runtime” Distribute  output   for everything Host   GPU   Fused
  • 29. Device   Memory  Bandwidth   Compute  Throughput   Memory  wall   Memory  wall     (109  bytes/s)   (109  thread-­‐cycles/s)   (bytes/cycle)   (instrs/word)   GTX  480   169.0   672.0   0.251   15.9   GTX  285   159.0   354.2   0.449   8.9   GTX  280   141.7   311.0   0.456   8.8   Tesla  C1060   102.0   312.0   0.327   12.2   9800  GTX+   70.4   235.0   0.300   13.4   8800  GT   57.6   168.0   0.343   11.7   9800  GT   57.6   168.0   0.343   11.7   8800  GTX   86.4   172.8   0.500   8.0   Quadro  FX  5600   76.8   152.3   0.504   7.9  
  • 30. Device   Memory  Bandwidth   Compute  Throughput   Memory  wall   Memory  wall     (109  bytes/s)   (109  thread-­‐cycles/s)   (bytes/cycle)   (instrs/word)   GTX  480   169.0   672.0   0.251   15.9   GTX  285   159.0   354.2   0.449   8.9   GTX  280   141.7   311.0   0.456   8.8   Tesla  C1060   102.0   312.0   0.327   12.2   9800  GTX+   70.4   235.0   0.300   13.4   8800  GT   57.6   168.0   0.343   11.7   9800  GT   57.6   168.0   0.343   11.7   8800  GTX   86.4   172.8   0.500   8.0   Quadro  FX  5600   76.8   152.3   0.504   7.9  
  • 31. 25   GTX285  r+w  memory  wall     Thread-­‐InstrucOons  /  32-­‐bit  scan  element   (17.8  instrucOons  per     20   input  word)   15   10   Insert  work  here   5   0   0   16   32   48   64   80   96   112   Problem  Size  (millions)  
  • 32. 25   Thread-­‐InstrucOons  /  32-­‐bit  scan  element   GTX285  r+w  memory   20   wall  (17.8)   15   10   Insert  work  here   5   Data  Movement   Skeleton   0   0   16   32   48   64   80   96   112   Problem  Size  (millions)  
  • 33. 25   Thread-­‐InstrucOons  /  32-­‐bit  scan  element   GTX285  r+w  memory   20   wall  (17.8)   15   Insert  work  here   10   Our  Scan  Kernel   5   Data  Movement   Skeleton   0   0   16   32   48   64   80   96   112   Problem  Size  (millions)  
  • 34. 25   Thread-­‐InstrucOons  /  32-­‐bit  scan  element   GTX285  r+w   20   memory  wall   (17.8)   15   –  Increase granularity / Insert  work  here   redundant computation • ghost cells 10   Our  Scan  Kernel   • radix bits –  Orthogonal kernel fusion 5   Data  Movement   Skeleton   0   0   16   32   48   64   80   96   112   Problem  Size  (millions)  
  • 35. 25   Thread-­‐InstrucOons  /  32-­‐bit  scan  element   CUDPP  Scan  Kernel   20   15   10   Our  Scan  Kernel   5   0   0   20   40   60   80   100   120   Problem  Size  (millions)  
  • 36. 35   30   GTX285  Radix   Thread-­‐InstrucOons  /  32-­‐bit  scan  element   –  Partially-coalesced writes Scader  Kernel  Wall   –  2x write overhead 25   GTX285  Scan   20   Insert  work  here   Kernel  Wall   15   –  4 total concurrent scan operations (radix 16) 10   Our  Scan  Kernel   5   0   0   16   32   48   64   80   96   112   Problem  Size  (millions)  
  • 37. 50   45   480  Radix   40   Scader  Kernel   Thread-­‐instructoins  /  32-­‐bit  word   Wall   35   30   –  Need kernels with tunable 25   local (or redundant) work 285  Radix   • ghost cells 20   Scader  Kernel   Wall   • radix bits 15   10   5   0   0   10   20   30   40   50   60   70   80   90   Problem  Size  (millions)  
  • 38.  
  • 39. –  Virtual processors abstract a diversity of hardware configurations –  Leads to a host of inefficiencies –  E.g., only several hundred CTAs
  • 40. –  Virtual processors abstract a diversity of hardware configurations –  Leads to a host of inefficiencies –  E.g., only several hundred CTAs
  • 41. …   Grid A threadblock   grid-size = (N / tilesize) CTAs …   Grid B threadblock   grid-size = 150 CTAs (or other small constant)
  • 42. …   threadblock   –  Thread-dependent predicates –  Setup and initialization code (notably for smem) –  Offset calculations (notably for smem) –  Common values are hoisted and kept live
  • 43. …   threadblock   –  Thread-dependent predicates –  Setup and initialization code (notably for smem) –  Offset calculations (notably for smem) –  Common values are hoisted and kept live
  • 44. …   threadblock   –  Thread-dependent predicates –  Setup and initialization code (notably for smem) –  Offset calculations (notably for smem) –  Common values are hoisted and kept live –  Spills are really bad
  • 45. log tilesize (N) -level tree Two-level tree load, store) –  O( N / tilesize) gmem accesses –  GPU is least efficient here: get it over with as quick as possible –  2-4 instructions per access (offset calcs,
  • 46. log tilesize (N) -level tree Two-level tree load, store) –  O( N / tilesize) gmem accesses –  GPU is least efficient here: get it over with as quick as possible –  2-4 instructions per access (offset calcs,
  • 47. 20   Thread-­‐instrucOons  /  Element   16   12   Compute  Load   8   285  Scan  Kernel  Wall   4   0   0   1000   2000   3000   4000   5000   6000   7000   8000   9000   Grid  Size  (#  of  threadblocks)  
  • 48. C = number of CTAs N = problem size –  16.1M / 150 CTAs / 1024 = 109.91 tiles per CTA T = tile size B = tiles per CTA –  conditional evaluation –  singleton loads
  • 49. C = number of CTAs N = problem size –  16.1M / 150 CTAs / 1024 = 109.91 tiles per CTA T = tile size B = tiles per CTA –  conditional evaluation –  singleton loads
  • 50. C = number of CTAs N = problem size –  floor(16.1M / (1024 * 150) ) = 109 tiles per CTA T = tile size B = tiles per CTA –  16.1M % (1024 * 150) = 136.4 extra tiles
  • 51. C = number of CTAs N = problem size –  floor(16.1M / (1024 * 150) ) = 109 tiles per CTA (14 CTAs) T = tile size B = tiles per CTA –  109 + 1 = 110 tiles per CTA (136 CTAs) –  16.1M % (1024 * 150) = 0.4 extra tiles
  • 52.  
  • 53. –  If you breathe on your code, run it through the VP •  Kernel runtimes •  Instruction counts –  Indispensible for tuning •  Host-side timing requires too many iterations •  Only 1-2 cudaprof iterations for consistent counter-based perf data –  Write tools to parse the output •  “Dummy” kernels useful for demarcation
  • 54. 1100   1000   GTX  480   900   C2050  (no  ECC)   SorOng  Rate  (106  keys/sec)   800   GTX  285   700   C2050  (ECC)   600   GTX  280   500   C1060   400   9800  GTX+   300   200   100   0   0   16   32   48   64   80   96   112   128   144   160   176   192   208   224   240   256   272   Problem  size  (millions)  
  • 55. 800   GTX  480   700   C2050  (no  ECC)   GTX  285   SorOng  Rate  (millions  of  pairs/sec)   600   GTX  280   C2050  (ECC)   C1060   500   9800  GTX+   400   300   200   100   0   0   16   32   48   64   80   96   112   128   144   160   176   192   208   224   240   Problem  size  (millions)  
  • 56. 180   160   Kernel  Bandwidth  (GiBytes  /  sec)   140   120   100   80   60   40   merrill_tree  Reduce   20   merrill_rts  Scan   0   0   20   40   60   80   100   120   Problem  Size  (millions)  
  • 57. 180   160   Kernel  Bandwidth  (Bytes  x109  /  sec)   140   120   100   80   60   40   merrill_linear  Reduce   20   merrill_linear  Scan   0   0   20   40   60   80   100   120   Problem  Size  (millions)  
  • 58. –  Implement device “memcpy” for tile-processing •  Optimize for “full tiles” –  Specialize for different SM versions, input types, etc.
  • 60. –  Use templated code to generate various instances –  Run with cudaprof env vars to collect data
  • 61. 160   128-­‐Thread  CTA  (64B  ld)   150   Bandwidth  (GiBytes/sec)   140   One-­‐way   130   120   Single   110   Single  (no-­‐overlap)   Double   100   Double  (no-­‐overlap)   90   Quad   80   Quad  (no-­‐overlap)   0   20   40   60   80   100   120   Words  Copied  (millions)   us   cudaMemcpy()   128-­‐Thread  CTA  (128B  ld/st)   140   Bandwidth  (GiBytes/sec)   130   Two-­‐way   120   Single   110   Single  (no-­‐overlap)   100   Double   90   Double  (no-­‐overlap)   Quad   80   Quad  (no-­‐overlap)   70   Intrinsic  Copy   0   20   40   60   80   100   120   Words  Copied  (millions)  
  • 62.  
  • 63. m0 m1 m2 m3 m4 m5 m6 m7 m0 m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 t0 x0 x1 x2 x3 x4 x5 x6 x7 t0   i i i i x0 x1 x2 x3 x4 x5 x6 x7 ⊕0   ⊕1   ⊕2   ⊕3   ⊕4   ⊕5   ⊕6   ⊕7   ⊕0   ⊕1   ⊕2   ⊕3   t1 x0 ⊕(x0..x1) x2 ⊕(x2..x3) x4 ⊕(x4..x5) x6 ⊕(x6..x7) t1   i i i i x0 ⊕(x0..x1) ⊕(x1..x2) ⊕(x2..x3) ⊕(x3..x4) ⊕(x4..x5) ⊕(x5..x6) ⊕(x6..x7) ⊕0   ⊕1   ⊕2   ⊕3   ⊕4   ⊕5   ⊕6   ⊕7   ⊕0   ⊕1   t2   i i i i x0 ⊕(x0..x1) ⊕(x0..x2) ⊕(x0..x3) ⊕(x1..x4) ⊕(x2..x5) ⊕(x3..x6) ⊕(x4..x7) t2 x0 ⊕(x0..x1) x2 ⊕(x0..x3) x4 ⊕(x4..x5) x6 ⊕(x4..x7) i ⊕0   ⊕1   ⊕2   ⊕3   ⊕4   ⊕5   ⊕6   ⊕7   t3   i i i i x0 ⊕(x0..x1) ⊕(x0..x2) ⊕(x0..x3) ⊕(x0..x4) ⊕(x0..x5) ⊕(x0..x6) ⊕(x0..x7) =0   ⊕0   t3 x0 ⊕(x0..x1) x2 i x4 ⊕(x4..x5) x6 ⊕(x0..x3) =0 ⊕0   =1 ⊕1   t4 x0 i x2 ⊕(x0..x1) x4 ⊕(x0..x3) x6 ⊕(x0..x5) =0 ⊕0   =1 ⊕1   =2 ⊕2   =3 ⊕3   t5 i x0 ⊕(x0..x1) ⊕(x0..x2) ⊕(x0..x3) ⊕(x0..x4) ⊕(x0..x5) ⊕(x0..x6) –  SIMD lanes wasted on O(n)-work Brent Kung (left), but less work when n > warp size –  Kogge-Stone (right) is O(n log n)-work, but faster when n ≤ warp size
  • 64. m0 m1 m2 m3 m4 m5 m6 m7 m0 m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 t0 x0 x1 x2 x3 x4 x5 x6 x7 t0   i i i i x0 x1 x2 x3 x4 x5 x6 x7 ⊕0   ⊕1   ⊕2   ⊕3   ⊕4   ⊕5   ⊕6   ⊕7   ⊕0   ⊕1   ⊕2   ⊕3   t1 x0 ⊕(x0..x1) x2 ⊕(x2..x3) x4 ⊕(x4..x5) x6 ⊕(x6..x7) t1   i i i i x0 ⊕(x0..x1) ⊕(x1..x2) ⊕(x2..x3) ⊕(x3..x4) ⊕(x4..x5) ⊕(x5..x6) ⊕(x6..x7) ⊕0   ⊕1   ⊕2   ⊕3   ⊕4   ⊕5   ⊕6   ⊕7   ⊕0   ⊕1   t2   i i i i x0 ⊕(x0..x1) ⊕(x0..x2) ⊕(x0..x3) ⊕(x1..x4) ⊕(x2..x5) ⊕(x3..x6) ⊕(x4..x7) t2 x0 ⊕(x0..x1) x2 ⊕(x0..x3) x4 ⊕(x4..x5) x6 ⊕(x4..x7) i ⊕0   ⊕1   ⊕2   ⊕3   ⊕4   ⊕5   ⊕6   ⊕7   t3   i i i i x0 ⊕(x0..x1) ⊕(x0..x2) ⊕(x0..x3) ⊕(x0..x4) ⊕(x0..x5) ⊕(x0..x6) ⊕(x0..x7) =0   ⊕0   t3 x0 ⊕(x0..x1) x2 i x4 ⊕(x4..x5) x6 ⊕(x0..x3) =0 ⊕0   =1 ⊕1   t4 x0 i x2 ⊕(x0..x1) x4 ⊕(x0..x3) x6 ⊕(x0..x5) =0 ⊕0   =1 ⊕1   =2 ⊕2   =3 ⊕3   t5 i x0 ⊕(x0..x1) ⊕(x0..x2) ⊕(x0..x3) ⊕(x0..x4) ⊕(x0..x5) ⊕(x0..x6) –  SIMD lanes wasted on O(n)-work Brent Kung (left), but less work when n > warp size –  Kogge-Stone (right) is O(n log n)-work, but faster when n ≤ warp size
  • 65. barrier   Tree-­‐based:   barrier   barrier   t0   t1   t2     t3 t0   t1   t2     t3 t0   t1   t2     t3 t0   t1   t2     t3 t0   t1   t2   …   t T/4  -­‐1   tT/4     tT/4  +   1   tT/4  +   2   …   t   T/2  -­‐  1 tT/2     tT/2  +   1   tT/2  +   2   …   t 3T/4  -­‐1   t3T/4     t3T/4+1   t3T/4+2   …   t   T  -­‐  1 vs.  raking-­‐based:   barrier   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2     t3 t0   t1   t2     t3 t0   t1   t2     t3 t0   t1   t2     t3
  • 66. barrier   Tree-­‐based:   barrier   barrier   t0   t1   t2     t3 t0   t1   t2     t3 t0   t1   t2     t3 t0   t1   t2     t3 t0   t1   t2   …   t T/4  -­‐1   tT/4     tT/4  +   1   tT/4  +   2   …   t   T/2  -­‐  1 tT/2     tT/2  +   1   tT/2  +   2   …   t 3T/4  -­‐1   t3T/4     t3T/4+1   t3T/4+2   …   t   T  -­‐  1 vs.  raking-­‐based:   barrier   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2     t3 t0   t1   t2     t3 t0   t1   t2     t3 t0   t1   t2     t3
  • 67. DMA   t0   t1   t2   …   t   T/4   tT/4     tT/4  +   1   tT/4  +   2   …   t   T/2  -­‐   tT/2     tT/2  +   1   tT/2  +   2   …   t 3T/4   -­‐1   t3T/4     t3T/ 4+1   t3T/ 4+2   …   t   T  -­‐  1 –  Barriers make O(n) code O(n log n) -­‐1 1 t0   t1   t2   t3   t0   t1   t2   t3   –  The rest are “DMA engine” threads t0   t1   t2   t3   Worker     –  Use threadblocks to cover pipeline t0   t1   t2     t3 latencies, e.g., for Fermi: t0   t1   t2     t3 t0   t1   t2     t3 •  2 worker warps per CTA t0   t1   t2     t3 •  6-7 CTAs
  • 68.  
  • 69. –  Different SMs (varied local storage: registers/smem) –  Different input types (e.g., sorting chars vs. ulongs) –  # of steps for each algorithm phase is configuration-driven –  Template expansion + Constant-propagation + Static loop unrolling + Preprocessor Macros –  Compiler produces a target assembly that is well-tuned for the specifically targeted hardware and problem