SlideShare a Scribd company logo
1 of 258
Download to read offline
Massively Parallel Computing
                        CS 264 / CSCI E-292
Lecture #3: GPU Programming with CUDA | February 8th, 2011




               Nicolas Pinto (MIT, Harvard)
                      pinto@mit.edu
Administrivia
•   New here? Welcome!
•   HW0: Forum, RSS, Survey
•   Lecture 1 & 2 slides posted
•   Project teams allowed (up to 2 students)
    • innocentive-like / challenge-driven ?
•   HW1: out tonight/tomorrow, due Fri 2/18/11
•   New guest lecturers!
    •   Wen-mei Hwu (UIUC/NCSA), Cyrus Omar (CMU), Cliff Wooley
        (NVIDIA), Richard Lethin (Reservoir Labs), James Malcom
        (Accelereyes), David Cox (Harvard)
During this course,
                          r CS264
                adapted fo



we’ll try to


          “                         ”

and use existing material ;-)
Today
yey!!
Objectives
• Get your started with GPU Programming
• Introduce CUDA
• “20,000 foot view”
• Get used to the jargon...
• ...with just enough details
• Point to relevant external resources
Outline
• Thinking Parallel (review)
• Why GPUs ?
• CUDA Overview
• Programming Model
• Threading/Execution Hierarchy
• Memory/Communication Hierarchy
• CUDA Programming
Outline
• Thinking Parallel (review)
• Why GPUs ?
• CUDA Overview
• Programming Model
• Threading/Execution Hierarchy
• Memory/Communication Hierarchy
• CUDA Programming
Revie w




Thinking Parallel
      (last week)
Getting your feet wet

• Common scenario: “I want to make the
  algorithm X run faster, help me!”


• Q: How do you approach the problem?
How?
How?
• Option 1: wait
• Option 2: gcc -O3 -msse4.2
• Option 3: xlc -O5
• Option 4: use parallel libraries (e.g. (cu)blas)
• Option 5: hand-optimize everything!
• Option 6: wait more
What else ?
How about
 analysis ?
Getting your feet wet
           Algorithm X v1.0 Profiling Analysis on Input 10x10x10

            100

                                                                    100% parallelizable
             75
                                sequential in nature
time (s)




             50                                              50



             25       29


                                       10              11
              0
                  load_data()         foo()        bar()    yey()



             Q: What is the maximum speed up ?
Getting your feet wet
           Algorithm X v1.0 Profiling Analysis on Input 10x10x10

            100

                                                                    100% parallelizable
             75
                                sequential in nature
time (s)




             50                                              50



             25       29


                                       10              11
              0
                  load_data()         foo()        bar()    yey()



                                       A: 2X ! :-(
You need to...
• ... understand the problem (duh!)
• ... study the current (sequential?) solutions and
  their constraints
• ... know the input domain
• ... profile accordingly
• ... “refactor” based on new constraints (hw/sw)
Some Perspective
The “problem tree” for scientific problem solving
  9 Some Perspective

                               Technical Problem to be Analyzed


                                                            Consultation with experts

          Scientific Model "A"                              Model "B"


                                                                  Theoretical analysis
          Discretization "A"           Discretization "B"   Experiments


          Iterative equation solver           Direct elimination equation solver



         Parallel implementation        Sequential implementation



  Figure 11: There“problem tree” for to try to achieve the same goal. are many
               The are many options scientific problem solving. There
  options to try to achieve the same goal.
                                                                        from Scott et al. “Scientific Parallel Computing” (2005)
Computational Thinking

• translate/formulate domain problems into
  computational models that can be solved
  efficiently by available computing resources


• requires a deep understanding of their
  relationships


                                        adapted from Hwu & Kirk (PASI 2011)
Getting ready...

                 Programming Models

Architecture      Algorithms                     Languages
                   Patterns                 il   ers
                                      C omp




                Parallel Thinking
                  Parallel
                 Computing




               APPLICATIONS
                                                       adapted from Scott et al. “Scientific Parallel Computing” (2005)
You can do it!


• thinking parallel is not as hard as you may think
• many techniques have been thoroughly explained...
• ... and are now “accessible” to non-experts !
Outline
• Thinking Parallel (review)
• Why GPUs ?
• CUDA Overview
• Programming Model
• Threading/Execution Hierarchy
• Memory/Communication Hierarchy
• CUDA Programming
Why GPUs?
ti vat i on
                                     Mo

!   7F"'/.;$'"#.2./1#'2%/C"&'.O'#./0.2"2$;'
    12'+2'E-'I1,,'6.%C,"'"<"&8'8"+&

!   P1;$.&1#+,,8'! -*Q;'3"$'O+;$"&

    " P+&6I+&"'&"+#F123'O&"R%"2#8',1/1$+$1.2;

!   S.I'! -*Q;'3"$'I16"&       GPUs

                                             slide by Matthew Bolitho
vatio n?
M ot i
Motivation                                     ti vat i on
                                                   Mo
                                                             GPU



                                     Fact:
                       nobody cares about theoretical peak

                             Challenge:
          harness GPU power for real application performance
GFLOPS




                       $"#
          #<=4>&+234&?@&6.A
                              !"#
                              !"#$#%&'()*%&+,-.-
                                                             CPU
                    0&12345   /0-&12345
               ,-/&89*:;)     67.&89*:;)
ti vat i on
                                 Mo

!   T+$F"&'$F+2'":0"#$123'-*Q;'$.'3"$'$I1#"'+;'
    O+;$9'":0"#$'$.'F+<"'$I1#"'+;'/+28U

!   *+&+,,",'0&.#";;123'O.&'$F"'/+;;";
!   Q2O.&$%2+$",8)'*+&+,,",'0&.3&+//123'1;'F+&6V''

    " D,3.&1$F/;'+26'B+$+'?$&%#$%&";'/%;$'C"'
      O%26+/"2$+,,8'&"6";132"6

                                           slide by Matthew Bolitho
Task vs Data Parallelism
       CPUs vs GPUs
Task parallelism
• Distribute the tasks across processors based on
  dependency
• Coarse-grain parallelism

     Task 1
                          Task 2                     Time
                                            Task 3
                                                     P1       Task 1     Task 2 Task 3
 Task 4                                              P2     Task 4        Task 5 Task 6
                 Task 5            Task 6
                                                     P3     Task 7   Task 8         Task 9

    Task 7                                  Task 9
                     Task 8                                   Task assignment across
                                                                   3 processors
              Task dependency graph


                                                                                             30
Data parallelism
• Run a single kernel over many elements
 –Each element is independently updated
 –Same operation is applied on each element
• Fine-grain parallelism
 –Many lightweight threads, easy to switch context
 –Maps well to ALU heavy architecture : GPU



            Data                            …….

         Kernel    P1   P2   P3   P4   P5   …….   Pn

                                                       31
Task vs. Data parallelism
• Task parallel
  – Independent processes with little communication
  – Easy to use
     • “Free” on modern operating systems with SMP
• Data parallel
  – Lots of data on which the same computation is being
    executed
  – No dependencies between data elements in each
    step in the computation
  – Can saturate many ALUs
  – But often requires redesign of traditional algorithms
                                                                   4
                                                 slide by Mike Houston
CPU vs. GPU
• CPU
  –   Really fast caches (great for data reuse)
  –   Fine branching granularity
  –   Lots of different processes/threads Computing?
                                      GPU
  –   High performance on a single thread of execution
• GPU                • Design target for CPUs:
  –   Lotsof math units • Make control away from fast
                         • Take
                                 a single thread very

  –   Fastaccess to onboard memory
                           programmer
                     • GPU Computing takes a
  –   Run a program on different fragment/vertex
                        each approach:
  –   High throughput on •parallel tasks
                            Throughput matters—
                              single threads do not
                            • Give explicit control to
                              programmer
• CPUs are great for task parallelism
• GPUs are great for data parallelism                    slide by Mike Houston
                                                                           5
GPUs ?
!   6'401-'@&)*(&+,3AB0-3'-407':&C,(,DD'D&
    C(*8D'+4/




!   E*('&3(,-4043*(4&@'@0.,3'@&3*&?">&3A,-&)D*F&
    .*-3(*D&,-@&@,3,&.,.A'
                                             slide by Matthew Bolitho
From CPUs to GPUs
  (how did we end up there?)
Intro PyOpenCL           What and Why? OpenCL


“CPU-style” Cores
     CPU-“style” cores


                              Fetch/                    Out-of-order control logic
                              Decode
                                                          Fancy branch predictor
                                ALU
                              (Execute)
                                                             Memory pre-fetcher
                            Execution
                             Context
                                                                    Data cache
                                                                      (A big one)




      SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/       13

   Credit: Kayvon Fatahalian (Stanford)
Intro PyOpenCL           What and Why? OpenCL


Slimming down
      Slimming down


                             Fetch/
                             Decode
                                                    Idea #1:
                               ALU                  Remove components that
                             (Execute)
                                                    help a single instruction
                           Execution                stream run fast
                            Context




     SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/                      14

  Credit: Kayvon Fatahalian (Stanford)

                           slide by Andreas Kl¨ckner
                                              o              GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL       What and Why? OpenCL


More Space: Double the Numberparallel)
   Two cores (two fragments in of Cores
    fragment 1                                                                              fragment 2


                                         Fetch/                           Fetch/
                                         Decode                           Decode
     !"#$$%&'()*"'+,-.
                                                                                             !"#$$%&'()*"'+,-.


                                          ALU                                 ALU
     &*/01'.+23.453.623.&2.
                                                                                             &*/01'.+23.453.623.&2.
     /%1..+73.423.892:2;.
                                                                                             /%1..+73.423.892:2;.
     /*"".+73.4<3.892:<;3.+7.
                                         (Execute)                        (Execute)
                                                                                             /*"".+73.4<3.892:<;3.+7.
     /*"".+73.4=3.892:=;3.+7.
                                                                                             /*"".+73.4=3.892:=;3.+7.
     81/0.+73.+73.1>2?2@3.1><?2@.
                                                                                             81/0.+73.+73.1>2?2@3.1><?2@.
     /%1..A23.+23.+7.
                                                                                             /%1..A23.+23.+7.


                                      Execution                         Execution
     /%1..A<3.+<3.+7.
                                                                                             /%1..A<3.+<3.+7.
     /%1..A=3.+=3.+7.
                                                                                             /%1..A=3.+=3.+7.


                                       Context                           Context
     /A4..A73.1><?2@.
                                                                                             /A4..A73.1><?2@.




   SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/                                             15

   Credit: Kayvon Fatahalian (Stanford)

                                    slide by Andreas Kl¨ckner
                                                       o         GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL        What and Why? OpenCL



Fouragain
  . . . cores                  (four fragments in parallel)


                                                          Fetch/                  Fetch/
                                                          Decode                  Decode

                                                            ALU                     ALU
                                                         (Execute)               (Execute)

                                                         Execution               Execution
                                                          Context                 Context




                                                          Fetch/                  Fetch/
                                                          Decode                  Decode

                                                            ALU                     ALU
                                                         (Execute)               (Execute)

                                                         Execution               Execution
                                                          Context                 Context




GRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/                               16

             Credit: Kayvon Fatahalian (Stanford)

                                         slide by Andreas Kl¨ckner
                                                            o           GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL       What and Why? OpenCL



xteen cores
  . . . and again                  (sixteen fragments in parallel)


                                                ALU          ALU         ALU       ALU




                                                ALU          ALU         ALU       ALU




                                                ALU          ALU         ALU       ALU




                                                ALU          ALU         ALU       ALU




                                 16 cores = 16 simultaneous instruction streams
H 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
            Credit: Kayvon Fatahalian (Stanford)                                                  17


                                      slide by Andreas Kl¨ckner
                                                         o          GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL       What and Why? OpenCL



xteen cores
  . . . and again                  (sixteen fragments in parallel)


                                                ALU          ALU         ALU       ALU




                                                ALU          ALU         ALU       ALU




                                                ALU          ALU         ALU       ALU




                                                ALU
                                                      → 16 independent instruction streams
                                                         ALU      ALU    ALU


                                              Reality: instruction streams not actually
                                 16 cores = 16very different/independent
                                               simultaneous instruction streams
H 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
            Credit: Kayvon Fatahalian (Stanford)                                                  17


                                      slide by Andreas Kl¨ckner
                                                         o          GPU-Python with PyOpenCL and PyCUDA
ecall: simple processing core  Intro PyOpenCL      What and Why? OpenCL


 Saving Yet More Space

               Fetch/
               Decode


                ALU
               (Execute)



            Execution
             Context




    Credit: Kayvon Fatahalian (Stanford)

                       slide by Andreas Kl¨ckner
                                          o        GPU-Python with PyOpenCL and PyCUDA
ecall: simple processing core  Intro PyOpenCL      What and Why? OpenCL


 Saving Yet More Space

               Fetch/
               Decode


                ALU                                Idea #2
               (Execute)
                                                   Amortize cost/complexity of
                                                   managing an instruction stream
            Execution                              across many ALUs
             Context                               → SIMD




    Credit: Kayvon Fatahalian (Stanford)

                       slide by Andreas Kl¨ckner
                                          o        GPU-Python with PyOpenCL and PyCUDA
ecall: simple processing core
dd ALUs                        Intro PyOpenCL       What and Why? OpenCL


 Saving Yet More Space

               Fetch/                              Idea #2:
               Decode
                                                   Amortize cost/complexity of
     ALU 1   ALU 2    ALU 3     ALU 4
                ALU                                managing an instruction
                                                    Idea #2
               (Execute)
     ALU 5    ALU 6   ALU 7     ALU 8              stream across many of
                                                    Amortize cost/complexity ALUs
                                                    managing an instruction stream
         Execution                                  across many ALUs
     Ctx Ctx Ctx
          Context
                                Ctx
                                                   SIMD processing
                                                    → SIMD
     Ctx      Ctx     Ctx       Ctx

        Shared Ctx Data
    Credit: Kayvon Fatahalian (Stanford)

                       slide by Andreas Kl¨ckner
                                          o         GPU-Python with PyOpenCL and PyCUDA
dd ALUs                        Intro PyOpenCL       What and Why? OpenCL


 Saving Yet More Space

               Fetch/                              Idea #2:
               Decode
                                                   Amortize cost/complexity of
     ALU 1   ALU 2    ALU 3     ALU 4
                                                   managing an instruction
                                                    Idea #2
     ALU 5    ALU 6   ALU 7     ALU 8              stream across many of
                                                    Amortize cost/complexity ALUs
                                                    managing an instruction stream
                                                    across many ALUs
     Ctx      Ctx     Ctx       Ctx
                                                   SIMD processing
                                                    → SIMD
     Ctx      Ctx     Ctx       Ctx

        Shared Ctx Data
    Credit: Kayvon Fatahalian (Stanford)

                       slide by Andreas Kl¨ckner
                                          o         GPU-Python with PyOpenCL and PyCUDA
http://www.youtube.com/watch?v=1yH_j8-VVLo           Intro PyOpenCL      What and Why? OpenCL


  Gratuitous Amounts of Parallelism!
ragments in parallel




                        16 cores = 128 ALUs
                                        = 16 simultaneous instruction streams
            Credit: Shading: http://s09.idav.ucdavis.edu/
                     Kayvon Fatahalian (Stanford)
Beyond Programmable                                                         24


                                               slide by Andreas Kl¨ckner
                                                                  o        GPU-Python with PyOpenCL and PyCUDA
http://www.youtube.com/watch?v=1yH_j8-VVLo           Intro PyOpenCL      What and Why? OpenCL


  Gratuitous Amounts of Parallelism!
ragments in parallel
                  Example:
                  128 instruction streams in parallel
                  16 independent groups of 8 synchronized streams




                        16 cores = 128 ALUs
                                        = 16 simultaneous instruction streams
            Credit: Shading: http://s09.idav.ucdavis.edu/
                     Kayvon Fatahalian (Stanford)
Beyond Programmable                                                         24


                                               slide by Andreas Kl¨ckner
                                                                  o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL      What and Why? OpenCL


Remaining Problem: Slow Memory


 Problem
 Memory still has very high latency. . .
 . . . but we’ve removed most of the
 hardware that helps us deal with that.

 We’ve removed
     caches
     branch prediction                              Idea #3
     out-of-order execution                                 Even more parallelism
 So what now?                                         +     Some extra memory
                                                      =     A solution!


                    slide by Andreas Kl¨ckner
                                       o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL      What and Why? OpenCL


  Remaining Problem: Slow Memory
                                  Fetch/
                                  Decode

    Problem             ALU     ALU      ALU      ALU
    Memory still has very high latency. . .
                     ALU  ALU ALU    ALU
    . . . but we’ve removed most of the
    hardware that helps us deal with that.
                        Ctx     Ctx      Ctx      Ctx

    We’ve removedCtx            Ctx      Ctx      Ctx
          caches
                          Shared Ctx Data
          branch prediction                                   Idea #3
          out-of-order execution                                      Even more parallelism
v.ucdavis.edu/
     So what     now?                                           +
                                                               33     Some extra memory
                                                                =     A solution!


                              slide by Andreas Kl¨ckner
                                                 o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL      What and Why? OpenCL


  Remaining Problem: Slow Memory
                              Fetch/
                              Decode

    Problem         ALU     ALU      ALU      ALU
    Memory still has very high latency. . .
                     ALU  ALU ALU    ALU
    . . . but we’ve removed most of the
    hardware that helps us deal with that.
                       1             2
    We’ve removed
         caches          3                    4
            branch prediction                             Idea #3
            out-of-order execution                                Even more parallelism
v.ucdavis.edu/ now?
     So what                                                +
                                                           34     Some extra memory
                                                            =     A solution!


                          slide by Andreas Kl¨ckner
                                             o        GPU-Python with PyOpenCL and PyCUDA
Hiding Memory Latency
 Hiding shader stalls
 Time                   Frag 1 … 8           Frag 9… 16           Frag 17 … 24         Frag 25 … 32
(clocks)
                             1                    2                     3                     4




                                                                                         Fetch/
                                                                                         Decode

                                                                                 ALU    ALU       ALU   ALU

                                                                                 ALU    ALU       ALU   ALU



                                                                                   1                    2


                                                                                   3                    4


 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/                                     34




Credit: Kayvon Fatahalian (Stanford)

                                                                                       Discuss HW1 Intro to GPU Computing
Hiding Memory Latency
 Hiding shader stalls
 Time                   Frag 1 … 8           Frag 9… 16           Frag 17 … 24   Frag 25 … 32
(clocks)
                             1                    2                     3             4



                           Stall




                       Runnable




 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/                              35




Credit: Kayvon Fatahalian (Stanford)

                                                                                 Discuss HW1 Intro to GPU Computing
Hiding Memory Latency
 Hiding shader stalls
 Time                   Frag 1 … 8           Frag 9… 16           Frag 17 … 24   Frag 25 … 32
(clocks)
                             1                    2                     3             4



                           Stall




                       Runnable




 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/                              36




Credit: Kayvon Fatahalian (Stanford)

                                                                                 Discuss HW1 Intro to GPU Computing
Hiding Memory Latency
 Hiding shader stalls
 Time                   Frag 1 … 8           Frag 9… 16           Frag 17 … 24   Frag 25 … 32
(clocks)
                             1                    2                     3             4



                           Stall




                                                 Stall




                       Runnable                                        Stall



                                             Runnable
                                                                                     Stall



                                                                   Runnable
 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/                              37




Credit: Kayvon Fatahalian (Stanford)

                                                                                 Discuss HW1 Intro to GPU Computing
Intro PyOpenCL      What and Why? OpenCL


GPU Architecture Summary


 Core Ideas:

   1   Many slimmed down cores
       → lots of parallelism

   2   More ALUs, Fewer Control Units

   3   Avoid memory stalls by interleaving
       execution of SIMD groups
       (“warps”)



   Credit: Kayvon Fatahalian (Stanford)

                      slide by Andreas Kl¨ckner
                                         o        GPU-Python with PyOpenCL and PyCUDA
Is it free?
!   GA,3&,('&3A'&.*-4'H2'-.'4I
!   $(*1(,+&+243&8'&+*('&C('@0.3,8D'/
    ! 6,3,&,..'44&.*A'('-.5
    ! $(*1(,+&)D*F




                                        slide by Matthew Bolitho
Outline
• Thinking Parallel (review)
• Why GPUs ?
• CUDA Overview
• Programming Model
• Threading/Execution Hierarchy
• Memory/Communication Hierarchy
• CUDA Programming
CUDA Overview
*,.;<+/$%=*=*8   GPGPU...
 >?9$ !"!"# @ 6,'2A%6)+%=*8%'16.%(+1+,0<B45,4.C+%
 2./456'1(%;D%20C6'1(%4,.;<+/%0C%(,04)'2C
    E5,1%F060%'16.%'/0(+C%GH6+I65,+%/04CJK
    E5,1%0<(.,'6)/C%'16.%'/0(+%CD16)+C'C%GH,+1F+,'1(%40CC+CJK


 *,./'C'1(%,+C5<6CL%;56$
    E.5()%<+0,1'1(%25,M+L%40,6'25<0,<D%-.,%1.1B(,04)'2C%+I4+,6C
    *.6+16'0<<D%)'()%.M+,)+0F%.-%(,04)'2C%:*N
    &'()<D%2.1C6,0'1+F%/+/.,D%<0D.56%O%022+CC%/.F+<
    P++F%-.,%/01D%40CC+C%F,'M+C%54%;01F7'F6)%2.1C5/46'.1
!   !"#$)'0,I=%$"'E+.K."-':"H.#"'F&#?.$"#$%&"
!   0&"1$"-'6B'LM*:*F

!   F'A1B'$,'="&K,&I'#,I=%$1$.,+',+'$?"'>8E

!   7="#.K.#1$.,+'K,&)
    ! F'#,I=%$"&'1&#?.$"#$%&"
    ! F'31+N%1N"
    ! F+'1==3.#1$.,+'.+$"&K1#"'OF8*P
                                       slide by Matthew Bolitho
CUDA Advantages over Legacy GPGPU
         Random access to memory
                   Thread can access any memory location
         Unlimited access to memory
                   Thread can read/write as many locations as needed
         User-managed cache (per block)
                   Threads can cooperatively load data into SMEM
                   Any thread can then access any SMEM location
         Low learning curve
                   Just a few extensions to C
                   No knowledge of graphics is required
         No graphics API overhead

© NVIDIA Corporation 2006
                                                                   9
CUDA Parallel Paradigm

         Scale to 100s of cores, 1000s of parallel threads
                      Transparently with one source and same binary



         Let programmers focus on parallel algorithms
                      Not mechanics of a parallel programming language



         Enable CPU+GPU Co-Processing
                      CPU & GPU are separate devices with separate memories

NVIDIA Confidential
C with CUDA Extensions: C with a few keywords

           !"#$%&'()*+&,-#'./#01%02%3."'1%'2%3."'1%4(2%3."'1%4*5
           6
               3"- /#01%#%7%89%# : 09%;;#5
                   *<#=%7%'4(<#=%;%*<#=9
           >
                                                                   Standard C Code
           ??%@0!"A,%&,-#'. BCDEF%A,-0,.
           &'()*+&,-#'./02%GH82%(2%*59


           ++I."J'.++%!"#$%&'()*+)'-'..,./#01%02%3."'1%'2%3."'1%4(2%3."'1%4*5
           6
               #01%#%7%J."KA@$(H(4J."KAL#MH(%;%1N-,'$@$(H(9
               #3 /# : 05%%*<#=%7%'4(<#=%;%*<#=9                    Parallel    C Code
           >
           ??%@0!"A,%)'-'..,. BCDEF%A,-0,. O#1N%GPQ%1N-,'$&?J."KA
           #01%0J."KA&%7%/0%;%GPP5%?%GPQ9
           &'()*+)'-'..,.:::0J."KA&2%GPQRRR/02%GH82%(2%*59

NVIDIA Confidential
Compiling C with CUDA Applications

     !!!                                          C CUDA                 Rest of C
 "
 #$%&'$()*+,-./0(%$/1%/('!!!'2'3
                                                 Key Kernels            Application
   !!!
 "
                                                    NVCC
 #$%&'45678,4*+%591-9$5('!!!'2'3                  (Open64)              CPU Compiler
    -$+ 1%/('%':';<'% = /<'>>%2
       8?%@':'5A6?%@'>'8?%@<       Modify into
 "                                  Parallel     CUDA object             CPU object
 #$%&'B5%/1'2'3
                                   CUDA code        files                  files
   -9$5('6<                                                    Linker
   45678,4*+%591!!2<
   !!!
 "                                                                       CPU-GPU
                                                                         Executable


NVIDIA Confidential
Compiling CUDA Code
               C/C++ CUDA
               Application




                  NVCC             CPU Code



                PTX Code
                                     Virtual


               PTX to Target
                                     Physical
                Compiler



         G80        …        GPU

           Target code
                                         © 2008 NVIDIA Corporation.
CUDA Software Development

 CUDA Optimized Libraries:         Integrated CPU + GPU
   math.h, FFT, BLAS, …                C Source Code



                      NVIDIA C Compiler



        NVIDIA Assembly
                                            CPU Host Code
      for Computing (PTX)


   CUDA                                   Standard C Compiler
                     Profiler
   Driver


              GPU                                CPU
CUDA Development Tools: cuda-gdb
CUDA-gdb


         Integrated into gdb
         Supports CUDA C
         Seamless CPU+GPU development experience
         Enabled on all CUDA supported 32/64bit Linux
         distros
         Set breakpoint and single step any source line
         Access and print all CUDA memory allocs, local,
         global, constant and shared vars.




© NVIDIA Corporation 2009
Parallel Source
                                 Debugging
                                CUDA-gdb in
                                  emacs




                            CUDA-GDB in
                              emacs




© NVIDIA Corporation 2009
Parallel Source
                              Debugging
                             CUDA-gdb in
                                DDD




© NVIDIA Corporation 2009
CUDA Development Tools: cuda-memcheck
CUDA-MemCheck


         Coming with CUDA 3.0 Release

         Track out of bounds and misaligned accesses

         Supports CUDA C

         Integrated into the CUDA-GDB debugger

         Available as standalone tool on all OS platforms.


© NVIDIA Corporation 2009
Parallel Source
                               Memory
                               Checker
                              CUDA-
                             MemCheck




© NVIDIA Corporation 2009
CUDA Development Tools: (Visual) Profiler
CUDA Visual Profiler
Outline
• Thinking Parallel (review)
• Why GPUs ?
• CUDA Overview
• Programming Model
• Threading/Execution Hierarchy
• Memory/Communication Hierarchy
• CUDA Programming
Programming Model
GPU Architecture




CUDA Programming Model
Intro PyOpenCL      What and Why? OpenCL


Connection: Hardware ↔ Programming Model
       Fetch/
       Decode                                                             Fetch/
                                                                          Decode
                                                                                          Fetch/
                                                                                          Decode
                                                                                                          Fetch/
                                                                                                          Decode




                                                                         32 kiB Ctx      32 kiB Ctx      32 kiB Ctx
                                                                           Private         Private         Private
                                                                        (“Registers”)   (“Registers”)   (“Registers”)


                                                                         16 kiB Ctx      16 kiB Ctx      16 kiB Ctx
                                                                           Shared          Shared          Shared




                                                                          Fetch/          Fetch/          Fetch/
                                                                          Decode          Decode          Decode




      32 kiB Ctx                                                         32 kiB Ctx      32 kiB Ctx      32 kiB Ctx


        Private
                                                                           Private         Private         Private
                                                                        (“Registers”)   (“Registers”)   (“Registers”)


                                                                         16 kiB Ctx      16 kiB Ctx      16 kiB Ctx

     (“Registers”)                                                         Shared          Shared          Shared




                                                                          Fetch/          Fetch/          Fetch/
                                                                          Decode          Decode          Decode




      16 kiB Ctx                                                         32 kiB Ctx
                                                                           Private
                                                                        (“Registers”)
                                                                                         32 kiB Ctx
                                                                                           Private
                                                                                        (“Registers”)
                                                                                                         32 kiB Ctx
                                                                                                           Private
                                                                                                        (“Registers”)


        Shared                                                           16 kiB Ctx
                                                                           Shared
                                                                                         16 kiB Ctx
                                                                                           Shared
                                                                                                         16 kiB Ctx
                                                                                                           Shared




                     slide by Andreas Kl¨ckner
                                        o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL      What and Why? OpenCL


Connection: Hardware ↔ Programming Model

                                                                         Fetch/          Fetch/          Fetch/
                                                                         Decode          Decode          Decode




                       show
                    are s?
                                                                        32 kiB Ctx      32 kiB Ctx      32 kiB Ctx




                 o c ore
                                                                          Private         Private         Private
                                                                       (“Registers”)   (“Registers”)   (“Registers”)




                h
               W ny c
                                                                        16 kiB Ctx
                                                                          Shared
                                                                                        16 kiB Ctx
                                                                                          Shared
                                                                                                        16 kiB Ctx
                                                                                                          Shared




                ma
                                                                         Fetch/          Fetch/          Fetch/
                                                                         Decode          Decode          Decode




                                                                        32 kiB Ctx      32 kiB Ctx      32 kiB Ctx
                                                                          Private         Private         Private
                                                                       (“Registers”)   (“Registers”)   (“Registers”)


      Idea:                                                             16 kiB Ctx
                                                                          Shared
                                                                                        16 kiB Ctx
                                                                                          Shared
                                                                                                        16 kiB Ctx
                                                                                                          Shared




              Program as if there were                                   Fetch/
                                                                         Decode
                                                                                         Fetch/
                                                                                         Decode
                                                                                                         Fetch/
                                                                                                         Decode




              “infinitely” many cores                                    32 kiB Ctx      32 kiB Ctx      32 kiB Ctx
                                                                          Private         Private         Private
                                                                       (“Registers”)   (“Registers”)   (“Registers”)


              Program as if there were                                  16 kiB Ctx
                                                                          Shared
                                                                                        16 kiB Ctx
                                                                                          Shared
                                                                                                        16 kiB Ctx
                                                                                                          Shared



              “infinitely” many ALUs per
              core



                    slide by Andreas Kl¨ckner
                                       o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL      What and Why? OpenCL


Connection: Hardware ↔ Programming Model

                                                                     Fetch/          Fetch/          Fetch/
                                                                     Decode          Decode          Decode




                      show
                   are s?
                                                                    32 kiB Ctx      32 kiB Ctx      32 kiB Ctx




                o c ore
                                                                      Private         Private         Private
                                                                   (“Registers”)   (“Registers”)   (“Registers”)




               h
              W ny c
                                                                    16 kiB Ctx
                                                                      Shared
                                                                                    16 kiB Ctx
                                                                                      Shared
                                                                                                    16 kiB Ctx
                                                                                                      Shared




               ma
                                                                     Fetch/          Fetch/          Fetch/
                                                                     Decode          Decode          Decode




                                                                    32 kiB Ctx      32 kiB Ctx      32 kiB Ctx
                                                                      Private         Private         Private
                                                                   (“Registers”)   (“Registers”)   (“Registers”)


      Idea:                                                         16 kiB Ctx
                                                                      Shared
                                                                                    16 kiB Ctx
                                                                                      Shared
                                                                                                    16 kiB Ctx
                                                                                                      Shared




       Consider: Which there were do automatically?
         Program as if is easy to                                    Fetch/
                                                                     Decode
                                                                                     Fetch/
                                                                                     Decode
                                                                                                     Fetch/
                                                                                                     Decode




         “infinitely” many cores
           Parallel program → sequential hardware                   32 kiB Ctx      32 kiB Ctx      32 kiB Ctx
                                                                      Private         Private         Private
                                                                   (“Registers”)   (“Registers”)   (“Registers”)


       or Program as if there were                                  16 kiB Ctx
                                                                      Shared
                                                                                    16 kiB Ctx
                                                                                      Shared
                                                                                                    16 kiB Ctx
                                                                                                      Shared



          “infinitely” many ALUs per
            Sequential program → parallel hardware?
          core



                slide by Andreas Kl¨ckner
                                   o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL      What and Why? OpenCL


Connection: Hardware ↔ Programming Model

                           Axis 0                                       Fetch/
                                                                        Decode
                                                                                        Fetch/
                                                                                        Decode
                                                                                                        Fetch/
                                                                                                        Decode




                                                                       32 kiB Ctx      32 kiB Ctx      32 kiB Ctx
                                                                         Private         Private         Private
                                                                      (“Registers”)   (“Registers”)   (“Registers”)


                                                                       16 kiB Ctx      16 kiB Ctx      16 kiB Ctx
                                                                         Shared          Shared          Shared




                                                                        Fetch/          Fetch/          Fetch/
                                                                        Decode          Decode          Decode
  Axis 1




                                                                       32 kiB Ctx      32 kiB Ctx      32 kiB Ctx
                                                                         Private         Private         Private
                                                                      (“Registers”)   (“Registers”)   (“Registers”)


                                                                       16 kiB Ctx      16 kiB Ctx      16 kiB Ctx
                                                                         Shared          Shared          Shared




                                                                        Fetch/          Fetch/          Fetch/
                                                                        Decode          Decode          Decode




                                                                       32 kiB Ctx      32 kiB Ctx      32 kiB Ctx
                                                                         Private         Private         Private
                                                                      (“Registers”)   (“Registers”)   (“Registers”)


                                                                       16 kiB Ctx      16 kiB Ctx      16 kiB Ctx
                                                                         Shared          Shared          Shared




           Software representation
                                                                      Hardware

                   slide by Andreas Kl¨ckner
                                      o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL      What and Why? OpenCL


Connection: Hardware ↔ Programming Model

                          Axis 0                                       Fetch/
                                                                       Decode
                                                                                       Fetch/
                                                                                       Decode
                                                                                                       Fetch/
                                                                                                       Decode




                (Work) Group                                          32 kiB Ctx      32 kiB Ctx      32 kiB Ctx




                       or “Block”
                                                                        Private         Private         Private
                                                                     (“Registers”)   (“Registers”)   (“Registers”)


                                                                      16 kiB Ctx      16 kiB Ctx      16 kiB Ctx
                                                                        Shared          Shared          Shared




           Grid             nc-
                                                                       Fetch/          Fetch/          Fetch/
                                                                       Decode          Decode          Decode




                  nel: Fu
                er
  Axis 1




           (K
                                                                      32 kiB Ctx      32 kiB Ctx      32 kiB Ctx
                                                                        Private         Private         Private
                                                                     (“Registers”)   (“Registers”)   (“Registers”)




                    nG  r i d)
                                                                      16 kiB Ctx      16 kiB Ctx      16 kiB Ctx
                                                                        Shared          Shared          Shared




            ti on o                                                    Fetch/
                                                                       Decode
                                                                                       Fetch/
                                                                                       Decode
                                                                                                       Fetch/
                                                                                                       Decode




                                                                      32 kiB Ctx      32 kiB Ctx      32 kiB Ctx
                                                                        Private         Private         Private
                                                                     (“Registers”)   (“Registers”)   (“Registers”)




                             (Work) Item
                                                                      16 kiB Ctx      16 kiB Ctx      16 kiB Ctx
                                                                        Shared          Shared          Shared




           Software representation
                                 or “Thread” Hardware


                  slide by Andreas Kl¨ckner
                                     o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL      What and Why? OpenCL


Connection: Hardware ↔ Programming Model

                           Axis 0                                       Fetch/
                                                                        Decode
                                                                                        Fetch/
                                                                                        Decode
                                                                                                        Fetch/
                                                                                                        Decode




                                                                       32 kiB Ctx      32 kiB Ctx      32 kiB Ctx
                                                                         Private         Private         Private
                                                                      (“Registers”)   (“Registers”)   (“Registers”)


                                                                       16 kiB Ctx      16 kiB Ctx      16 kiB Ctx
                                                                         Shared          Shared          Shared




           Grid             nc-
                                                                        Fetch/          Fetch/          Fetch/
                                                                        Decode          Decode          Decode




                  nel: Fu
                er
  Axis 1




           (K
                                                                       32 kiB Ctx      32 kiB Ctx      32 kiB Ctx
                                                                         Private         Private         Private
                                                                      (“Registers”)   (“Registers”)   (“Registers”)




                    nG  r i d)
                                                                       16 kiB Ctx      16 kiB Ctx      16 kiB Ctx
                                                                         Shared          Shared          Shared




            ti on o                                                     Fetch/
                                                                        Decode
                                                                                        Fetch/
                                                                                        Decode
                                                                                                        Fetch/
                                                                                                        Decode




                                                                       32 kiB Ctx      32 kiB Ctx      32 kiB Ctx
                                                                         Private         Private         Private
                                                                      (“Registers”)   (“Registers”)   (“Registers”)


                                                                       16 kiB Ctx      16 kiB Ctx      16 kiB Ctx
                                                                         Shared          Shared          Shared




           Software representation
                                                                      Hardware

                   slide by Andreas Kl¨ckner
                                      o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL      What and Why? OpenCL


Connection: Hardware ↔ Programming Model

                           Axis 0                                       Fetch/
                                                                        Decode
                                                                                        Fetch/
                                                                                        Decode
                                                                                                        Fetch/
                                                                                                        Decode




                 (Work) Group                                          32 kiB Ctx      32 kiB Ctx      32 kiB Ctx




                        or “Block”
                                                                         Private         Private         Private
                                                                      (“Registers”)   (“Registers”)   (“Registers”)


                                                                       16 kiB Ctx      16 kiB Ctx      16 kiB Ctx
                                                                         Shared          Shared          Shared




           Grid             nc-
                                                                        Fetch/          Fetch/          Fetch/
                                                                        Decode          Decode          Decode




                  nel: Fu
                er
  Axis 1




           (K
                                                                       32 kiB Ctx      32 kiB Ctx      32 kiB Ctx
                                                                         Private         Private         Private
                                                                      (“Registers”)   (“Registers”)   (“Registers”)




                    nG  r i d)
                                                                       16 kiB Ctx      16 kiB Ctx      16 kiB Ctx
                                                                         Shared          Shared          Shared




            ti on o                                                     Fetch/
                                                                        Decode
                                                                                        Fetch/
                                                                                        Decode
                                                                                                        Fetch/
                                                                                                        Decode




                                                                       32 kiB Ctx      32 kiB Ctx      32 kiB Ctx
                                                                         Private         Private         Private
                                                                      (“Registers”)   (“Registers”)   (“Registers”)


                                                                       16 kiB Ctx      16 kiB Ctx      16 kiB Ctx
                                                                         Shared          Shared          Shared




           Software representation
                                                                      Hardware

                   slide by Andreas Kl¨ckner
                                      o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL      What and Why? OpenCL


Connection: Hardware ↔ Programming Model

                           Axis 0
                                                       ?                Fetch/
                                                                        Decode




                                                                       32 kiB Ctx
                                                                         Private
                                                                      (“Registers”)


                                                                       16 kiB Ctx
                                                                         Shared
                                                                                        Fetch/
                                                                                        Decode




                                                                                       32 kiB Ctx
                                                                                         Private
                                                                                      (“Registers”)


                                                                                       16 kiB Ctx
                                                                                         Shared
                                                                                                        Fetch/
                                                                                                        Decode




                                                                                                       32 kiB Ctx
                                                                                                         Private
                                                                                                      (“Registers”)


                                                                                                       16 kiB Ctx
                                                                                                         Shared




                                                                        Fetch/          Fetch/          Fetch/
                                                                        Decode          Decode          Decode
  Axis 1




                                                                       32 kiB Ctx      32 kiB Ctx      32 kiB Ctx
                                                                         Private         Private         Private
                                                                      (“Registers”)   (“Registers”)   (“Registers”)


                                                                       16 kiB Ctx      16 kiB Ctx      16 kiB Ctx
                                                                         Shared          Shared          Shared




                                                                        Fetch/          Fetch/          Fetch/
                                                                        Decode          Decode          Decode




                                                                       32 kiB Ctx      32 kiB Ctx      32 kiB Ctx
                                                                         Private         Private         Private
                                                                      (“Registers”)   (“Registers”)   (“Registers”)


                                                                       16 kiB Ctx      16 kiB Ctx      16 kiB Ctx
                                                                         Shared          Shared          Shared




           Software representation
                                                                      Hardware

                   slide by Andreas Kl¨ckner
                                      o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL      What and Why? OpenCL


Connection: Hardware ↔ Programming Model

                           Axis 0                                       Fetch/
                                                                        Decode
                                                                                        Fetch/
                                                                                        Decode
                                                                                                        Fetch/
                                                                                                        Decode




                                                                       32 kiB Ctx      32 kiB Ctx      32 kiB Ctx
                                                                         Private         Private         Private
                                                                      (“Registers”)   (“Registers”)   (“Registers”)


                                                                       16 kiB Ctx      16 kiB Ctx      16 kiB Ctx
                                                                         Shared          Shared          Shared




                                                                        Fetch/          Fetch/          Fetch/
                                                                        Decode          Decode          Decode
  Axis 1




                                                                       32 kiB Ctx      32 kiB Ctx      32 kiB Ctx
                                                                         Private         Private         Private
                                                                      (“Registers”)   (“Registers”)   (“Registers”)


                                                                       16 kiB Ctx      16 kiB Ctx      16 kiB Ctx
                                                                         Shared          Shared          Shared




                                                                        Fetch/          Fetch/          Fetch/
                                                                        Decode          Decode          Decode




                                                                       32 kiB Ctx      32 kiB Ctx      32 kiB Ctx
                                                                         Private         Private         Private
                                                                      (“Registers”)   (“Registers”)   (“Registers”)


                                                                       16 kiB Ctx      16 kiB Ctx      16 kiB Ctx
                                                                         Shared          Shared          Shared




           Software representation
                                                                      Hardware

                   slide by Andreas Kl¨ckner
                                      o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL      What and Why? OpenCL


Connection: Hardware ↔ Programming Model

                           Axis 0                                       Fetch/
                                                                        Decode
                                                                                        Fetch/
                                                                                        Decode
                                                                                                        Fetch/
                                                                                                        Decode




                                                                       32 kiB Ctx      32 kiB Ctx      32 kiB Ctx
                                                                         Private         Private         Private
                                                                      (“Registers”)   (“Registers”)   (“Registers”)


                                                                       16 kiB Ctx      16 kiB Ctx      16 kiB Ctx
                                                                         Shared          Shared          Shared




                                                                        Fetch/          Fetch/          Fetch/
                                                                        Decode          Decode          Decode
  Axis 1




                                                                       32 kiB Ctx      32 kiB Ctx      32 kiB Ctx
                                                                         Private         Private         Private
                                                                      (“Registers”)   (“Registers”)   (“Registers”)


                                                                       16 kiB Ctx      16 kiB Ctx      16 kiB Ctx
                                                                         Shared          Shared          Shared




                                                                        Fetch/          Fetch/          Fetch/
                                                                        Decode          Decode          Decode




                                                                       32 kiB Ctx      32 kiB Ctx      32 kiB Ctx
                                                                         Private         Private         Private
                                                                      (“Registers”)   (“Registers”)   (“Registers”)


                                                                       16 kiB Ctx      16 kiB Ctx      16 kiB Ctx
                                                                         Shared          Shared          Shared




           Software representation
                                                                      Hardware

                   slide by Andreas Kl¨ckner
                                      o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL      What and Why? OpenCL


Connection: Hardware ↔ Programming Model

                           Axis 0                                       Fetch/
                                                                        Decode
                                                                                        Fetch/
                                                                                        Decode
                                                                                                        Fetch/
                                                                                                        Decode




                                                                       32 kiB Ctx      32 kiB Ctx      32 kiB Ctx
                                                                         Private         Private         Private
                                                                      (“Registers”)   (“Registers”)   (“Registers”)


                                                                       16 kiB Ctx      16 kiB Ctx      16 kiB Ctx
                                                                         Shared          Shared          Shared




                                                                        Fetch/          Fetch/          Fetch/
                                                                        Decode          Decode          Decode
  Axis 1




                                                                       32 kiB Ctx      32 kiB Ctx      32 kiB Ctx
                                                                         Private         Private         Private
                                                                      (“Registers”)   (“Registers”)   (“Registers”)


                                                                       16 kiB Ctx      16 kiB Ctx      16 kiB Ctx
                                                                         Shared          Shared          Shared




                                                                        Fetch/          Fetch/          Fetch/
                                                                        Decode          Decode          Decode




                                                                       32 kiB Ctx      32 kiB Ctx      32 kiB Ctx
                                                                         Private         Private         Private
                                                                      (“Registers”)   (“Registers”)   (“Registers”)


                                                                       16 kiB Ctx      16 kiB Ctx      16 kiB Ctx
                                                                         Shared          Shared          Shared




           Software representation
                                                                      Hardware

                   slide by Andreas Kl¨ckner
                                      o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL      What and Why? OpenCL


Connection: Hardware ↔ Programming Model

                           Axis 0                                       Fetch/
                                                                        Decode
                                                                                        Fetch/
                                                                                        Decode
                                                                                                        Fetch/
                                                                                                        Decode




                                                                       32 kiB Ctx      32 kiB Ctx      32 kiB Ctx
                                                                         Private         Private         Private
                                                                      (“Registers”)   (“Registers”)   (“Registers”)


                                                                       16 kiB Ctx      16 kiB Ctx      16 kiB Ctx
                                                                         Shared          Shared          Shared




                                                                        Fetch/          Fetch/          Fetch/
                                                                        Decode          Decode          Decode
  Axis 1




                                                                       32 kiB Ctx      32 kiB Ctx      32 kiB Ctx
                                                                         Private         Private         Private
                                                                      (“Registers”)   (“Registers”)   (“Registers”)


                                                                       16 kiB Ctx      16 kiB Ctx      16 kiB Ctx
                                                                         Shared          Shared          Shared




                                                                        Fetch/          Fetch/          Fetch/
                                                                        Decode          Decode          Decode




                                                                       32 kiB Ctx      32 kiB Ctx      32 kiB Ctx
                                                                         Private         Private         Private
                                                                      (“Registers”)   (“Registers”)   (“Registers”)


                                                                       16 kiB Ctx      16 kiB Ctx      16 kiB Ctx
                                                                         Shared          Shared          Shared




           Software representation
                                                                      Hardware

                   slide by Andreas Kl¨ckner
                                      o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL      What and Why? OpenCL


Connection: Hardware ↔ Programming Model

                           Axis 0                                       Fetch/
                                                                        Decode
                                                                                        Fetch/
                                                                                        Decode
                                                                                                        Fetch/
                                                                                                        Decode




                                                                       32 kiB Ctx      32 kiB Ctx      32 kiB Ctx
                                                                         Private         Private         Private
                                                                      (“Registers”)   (“Registers”)   (“Registers”)


                                                                       16 kiB Ctx      16 kiB Ctx      16 kiB Ctx
                                                                         Shared          Shared          Shared




                                                                        Fetch/          Fetch/          Fetch/
                                                                        Decode          Decode          Decode
  Axis 1




                                                                       32 kiB Ctx      32 kiB Ctx      32 kiB Ctx
                                                                         Private         Private         Private
                                                                      (“Registers”)   (“Registers”)   (“Registers”)


                                                                       16 kiB Ctx      16 kiB Ctx      16 kiB Ctx
                                                                         Shared          Shared          Shared




                                                                        Fetch/          Fetch/          Fetch/
                                                                        Decode          Decode          Decode




                                                                       32 kiB Ctx      32 kiB Ctx      32 kiB Ctx
                                                                         Private         Private         Private
                                                                      (“Registers”)   (“Registers”)   (“Registers”)


                                                                       16 kiB Ctx      16 kiB Ctx      16 kiB Ctx
                                                                         Shared          Shared          Shared




           Software representation
                                                                      Hardware

                   slide by Andreas Kl¨ckner
                                      o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL      What and Why? OpenCL


Connection: Hardware ↔ Programming Model

                           Axis 0                                       Fetch/
                                                                        Decode
                                                                                        Fetch/
                                                                                        Decode
                                                                                                        Fetch/
                                                                                                        Decode




                                                                       32 kiB Ctx      32 kiB Ctx      32 kiB Ctx
                                                                         Private         Private         Private
                                                                      (“Registers”)   (“Registers”)   (“Registers”)


                                                                       16 kiB Ctx      16 kiB Ctx      16 kiB Ctx
                                                                         Shared          Shared          Shared




                                                                        Fetch/          Fetch/          Fetch/
                                                                        Decode          Decode          Decode
  Axis 1




                                                                       32 kiB Ctx      32 kiB Ctx      32 kiB Ctx
                                                                         Private         Private         Private
                                                                      (“Registers”)   (“Registers”)   (“Registers”)


                                                                       16 kiB Ctx      16 kiB Ctx      16 kiB Ctx
                                                                         Shared          Shared          Shared




                                                                        Fetch/          Fetch/          Fetch/
                                                                        Decode          Decode          Decode




                                                                       32 kiB Ctx      32 kiB Ctx      32 kiB Ctx
                                                                         Private         Private         Private
                                                                      (“Registers”)   (“Registers”)   (“Registers”)


                                                                       16 kiB Ctx      16 kiB Ctx      16 kiB Ctx
                                                                         Shared          Shared          Shared




           Software representation
                                                                      Hardware

                   slide by Andreas Kl¨ckner
                                      o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL      What and Why? OpenCL


Connection: Hardware ↔ Programming Model

                           Axis 0                                       Fetch/
                                                                        Decode
                                                                                        Fetch/
                                                                                        Decode
                                                                                                        Fetch/
                                                                                                        Decode




                                                                       32 kiB Ctx      32 kiB Ctx      32 kiB Ctx
                                                                         Private         Private         Private
                                                                      (“Registers”)   (“Registers”)   (“Registers”)


                                                                       16 kiB Ctx      16 kiB Ctx      16 kiB Ctx
                                                                         Shared          Shared          Shared




                                                                        Fetch/          Fetch/          Fetch/
                                                                        Decode          Decode          Decode
  Axis 1




                                                                       32 kiB Ctx      32 kiB Ctx      32 kiB Ctx
                                                                         Private         Private         Private
                                                                      (“Registers”)   (“Registers”)   (“Registers”)


                                                                       16 kiB Ctx      16 kiB Ctx      16 kiB Ctx
                                                                         Shared          Shared          Shared




                                                                        Fetch/          Fetch/          Fetch/
                                                                        Decode          Decode          Decode




                                                                       32 kiB Ctx      32 kiB Ctx      32 kiB Ctx
                                                                         Private         Private         Private
                                                                      (“Registers”)   (“Registers”)   (“Registers”)


                                                                       16 kiB Ctx      16 kiB Ctx      16 kiB Ctx
                                                                         Shared          Shared          Shared




           Software representation
                                                                      Hardware

                   slide by Andreas Kl¨ckner
                                      o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL      What and Why? OpenCL


Connection: Hardware ↔ Programming Model

                           Axis 0                                       Fetch/
                                                                        Decode
                                                                                        Fetch/
                                                                                        Decode
                                                                                                        Fetch/
                                                                                                        Decode




                                                                       32 kiB Ctx      32 kiB Ctx      32 kiB Ctx
                                                                         Private         Private         Private
                                                                      (“Registers”)   (“Registers”)   (“Registers”)


                                                                       16 kiB Ctx      16 kiB Ctx      16 kiB Ctx
                                                                         Shared          Shared          Shared




                                                                        Fetch/          Fetch/          Fetch/
                                                                        Decode          Decode          Decode
  Axis 1




                                                                       32 kiB Ctx      32 kiB Ctx      32 kiB Ctx
                                                                         Private         Private         Private
                                                                      (“Registers”)   (“Registers”)   (“Registers”)


                                                                       16 kiB Ctx      16 kiB Ctx      16 kiB Ctx
                                                                         Shared          Shared          Shared




                                                                        Fetch/          Fetch/          Fetch/
                                                                        Decode          Decode          Decode




                                                                       32 kiB Ctx      32 kiB Ctx      32 kiB Ctx
                                                                         Private         Private         Private
                                                                      (“Registers”)   (“Registers”)   (“Registers”)


                                                                       16 kiB Ctx      16 kiB Ctx      16 kiB Ctx
                                                                         Shared          Shared          Shared




           Software representation
                                                                      Hardware

                   slide by Andreas Kl¨ckner
                                      o        GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL      What and Why? OpenCL


Connection: Hardware ↔ Programming Model

                           Axis 0                                       Fetch/
                                                                        Decode
                                                                                        Fetch/
                                                                                        Decode
                                                                                                        Fetch/
                                                                                                        Decode




                                                                       32 kiB Ctx      32 kiB Ctx      32 kiB Ctx
                                                                         Private         Private         Private
                                                                      (“Registers”)   (“Registers”)   (“Registers”)


                                                                       16 kiB Ctx      16 kiB Ctx      16 kiB Ctx
                                                                         Shared          Shared          Shared




                                                                        Fetch/          Fetch/          Fetch/
                                                                        Decode          Decode          Decode
  Axis 1




                                                                       32 kiB Ctx      32 kiB Ctx      32 kiB Ctx
                                                                         Private         Private         Private
                                                                      (“Registers”)   (“Registers”)   (“Registers”)


                                                                       16 kiB Ctx      16 kiB Ctx      16 kiB Ctx
                                                                         Shared          Shared          Shared



                                                   Really: Block provides
                                                           Group        Fetch/          Fetch/          Fetch/
                                                                        Decode          Decode          Decode


                                                   pool of parallelism to draw
                                                   from.               32 kiB Ctx
                                                                         Private
                                                                      (“Registers”)
                                                                                       32 kiB Ctx
                                                                                         Private
                                                                                      (“Registers”)
                                                                                                       32 kiB Ctx
                                                                                                         Private
                                                                                                      (“Registers”)


                                                                       16 kiB Ctx      16 kiB Ctx      16 kiB Ctx



                                                                      block
                                                                         Shared          Shared          Shared

                                                   X,Y,Z order within group
           Software representation                 matters. (Not among
                                                               Hardware
                                                   groups, though.)


                   slide by Andreas Kl¨ckner
                                      o        GPU-Python with PyOpenCL and PyCUDA
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics

More Related Content

What's hot

Webinar - Key Reasons to Upgrade to MySQL 8.0 or MariaDB 10.11
Webinar - Key Reasons to Upgrade to MySQL 8.0 or MariaDB 10.11Webinar - Key Reasons to Upgrade to MySQL 8.0 or MariaDB 10.11
Webinar - Key Reasons to Upgrade to MySQL 8.0 or MariaDB 10.11Federico Razzoli
 
Lisa 2015-gluster fs-hands-on
Lisa 2015-gluster fs-hands-onLisa 2015-gluster fs-hands-on
Lisa 2015-gluster fs-hands-onGluster.org
 
The TCP/IP Stack in the Linux Kernel
The TCP/IP Stack in the Linux KernelThe TCP/IP Stack in the Linux Kernel
The TCP/IP Stack in the Linux KernelDivye Kapoor
 
오픈소스컨설팅 클러스터제안 V1.0
오픈소스컨설팅 클러스터제안 V1.0오픈소스컨설팅 클러스터제안 V1.0
오픈소스컨설팅 클러스터제안 V1.0sprdd
 
電子署名(PKI)ハンズオン資料 V1.00
電子署名(PKI)ハンズオン資料 V1.00電子署名(PKI)ハンズオン資料 V1.00
電子署名(PKI)ハンズオン資料 V1.00Naoto Miyachi
 
Deploying IPv6-mostly access networks
Deploying IPv6-mostly access networksDeploying IPv6-mostly access networks
Deploying IPv6-mostly access networksRIPE NCC
 
ZFSでストレージ
ZFSでストレージZFSでストレージ
ZFSでストレージ悟 宮崎
 
"SRv6の現状と展望" ENOG53@上越
"SRv6の現状と展望" ENOG53@上越"SRv6の現状と展望" ENOG53@上越
"SRv6の現状と展望" ENOG53@上越Kentaro Ebisawa
 
Fun with Network Interfaces
Fun with Network InterfacesFun with Network Interfaces
Fun with Network InterfacesKernel TLV
 
3GPP TR38.801-e00まとめ
3GPP TR38.801-e00まとめ3GPP TR38.801-e00まとめ
3GPP TR38.801-e00まとめTetsuya Hasegawa
 
Hacking Lab con ProxMox e Metasploitable
Hacking Lab con ProxMox e MetasploitableHacking Lab con ProxMox e Metasploitable
Hacking Lab con ProxMox e MetasploitableAndrea Draghetti
 
[135] 오픈소스 데이터베이스, 은행 서비스에 첫발을 내밀다.
[135] 오픈소스 데이터베이스, 은행 서비스에 첫발을 내밀다.[135] 오픈소스 데이터베이스, 은행 서비스에 첫발을 내밀다.
[135] 오픈소스 데이터베이스, 은행 서비스에 첫발을 내밀다.NAVER D2
 
VPP事始め
VPP事始めVPP事始め
VPP事始めnpsg
 
Infiniband hack-a-thon #2 Windows班まとめ資料 Windows Server 2012 + FDR Infinibandで...
Infiniband hack-a-thon #2 Windows班まとめ資料 Windows Server 2012 + FDR Infinibandで...Infiniband hack-a-thon #2 Windows班まとめ資料 Windows Server 2012 + FDR Infinibandで...
Infiniband hack-a-thon #2 Windows班まとめ資料 Windows Server 2012 + FDR Infinibandで...milk hanakara
 
The Linux Block Layer - Built for Fast Storage
The Linux Block Layer - Built for Fast StorageThe Linux Block Layer - Built for Fast Storage
The Linux Block Layer - Built for Fast StorageKernel TLV
 
OSvの概要と実装
OSvの概要と実装OSvの概要と実装
OSvの概要と実装Takuya ASADA
 
MAASで管理するBaremetal server
MAASで管理するBaremetal serverMAASで管理するBaremetal server
MAASで管理するBaremetal serverYuki Yamashita
 

What's hot (20)

Webinar - Key Reasons to Upgrade to MySQL 8.0 or MariaDB 10.11
Webinar - Key Reasons to Upgrade to MySQL 8.0 or MariaDB 10.11Webinar - Key Reasons to Upgrade to MySQL 8.0 or MariaDB 10.11
Webinar - Key Reasons to Upgrade to MySQL 8.0 or MariaDB 10.11
 
Lisa 2015-gluster fs-hands-on
Lisa 2015-gluster fs-hands-onLisa 2015-gluster fs-hands-on
Lisa 2015-gluster fs-hands-on
 
The TCP/IP Stack in the Linux Kernel
The TCP/IP Stack in the Linux KernelThe TCP/IP Stack in the Linux Kernel
The TCP/IP Stack in the Linux Kernel
 
오픈소스컨설팅 클러스터제안 V1.0
오픈소스컨설팅 클러스터제안 V1.0오픈소스컨설팅 클러스터제안 V1.0
오픈소스컨설팅 클러스터제안 V1.0
 
Syslog Protocols
Syslog ProtocolsSyslog Protocols
Syslog Protocols
 
電子署名(PKI)ハンズオン資料 V1.00
電子署名(PKI)ハンズオン資料 V1.00電子署名(PKI)ハンズオン資料 V1.00
電子署名(PKI)ハンズオン資料 V1.00
 
Deploying IPv6-mostly access networks
Deploying IPv6-mostly access networksDeploying IPv6-mostly access networks
Deploying IPv6-mostly access networks
 
ZFSでストレージ
ZFSでストレージZFSでストレージ
ZFSでストレージ
 
"SRv6の現状と展望" ENOG53@上越
"SRv6の現状と展望" ENOG53@上越"SRv6の現状と展望" ENOG53@上越
"SRv6の現状と展望" ENOG53@上越
 
Fun with Network Interfaces
Fun with Network InterfacesFun with Network Interfaces
Fun with Network Interfaces
 
3GPP TR38.801-e00まとめ
3GPP TR38.801-e00まとめ3GPP TR38.801-e00まとめ
3GPP TR38.801-e00まとめ
 
Hacking Lab con ProxMox e Metasploitable
Hacking Lab con ProxMox e MetasploitableHacking Lab con ProxMox e Metasploitable
Hacking Lab con ProxMox e Metasploitable
 
GTPing, How To
GTPing, How ToGTPing, How To
GTPing, How To
 
[135] 오픈소스 데이터베이스, 은행 서비스에 첫발을 내밀다.
[135] 오픈소스 데이터베이스, 은행 서비스에 첫발을 내밀다.[135] 오픈소스 데이터베이스, 은행 서비스에 첫발을 내밀다.
[135] 오픈소스 데이터베이스, 은행 서비스에 첫발을 내밀다.
 
Ixgbe internals
Ixgbe internalsIxgbe internals
Ixgbe internals
 
VPP事始め
VPP事始めVPP事始め
VPP事始め
 
Infiniband hack-a-thon #2 Windows班まとめ資料 Windows Server 2012 + FDR Infinibandで...
Infiniband hack-a-thon #2 Windows班まとめ資料 Windows Server 2012 + FDR Infinibandで...Infiniband hack-a-thon #2 Windows班まとめ資料 Windows Server 2012 + FDR Infinibandで...
Infiniband hack-a-thon #2 Windows班まとめ資料 Windows Server 2012 + FDR Infinibandで...
 
The Linux Block Layer - Built for Fast Storage
The Linux Block Layer - Built for Fast StorageThe Linux Block Layer - Built for Fast Storage
The Linux Block Layer - Built for Fast Storage
 
OSvの概要と実装
OSvの概要と実装OSvの概要と実装
OSvの概要と実装
 
MAASで管理するBaremetal server
MAASで管理するBaremetal serverMAASで管理するBaremetal server
MAASで管理するBaremetal server
 

Viewers also liked

Notes2StudyGST-160511
Notes2StudyGST-160511Notes2StudyGST-160511
Notes2StudyGST-160511xiaozhong hua
 
gtkgst video in your widgets!
gtkgst video in your widgets!gtkgst video in your widgets!
gtkgst video in your widgets!ystreet00
 
Introduction to CUDA
Introduction to CUDAIntroduction to CUDA
Introduction to CUDARaymond Tay
 
OpenGL 3.2 and More
OpenGL 3.2 and MoreOpenGL 3.2 and More
OpenGL 3.2 and MoreMark Kilgard
 
OpenGL 4.5 Update for NVIDIA GPUs
OpenGL 4.5 Update for NVIDIA GPUsOpenGL 4.5 Update for NVIDIA GPUs
OpenGL 4.5 Update for NVIDIA GPUsMark Kilgard
 
以深度學習加速語音及影像辨識應用發展
以深度學習加速語音及影像辨識應用發展以深度學習加速語音及影像辨識應用發展
以深度學習加速語音及影像辨識應用發展NVIDIA Taiwan
 

Viewers also liked (9)

Notes2StudyGST-160511
Notes2StudyGST-160511Notes2StudyGST-160511
Notes2StudyGST-160511
 
Haskell Accelerate
Haskell  AccelerateHaskell  Accelerate
Haskell Accelerate
 
gtkgst video in your widgets!
gtkgst video in your widgets!gtkgst video in your widgets!
gtkgst video in your widgets!
 
GPU Computing
GPU ComputingGPU Computing
GPU Computing
 
GPU: Understanding CUDA
GPU: Understanding CUDAGPU: Understanding CUDA
GPU: Understanding CUDA
 
Introduction to CUDA
Introduction to CUDAIntroduction to CUDA
Introduction to CUDA
 
OpenGL 3.2 and More
OpenGL 3.2 and MoreOpenGL 3.2 and More
OpenGL 3.2 and More
 
OpenGL 4.5 Update for NVIDIA GPUs
OpenGL 4.5 Update for NVIDIA GPUsOpenGL 4.5 Update for NVIDIA GPUs
OpenGL 4.5 Update for NVIDIA GPUs
 
以深度學習加速語音及影像辨識應用發展
以深度學習加速語音及影像辨識應用發展以深度學習加速語音及影像辨識應用發展
以深度學習加速語音及影像辨識應用發展
 

Similar to [Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics

[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patternsnpinto
 
Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015Christian Peel
 
Making fitting in RooFit faster
Making fitting in RooFit fasterMaking fitting in RooFit faster
Making fitting in RooFit fasterPatrick Bos
 
On the Necessity and Inapplicability of Python
On the Necessity and Inapplicability of PythonOn the Necessity and Inapplicability of Python
On the Necessity and Inapplicability of PythonTakeshi Akutsu
 
On the necessity and inapplicability of python
On the necessity and inapplicability of pythonOn the necessity and inapplicability of python
On the necessity and inapplicability of pythonYung-Yu Chen
 
Large-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PCLarge-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PCAapo Kyrölä
 
Threaded Programming
Threaded ProgrammingThreaded Programming
Threaded ProgrammingSri Prasanna
 
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali ZaidiNatural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali ZaidiDatabricks
 
Automating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomateAutomating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomateAnubhav Jain
 
Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine ParallelismSri Prasanna
 
OpenCL & the Future of Desktop High Performance Computing in CAD
OpenCL & the Future of Desktop High Performance Computing in CADOpenCL & the Future of Desktop High Performance Computing in CAD
OpenCL & the Future of Desktop High Performance Computing in CADDesign World
 
Thinking in parallel ab tuladev
Thinking in parallel ab tuladevThinking in parallel ab tuladev
Thinking in parallel ab tuladevPavel Tsukanov
 
Deep Dive on Deep Learning (June 2018)
Deep Dive on Deep Learning (June 2018)Deep Dive on Deep Learning (June 2018)
Deep Dive on Deep Learning (June 2018)Julien SIMON
 
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...huguk
 
Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux
Building a Cutting-Edge Data Process Environment on a Budget by Gael VaroquauxBuilding a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux
Building a Cutting-Edge Data Process Environment on a Budget by Gael VaroquauxPyData
 
Atomate: a high-level interface to generate, execute, and analyze computation...
Atomate: a high-level interface to generate, execute, and analyze computation...Atomate: a high-level interface to generate, execute, and analyze computation...
Atomate: a high-level interface to generate, execute, and analyze computation...Anubhav Jain
 
OpenMP tasking model: from the standard to the classroom
OpenMP tasking model: from the standard to the classroomOpenMP tasking model: from the standard to the classroom
OpenMP tasking model: from the standard to the classroomFacultad de Informática UCM
 
2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users Group2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users GroupNitay Joffe
 
OpenPOWER Workshop in Silicon Valley
OpenPOWER Workshop in Silicon ValleyOpenPOWER Workshop in Silicon Valley
OpenPOWER Workshop in Silicon ValleyGanesan Narayanasamy
 
Concurrency and Python - PyCon MY 2015
Concurrency and Python - PyCon MY 2015Concurrency and Python - PyCon MY 2015
Concurrency and Python - PyCon MY 2015Boey Pak Cheong
 

Similar to [Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics (20)

[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
 
Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015
 
Making fitting in RooFit faster
Making fitting in RooFit fasterMaking fitting in RooFit faster
Making fitting in RooFit faster
 
On the Necessity and Inapplicability of Python
On the Necessity and Inapplicability of PythonOn the Necessity and Inapplicability of Python
On the Necessity and Inapplicability of Python
 
On the necessity and inapplicability of python
On the necessity and inapplicability of pythonOn the necessity and inapplicability of python
On the necessity and inapplicability of python
 
Large-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PCLarge-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PC
 
Threaded Programming
Threaded ProgrammingThreaded Programming
Threaded Programming
 
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali ZaidiNatural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
 
Automating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomateAutomating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomate
 
Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine Parallelism
 
OpenCL & the Future of Desktop High Performance Computing in CAD
OpenCL & the Future of Desktop High Performance Computing in CADOpenCL & the Future of Desktop High Performance Computing in CAD
OpenCL & the Future of Desktop High Performance Computing in CAD
 
Thinking in parallel ab tuladev
Thinking in parallel ab tuladevThinking in parallel ab tuladev
Thinking in parallel ab tuladev
 
Deep Dive on Deep Learning (June 2018)
Deep Dive on Deep Learning (June 2018)Deep Dive on Deep Learning (June 2018)
Deep Dive on Deep Learning (June 2018)
 
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
 
Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux
Building a Cutting-Edge Data Process Environment on a Budget by Gael VaroquauxBuilding a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux
Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux
 
Atomate: a high-level interface to generate, execute, and analyze computation...
Atomate: a high-level interface to generate, execute, and analyze computation...Atomate: a high-level interface to generate, execute, and analyze computation...
Atomate: a high-level interface to generate, execute, and analyze computation...
 
OpenMP tasking model: from the standard to the classroom
OpenMP tasking model: from the standard to the classroomOpenMP tasking model: from the standard to the classroom
OpenMP tasking model: from the standard to the classroom
 
2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users Group2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users Group
 
OpenPOWER Workshop in Silicon Valley
OpenPOWER Workshop in Silicon ValleyOpenPOWER Workshop in Silicon Valley
OpenPOWER Workshop in Silicon Valley
 
Concurrency and Python - PyCon MY 2015
Concurrency and Python - PyCon MY 2015Concurrency and Python - PyCon MY 2015
Concurrency and Python - PyCon MY 2015
 

More from npinto

"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)npinto
 
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...npinto
 
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...npinto
 
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...npinto
 
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)npinto
 
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...npinto
 
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...npinto
 
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...npinto
 
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...npinto
 
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...npinto
 
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...npinto
 
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...npinto
 
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...npinto
 
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...npinto
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)npinto
 
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)npinto
 
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...npinto
 
[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programmingnpinto
 
[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programming[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programmingnpinto
 
[Harvard CS264] 01 - Introduction
[Harvard CS264] 01 - Introduction[Harvard CS264] 01 - Introduction
[Harvard CS264] 01 - Introductionnpinto
 

More from npinto (20)

"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)
 
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
 
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
 
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
 
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
 
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
 
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
 
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
 
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
 
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
 
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
 
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
 
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
 
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
 
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
 
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
 
[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming
 
[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programming[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programming
 
[Harvard CS264] 01 - Introduction
[Harvard CS264] 01 - Introduction[Harvard CS264] 01 - Introduction
[Harvard CS264] 01 - Introduction
 

Recently uploaded

Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptxSherlyMaeNeri
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfErwinPantujan2
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...Postal Advocate Inc.
 
FILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipinoFILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipinojohnmickonozaleda
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)cama23
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Seán Kennedy
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxCarlos105
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 

Recently uploaded (20)

Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptx
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
 
FILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipinoFILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipino
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)
 
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptxLEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 

[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics

  • 1. Massively Parallel Computing CS 264 / CSCI E-292 Lecture #3: GPU Programming with CUDA | February 8th, 2011 Nicolas Pinto (MIT, Harvard) pinto@mit.edu
  • 2. Administrivia • New here? Welcome! • HW0: Forum, RSS, Survey • Lecture 1 & 2 slides posted • Project teams allowed (up to 2 students) • innocentive-like / challenge-driven ? • HW1: out tonight/tomorrow, due Fri 2/18/11 • New guest lecturers! • Wen-mei Hwu (UIUC/NCSA), Cyrus Omar (CMU), Cliff Wooley (NVIDIA), Richard Lethin (Reservoir Labs), James Malcom (Accelereyes), David Cox (Harvard)
  • 3. During this course, r CS264 adapted fo we’ll try to “ ” and use existing material ;-)
  • 4.
  • 6. Objectives • Get your started with GPU Programming • Introduce CUDA • “20,000 foot view” • Get used to the jargon... • ...with just enough details • Point to relevant external resources
  • 7. Outline • Thinking Parallel (review) • Why GPUs ? • CUDA Overview • Programming Model • Threading/Execution Hierarchy • Memory/Communication Hierarchy • CUDA Programming
  • 8. Outline • Thinking Parallel (review) • Why GPUs ? • CUDA Overview • Programming Model • Threading/Execution Hierarchy • Memory/Communication Hierarchy • CUDA Programming
  • 10. Getting your feet wet • Common scenario: “I want to make the algorithm X run faster, help me!” • Q: How do you approach the problem?
  • 11. How?
  • 12.
  • 13. How? • Option 1: wait • Option 2: gcc -O3 -msse4.2 • Option 3: xlc -O5 • Option 4: use parallel libraries (e.g. (cu)blas) • Option 5: hand-optimize everything! • Option 6: wait more
  • 16. Getting your feet wet Algorithm X v1.0 Profiling Analysis on Input 10x10x10 100 100% parallelizable 75 sequential in nature time (s) 50 50 25 29 10 11 0 load_data() foo() bar() yey() Q: What is the maximum speed up ?
  • 17. Getting your feet wet Algorithm X v1.0 Profiling Analysis on Input 10x10x10 100 100% parallelizable 75 sequential in nature time (s) 50 50 25 29 10 11 0 load_data() foo() bar() yey() A: 2X ! :-(
  • 18. You need to... • ... understand the problem (duh!) • ... study the current (sequential?) solutions and their constraints • ... know the input domain • ... profile accordingly • ... “refactor” based on new constraints (hw/sw)
  • 19. Some Perspective The “problem tree” for scientific problem solving 9 Some Perspective Technical Problem to be Analyzed Consultation with experts Scientific Model "A" Model "B" Theoretical analysis Discretization "A" Discretization "B" Experiments Iterative equation solver Direct elimination equation solver Parallel implementation Sequential implementation Figure 11: There“problem tree” for to try to achieve the same goal. are many The are many options scientific problem solving. There options to try to achieve the same goal. from Scott et al. “Scientific Parallel Computing” (2005)
  • 20. Computational Thinking • translate/formulate domain problems into computational models that can be solved efficiently by available computing resources • requires a deep understanding of their relationships adapted from Hwu & Kirk (PASI 2011)
  • 21. Getting ready... Programming Models Architecture Algorithms Languages Patterns il ers C omp Parallel Thinking Parallel Computing APPLICATIONS adapted from Scott et al. “Scientific Parallel Computing” (2005)
  • 22. You can do it! • thinking parallel is not as hard as you may think • many techniques have been thoroughly explained... • ... and are now “accessible” to non-experts !
  • 23. Outline • Thinking Parallel (review) • Why GPUs ? • CUDA Overview • Programming Model • Threading/Execution Hierarchy • Memory/Communication Hierarchy • CUDA Programming
  • 25. ti vat i on Mo ! 7F"'/.;$'"#.2./1#'2%/C"&'.O'#./0.2"2$;' 12'+2'E-'I1,,'6.%C,"'"<"&8'8"+& ! P1;$.&1#+,,8'! -*Q;'3"$'O+;$"& " P+&6I+&"'&"+#F123'O&"R%"2#8',1/1$+$1.2; ! S.I'! -*Q;'3"$'I16"& GPUs slide by Matthew Bolitho
  • 27. Motivation ti vat i on Mo GPU Fact: nobody cares about theoretical peak Challenge: harness GPU power for real application performance GFLOPS $"# #<=4>&+234&?@&6.A !"# !"#$#%&'()*%&+,-.- CPU 0&12345 /0-&12345 ,-/&89*:;) 67.&89*:;)
  • 28. ti vat i on Mo ! T+$F"&'$F+2'":0"#$123'-*Q;'$.'3"$'$I1#"'+;' O+;$9'":0"#$'$.'F+<"'$I1#"'+;'/+28U ! *+&+,,",'0&.#";;123'O.&'$F"'/+;;"; ! Q2O.&$%2+$",8)'*+&+,,",'0&.3&+//123'1;'F+&6V'' " D,3.&1$F/;'+26'B+$+'?$&%#$%&";'/%;$'C"' O%26+/"2$+,,8'&"6";132"6 slide by Matthew Bolitho
  • 29. Task vs Data Parallelism CPUs vs GPUs
  • 30. Task parallelism • Distribute the tasks across processors based on dependency • Coarse-grain parallelism Task 1 Task 2 Time Task 3 P1 Task 1 Task 2 Task 3 Task 4 P2 Task 4 Task 5 Task 6 Task 5 Task 6 P3 Task 7 Task 8 Task 9 Task 7 Task 9 Task 8 Task assignment across 3 processors Task dependency graph 30
  • 31. Data parallelism • Run a single kernel over many elements –Each element is independently updated –Same operation is applied on each element • Fine-grain parallelism –Many lightweight threads, easy to switch context –Maps well to ALU heavy architecture : GPU Data ……. Kernel P1 P2 P3 P4 P5 ……. Pn 31
  • 32. Task vs. Data parallelism • Task parallel – Independent processes with little communication – Easy to use • “Free” on modern operating systems with SMP • Data parallel – Lots of data on which the same computation is being executed – No dependencies between data elements in each step in the computation – Can saturate many ALUs – But often requires redesign of traditional algorithms 4 slide by Mike Houston
  • 33. CPU vs. GPU • CPU – Really fast caches (great for data reuse) – Fine branching granularity – Lots of different processes/threads Computing? GPU – High performance on a single thread of execution • GPU • Design target for CPUs: – Lotsof math units • Make control away from fast • Take a single thread very – Fastaccess to onboard memory programmer • GPU Computing takes a – Run a program on different fragment/vertex each approach: – High throughput on •parallel tasks Throughput matters— single threads do not • Give explicit control to programmer • CPUs are great for task parallelism • GPUs are great for data parallelism slide by Mike Houston 5
  • 34. GPUs ? ! 6'401-'@&)*(&+,3AB0-3'-407':&C,(,DD'D& C(*8D'+4/ ! E*('&3(,-4043*(4&@'@0.,3'@&3*&?">&3A,-&)D*F& .*-3(*D&,-@&@,3,&.,.A' slide by Matthew Bolitho
  • 35. From CPUs to GPUs (how did we end up there?)
  • 36. Intro PyOpenCL What and Why? OpenCL “CPU-style” Cores CPU-“style” cores Fetch/ Out-of-order control logic Decode Fancy branch predictor ALU (Execute) Memory pre-fetcher Execution Context Data cache (A big one) SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ 13 Credit: Kayvon Fatahalian (Stanford)
  • 37. Intro PyOpenCL What and Why? OpenCL Slimming down Slimming down Fetch/ Decode Idea #1: ALU Remove components that (Execute) help a single instruction Execution stream run fast Context SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ 14 Credit: Kayvon Fatahalian (Stanford) slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 38. Intro PyOpenCL What and Why? OpenCL More Space: Double the Numberparallel) Two cores (two fragments in of Cores fragment 1 fragment 2 Fetch/ Fetch/ Decode Decode !"#$$%&'()*"'+,-. !"#$$%&'()*"'+,-. ALU ALU &*/01'.+23.453.623.&2. &*/01'.+23.453.623.&2. /%1..+73.423.892:2;. /%1..+73.423.892:2;. /*"".+73.4<3.892:<;3.+7. (Execute) (Execute) /*"".+73.4<3.892:<;3.+7. /*"".+73.4=3.892:=;3.+7. /*"".+73.4=3.892:=;3.+7. 81/0.+73.+73.1>2?2@3.1><?2@. 81/0.+73.+73.1>2?2@3.1><?2@. /%1..A23.+23.+7. /%1..A23.+23.+7. Execution Execution /%1..A<3.+<3.+7. /%1..A<3.+<3.+7. /%1..A=3.+=3.+7. /%1..A=3.+=3.+7. Context Context /A4..A73.1><?2@. /A4..A73.1><?2@. SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ 15 Credit: Kayvon Fatahalian (Stanford) slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 39. Intro PyOpenCL What and Why? OpenCL Fouragain . . . cores (four fragments in parallel) Fetch/ Fetch/ Decode Decode ALU ALU (Execute) (Execute) Execution Execution Context Context Fetch/ Fetch/ Decode Decode ALU ALU (Execute) (Execute) Execution Execution Context Context GRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ 16 Credit: Kayvon Fatahalian (Stanford) slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 40. Intro PyOpenCL What and Why? OpenCL xteen cores . . . and again (sixteen fragments in parallel) ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU 16 cores = 16 simultaneous instruction streams H 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ Credit: Kayvon Fatahalian (Stanford) 17 slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 41. Intro PyOpenCL What and Why? OpenCL xteen cores . . . and again (sixteen fragments in parallel) ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU → 16 independent instruction streams ALU ALU ALU Reality: instruction streams not actually 16 cores = 16very different/independent simultaneous instruction streams H 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ Credit: Kayvon Fatahalian (Stanford) 17 slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 42. ecall: simple processing core Intro PyOpenCL What and Why? OpenCL Saving Yet More Space Fetch/ Decode ALU (Execute) Execution Context Credit: Kayvon Fatahalian (Stanford) slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 43. ecall: simple processing core Intro PyOpenCL What and Why? OpenCL Saving Yet More Space Fetch/ Decode ALU Idea #2 (Execute) Amortize cost/complexity of managing an instruction stream Execution across many ALUs Context → SIMD Credit: Kayvon Fatahalian (Stanford) slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 44. ecall: simple processing core dd ALUs Intro PyOpenCL What and Why? OpenCL Saving Yet More Space Fetch/ Idea #2: Decode Amortize cost/complexity of ALU 1 ALU 2 ALU 3 ALU 4 ALU managing an instruction Idea #2 (Execute) ALU 5 ALU 6 ALU 7 ALU 8 stream across many of Amortize cost/complexity ALUs managing an instruction stream Execution across many ALUs Ctx Ctx Ctx Context Ctx SIMD processing → SIMD Ctx Ctx Ctx Ctx Shared Ctx Data Credit: Kayvon Fatahalian (Stanford) slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 45. dd ALUs Intro PyOpenCL What and Why? OpenCL Saving Yet More Space Fetch/ Idea #2: Decode Amortize cost/complexity of ALU 1 ALU 2 ALU 3 ALU 4 managing an instruction Idea #2 ALU 5 ALU 6 ALU 7 ALU 8 stream across many of Amortize cost/complexity ALUs managing an instruction stream across many ALUs Ctx Ctx Ctx Ctx SIMD processing → SIMD Ctx Ctx Ctx Ctx Shared Ctx Data Credit: Kayvon Fatahalian (Stanford) slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 46. http://www.youtube.com/watch?v=1yH_j8-VVLo Intro PyOpenCL What and Why? OpenCL Gratuitous Amounts of Parallelism! ragments in parallel 16 cores = 128 ALUs = 16 simultaneous instruction streams Credit: Shading: http://s09.idav.ucdavis.edu/ Kayvon Fatahalian (Stanford) Beyond Programmable 24 slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 47. http://www.youtube.com/watch?v=1yH_j8-VVLo Intro PyOpenCL What and Why? OpenCL Gratuitous Amounts of Parallelism! ragments in parallel Example: 128 instruction streams in parallel 16 independent groups of 8 synchronized streams 16 cores = 128 ALUs = 16 simultaneous instruction streams Credit: Shading: http://s09.idav.ucdavis.edu/ Kayvon Fatahalian (Stanford) Beyond Programmable 24 slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 48. Intro PyOpenCL What and Why? OpenCL Remaining Problem: Slow Memory Problem Memory still has very high latency. . . . . . but we’ve removed most of the hardware that helps us deal with that. We’ve removed caches branch prediction Idea #3 out-of-order execution Even more parallelism So what now? + Some extra memory = A solution! slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 49. Intro PyOpenCL What and Why? OpenCL Remaining Problem: Slow Memory Fetch/ Decode Problem ALU ALU ALU ALU Memory still has very high latency. . . ALU ALU ALU ALU . . . but we’ve removed most of the hardware that helps us deal with that. Ctx Ctx Ctx Ctx We’ve removedCtx Ctx Ctx Ctx caches Shared Ctx Data branch prediction Idea #3 out-of-order execution Even more parallelism v.ucdavis.edu/ So what now? + 33 Some extra memory = A solution! slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 50. Intro PyOpenCL What and Why? OpenCL Remaining Problem: Slow Memory Fetch/ Decode Problem ALU ALU ALU ALU Memory still has very high latency. . . ALU ALU ALU ALU . . . but we’ve removed most of the hardware that helps us deal with that. 1 2 We’ve removed caches 3 4 branch prediction Idea #3 out-of-order execution Even more parallelism v.ucdavis.edu/ now? So what + 34 Some extra memory = A solution! slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 51. Hiding Memory Latency Hiding shader stalls Time Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32 (clocks) 1 2 3 4 Fetch/ Decode ALU ALU ALU ALU ALU ALU ALU ALU 1 2 3 4 SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ 34 Credit: Kayvon Fatahalian (Stanford) Discuss HW1 Intro to GPU Computing
  • 52. Hiding Memory Latency Hiding shader stalls Time Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32 (clocks) 1 2 3 4 Stall Runnable SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ 35 Credit: Kayvon Fatahalian (Stanford) Discuss HW1 Intro to GPU Computing
  • 53. Hiding Memory Latency Hiding shader stalls Time Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32 (clocks) 1 2 3 4 Stall Runnable SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ 36 Credit: Kayvon Fatahalian (Stanford) Discuss HW1 Intro to GPU Computing
  • 54. Hiding Memory Latency Hiding shader stalls Time Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32 (clocks) 1 2 3 4 Stall Stall Runnable Stall Runnable Stall Runnable SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ 37 Credit: Kayvon Fatahalian (Stanford) Discuss HW1 Intro to GPU Computing
  • 55. Intro PyOpenCL What and Why? OpenCL GPU Architecture Summary Core Ideas: 1 Many slimmed down cores → lots of parallelism 2 More ALUs, Fewer Control Units 3 Avoid memory stalls by interleaving execution of SIMD groups (“warps”) Credit: Kayvon Fatahalian (Stanford) slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 56. Is it free? ! GA,3&,('&3A'&.*-4'H2'-.'4I ! $(*1(,+&+243&8'&+*('&C('@0.3,8D'/ ! 6,3,&,..'44&.*A'('-.5 ! $(*1(,+&)D*F slide by Matthew Bolitho
  • 57. Outline • Thinking Parallel (review) • Why GPUs ? • CUDA Overview • Programming Model • Threading/Execution Hierarchy • Memory/Communication Hierarchy • CUDA Programming
  • 59. *,.;<+/$%=*=*8 GPGPU... >?9$ !"!"# @ 6,'2A%6)+%=*8%'16.%(+1+,0<B45,4.C+% 2./456'1(%;D%20C6'1(%4,.;<+/%0C%(,04)'2C E5,1%F060%'16.%'/0(+C%GH6+I65,+%/04CJK E5,1%0<(.,'6)/C%'16.%'/0(+%CD16)+C'C%GH,+1F+,'1(%40CC+CJK *,./'C'1(%,+C5<6CL%;56$ E.5()%<+0,1'1(%25,M+L%40,6'25<0,<D%-.,%1.1B(,04)'2C%+I4+,6C *.6+16'0<<D%)'()%.M+,)+0F%.-%(,04)'2C%:*N &'()<D%2.1C6,0'1+F%/+/.,D%<0D.56%O%022+CC%/.F+< P++F%-.,%/01D%40CC+C%F,'M+C%54%;01F7'F6)%2.1C5/46'.1
  • 60. ! !"#$)'0,I=%$"'E+.K."-':"H.#"'F&#?.$"#$%&" ! 0&"1$"-'6B'LM*:*F ! F'A1B'$,'="&K,&I'#,I=%$1$.,+',+'$?"'>8E ! 7="#.K.#1$.,+'K,&) ! F'#,I=%$"&'1&#?.$"#$%&" ! F'31+N%1N" ! F+'1==3.#1$.,+'.+$"&K1#"'OF8*P slide by Matthew Bolitho
  • 61. CUDA Advantages over Legacy GPGPU Random access to memory Thread can access any memory location Unlimited access to memory Thread can read/write as many locations as needed User-managed cache (per block) Threads can cooperatively load data into SMEM Any thread can then access any SMEM location Low learning curve Just a few extensions to C No knowledge of graphics is required No graphics API overhead © NVIDIA Corporation 2006 9
  • 62. CUDA Parallel Paradigm Scale to 100s of cores, 1000s of parallel threads Transparently with one source and same binary Let programmers focus on parallel algorithms Not mechanics of a parallel programming language Enable CPU+GPU Co-Processing CPU & GPU are separate devices with separate memories NVIDIA Confidential
  • 63. C with CUDA Extensions: C with a few keywords !"#$%&'()*+&,-#'./#01%02%3."'1%'2%3."'1%4(2%3."'1%4*5 6 3"- /#01%#%7%89%# : 09%;;#5 *<#=%7%'4(<#=%;%*<#=9 > Standard C Code ??%@0!"A,%&,-#'. BCDEF%A,-0,. &'()*+&,-#'./02%GH82%(2%*59 ++I."J'.++%!"#$%&'()*+)'-'..,./#01%02%3."'1%'2%3."'1%4(2%3."'1%4*5 6 #01%#%7%J."KA@$(H(4J."KAL#MH(%;%1N-,'$@$(H(9 #3 /# : 05%%*<#=%7%'4(<#=%;%*<#=9 Parallel C Code > ??%@0!"A,%)'-'..,. BCDEF%A,-0,. O#1N%GPQ%1N-,'$&?J."KA #01%0J."KA&%7%/0%;%GPP5%?%GPQ9 &'()*+)'-'..,.:::0J."KA&2%GPQRRR/02%GH82%(2%*59 NVIDIA Confidential
  • 64. Compiling C with CUDA Applications !!! C CUDA Rest of C " #$%&'$()*+,-./0(%$/1%/('!!!'2'3 Key Kernels Application !!! " NVCC #$%&'45678,4*+%591-9$5('!!!'2'3 (Open64) CPU Compiler -$+ 1%/('%':';<'% = /<'>>%2 8?%@':'5A6?%@'>'8?%@< Modify into " Parallel CUDA object CPU object #$%&'B5%/1'2'3 CUDA code files files -9$5('6< Linker 45678,4*+%591!!2< !!! " CPU-GPU Executable NVIDIA Confidential
  • 65. Compiling CUDA Code C/C++ CUDA Application NVCC CPU Code PTX Code Virtual PTX to Target Physical Compiler G80 … GPU Target code © 2008 NVIDIA Corporation.
  • 66. CUDA Software Development CUDA Optimized Libraries: Integrated CPU + GPU math.h, FFT, BLAS, … C Source Code NVIDIA C Compiler NVIDIA Assembly CPU Host Code for Computing (PTX) CUDA Standard C Compiler Profiler Driver GPU CPU
  • 67. CUDA Development Tools: cuda-gdb CUDA-gdb Integrated into gdb Supports CUDA C Seamless CPU+GPU development experience Enabled on all CUDA supported 32/64bit Linux distros Set breakpoint and single step any source line Access and print all CUDA memory allocs, local, global, constant and shared vars. © NVIDIA Corporation 2009
  • 68. Parallel Source Debugging CUDA-gdb in emacs CUDA-GDB in emacs © NVIDIA Corporation 2009
  • 69. Parallel Source Debugging CUDA-gdb in DDD © NVIDIA Corporation 2009
  • 70. CUDA Development Tools: cuda-memcheck CUDA-MemCheck Coming with CUDA 3.0 Release Track out of bounds and misaligned accesses Supports CUDA C Integrated into the CUDA-GDB debugger Available as standalone tool on all OS platforms. © NVIDIA Corporation 2009
  • 71. Parallel Source Memory Checker CUDA- MemCheck © NVIDIA Corporation 2009
  • 72. CUDA Development Tools: (Visual) Profiler CUDA Visual Profiler
  • 73. Outline • Thinking Parallel (review) • Why GPUs ? • CUDA Overview • Programming Model • Threading/Execution Hierarchy • Memory/Communication Hierarchy • CUDA Programming
  • 76. Intro PyOpenCL What and Why? OpenCL Connection: Hardware ↔ Programming Model Fetch/ Decode Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx (“Registers”) Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode 16 kiB Ctx 32 kiB Ctx Private (“Registers”) 32 kiB Ctx Private (“Registers”) 32 kiB Ctx Private (“Registers”) Shared 16 kiB Ctx Shared 16 kiB Ctx Shared 16 kiB Ctx Shared slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 77. Intro PyOpenCL What and Why? OpenCL Connection: Hardware ↔ Programming Model Fetch/ Fetch/ Fetch/ Decode Decode Decode show are s? 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx o c ore Private Private Private (“Registers”) (“Registers”) (“Registers”) h W ny c 16 kiB Ctx Shared 16 kiB Ctx Shared 16 kiB Ctx Shared ma Fetch/ Fetch/ Fetch/ Decode Decode Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) Idea: 16 kiB Ctx Shared 16 kiB Ctx Shared 16 kiB Ctx Shared Program as if there were Fetch/ Decode Fetch/ Decode Fetch/ Decode “infinitely” many cores 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) Program as if there were 16 kiB Ctx Shared 16 kiB Ctx Shared 16 kiB Ctx Shared “infinitely” many ALUs per core slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 78. Intro PyOpenCL What and Why? OpenCL Connection: Hardware ↔ Programming Model Fetch/ Fetch/ Fetch/ Decode Decode Decode show are s? 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx o c ore Private Private Private (“Registers”) (“Registers”) (“Registers”) h W ny c 16 kiB Ctx Shared 16 kiB Ctx Shared 16 kiB Ctx Shared ma Fetch/ Fetch/ Fetch/ Decode Decode Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) Idea: 16 kiB Ctx Shared 16 kiB Ctx Shared 16 kiB Ctx Shared Consider: Which there were do automatically? Program as if is easy to Fetch/ Decode Fetch/ Decode Fetch/ Decode “infinitely” many cores Parallel program → sequential hardware 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) or Program as if there were 16 kiB Ctx Shared 16 kiB Ctx Shared 16 kiB Ctx Shared “infinitely” many ALUs per Sequential program → parallel hardware? core slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 79. Intro PyOpenCL What and Why? OpenCL Connection: Hardware ↔ Programming Model Axis 0 Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode Axis 1 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Software representation Hardware slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 80. Intro PyOpenCL What and Why? OpenCL Connection: Hardware ↔ Programming Model Axis 0 Fetch/ Decode Fetch/ Decode Fetch/ Decode (Work) Group 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx or “Block” Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Grid nc- Fetch/ Fetch/ Fetch/ Decode Decode Decode nel: Fu er Axis 1 (K 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) nG r i d) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared ti on o Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) (Work) Item 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Software representation or “Thread” Hardware slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 81. Intro PyOpenCL What and Why? OpenCL Connection: Hardware ↔ Programming Model Axis 0 Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Grid nc- Fetch/ Fetch/ Fetch/ Decode Decode Decode nel: Fu er Axis 1 (K 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) nG r i d) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared ti on o Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Software representation Hardware slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 82. Intro PyOpenCL What and Why? OpenCL Connection: Hardware ↔ Programming Model Axis 0 Fetch/ Decode Fetch/ Decode Fetch/ Decode (Work) Group 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx or “Block” Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Grid nc- Fetch/ Fetch/ Fetch/ Decode Decode Decode nel: Fu er Axis 1 (K 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) nG r i d) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared ti on o Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Software representation Hardware slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 83. Intro PyOpenCL What and Why? OpenCL Connection: Hardware ↔ Programming Model Axis 0 ? Fetch/ Decode 32 kiB Ctx Private (“Registers”) 16 kiB Ctx Shared Fetch/ Decode 32 kiB Ctx Private (“Registers”) 16 kiB Ctx Shared Fetch/ Decode 32 kiB Ctx Private (“Registers”) 16 kiB Ctx Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode Axis 1 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Software representation Hardware slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 84. Intro PyOpenCL What and Why? OpenCL Connection: Hardware ↔ Programming Model Axis 0 Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode Axis 1 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Software representation Hardware slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 85. Intro PyOpenCL What and Why? OpenCL Connection: Hardware ↔ Programming Model Axis 0 Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode Axis 1 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Software representation Hardware slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 86. Intro PyOpenCL What and Why? OpenCL Connection: Hardware ↔ Programming Model Axis 0 Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode Axis 1 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Software representation Hardware slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 87. Intro PyOpenCL What and Why? OpenCL Connection: Hardware ↔ Programming Model Axis 0 Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode Axis 1 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Software representation Hardware slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 88. Intro PyOpenCL What and Why? OpenCL Connection: Hardware ↔ Programming Model Axis 0 Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode Axis 1 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Software representation Hardware slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 89. Intro PyOpenCL What and Why? OpenCL Connection: Hardware ↔ Programming Model Axis 0 Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode Axis 1 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Software representation Hardware slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 90. Intro PyOpenCL What and Why? OpenCL Connection: Hardware ↔ Programming Model Axis 0 Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode Axis 1 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Software representation Hardware slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 91. Intro PyOpenCL What and Why? OpenCL Connection: Hardware ↔ Programming Model Axis 0 Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode Axis 1 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Software representation Hardware slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  • 92. Intro PyOpenCL What and Why? OpenCL Connection: Hardware ↔ Programming Model Axis 0 Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode Axis 1 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Really: Block provides Group Fetch/ Fetch/ Fetch/ Decode Decode Decode pool of parallelism to draw from. 32 kiB Ctx Private (“Registers”) 32 kiB Ctx Private (“Registers”) 32 kiB Ctx Private (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx block Shared Shared Shared X,Y,Z order within group Software representation matters. (Not among Hardware groups, though.) slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA