SlideShare uma empresa Scribd logo
1 de 33
Introduction to
Heterogeneous Computing for
High Performance Computing



                       Presented by
             Supasit Kajkamhaeng
                                      1
      Definition [IDC, 2011]                                          1



                                            “The term high-performance computing to
                                            refer to all technical computing servers and
                                            clusters used to solve problems that are
                                            computationally intensive or data intensive
                                            efficiently, reliably and quickly.”



http://www.elseptimoarte.net/peliculas/kung-fu-panda-2-2285.html
                                                                                                           http://smu.edu/catco/research/drug-design-a35.html




                                                         http://www.prweb.com/releases/cfd/simulation/prweb1891174.htm
                                                                                                                                                                http://www.drroyspencer.com/2009/07/
                                                                                                                                                                how-do-climate-models-work/
                                                                                                                                                                                                       2
HPC Applications



Processors   Memories   Storages   Networks

             HPC Infrastructure
                                              3
    A form of computation in which many calculations are
     carried out simultaneously, operating on the principle
     that large problems can often be divided into smaller
     ones, which are then 2solved concurrently ("in parallel").
     [Almasi and Gottlieb, 1989]                          Problem
                                                  Task   Task   Task   Task
     Problem




                                   Instructions
             …              CPU
                                                   …      …      …      …
    Instructions


                                                  CPU    CPU    CPU    CPU

Sequential Computing                               Parallel Computing         4
   Classes of parallel computers
     Multicore Processor
       A processor that includes multiple execution units
        ("cores").
     Cluster [Webopedia computer dictionary, 2007]   3



       A group of linked computers, working together closely so
        that in many respects they from a single computer
       To improve performance and/or availability over that
        provided by a single computer
     etc.


                                                                   5
 Advantages
  Reduce computing time
   More Processors
  Make large scale job doable
   More Memories

 Problems
  Complex programming models
   Difficult development
                                        Challenges
  Complex infrastructures
   Complicated architecture and deployment
                                                     6
 Whydo HPC Applications need computing
 power more and more?
  Race against time
    Solve problems in the shortest time possible

  Precision improvement
    In the amount of time, results can be increased a
     precision

 At
   this time the computing power limitation
 may be considered from performance of most
 powerful computer systems being used today
  Top500 Supercomputing Sites (www.top500.org)
                                                         7
4
 What   is the Top500?          [www.top500.org]

  The Top500 list the 500 fastest computer system being
   used today
  In 1993 the collection was started and has been updated
   every 6 months since then
  The best Linpack benchmark performance achieved is
   used as a performance measure in ranking the
   computers.




                                                             8
#1 (Nov 2011)
  10.51 PF




                9
[www.top500.org]   10
 Oneof Challenges is to improve the performance
 (means “flops”) of HPC systems
   “The worldwide high-performance computing (HPC) market is
    already more than three years into the petascale era (June 2008-
    present) and is looking to make the thousandfold leap into the
                                                                 1
    exascale era before the end of this decade.” [IDC, Nov 2011]

 Concerned improvement factors               of the
 performance development
   System costs (flops/dollar)
   Space and compute density requirements (flops/square foot)
   Energy costs for computation (flops/watt)
                      1
    [IDC, Nov 2011]                                        Goal

Want more flops/dollar, flops/square foot, flops/watt
                                                                       11
   All performance of many powerful HPC systems
              aren’t only produced by CPUs
                  Tianhe-1A
                                       #2 rank of Top500 lists (Nov 2011)
                                       2.566 PFLOPS (Rmax)
Present                                  14,336 Xeon X5670 CPUs
                                         7,168 Tesla M2050 GPUs
                                         2,048 NUDT FT1000 heterogeneous processors
                                                                      5
                                        [http://www.nscc-tj.gov.cn]


                   Jaguar                                                            Titan
Future                                       2013


      #3 rank of Top500 lists (Nov 2011)                                20-30 PFLOPS (Rpeak)
      1.759 PFLOPS (Rmax)                                                 18,000 AMD Opteron CPUs
           36K AMD Opteron CPUs                                           18,000 Tesla GPUs
                                                                          [IDC, Nov 2011]
                                                                                            1         12
    Definition [IDC, 2011]   1



                   “The heterogeneous computing refer to the use of
                   multiple types of processors, typically CPUs in
                   combination with GPUs or other accelerators,
                   within the same HPC system.”
                                   Application Code


   Accelerator                                              CPU
(NVIDIA GPU, AMD GPU, Intel MIC)




                                                                      13
 Main Point     of Most HPC Application Codes
  Lots of Floating-point Calculations (Operations)
    “A frequently used sequence of operations in computer graphics,
     liner algebra, and scientific applications is to multiply two
     numbers, adding the product to a third number, for example,
     D = A x B + C (multiply-add (MAD) instruction)” [NVIDIA, 2009]
                                                               6




  Lots of Parallelism
    Large data sets can be performed in parallel with massively
     multithreaded SIMD (Single Instruction, Multiple Data) Model




                                                                       14
 CPUs  are fundamentally designed for single
 thread performance rather than energy
 efficiency [Steve Scott, November 2011]7




    Fast clock rates with deep pipelines
    Data and instruction caches optimized for latency
    Superscalar issue with out-of-order execution
    Dynamic conflict detection
    Lots of predictions and speculative execution
    Lots of instruction overhead per operation

 Less than 2% of chip power today goes to flops
                                                         15
8
[Peter N. Glaskowsky, 2009]
                                  16
8
[Peter N. Glaskowsky, 2009]

                                  17
   Definition [S. Patel and W.Hwu, 2008]   9




     “An accelerator is a separate architectural substructure
      (on the same chip, or on a different die) that is architected
      using a different set of objectives than the base processor,
      where these objectives are derived from the needs of a
      special class of applications.”
     “Through this manner of design, the accelerator is tuned to
      provide higher performance at lower cost, or at lower
      power, or with less development effort than with the
      general-purpose base hardware.”



                                                                      18
 Example

  Intel x87 floating-point (math) coprocessors                          10,11,12,13




    During the 1980s and the early 1990s
    A separate floating point coprocessor (Intel 8087, 80187,
     80287, 80387, 80487) for the 80x86 line of microprocessors
    “Later Intel processors (introduced after the 486DX) did not
     use a separate floating point coprocessor (integrated the
     floating point hardware on the main processor chip)”




                                                                                       19
                    http://en.wikipedia.org/wiki/File:80386with387.JPG
 Example
  Graphics Processing Unit (GPU)        14




    “A GPU is a specialized circuit designed to rapidly manipulate
     and alter memory in such a way so as to accelerate the building
     of images in a frame buffer intended for output to a display.”
    “A GPU can be present on a video card, or it can be on the
     motherboard or on the CPU die.”
    “Modern GPUs are very efficient at manipulating computer
     graphics, and their highly parallel structure makes them more
     effective than general-purpose CPUs for algorithms where
     processing of large blocks of data is done in parallel.”

                        GPU Computing
                                                                       20
   Definition [NVIDIA, 2011]          15




       “GPU computing or GPGPU is the use of a GPU (graphics
        processing unit) to do general purpose scientific and engineering
        computing.”
       “The model for GPU computing is to use a CPU and GPU
        together in a heterogeneous co-processing computing model.”
       “The sequential part of the application runs on the CPU and the
        computationally-intensive part is accelerated by the GPU.”




                                                                                      21
                      http://www.nvidia.com/docs/IO/65513/gpu-computing-feature.jpg
   More computationally
                demanding stage
                (especially, pixel shader
                stage)
               Lots of Data Parallelism
                (suited for parallel
                hardware)



These are the various stages in the typical pipeline
of a modern graphics processing unit (GPU).
(Illustration courtesy of NVIDIA Corporation.)
                                                 22
A fixed function graphics pipeline


                   A programmable parts (vector and pixel)
                            of graphics pipeline
               (a programmable engine surrounded by supporting fixed-function units and using
               graphics programming languages like OpenGL, DirectX, Cg to program the GPU)

    GPU Computing

           A unified graphics & compute architecture
            (all programmable units in a graphics pipeline share a single programmable hardware
                   unit and added support for high-level languages like C, C++, and Fortran)
                       16
[Owens et al., 2008]                                                                              23
        Compute Unified Device Architecture [NVIDIA, 2011]                                                                           17




                         “CUDA is NVIDIA’s parallel computing architecture. It
                          enables dramatic increases in computing performance by
                          harnessing the power of the GPU.”



                                                         SM

        Fermi




                 6   Fermi’s 16 SM are positioned around a common L2 cache. Each SM is a vertical rectangular strip that contain an orange
[NVIDIA, 2009]       portion (scheduler and dispatch), a green portion (execution units), and light blue portions (register file and L1 cache).        24
Fermi SM




                 6
[NVIDIA, 2009]       25
18
[Wikipedia, 2011]        26
[NVIDIA]   27
28
29
Tianhe-1A

                             #2 rank of Top500 lists (November 2011)
                             2.566 PFLOPS (Rmax)
                               14,336 Xeon X5670 processors
                               7,168 Tesla M2050 GPUs
                               2,048 NUDT FT1000 heterogeneous processors


                     Double Precision FLOPS
    Processor                                        Power Consumption
                             [Peak]

 Intel Xeon X5670        70.392 GFLOPS                     95W TDP
NVIDIA Tesla M2050           515 GFLOPS                   225W TDP

                                                                             30
   HPC applications need computing power more and more
    for solve problems that are compute and data intensive.

   Heterogeneous computing (such as CPU+GPU) helps
    to deliver more cost-effective and energy-efficient
    (flops/dollar, flops/square foot, flops/watt) for
    applications that need it, rather than using only CPUs.




                                                              31
1.   International Data Corporation (IDC). November, 2011. IDC Executive Brief -
     Heterogeneous Computing: A New Paradigm for the Exascale Era.
2.   G. S. Almasi and A. Gottlieb. 1989. Highly Parallel Computing. Benjamin-
     Cummings publishers, Redwood City, CA.
3.   What is clustering?. Webopedia computer dictionary. Retrieved on November
     7, 2007.
4.   Top500 Supercomputing Sites. www.top500.org. Retrieved on December ,
     2011.
5.   NSCC-TJ National Supercomputing Center in Tianjin. www.nscc-tj.gov.cn.
     Retrieved on December , 2011.
                                                      TM
6.   NVIDIA. 2009. NVIDIA’s Next Generation CUDA Compute Architecture:
     Fermi V1.1.
7.   Steve Scott. November 15, 2011. Why the Future of HPC will be Green.
     SC’11
8.   Peter N. Glaskowsky. September, 2009. NVIDIA’s Fermi: The First Complete
     GPU Computing Architecture.


                                                                                   32
9.    S. Patel and W. Hwu. 2008. Guest Editors’ Introduction: Accelerator
      Architectures. IEEE Micro 28(4): 4-12 (2008).
10.   X87. en.wikipedia.org/wiki/X87. Retrieved on December, 2011.
11.   Coprocessor. en.wikipedia.org/wiki/Coprocessor. Retrieved on December,
      2011.
12.   Intel 8087. en.wikipedia.org/wiki/Intel_8087. Retrieved on December, 2011.
13.   x87 info you need to know!. http://coprocessor.cpu-
      info.com/index2.php?mainid=Copro&tabid=1&page=1. Retrieved on
      December, 2011.
14.   Graphics Processing Unit. en.wikipedia.org/wiki/Graphics_processing_unit.
      Retrieved on December, 2011.
15.   NVIDIA. 2011. What is GPU Computing?.
      www.nvidia.com/object/GPU_Computing.html. Retrieved on December, 2011.
16.   J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone and J. C. Phillips.
      2008. GPU Computing. Proceedings of the IEEE, Vol. 96, No.5, May 2008.
17.   NVIDIA. 2011. What is CUDA. developer.nvidia.com/what-cuda. Retrieved on
      December, 2011.
18.   CUDA. en.wikipedia.org/wiki/CUDA. Retrieved on December, 2011.
                                                                                      33

Mais conteúdo relacionado

Mais procurados

Using Custom Macros for getting Best Results (Performance, QoS...) in Telecom...
Using Custom Macros for getting Best Results (Performance, QoS...) in Telecom...Using Custom Macros for getting Best Results (Performance, QoS...) in Telecom...
Using Custom Macros for getting Best Results (Performance, QoS...) in Telecom...telecomhall
 
Virtualization Performance on the IBM PureFlex System
Virtualization Performance on the IBM PureFlex SystemVirtualization Performance on the IBM PureFlex System
Virtualization Performance on the IBM PureFlex SystemIBM India Smarter Computing
 
Introduction to Parallel Distributed Computer Systems
Introduction to Parallel Distributed Computer SystemsIntroduction to Parallel Distributed Computer Systems
Introduction to Parallel Distributed Computer SystemsMrMaKKaWi
 
Introduction to Parallel and Distributed Computing
Introduction to Parallel and Distributed ComputingIntroduction to Parallel and Distributed Computing
Introduction to Parallel and Distributed ComputingSayed Chhattan Shah
 
Distributed Processing Frameworks
Distributed Processing FrameworksDistributed Processing Frameworks
Distributed Processing FrameworksAntonios Katsarakis
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)theijes
 
Synergistic processing in cell's multicore architecture
Synergistic processing in cell's multicore architectureSynergistic processing in cell's multicore architecture
Synergistic processing in cell's multicore architectureMichael Gschwind
 
IBMSystem x3850 X5and x3950 X5 IBMSystems and TechnologyData Sheet
IBMSystem x3850 X5and x3950 X5 IBMSystems and TechnologyData SheetIBMSystem x3850 X5and x3950 X5 IBMSystems and TechnologyData Sheet
IBMSystem x3850 X5and x3950 X5 IBMSystems and TechnologyData SheetIBM India Smarter Computing
 
Image Processing Application on Graphics processors
Image Processing Application on Graphics processorsImage Processing Application on Graphics processors
Image Processing Application on Graphics processorsCSCJournals
 

Mais procurados (12)

Lj2419141918
Lj2419141918Lj2419141918
Lj2419141918
 
HPC and Simulation
HPC and SimulationHPC and Simulation
HPC and Simulation
 
HPC Platform options: Cell BE and GPU
HPC Platform options: Cell BE and GPUHPC Platform options: Cell BE and GPU
HPC Platform options: Cell BE and GPU
 
Using Custom Macros for getting Best Results (Performance, QoS...) in Telecom...
Using Custom Macros for getting Best Results (Performance, QoS...) in Telecom...Using Custom Macros for getting Best Results (Performance, QoS...) in Telecom...
Using Custom Macros for getting Best Results (Performance, QoS...) in Telecom...
 
Virtualization Performance on the IBM PureFlex System
Virtualization Performance on the IBM PureFlex SystemVirtualization Performance on the IBM PureFlex System
Virtualization Performance on the IBM PureFlex System
 
Introduction to Parallel Distributed Computer Systems
Introduction to Parallel Distributed Computer SystemsIntroduction to Parallel Distributed Computer Systems
Introduction to Parallel Distributed Computer Systems
 
Introduction to Parallel and Distributed Computing
Introduction to Parallel and Distributed ComputingIntroduction to Parallel and Distributed Computing
Introduction to Parallel and Distributed Computing
 
Distributed Processing Frameworks
Distributed Processing FrameworksDistributed Processing Frameworks
Distributed Processing Frameworks
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
 
Synergistic processing in cell's multicore architecture
Synergistic processing in cell's multicore architectureSynergistic processing in cell's multicore architecture
Synergistic processing in cell's multicore architecture
 
IBMSystem x3850 X5and x3950 X5 IBMSystems and TechnologyData Sheet
IBMSystem x3850 X5and x3950 X5 IBMSystems and TechnologyData SheetIBMSystem x3850 X5and x3950 X5 IBMSystems and TechnologyData Sheet
IBMSystem x3850 X5and x3950 X5 IBMSystems and TechnologyData Sheet
 
Image Processing Application on Graphics processors
Image Processing Application on Graphics processorsImage Processing Application on Graphics processors
Image Processing Application on Graphics processors
 

Destaque

Accelerated Computing: The Path Forward
Accelerated Computing: The Path ForwardAccelerated Computing: The Path Forward
Accelerated Computing: The Path ForwardNVIDIA
 
Building New Realities in AEC with NVIDIA Quadro VR Webinar
Building New Realities in AEC with NVIDIA Quadro VR WebinarBuilding New Realities in AEC with NVIDIA Quadro VR Webinar
Building New Realities in AEC with NVIDIA Quadro VR WebinarNVIDIA
 
GRAPHICS PROCESSING UNIT (GPU)
GRAPHICS PROCESSING UNIT (GPU)GRAPHICS PROCESSING UNIT (GPU)
GRAPHICS PROCESSING UNIT (GPU)self employed
 
Graphic Processing Unit (GPU)
Graphic Processing Unit (GPU)Graphic Processing Unit (GPU)
Graphic Processing Unit (GPU)Jafar Khan
 
KeynoteTHE HETEROGENEOUS SYSTEM ARCHITECTURE ITS (NOT) ALL ABOUT THE GPU
KeynoteTHE HETEROGENEOUS SYSTEM ARCHITECTURE ITS (NOT) ALL ABOUT THE GPUKeynoteTHE HETEROGENEOUS SYSTEM ARCHITECTURE ITS (NOT) ALL ABOUT THE GPU
KeynoteTHE HETEROGENEOUS SYSTEM ARCHITECTURE ITS (NOT) ALL ABOUT THE GPUHSA Foundation
 
NVIDIA – Inventor of the GPU
NVIDIA – Inventor of the GPUNVIDIA – Inventor of the GPU
NVIDIA – Inventor of the GPUNVIDIA
 

Destaque (8)

Nvidia SC13 Podcast
Nvidia SC13 PodcastNvidia SC13 Podcast
Nvidia SC13 Podcast
 
Accelerated Computing: The Path Forward
Accelerated Computing: The Path ForwardAccelerated Computing: The Path Forward
Accelerated Computing: The Path Forward
 
Building New Realities in AEC with NVIDIA Quadro VR Webinar
Building New Realities in AEC with NVIDIA Quadro VR WebinarBuilding New Realities in AEC with NVIDIA Quadro VR Webinar
Building New Realities in AEC with NVIDIA Quadro VR Webinar
 
GRAPHICS PROCESSING UNIT (GPU)
GRAPHICS PROCESSING UNIT (GPU)GRAPHICS PROCESSING UNIT (GPU)
GRAPHICS PROCESSING UNIT (GPU)
 
Graphic Processing Unit (GPU)
Graphic Processing Unit (GPU)Graphic Processing Unit (GPU)
Graphic Processing Unit (GPU)
 
HPC Computing Trends
HPC Computing TrendsHPC Computing Trends
HPC Computing Trends
 
KeynoteTHE HETEROGENEOUS SYSTEM ARCHITECTURE ITS (NOT) ALL ABOUT THE GPU
KeynoteTHE HETEROGENEOUS SYSTEM ARCHITECTURE ITS (NOT) ALL ABOUT THE GPUKeynoteTHE HETEROGENEOUS SYSTEM ARCHITECTURE ITS (NOT) ALL ABOUT THE GPU
KeynoteTHE HETEROGENEOUS SYSTEM ARCHITECTURE ITS (NOT) ALL ABOUT THE GPU
 
NVIDIA – Inventor of the GPU
NVIDIA – Inventor of the GPUNVIDIA – Inventor of the GPU
NVIDIA – Inventor of the GPU
 

Semelhante a Introduction to heterogeneous_computing_for_hpc

An Overview of Intel TFLOPS Super Computer
An Overview of Intel TFLOPS Super ComputerAn Overview of Intel TFLOPS Super Computer
An Overview of Intel TFLOPS Super ComputerSerwer Alam
 
ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...
ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...
ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...ijdpsjournal
 
ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...
ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...
ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...ijdpsjournal
 
UNIT I_Introduction.pptx
UNIT I_Introduction.pptxUNIT I_Introduction.pptx
UNIT I_Introduction.pptxssuser4ca1eb
 
Optimization Intevac Aug23 7f
Optimization Intevac Aug23 7fOptimization Intevac Aug23 7f
Optimization Intevac Aug23 7fvvk0
 
COMPARATIVE ANALYSIS OF SINGLE-CORE AND MULTI-CORE SYSTEMS
COMPARATIVE ANALYSIS OF SINGLE-CORE AND MULTI-CORE SYSTEMSCOMPARATIVE ANALYSIS OF SINGLE-CORE AND MULTI-CORE SYSTEMS
COMPARATIVE ANALYSIS OF SINGLE-CORE AND MULTI-CORE SYSTEMSijcsit
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitJinwon Lee
 
54665962-Nav-Cluster-Computing.pptx
54665962-Nav-Cluster-Computing.pptx54665962-Nav-Cluster-Computing.pptx
54665962-Nav-Cluster-Computing.pptxYashAhire28
 
Parallex - The Supercomputer
Parallex - The SupercomputerParallex - The Supercomputer
Parallex - The SupercomputerAnkit Singh
 
Analysis Of AMD And Intel
Analysis Of AMD And IntelAnalysis Of AMD And Intel
Analysis Of AMD And IntelTammy Lacy
 
Applying Cloud Techniques to Address Complexity in HPC System Integrations
Applying Cloud Techniques to Address Complexity in HPC System IntegrationsApplying Cloud Techniques to Address Complexity in HPC System Integrations
Applying Cloud Techniques to Address Complexity in HPC System Integrationsinside-BigData.com
 
Fugaku, the Successes and the Lessons Learned
Fugaku, the Successes and the Lessons LearnedFugaku, the Successes and the Lessons Learned
Fugaku, the Successes and the Lessons LearnedRCCSRENKEI
 

Semelhante a Introduction to heterogeneous_computing_for_hpc (20)

An Overview of Intel TFLOPS Super Computer
An Overview of Intel TFLOPS Super ComputerAn Overview of Intel TFLOPS Super Computer
An Overview of Intel TFLOPS Super Computer
 
ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...
ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...
ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...
 
ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...
ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...
ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...
 
Clusetrreport
ClusetrreportClusetrreport
Clusetrreport
 
Ch1
Ch1Ch1
Ch1
 
Ch1
Ch1Ch1
Ch1
 
UNIT I_Introduction.pptx
UNIT I_Introduction.pptxUNIT I_Introduction.pptx
UNIT I_Introduction.pptx
 
UNIT I.pptx
UNIT I.pptxUNIT I.pptx
UNIT I.pptx
 
Optimization Intevac Aug23 7f
Optimization Intevac Aug23 7fOptimization Intevac Aug23 7f
Optimization Intevac Aug23 7f
 
COMPARATIVE ANALYSIS OF SINGLE-CORE AND MULTI-CORE SYSTEMS
COMPARATIVE ANALYSIS OF SINGLE-CORE AND MULTI-CORE SYSTEMSCOMPARATIVE ANALYSIS OF SINGLE-CORE AND MULTI-CORE SYSTEMS
COMPARATIVE ANALYSIS OF SINGLE-CORE AND MULTI-CORE SYSTEMS
 
36575
3657536575
36575
 
Cluster computing
Cluster computingCluster computing
Cluster computing
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
 
54665962-Nav-Cluster-Computing.pptx
54665962-Nav-Cluster-Computing.pptx54665962-Nav-Cluster-Computing.pptx
54665962-Nav-Cluster-Computing.pptx
 
NWU and HPC
NWU and HPCNWU and HPC
NWU and HPC
 
Parallex - The Supercomputer
Parallex - The SupercomputerParallex - The Supercomputer
Parallex - The Supercomputer
 
Analysis Of AMD And Intel
Analysis Of AMD And IntelAnalysis Of AMD And Intel
Analysis Of AMD And Intel
 
Applying Cloud Techniques to Address Complexity in HPC System Integrations
Applying Cloud Techniques to Address Complexity in HPC System IntegrationsApplying Cloud Techniques to Address Complexity in HPC System Integrations
Applying Cloud Techniques to Address Complexity in HPC System Integrations
 
Fugaku, the Successes and the Lessons Learned
Fugaku, the Successes and the Lessons LearnedFugaku, the Successes and the Lessons Learned
Fugaku, the Successes and the Lessons Learned
 

Introduction to heterogeneous_computing_for_hpc

  • 1. Introduction to Heterogeneous Computing for High Performance Computing Presented by Supasit Kajkamhaeng 1
  • 2. Definition [IDC, 2011] 1 “The term high-performance computing to refer to all technical computing servers and clusters used to solve problems that are computationally intensive or data intensive efficiently, reliably and quickly.” http://www.elseptimoarte.net/peliculas/kung-fu-panda-2-2285.html http://smu.edu/catco/research/drug-design-a35.html http://www.prweb.com/releases/cfd/simulation/prweb1891174.htm http://www.drroyspencer.com/2009/07/ how-do-climate-models-work/ 2
  • 3. HPC Applications Processors Memories Storages Networks HPC Infrastructure 3
  • 4. A form of computation in which many calculations are carried out simultaneously, operating on the principle that large problems can often be divided into smaller ones, which are then 2solved concurrently ("in parallel"). [Almasi and Gottlieb, 1989] Problem Task Task Task Task Problem Instructions … CPU … … … … Instructions CPU CPU CPU CPU Sequential Computing Parallel Computing 4
  • 5. Classes of parallel computers  Multicore Processor  A processor that includes multiple execution units ("cores").  Cluster [Webopedia computer dictionary, 2007] 3  A group of linked computers, working together closely so that in many respects they from a single computer  To improve performance and/or availability over that provided by a single computer  etc. 5
  • 6.  Advantages  Reduce computing time  More Processors  Make large scale job doable  More Memories  Problems  Complex programming models  Difficult development Challenges  Complex infrastructures  Complicated architecture and deployment 6
  • 7.  Whydo HPC Applications need computing power more and more?  Race against time  Solve problems in the shortest time possible  Precision improvement  In the amount of time, results can be increased a precision  At this time the computing power limitation may be considered from performance of most powerful computer systems being used today  Top500 Supercomputing Sites (www.top500.org) 7
  • 8. 4  What is the Top500? [www.top500.org]  The Top500 list the 500 fastest computer system being used today  In 1993 the collection was started and has been updated every 6 months since then  The best Linpack benchmark performance achieved is used as a performance measure in ranking the computers. 8
  • 9. #1 (Nov 2011) 10.51 PF 9
  • 11.  Oneof Challenges is to improve the performance (means “flops”) of HPC systems  “The worldwide high-performance computing (HPC) market is already more than three years into the petascale era (June 2008- present) and is looking to make the thousandfold leap into the 1 exascale era before the end of this decade.” [IDC, Nov 2011]  Concerned improvement factors of the performance development  System costs (flops/dollar)  Space and compute density requirements (flops/square foot)  Energy costs for computation (flops/watt) 1 [IDC, Nov 2011] Goal Want more flops/dollar, flops/square foot, flops/watt 11
  • 12. All performance of many powerful HPC systems aren’t only produced by CPUs Tianhe-1A  #2 rank of Top500 lists (Nov 2011)  2.566 PFLOPS (Rmax) Present  14,336 Xeon X5670 CPUs  7,168 Tesla M2050 GPUs  2,048 NUDT FT1000 heterogeneous processors 5 [http://www.nscc-tj.gov.cn] Jaguar Titan Future 2013  #3 rank of Top500 lists (Nov 2011)  20-30 PFLOPS (Rpeak)  1.759 PFLOPS (Rmax)  18,000 AMD Opteron CPUs  36K AMD Opteron CPUs  18,000 Tesla GPUs [IDC, Nov 2011] 1 12
  • 13. Definition [IDC, 2011] 1 “The heterogeneous computing refer to the use of multiple types of processors, typically CPUs in combination with GPUs or other accelerators, within the same HPC system.” Application Code Accelerator CPU (NVIDIA GPU, AMD GPU, Intel MIC) 13
  • 14.  Main Point of Most HPC Application Codes  Lots of Floating-point Calculations (Operations)  “A frequently used sequence of operations in computer graphics, liner algebra, and scientific applications is to multiply two numbers, adding the product to a third number, for example, D = A x B + C (multiply-add (MAD) instruction)” [NVIDIA, 2009] 6  Lots of Parallelism  Large data sets can be performed in parallel with massively multithreaded SIMD (Single Instruction, Multiple Data) Model 14
  • 15.  CPUs are fundamentally designed for single thread performance rather than energy efficiency [Steve Scott, November 2011]7  Fast clock rates with deep pipelines  Data and instruction caches optimized for latency  Superscalar issue with out-of-order execution  Dynamic conflict detection  Lots of predictions and speculative execution  Lots of instruction overhead per operation Less than 2% of chip power today goes to flops 15
  • 18. Definition [S. Patel and W.Hwu, 2008] 9  “An accelerator is a separate architectural substructure (on the same chip, or on a different die) that is architected using a different set of objectives than the base processor, where these objectives are derived from the needs of a special class of applications.”  “Through this manner of design, the accelerator is tuned to provide higher performance at lower cost, or at lower power, or with less development effort than with the general-purpose base hardware.” 18
  • 19.  Example  Intel x87 floating-point (math) coprocessors 10,11,12,13  During the 1980s and the early 1990s  A separate floating point coprocessor (Intel 8087, 80187, 80287, 80387, 80487) for the 80x86 line of microprocessors  “Later Intel processors (introduced after the 486DX) did not use a separate floating point coprocessor (integrated the floating point hardware on the main processor chip)” 19 http://en.wikipedia.org/wiki/File:80386with387.JPG
  • 20.  Example  Graphics Processing Unit (GPU) 14  “A GPU is a specialized circuit designed to rapidly manipulate and alter memory in such a way so as to accelerate the building of images in a frame buffer intended for output to a display.”  “A GPU can be present on a video card, or it can be on the motherboard or on the CPU die.”  “Modern GPUs are very efficient at manipulating computer graphics, and their highly parallel structure makes them more effective than general-purpose CPUs for algorithms where processing of large blocks of data is done in parallel.” GPU Computing 20
  • 21. Definition [NVIDIA, 2011] 15  “GPU computing or GPGPU is the use of a GPU (graphics processing unit) to do general purpose scientific and engineering computing.”  “The model for GPU computing is to use a CPU and GPU together in a heterogeneous co-processing computing model.”  “The sequential part of the application runs on the CPU and the computationally-intensive part is accelerated by the GPU.” 21 http://www.nvidia.com/docs/IO/65513/gpu-computing-feature.jpg
  • 22. More computationally demanding stage (especially, pixel shader stage)  Lots of Data Parallelism (suited for parallel hardware) These are the various stages in the typical pipeline of a modern graphics processing unit (GPU). (Illustration courtesy of NVIDIA Corporation.) 22
  • 23. A fixed function graphics pipeline A programmable parts (vector and pixel) of graphics pipeline (a programmable engine surrounded by supporting fixed-function units and using graphics programming languages like OpenGL, DirectX, Cg to program the GPU) GPU Computing A unified graphics & compute architecture (all programmable units in a graphics pipeline share a single programmable hardware unit and added support for high-level languages like C, C++, and Fortran) 16 [Owens et al., 2008] 23
  • 24. Compute Unified Device Architecture [NVIDIA, 2011] 17  “CUDA is NVIDIA’s parallel computing architecture. It enables dramatic increases in computing performance by harnessing the power of the GPU.” SM Fermi 6 Fermi’s 16 SM are positioned around a common L2 cache. Each SM is a vertical rectangular strip that contain an orange [NVIDIA, 2009] portion (scheduler and dispatch), a green portion (execution units), and light blue portions (register file and L1 cache). 24
  • 25. Fermi SM 6 [NVIDIA, 2009] 25
  • 27. [NVIDIA] 27
  • 28. 28
  • 29. 29
  • 30. Tianhe-1A  #2 rank of Top500 lists (November 2011)  2.566 PFLOPS (Rmax)  14,336 Xeon X5670 processors  7,168 Tesla M2050 GPUs  2,048 NUDT FT1000 heterogeneous processors Double Precision FLOPS Processor Power Consumption [Peak] Intel Xeon X5670 70.392 GFLOPS 95W TDP NVIDIA Tesla M2050 515 GFLOPS 225W TDP 30
  • 31. HPC applications need computing power more and more for solve problems that are compute and data intensive.  Heterogeneous computing (such as CPU+GPU) helps to deliver more cost-effective and energy-efficient (flops/dollar, flops/square foot, flops/watt) for applications that need it, rather than using only CPUs. 31
  • 32. 1. International Data Corporation (IDC). November, 2011. IDC Executive Brief - Heterogeneous Computing: A New Paradigm for the Exascale Era. 2. G. S. Almasi and A. Gottlieb. 1989. Highly Parallel Computing. Benjamin- Cummings publishers, Redwood City, CA. 3. What is clustering?. Webopedia computer dictionary. Retrieved on November 7, 2007. 4. Top500 Supercomputing Sites. www.top500.org. Retrieved on December , 2011. 5. NSCC-TJ National Supercomputing Center in Tianjin. www.nscc-tj.gov.cn. Retrieved on December , 2011. TM 6. NVIDIA. 2009. NVIDIA’s Next Generation CUDA Compute Architecture: Fermi V1.1. 7. Steve Scott. November 15, 2011. Why the Future of HPC will be Green. SC’11 8. Peter N. Glaskowsky. September, 2009. NVIDIA’s Fermi: The First Complete GPU Computing Architecture. 32
  • 33. 9. S. Patel and W. Hwu. 2008. Guest Editors’ Introduction: Accelerator Architectures. IEEE Micro 28(4): 4-12 (2008). 10. X87. en.wikipedia.org/wiki/X87. Retrieved on December, 2011. 11. Coprocessor. en.wikipedia.org/wiki/Coprocessor. Retrieved on December, 2011. 12. Intel 8087. en.wikipedia.org/wiki/Intel_8087. Retrieved on December, 2011. 13. x87 info you need to know!. http://coprocessor.cpu- info.com/index2.php?mainid=Copro&tabid=1&page=1. Retrieved on December, 2011. 14. Graphics Processing Unit. en.wikipedia.org/wiki/Graphics_processing_unit. Retrieved on December, 2011. 15. NVIDIA. 2011. What is GPU Computing?. www.nvidia.com/object/GPU_Computing.html. Retrieved on December, 2011. 16. J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone and J. C. Phillips. 2008. GPU Computing. Proceedings of the IEEE, Vol. 96, No.5, May 2008. 17. NVIDIA. 2011. What is CUDA. developer.nvidia.com/what-cuda. Retrieved on December, 2011. 18. CUDA. en.wikipedia.org/wiki/CUDA. Retrieved on December, 2011. 33