Introduction to heterogeneous_computing_for_hpc

Introduction to
Heterogeneous Computing for
High Performance Computing

Presented by
Supasit Kajkamhaeng
1

 Definition [IDC, 2011] 1

“The term high-performance computing to
refer to all technical computing servers and
clusters used to solve problems that are
computationally intensive or data intensive
efficiently, reliably and quickly.”

http://www.elseptimoarte.net/peliculas/kung-fu-panda-2-2285.html
http://smu.edu/catco/research/drug-design-a35.html

http://www.prweb.com/releases/cfd/simulation/prweb1891174.htm
http://www.drroyspencer.com/2009/07/
how-do-climate-models-work/
2

HPC Applications

Processors Memories Storages Networks

HPC Infrastructure
3

 A form of computation in which many calculations are
carried out simultaneously, operating on the principle
that large problems can often be divided into smaller
ones, which are then 2solved concurrently ("in parallel").
[Almasi and Gottlieb, 1989] Problem
Task Task Task Task
Problem

Instructions
… CPU
… … … …
Instructions

CPU CPU CPU CPU

Sequential Computing Parallel Computing 4

 Classes of parallel computers
 Multicore Processor
 A processor that includes multiple execution units
("cores").
 Cluster [Webopedia computer dictionary, 2007] 3

 A group of linked computers, working together closely so
that in many respects they from a single computer
 To improve performance and/or availability over that
provided by a single computer
 etc.

5

 Advantages
 Reduce computing time
 More Processors
 Make large scale job doable
 More Memories

 Problems
 Complex programming models
 Difficult development
Challenges
 Complex infrastructures
 Complicated architecture and deployment
6

 Whydo HPC Applications need computing
power more and more?
 Race against time
 Solve problems in the shortest time possible

 Precision improvement
 In the amount of time, results can be increased a
precision

 At
this time the computing power limitation
may be considered from performance of most
powerful computer systems being used today
 Top500 Supercomputing Sites (www.top500.org)
7

4
 What is the Top500? [www.top500.org]

 The Top500 list the 500 fastest computer system being
used today
 In 1993 the collection was started and has been updated
every 6 months since then
 The best Linpack benchmark performance achieved is
used as a performance measure in ranking the
computers.

8

#1 (Nov 2011)
10.51 PF

9

 Oneof Challenges is to improve the performance
(means “flops”) of HPC systems
 “The worldwide high-performance computing (HPC) market is
already more than three years into the petascale era (June 2008-
present) and is looking to make the thousandfold leap into the
1
exascale era before the end of this decade.” [IDC, Nov 2011]

 Concerned improvement factors of the
performance development
 System costs (flops/dollar)
 Space and compute density requirements (flops/square foot)
 Energy costs for computation (flops/watt)
1
[IDC, Nov 2011] Goal

Want more flops/dollar, flops/square foot, flops/watt
11

 All performance of many powerful HPC systems
aren’t only produced by CPUs
Tianhe-1A
 #2 rank of Top500 lists (Nov 2011)
 2.566 PFLOPS (Rmax)
Present  14,336 Xeon X5670 CPUs
 7,168 Tesla M2050 GPUs
 2,048 NUDT FT1000 heterogeneous processors
5
[http://www.nscc-tj.gov.cn]

Jaguar Titan
Future 2013

 #3 rank of Top500 lists (Nov 2011)  20-30 PFLOPS (Rpeak)
 1.759 PFLOPS (Rmax)  18,000 AMD Opteron CPUs
 36K AMD Opteron CPUs  18,000 Tesla GPUs
[IDC, Nov 2011]
1 12

 Definition [IDC, 2011] 1

“The heterogeneous computing refer to the use of
multiple types of processors, typically CPUs in
combination with GPUs or other accelerators,
within the same HPC system.”
Application Code

Accelerator CPU
(NVIDIA GPU, AMD GPU, Intel MIC)

13

 Main Point of Most HPC Application Codes
 Lots of Floating-point Calculations (Operations)
 “A frequently used sequence of operations in computer graphics,
liner algebra, and scientific applications is to multiply two
numbers, adding the product to a third number, for example,
D = A x B + C (multiply-add (MAD) instruction)” [NVIDIA, 2009]
6

 Lots of Parallelism
 Large data sets can be performed in parallel with massively
multithreaded SIMD (Single Instruction, Multiple Data) Model

14

 CPUs are fundamentally designed for single
thread performance rather than energy
efficiency [Steve Scott, November 2011]7

 Fast clock rates with deep pipelines
 Data and instruction caches optimized for latency
 Superscalar issue with out-of-order execution
 Dynamic conflict detection
 Lots of predictions and speculative execution
 Lots of instruction overhead per operation

Less than 2% of chip power today goes to flops
15

8
[Peter N. Glaskowsky, 2009]
16

8
[Peter N. Glaskowsky, 2009]

17

 Definition [S. Patel and W.Hwu, 2008] 9

 “An accelerator is a separate architectural substructure
(on the same chip, or on a different die) that is architected
using a different set of objectives than the base processor,
where these objectives are derived from the needs of a
special class of applications.”
 “Through this manner of design, the accelerator is tuned to
provide higher performance at lower cost, or at lower
power, or with less development effort than with the
general-purpose base hardware.”

18

 Example

 Intel x87 floating-point (math) coprocessors 10,11,12,13

 During the 1980s and the early 1990s
 A separate floating point coprocessor (Intel 8087, 80187,
80287, 80387, 80487) for the 80x86 line of microprocessors
 “Later Intel processors (introduced after the 486DX) did not
use a separate floating point coprocessor (integrated the
floating point hardware on the main processor chip)”

19
http://en.wikipedia.org/wiki/File:80386with387.JPG

 Example
 Graphics Processing Unit (GPU) 14

 “A GPU is a specialized circuit designed to rapidly manipulate
and alter memory in such a way so as to accelerate the building
of images in a frame buffer intended for output to a display.”
 “A GPU can be present on a video card, or it can be on the
motherboard or on the CPU die.”
 “Modern GPUs are very efficient at manipulating computer
graphics, and their highly parallel structure makes them more
effective than general-purpose CPUs for algorithms where
processing of large blocks of data is done in parallel.”

GPU Computing
20

 Definition [NVIDIA, 2011] 15

 “GPU computing or GPGPU is the use of a GPU (graphics
processing unit) to do general purpose scientific and engineering
computing.”
 “The model for GPU computing is to use a CPU and GPU
together in a heterogeneous co-processing computing model.”
 “The sequential part of the application runs on the CPU and the
computationally-intensive part is accelerated by the GPU.”

21
http://www.nvidia.com/docs/IO/65513/gpu-computing-feature.jpg

 More computationally
demanding stage
(especially, pixel shader
stage)
 Lots of Data Parallelism
(suited for parallel
hardware)

These are the various stages in the typical pipeline
of a modern graphics processing unit (GPU).
(Illustration courtesy of NVIDIA Corporation.)
22

A fixed function graphics pipeline

A programmable parts (vector and pixel)
of graphics pipeline
(a programmable engine surrounded by supporting fixed-function units and using
graphics programming languages like OpenGL, DirectX, Cg to program the GPU)

GPU Computing

A unified graphics & compute architecture
(all programmable units in a graphics pipeline share a single programmable hardware
unit and added support for high-level languages like C, C++, and Fortran)
16
[Owens et al., 2008] 23

 Compute Unified Device Architecture [NVIDIA, 2011] 17

 “CUDA is NVIDIA’s parallel computing architecture. It
enables dramatic increases in computing performance by
harnessing the power of the GPU.”

SM

Fermi

6 Fermi’s 16 SM are positioned around a common L2 cache. Each SM is a vertical rectangular strip that contain an orange
[NVIDIA, 2009] portion (scheduler and dispatch), a green portion (execution units), and light blue portions (register file and L1 cache). 24

Fermi SM

6
[NVIDIA, 2009] 25

18
[Wikipedia, 2011] 26

Tianhe-1A

 #2 rank of Top500 lists (November 2011)
 2.566 PFLOPS (Rmax)
 14,336 Xeon X5670 processors
 7,168 Tesla M2050 GPUs
 2,048 NUDT FT1000 heterogeneous processors

Double Precision FLOPS
Processor Power Consumption
[Peak]

Intel Xeon X5670 70.392 GFLOPS 95W TDP
NVIDIA Tesla M2050 515 GFLOPS 225W TDP

30

 HPC applications need computing power more and more
for solve problems that are compute and data intensive.

 Heterogeneous computing (such as CPU+GPU) helps
to deliver more cost-effective and energy-efficient
(flops/dollar, flops/square foot, flops/watt) for
applications that need it, rather than using only CPUs.

31

1. International Data Corporation (IDC). November, 2011. IDC Executive Brief -
Heterogeneous Computing: A New Paradigm for the Exascale Era.
2. G. S. Almasi and A. Gottlieb. 1989. Highly Parallel Computing. Benjamin-
Cummings publishers, Redwood City, CA.
3. What is clustering?. Webopedia computer dictionary. Retrieved on November
7, 2007.
4. Top500 Supercomputing Sites. www.top500.org. Retrieved on December ,
2011.
5. NSCC-TJ National Supercomputing Center in Tianjin. www.nscc-tj.gov.cn.
Retrieved on December , 2011.
TM
6. NVIDIA. 2009. NVIDIA’s Next Generation CUDA Compute Architecture:
Fermi V1.1.
7. Steve Scott. November 15, 2011. Why the Future of HPC will be Green.
SC’11
8. Peter N. Glaskowsky. September, 2009. NVIDIA’s Fermi: The First Complete
GPU Computing Architecture.

32

9. S. Patel and W. Hwu. 2008. Guest Editors’ Introduction: Accelerator
Architectures. IEEE Micro 28(4): 4-12 (2008).
10. X87. en.wikipedia.org/wiki/X87. Retrieved on December, 2011.
11. Coprocessor. en.wikipedia.org/wiki/Coprocessor. Retrieved on December,
2011.
12. Intel 8087. en.wikipedia.org/wiki/Intel_8087. Retrieved on December, 2011.
13. x87 info you need to know!. http://coprocessor.cpu-
info.com/index2.php?mainid=Copro&tabid=1&page=1. Retrieved on
December, 2011.
14. Graphics Processing Unit. en.wikipedia.org/wiki/Graphics_processing_unit.
Retrieved on December, 2011.
15. NVIDIA. 2011. What is GPU Computing?.
www.nvidia.com/object/GPU_Computing.html. Retrieved on December, 2011.
16. J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone and J. C. Phillips.
2008. GPU Computing. Proceedings of the IEEE, Vol. 96, No.5, May 2008.
17. NVIDIA. 2011. What is CUDA. developer.nvidia.com/what-cuda. Retrieved on
December, 2011.
18. CUDA. en.wikipedia.org/wiki/CUDA. Retrieved on December, 2011.
33

Introduction to heterogeneous_computing_for_hpc

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (12)

Destaque

Destaque (8)

Semelhante a Introduction to heterogeneous_computing_for_hpc

Semelhante a Introduction to heterogeneous_computing_for_hpc (20)

Introduction to heterogeneous_computing_for_hpc