High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe Fiameni, CINECA)

High Performance & High Throughput
Computing
Giuseppe Fiameni – g.Fiameni@cineca.it
http://hackaday.com/2016/07/13/nirvana-like-youve-never-heard-them-before/

Agenda
• Computational Problem
• System architecture
• Serial vs Parallel
• Levels of parallelism
• Models for parallel programming
• HPC and HTC examples
EUDAT Summer School 2017 - High Performance/Throughput Computing 2

What does it make a problem be
relevant from the computational
point of view?

Computational Dimension:
number of operations needed to solve the problem,
in general is a function of the size of the involved
data structures (n, n2 , n3 , n log n, etc.)
flop - Floating point operations
indicates an arithmetic floating point operation.
flop/s - Floating points operations per second
is a unit to measure the speed of a computer.
Computational problems today: 1015 – 1022 flop
One year has about 3 x 107 seconds!
Most powerful computers today have reach a sustained
performance is of the order of Tflop/s - Pflop/s (1012 - 1015
flop/s).
Size of computational applications

Example: Weather Prediction
Forecasts on a global scale:
 3D Grid to represent the Earth
 Earth's circumference:  40000 km
 radius:  6370 km
 Earth's surface:  4πr2  5•108 km2
 6 variables:
 temperature
 pressure
 humidity
 wind speed in the 3 Cartesian directions
 cells of 1 km on each side
 100 slices to see how the variables evolve on the
different levels of the atmosphere
 a 30 seconds time step is required for the
simulation with such resolution
 Each cell requires about 1000 operations per
time step (Navier-Stokes turbulence and various
phenomena)
In physics, the Navier–Stokes
equations /nævˈjeɪ stoʊks/,
named after Claude-Louis
Navier and George Gabriel
Stokes, describe the motion
of viscous fluid substances.

Grid: 5 x 108 x 100 = 5 x 1010 cells
 each cell is represented with 8 Byte
 Memory space:
 (6 var) x (8 Byte) x (5 x 1010 cells)  2 x 1012 Byte = 2TB
A 24 hours forecast needs:
• 24 x 60 x 2  3 x 103 time-step
• (5 x 1010 cells) x (103 oper.) x (3 x 103 time-steps) = 1.5 x 1017
operations !
A computer with a power of 1Tflop/s will take 1.5 x 105 sec.
Example: Weather Prediction (cont.)
24 hours forecast will need 2 days to run ... but we shall obtain a very
accurate forecast!

John Von Neumann
https://en.wikipedia.org/wiki/Von_Neumann_architecture

What if you want increase
performance?
• Increase the clock rate
– number of clock cycles per second (peacemaker)
• Increase the instruction rate
– number of basic instructions that can be executed
per clock cycle
– Pipeling, vectorization
• Reduce access time to data
– Having frequently used data on the processor
caches significantly improve computational
performance

The clock cycle  is defined as the
time between two adjacent pulses
of oscillator that sets the time of
the processor.
The number of these pulses per
second is known as clock speed or
clock frequency, generally
measured in GHz (gigahertz, or
billions of pulses per second).
The clock cycle controls the
synchronization of operations in a
computer: All the operations
inside the processor last a multiple
of .
Processor  (ns) freq (MHz)
CDC 6600 100 10
Cyber 76 27.5 36,3
IBM ES 9000 9 111
Cray Y-MP C90 4.1 244
Intel i860 20 50
PC Pentium < 0.5 > 2 GHz
Power PC 1.17 850
IBM Power 5 0.52 1.9 GHz
IBM Power 6 0.21 4.7 GHz
Increasing the clock frequency:
The speed of light sets an upper limit to the speed
with which electronic components can operate .
Propagation velocity of a signal in a vacuum:
300.000 Km/s = 30 cm/ns
Heat dissipation problems inside the processor.
Also Quantum tunneling expected to become
important.
Speed of Processors: Clock Cycle and Frequency

Memory access time: the time required
by the processor to access data or to write
data from / to memory
The hierarchy exists because :
 fast memory is expensive and small
 slow memory is cheap and big
Latency
 how long do I have to wait for the
data?
 (cannot do anything while waiting)
Throughput
 how many bytes/second. but not
important if waiting. Total time = latency + (amount of data
/ throughput)
Time to run code = clock cycles running code + clock cycles waiting for
memory
Memory hierarchies

Loop unrolling
for (i=0; i<N; i++)
s += a[i]*b[i]
for (i=0; i<N; i += 2)
s += a[i]*b[i] + a[i+1]*b[i+1]

Why Have Cache?
CPU 351 GB/sec
10.66 GB/sec (3%)
253 GB/sec (72%)

Parallelism
 Serial Process:
 A process in which its sub-processes happen
sequentially in time.
 Speed depends only on the rate at which each
sub-process will occur (e.g. processing unit clock
speed).
 Parallel Process:
 Process in which multiple sub-processes can be
active simultaneously.
 Speed depends on execution rate of each sub-
process AND how many sub-processes can be
made to occur simultaneously.

Parallel for Capacity
• Example: Searching database of web-sites for some
text (i.e. Google)
• Searching sequentially through large volumes of text
too time-consuming
– Multiple servers hold different pages
– Each server can report the result of each
individual search
– The more servers you add the quicker the search
is

Parallel for Capability
• Example: Large scale weather simulation
• Detailed description for atmosphere too large to run
on today's desktop
– Multiple servers are needed to hold all grid data in
memory
– Servers need to quickly communicate to
synchronise work over entire grid
– Communication between servers can become a
bottleneck. Latency really matters!

Capability and capacity
• Parallel for Capability => High Performance Computing
– deals with one large problem
– problems that are hard to parallelise, not
embarrassing parallel
– single processor performance is important
• Parallel for Capacity => High Throughput Computing
– It is appropriate when the problem can be
decomposed into many (very many) smaller problems
that are essentially independent
– problems that are easy to parallelise, parameter
swapping
– can be adapted to very large numbers of processors

Systems main architecture
• Single processor, single core
• Dual processor, single core
• Single processor, dual core

Multi core architecture
• Modern architecture can go up to 34 cores
per processor and to 2/4 processors per
node

Flynn Taxonomy
• A computer architecture is categorized by the
multiplicity of hardware used to manipulate streams of
instructions (sequence of instructions executed by the
computer) and streams of data (sequence of data used
to execute a stream of instructions).

CU Control Unit
PU Processing Unit
MM Memory Module
DS Data stream
IS Instruction Stream
DS2
DS1
MM2
MM1
MMn
.
.
.
CU
IS
PU2
PU1
PUn
DSn
.
.
.
SIMD Systems
Synchronous parallelism
 SIMD systems presents a
single control unit
 A single instruction
operates simultaneously
on multiple data.
 Array processor and
vector systems fall in this
class

22
.
.
.
.
.
.
.
.
.
IS2
CU2
IS2
PU2 MM2
DS2
PU1 MM1
DS1
PUn MMn
DSn
CU1
CUn
ISn
IS1
IS1
ISn
CU Control Unit
PU Processing Unit
MM Memory Module
DS Data stream
IS Instruction Stream
MIMD Systems
Asynchronous parallelism
 Multiple processors
execute different
instructions operating on
different data.
 Represents the
multiprocessor version
of the SIMD class.
 Wide class ranging from
multi-core systems to
large MPP systems.

2
3
Moore's Law
• Empirical law which states that the complexity of
devices (number of transistors per square inch in
microprocessors) doubles every 18 months..Gordon
Moore, INTEL co-founder, 1965
• It is estimated that Moore's Law still applies in the near
future but applied to the number of cores per processor

The limits of parallelism - Amdahl’s Law
If we have N processors:
taking s as the fraction of the
time spent in the sequential part
of the program (s + p = 1)
les.robertson@cern.ch
s – time spent in a serial
processor on serial parts
of the code
p – time spent in a serial
processor on parts that
could be executed in
parallel
Amdahl, G.M., Validity of single-processor approach to achieving large scale computing capability
Proc. AFIPS Conf., Reston, VA, 1967, pp. 483-485
Speedup =
𝑠+𝑝
𝑠+𝑝/𝑁
Speedup =
1
𝑠+(1−𝑠) /𝑁
→ 1/𝑠
EUDAT Summer School 2017 - High Performance/Throughput Computing

Amdahl’s Law - maximum speedup
max speedup
0
20
40
60
80
100
120
1% 2% 3% 4% 5% 6% 7% 8% 9% 10%
sequential part as percentage of total time
speedup
EUDAT Summer School 2017 - High Performance/Throughput Computing les.robertson@cern.ch

The importance of the network
• Latency – the round-trip time (RTT) to
communicate between two processors
Communications overhead:
• For fine grained parallel programs the
problem is latency, not bandwidth
t
… …
… …
… …
sequential
communications
overhead
parallelisable
Speedup =
𝑠+𝑝
𝑠+𝑐+ 𝑝/𝑁
c = latency + data_transfer_time
EUDAT Summer School 2017 - High Performance/Throughput Computing les.robertson@cern.ch

Aspects of parallelism
 It has been recognized for a long time that
constant performance improvements cannot be
obtained just by increasing factors such as
processor clock speed – parallelism is needed.
 Parallelism can be present at many levels:
 Functional parallelism within the CPU
 Pipelining and vectorization
 Multi-processor and multi-core
 Accelerators
 Parallel I/O

Pipelining
 It is a technique where more
instructions, belonging to a stream of
sequential execution, overlap their
execution
 This technique improves the
performance of the single processor
 The concept of pipelining is similar to
that of assembly line in a factory where
in a flow line (pipe) of assembly stations
the elements are assembled in a
continuous flow.
 All the assembly stations must operate
at the same processing speed, otherwise
the station slower becomes the
bottleneck of the entire pipe.

30
The stages are
1 Fetch
2 D1 (main decode)
3 D2 (secondary decode, also called translate)
4 EX (execute)
5 WB (write back to registers and memory)
http://www.gamedev.net/page/resources/_/technical/general-programming/a-
journey-through-the-cpu-pipeline-r3115
Instructions Pipelining

CPU Vector units
 Vectorization
performed by
dedicated hardware on
chip
 Compiler generates
vector instructions,
when it can, from
programmer’s code
 Important optimization
which can lead to 4x,
8x speedups according
“size” of vector unit
(e.g. 256 bit)

CPU vs GPU
• A CPU is designed to handle complex tasks while GPUs only do
one thing well - handle billions of repetitive low level tasks -
originally the rendering of triangles in 3D graphics
• Originally, this was called GPGPU (General Purpose GPU
programming), and it required mapping scientific code to the
matrix operations for manipulating triangles.
• This was insanely difficult to do and took a lot of dedication.
However, with the advent of CUDA and OpenCL, high-level
languages targeting the GPU, GPU programming is rapidly
becoming mainstream in the scientific community.

HPC vs HTC
High Performance
• Granularity largely defined
by the algorithm, limitations
in the hardware
• Load balancing difficult
• Reliability is all important
– if one part fails the
calculation stops (maybe
even aborts!)
– check-pointing essential
– hard to dynamically
reconfigure for smaller
number
of processors
High Throughput
• Granularity can be selected to
fit the environment
• Load balancing easy
• Mixing workloads is easy
• Sustained throughput is the
key
goal
– the order in which the
individual tasks execute is
(usually) not important
– if some equipment goes down
the work can be re-run later
– easy to re-schedule
dynamically the workload to
different configurations

Granularity
• fine-grained parallelism – the independent bits are
small, need to exchange information, synchronize
often
– large number of more expensive processors
expensively interconnected
• coarse-grained – the problem can be decomposed
into large chunks that can be processed
independently
– large number of inexpensive processors,
inexpensively interconnected
EUDAT Summer School 2017 - High Performance/Throughput Computing

(1) Map Only
(4) Point to Point or
Map-Communication
(3) Iterative Map Reduce
or Map-Collective
(2) Classic
MapReduce
Input
map
reduce
Input
map
reduce
Iterations
Input
Output
map
Local
Graph
BLAST Analysis
Local Machine
Learning
Pleasingly Parallel
High Energy
Physics (HEP)
Histograms
Distributed search
Recommender
Engines
Expectation
maximization
Clustering e.g. K-
means
Linear Algebra,
PageRank
Classic MPI
PDE Solvers and
Particle Dynamics
Graph Problems
MapReduce and Iterative Extensions (Spark, Twister) MPI, Giraph
Integrated Systems such as Hadoop + Harp with
Compute and Communication model separated
Courtesy of Prof. Geoffrey Charles Fox – Indiana University

Matrix multiplication
• Basic matrix multiplication on a 2-D grid
• Matrix multiplication is an important application in
HPC and appears in many areas (linear algebra)
• C = A * B where A, B, and C are matrices (two-
dimensional arrays)
• A restricted case is when B has only one column,
matrix-vector product, which appears in
representation of linear equations and partial
differential equations

C = A x B

Monte Carlo Methods
• Monte Carlo simulation is a broad term for methods that use
random numbers and statistical sampling to
solve problems, rather than exact modeling.
• Due to the sampling, the result will have some uncertainty,
but the statistical ‘law of large numbers’ will ensure that the
uncertainty goes down as the number of samples grows.
• The statistical law that underlies this is as follows:
– if N independent observations are made of a quantity with
standard deviation σ, then the standard deviation of the
mean is σ=pN. This means that more observations will lead
to more accuracy
• What makes Monte Carlo methods interesting is that the gain
in accuracy is not related to dimensionality of the original
problem.

Agia Pelagia gulf area
0 1
1

Hyper-Threading Technology
• Hyper-Threading Technology brings the simultaneous multi-
threading approach to the Intel architecture.
• Hyper-Threading Technology makes a single physical processor
appear as two or more logical processors
• Hyper-Threading Technology provides thread-level-parallelism (TLP)
on each processor resulting in increased utilization of processor and
execution resources
• Each logical processor maintain one copy of the architecture state
• While HT improves some performance benchmarks by up to
30%, its benefits are strongly application dependent. For HPC
codes, the consequences of HT also depend on inter-node
communication topology and load balance.

Reference

High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe Fiameni, CINECA)

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe Fiameni, CINECA)

Semelhante a High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe Fiameni, CINECA) (20)

Mais de EUDAT

Mais de EUDAT (20)

Último

Último (20)

High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe Fiameni, CINECA)