Giuseppe will present the differences between high-performance and high-throughput applications. High-throughput computing (HTC) refers to computations where individual tasks do not need to interact while running. It differs from High-performance (HPC) where frequent and rapid exchanges of intermediate results is required to perform the computations. HPC codes are based on tightly coupled MPI, OpenMP, GPGPU, and hybrid programs and require low latency interconnected nodes. HTC makes use of unreliable components distributing the work out to every node and collecting results at the end of all parallel tasks.
Visit: https://www.eudat.eu/eudat-summer-school
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe Fiameni, CINECA)
1. High Performance & High Throughput
Computing
Giuseppe Fiameni – g.Fiameni@cineca.it
http://hackaday.com/2016/07/13/nirvana-like-youve-never-heard-them-before/
2. Agenda
• Computational Problem
• System architecture
• Serial vs Parallel
• Levels of parallelism
• Models for parallel programming
• HPC and HTC examples
EUDAT Summer School 2017 - High Performance/Throughput Computing 2
3. What does it make a problem be
relevant from the computational
point of view?
4. Computational Dimension:
number of operations needed to solve the problem,
in general is a function of the size of the involved
data structures (n, n2 , n3 , n log n, etc.)
flop - Floating point operations
indicates an arithmetic floating point operation.
flop/s - Floating points operations per second
is a unit to measure the speed of a computer.
Computational problems today: 1015 – 1022 flop
One year has about 3 x 107 seconds!
Most powerful computers today have reach a sustained
performance is of the order of Tflop/s - Pflop/s (1012 - 1015
flop/s).
Size of computational applications
EUDAT Summer School 2017 - High Performance/Throughput Computing 4
5. Example: Weather Prediction
Forecasts on a global scale:
3D Grid to represent the Earth
Earth's circumference: 40000 km
radius: 6370 km
Earth's surface: 4πr2 5•108 km2
6 variables:
temperature
pressure
humidity
wind speed in the 3 Cartesian directions
cells of 1 km on each side
100 slices to see how the variables evolve on the
different levels of the atmosphere
a 30 seconds time step is required for the
simulation with such resolution
Each cell requires about 1000 operations per
time step (Navier-Stokes turbulence and various
phenomena)
In physics, the Navier–Stokes
equations /nævˈjeɪ stoʊks/,
named after Claude-Louis
Navier and George Gabriel
Stokes, describe the motion
of viscous fluid substances.
EUDAT Summer School 2017 - High Performance/Throughput Computing 5
6. Grid: 5 x 108 x 100 = 5 x 1010 cells
each cell is represented with 8 Byte
Memory space:
(6 var) x (8 Byte) x (5 x 1010 cells) 2 x 1012 Byte = 2TB
A 24 hours forecast needs:
• 24 x 60 x 2 3 x 103 time-step
• (5 x 1010 cells) x (103 oper.) x (3 x 103 time-steps) = 1.5 x 1017
operations !
A computer with a power of 1Tflop/s will take 1.5 x 105 sec.
Example: Weather Prediction (cont.)
24 hours forecast will need 2 days to run ... but we shall obtain a very
accurate forecast!
EUDAT Summer School 2017 - High Performance/Throughput Computing 6
7. John Von Neumann
EUDAT Summer School 2017 - High Performance/Throughput Computing 7
https://en.wikipedia.org/wiki/Von_Neumann_architecture
8. What if you want increase
performance?
• Increase the clock rate
– number of clock cycles per second (peacemaker)
• Increase the instruction rate
– number of basic instructions that can be executed
per clock cycle
– Pipeling, vectorization
• Reduce access time to data
– Having frequently used data on the processor
caches significantly improve computational
performance
EUDAT Summer School 2017 - High Performance/Throughput Computing 8
9. The clock cycle is defined as the
time between two adjacent pulses
of oscillator that sets the time of
the processor.
The number of these pulses per
second is known as clock speed or
clock frequency, generally
measured in GHz (gigahertz, or
billions of pulses per second).
The clock cycle controls the
synchronization of operations in a
computer: All the operations
inside the processor last a multiple
of .
Processor (ns) freq (MHz)
CDC 6600 100 10
Cyber 76 27.5 36,3
IBM ES 9000 9 111
Cray Y-MP C90 4.1 244
Intel i860 20 50
PC Pentium < 0.5 > 2 GHz
Power PC 1.17 850
IBM Power 5 0.52 1.9 GHz
IBM Power 6 0.21 4.7 GHz
Increasing the clock frequency:
The speed of light sets an upper limit to the speed
with which electronic components can operate .
Propagation velocity of a signal in a vacuum:
300.000 Km/s = 30 cm/ns
Heat dissipation problems inside the processor.
Also Quantum tunneling expected to become
important.
Speed of Processors: Clock Cycle and Frequency
EUDAT Summer School 2017 - High Performance/Throughput Computing 9
10. Memory access time: the time required
by the processor to access data or to write
data from / to memory
The hierarchy exists because :
fast memory is expensive and small
slow memory is cheap and big
Latency
how long do I have to wait for the
data?
(cannot do anything while waiting)
Throughput
how many bytes/second. but not
important if waiting. Total time = latency + (amount of data
/ throughput)
Time to run code = clock cycles running code + clock cycles waiting for
memory
Memory hierarchies
EUDAT Summer School 2017 - High Performance/Throughput Computing 10
11. Loop unrolling
for (i=0; i<N; i++)
s += a[i]*b[i]
EUDAT Summer School 2017 - High Performance/Throughput Computing 11
for (i=0; i<N; i += 2)
s += a[i]*b[i] + a[i+1]*b[i+1]
12. Why Have Cache?
EUDAT Summer School 2017 - High Performance/Throughput Computing 12
CPU 351 GB/sec
10.66 GB/sec (3%)
253 GB/sec (72%)
13. Parallelism
Serial Process:
A process in which its sub-processes happen
sequentially in time.
Speed depends only on the rate at which each
sub-process will occur (e.g. processing unit clock
speed).
Parallel Process:
Process in which multiple sub-processes can be
active simultaneously.
Speed depends on execution rate of each sub-
process AND how many sub-processes can be
made to occur simultaneously.
EUDAT Summer School 2017 - High Performance/Throughput Computing 13
14. Parallel for Capacity
• Example: Searching database of web-sites for some
text (i.e. Google)
• Searching sequentially through large volumes of text
too time-consuming
– Multiple servers hold different pages
– Each server can report the result of each
individual search
– The more servers you add the quicker the search
is
EUDAT Summer School 2017 - High Performance/Throughput Computing 14
15. Parallel for Capability
• Example: Large scale weather simulation
• Detailed description for atmosphere too large to run
on today's desktop
– Multiple servers are needed to hold all grid data in
memory
– Servers need to quickly communicate to
synchronise work over entire grid
– Communication between servers can become a
bottleneck. Latency really matters!
EUDAT Summer School 2017 - High Performance/Throughput Computing 15
16. Capability and capacity
• Parallel for Capability => High Performance Computing
– deals with one large problem
– problems that are hard to parallelise, not
embarrassing parallel
– single processor performance is important
• Parallel for Capacity => High Throughput Computing
– It is appropriate when the problem can be
decomposed into many (very many) smaller problems
that are essentially independent
– problems that are easy to parallelise, parameter
swapping
– can be adapted to very large numbers of processors
EUDAT Summer School 2017 - High Performance/Throughput Computing 16
17. Systems main architecture
• Single processor, single core
EUDAT Summer School 2017 - High Performance/Throughput Computing 17
• Dual processor, single core
• Single processor, dual core
18. Multi core architecture
• Modern architecture can go up to 34 cores
per processor and to 2/4 processors per
node
EUDAT Summer School 2017 - High Performance/Throughput Computing 18
19. Flynn Taxonomy
• A computer architecture is categorized by the
multiplicity of hardware used to manipulate streams of
instructions (sequence of instructions executed by the
computer) and streams of data (sequence of data used
to execute a stream of instructions).
EUDAT Summer School 2017 - High Performance/Throughput Computing 20
20. CU Control Unit
PU Processing Unit
MM Memory Module
DS Data stream
IS Instruction Stream
DS2
DS1
MM2
MM1
MMn
.
.
.
CU
IS
PU2
PU1
PUn
DSn
.
.
.
SIMD Systems
Synchronous parallelism
SIMD systems presents a
single control unit
A single instruction
operates simultaneously
on multiple data.
Array processor and
vector systems fall in this
class
EUDAT Summer School 2017 - High Performance/Throughput Computing 21
21. 22
.
.
.
.
.
.
.
.
.
IS2
CU2
IS2
PU2 MM2
DS2
PU1 MM1
DS1
PUn MMn
DSn
CU1
CUn
ISn
IS1
IS1
ISn
CU Control Unit
PU Processing Unit
MM Memory Module
DS Data stream
IS Instruction Stream
MIMD Systems
Asynchronous parallelism
Multiple processors
execute different
instructions operating on
different data.
Represents the
multiprocessor version
of the SIMD class.
Wide class ranging from
multi-core systems to
large MPP systems.
EUDAT Summer School 2017 - High Performance/Throughput Computing 22
22. 2
3
Moore's Law
• Empirical law which states that the complexity of
devices (number of transistors per square inch in
microprocessors) doubles every 18 months..Gordon
Moore, INTEL co-founder, 1965
• It is estimated that Moore's Law still applies in the near
future but applied to the number of cores per processor
EUDAT Summer School 2017 - High Performance/Throughput Computing 23
23. The limits of parallelism - Amdahl’s Law
If we have N processors:
taking s as the fraction of the
time spent in the sequential part
of the program (s + p = 1)
les.robertson@cern.ch
s – time spent in a serial
processor on serial parts
of the code
p – time spent in a serial
processor on parts that
could be executed in
parallel
Amdahl, G.M., Validity of single-processor approach to achieving large scale computing capability
Proc. AFIPS Conf., Reston, VA, 1967, pp. 483-485
Speedup =
𝑠+𝑝
𝑠+𝑝/𝑁
Speedup =
1
𝑠+(1−𝑠) /𝑁
→ 1/𝑠
EUDAT Summer School 2017 - High Performance/Throughput Computing
24. Amdahl’s Law - maximum speedup
max speedup
0
20
40
60
80
100
120
1% 2% 3% 4% 5% 6% 7% 8% 9% 10%
sequential part as percentage of total time
speedup
EUDAT Summer School 2017 - High Performance/Throughput Computing les.robertson@cern.ch
25. The importance of the network
• Latency – the round-trip time (RTT) to
communicate between two processors
Communications overhead:
• For fine grained parallel programs the
problem is latency, not bandwidth
t
… …
… …
… …
sequential
communications
overhead
parallelisable
Speedup =
𝑠+𝑝
𝑠+𝑐+ 𝑝/𝑁
c = latency + data_transfer_time
EUDAT Summer School 2017 - High Performance/Throughput Computing les.robertson@cern.ch
26. Aspects of parallelism
It has been recognized for a long time that
constant performance improvements cannot be
obtained just by increasing factors such as
processor clock speed – parallelism is needed.
Parallelism can be present at many levels:
Functional parallelism within the CPU
Pipelining and vectorization
Multi-processor and multi-core
Accelerators
Parallel I/O
EUDAT Summer School 2017 - High Performance/Throughput Computing 28
27. Pipelining
It is a technique where more
instructions, belonging to a stream of
sequential execution, overlap their
execution
This technique improves the
performance of the single processor
The concept of pipelining is similar to
that of assembly line in a factory where
in a flow line (pipe) of assembly stations
the elements are assembled in a
continuous flow.
All the assembly stations must operate
at the same processing speed, otherwise
the station slower becomes the
bottleneck of the entire pipe.
EUDAT Summer School 2017 - High Performance/Throughput Computing 29
28. 30
The stages are
1 Fetch
2 D1 (main decode)
3 D2 (secondary decode, also called translate)
4 EX (execute)
5 WB (write back to registers and memory)
http://www.gamedev.net/page/resources/_/technical/general-programming/a-
journey-through-the-cpu-pipeline-r3115
Instructions Pipelining
EUDAT Summer School 2017 - High Performance/Throughput Computing 30
29. CPU Vector units
Vectorization
performed by
dedicated hardware on
chip
Compiler generates
vector instructions,
when it can, from
programmer’s code
Important optimization
which can lead to 4x,
8x speedups according
“size” of vector unit
(e.g. 256 bit)
EUDAT Summer School 2017 - High Performance/Throughput Computing 32
30. CPU vs GPU
• A CPU is designed to handle complex tasks while GPUs only do
one thing well - handle billions of repetitive low level tasks -
originally the rendering of triangles in 3D graphics
• Originally, this was called GPGPU (General Purpose GPU
programming), and it required mapping scientific code to the
matrix operations for manipulating triangles.
• This was insanely difficult to do and took a lot of dedication.
However, with the advent of CUDA and OpenCL, high-level
languages targeting the GPU, GPU programming is rapidly
becoming mainstream in the scientific community.
EUDAT Summer School 2017 - High Performance/Throughput Computing 33
31. HPC vs HTC
High Performance
• Granularity largely defined
by the algorithm, limitations
in the hardware
• Load balancing difficult
• Reliability is all important
– if one part fails the
calculation stops (maybe
even aborts!)
– check-pointing essential
– hard to dynamically
reconfigure for smaller
number
of processors
High Throughput
• Granularity can be selected to
fit the environment
• Load balancing easy
• Mixing workloads is easy
• Sustained throughput is the
key
goal
– the order in which the
individual tasks execute is
(usually) not important
– if some equipment goes down
the work can be re-run later
– easy to re-schedule
dynamically the workload to
different configurations
EUDAT Summer School 2017 - High Performance/Throughput Computing 34
32. Granularity
• fine-grained parallelism – the independent bits are
small, need to exchange information, synchronize
often
– large number of more expensive processors
expensively interconnected
• coarse-grained – the problem can be decomposed
into large chunks that can be processed
independently
– large number of inexpensive processors,
inexpensively interconnected
EUDAT Summer School 2017 - High Performance/Throughput Computing
33. (1) Map Only
(4) Point to Point or
Map-Communication
(3) Iterative Map Reduce
or Map-Collective
(2) Classic
MapReduce
Input
map
reduce
Input
map
reduce
Iterations
Input
Output
map
Local
Graph
BLAST Analysis
Local Machine
Learning
Pleasingly Parallel
High Energy
Physics (HEP)
Histograms
Distributed search
Recommender
Engines
Expectation
maximization
Clustering e.g. K-
means
Linear Algebra,
PageRank
Classic MPI
PDE Solvers and
Particle Dynamics
Graph Problems
MapReduce and Iterative Extensions (Spark, Twister) MPI, Giraph
Integrated Systems such as Hadoop + Harp with
Compute and Communication model separated
Courtesy of Prof. Geoffrey Charles Fox – Indiana University
EUDAT Summer School 2017 - High Performance/Throughput Computing 36
35. Matrix multiplication
• Basic matrix multiplication on a 2-D grid
• Matrix multiplication is an important application in
HPC and appears in many areas (linear algebra)
• C = A * B where A, B, and C are matrices (two-
dimensional arrays)
• A restricted case is when B has only one column,
matrix-vector product, which appears in
representation of linear equations and partial
differential equations
EUDAT Summer School 2017 - High Performance/Throughput Computing 38
36. C = A x B
EUDAT Summer School 2017 - High Performance/Throughput Computing 39
38. Monte Carlo Methods
• Monte Carlo simulation is a broad term for methods that use
random numbers and statistical sampling to
solve problems, rather than exact modeling.
• Due to the sampling, the result will have some uncertainty,
but the statistical ‘law of large numbers’ will ensure that the
uncertainty goes down as the number of samples grows.
• The statistical law that underlies this is as follows:
– if N independent observations are made of a quantity with
standard deviation σ, then the standard deviation of the
mean is σ=pN. This means that more observations will lead
to more accuracy
• What makes Monte Carlo methods interesting is that the gain
in accuracy is not related to dimensionality of the original
problem.
EUDAT Summer School 2017 - High Performance/Throughput Computing 41
39. Agia Pelagia gulf area
EUDAT Summer School 2017 - High Performance/Throughput Computing 42
0 1
1
40. Hyper-Threading Technology
• Hyper-Threading Technology brings the simultaneous multi-
threading approach to the Intel architecture.
• Hyper-Threading Technology makes a single physical processor
appear as two or more logical processors
• Hyper-Threading Technology provides thread-level-parallelism (TLP)
on each processor resulting in increased utilization of processor and
execution resources
• Each logical processor maintain one copy of the architecture state
• While HT improves some performance benchmarks by up to
30%, its benefits are strongly application dependent. For HPC
codes, the consequences of HT also depend on inter-node
communication topology and load balance.
EUDAT Summer School 2017 - High Performance/Throughput Computing 43