1. Loop Parallelization & Pipelining
AND
Trends in Parallel System & Forms of
Parallelism
By
Jagrat Gupta
M.tech[CSE] 1st Year
(Madhav Institute of Technology and Science, Gwalior-467005)
2. Loop Parallelization & Pipelining
It describes the theory & application of loop transformations for
vectorization and parallelization purposes.
Loop Transformation Theory:-
Parallelization loop nests is one of the most fundamental program
optimization techniques demanded in a vectorization &
parallelization compiler.
The main goal is to maximize the degree of parallelism or data
locality in a loop nest. It also support efficient use of memory
hierarchy on a parallel machine.
3. Elementary Transformations:-
Permutation:- Simply interchange the i & j.
Do i=1,N
Do j=1,N
A(j)=A(j)+C(i,j)
End Do
End Do
Before Transformation
Do j=1,N
Do i=1,N
A(j)=A(j)+C(I,j)
End Do
End Do
After Transformation
Reversal:- Reversal of the ith loop is represented by the identity
matrix with the ith element on the diagonal equal to -1.
4. Do i=1,N
Do j=1,N
A(I,j)=A(i-1,j+1)
End Do
End Do
Before Transformation
Do i=1,N
Do j=-N,-1
A(I,-j)=A(i-1,-j+1)
End Do
End Do
After Transformation
1 0
0 -1
5. Skewing:- Skewing loop Ij by the integer factor f w.r.t. loop Ii . In
the following loop nest, the transformation performed is a skew
of the inner loop with respect to the outer loop by a factor of 1.
1 0
1 1
Do i=1,N
Do j=1,N
A(i,j)=A(i,j-1)+A(i-1,j)
End Do
End Do
Before Transformation
Do i=1,N
Do j=1,N
A(i,j-i)=A(i,j-i-1)+A(i-1,j-i)
End Do
End Do
After Transformation
6.
7. Transformation Matrices:-
Unimodular transformations are defined by transformation
matrices.
A unimodular martix has 3 important properties:-
1) It is square, i.e. it map n dimensional iteration space into n-dimensional
iteration space.
2) It has all integers components, so it maps integer vectors to
integer vectors.
3) The absolute value of determinant is 1.
Wolf and Lam have stated the following conditions for
unimodular transformations:-
1) Let D be the set of distance vector s of a loop nest. A
unimodular transformation T is legal if and only if, d € D
T.d>=0 (Lexicographic positive)
2) Loop i through j of a nested computation with dependence
8. Vector D are fully permutable if for all d € D.
Do i=1,N
Do j=1,N
A(i,j)=f(A(i,j),A(i+1,j-1))
End Do
End Do
Let this code has the dependence vector d=(1,-1) . The Loop interchange
Transformation is represented by the matrix
0 1
1 0
9. Here T.d= (-1,1) i.e. Negative
Now if we compound the interchange with a reversal represented by
the transformation matrix:
T’=
-1 0
0 1
Now T’.d= (-1,-1)=-(1,1) So matrix part is positive. So it is legal.
Parallelization and Wavefronting:-
The theory of loop transformation can be applied to execute loop
iterations in parallel.
0 1
1 0
0 -1
1 0
10. Parallelization Conditions:- The purpose of loop parallelization is
to maximize the no of parallelizable loops. The algorithm for loop
parallelization consists of two steps:-
1) It first transforms the original loop nest into canonical form,
namely fully permutable loop nest.
2) It then transforms the fully permutable loop nest to exploit
coarse and/or fine grain parallelism according to the target
architecture.
Fine Grain Wavefronting:-
• A nest of n fully permutable loops can be transformed into code
containing at least (n-1) degrees of parallelism. So these (n-1)
parallel loops can be obtained by skewing the innermost loop in
the fully permutable nest by each of the other loop and moving
the innermost loop to the outermost position.
This Transformation is called Wavefront
Transformation, is represented by the following matrix:-
11. 1 1 - - - - - - - 1 1
1 0 - - - - - - - 0 0
0 1 - - - - - - - 0 0
- - - -
- - - -
0 0 - - - - - - - 1 0
• Fine grain parallelism is exploited on vector m/c, superscalar
processors and systolic arrays.
• Actually wavefront transformation automatically places the
maximum doall loops in the innermost loops, maximizing fine
grain parallelism.
12. Coarse Grain Parallelism:-
• A wavefront transformation produces the max degree of
parallelism but makes the outermost loop sequential if any one.
• a heuristic although non optimal approach for making loops
doall is simply to identify loops Ii such that all di are zero. Those
loops can be made outermost Doall. The remaining loops in the
tile can be wavefronted to obtain the remaining parallelism.
• The loop parallelization algorithm has a common step for fine
grain and coarse grain parallelism in creating a n-deep fully
permutable loop nest by skewing. The algorithm can be tailored
for different machine based on the following guidelines:-
1) Move Doall loops innermost for fine-grain machine. Apply a
wavefront transformation to create up to (n-1) doall loops.
2) Create outermost doall loops for coarse grain machine. Apply
tilling to a fully permutable loop nest.
3) Use tilling to create loops for both fine and coarse grain m/c.
13. Tiling & Localization:-
The purpose is to reduce synchronization overhead and to
enhance multiprocessor efficiency when loops are distributed for
parallel execution.
It is possible to reduce to synchronization cost and improve data
locality of parallelized loops via an optimization known as tiling.
In general tiling maps an n deep loop nest into 2n deep loop nest
where the inner n loops include only a small fixed no of iteration.
The outer loop of tiled code control the execution of tiles.
It also satisfy the property of Full permutability.
We can reduce synchronization cost in the following way-We
first tile the loops and then apply the wavefront transformation to
the controlling loops of the tiles. In this way, synchronization cost
is reduced by the size of the tile.
14.
15. Tiling for Locality:-
• Technique to improve the data locality of numerical algorithms.
• It can be used for different levels of memory, caches & registers;
multiple tiling can be used to achieve locality at multiple levels of
the memory hierarchy simultaneously.
Do i=1,N
Do j=1,N
Do k=1,N
C(i,k)=C(i,k)+A(i,j)*B(j,k)
End Do
End Do
End Do
Before Tiling
Do l=1,N,s
Do m=1,N,s
Do i=1,N
Do j=l, min(l+s-1,N)
Do k=m, min(m+s-1,N)
C(i,k)=C(i,k)+A(i,j)*B(j,k)
End Do
End Do
End Do
End Do
End Do
After Tiling
16. • In the previous code some row of B & C are reused in the next
iteration of the middle & outer loop. So tiling reorders the
execution sequence such that iterations from loops of the outer
dimensions are executed before all the iterations of the inner
loops are completed.
• Tiling reduces the no of interleaving iterations and the data
fetched b/w data reuses. This allows reused data to still be in the
cache or register file & hence reduces memory access.
17. Pipelining
Software Pipelining:-
Pipelining of successive iterations of the loop in the source
programs. The advantage of s/w pipelining is to reduce the
execution time with compact object code.
Pipelining of loop iterations:- (Lam`s Tutorials Notes)
Do i=1,N
A(i)= A(i)*B+C
End Do
• In the above code iterations are independents. It is assumed that
each memory accesses (R or W) takes 1 cycles & each operation
(Mul & Add) takes 2 cycles.
18. • Without Pipelining:-
1 Iteration require 6 cycles to be execute. So N Iteration Require
6N Cycles to complete ignoring loop control overhead.
Cycles Instructions Comment
1. Read /Fetch A(i)/
2. Mul Multiply by B
3.
4. Add /Add to C/
5.
6. Write /Store A(i)/
• With Pipelining:-
Now same code is executed on a 8-deep instruction pipeline.
19. Cycles Iterations
1 2 3 4
1 R
2 Mul
3 R
4 Mul
5 Add R
6 Mul
7 Add R
8 W Mul
9 Add
10 W
11 Add
12 W
13
14 W
20. Hence 4 Iterations are required 14 Clock Cycles.
Speed up factor= 24/14=1.7
For N Iterations, it is 6N/(2N+6).
21. Trends towards Parallel Systems
From an application point of view, the mainstream of usage of
computer is experiencing a trend of four ascending levels of
sophistication:-
• Data processing.
• Information processing.
• Knowledge processing.
• Intelligence processing.
Computer usage started with data processing, while is still a
major task of today’s computers. With more and more data
structures developed, many users are shifting to computer roles
from pure data processing to information processing.
As the accumulated knowledge bases expanded rapidly in recent
years, there grew a strong demand to use computers for
knowledge processing.
22. Intelligence is very difficult to create; its processing even more so.
Today's computers are very fast and obedient and have many
reliable memory cells to be qualified for data-information-knowledge
processing. Computers are far from being satisfactory
in performing theorem proving, logical inference and creative
thinking.
23. Forms Of Parallelism
Parallelism in Hardware (Uniprocessor)
– Pipelining
– Superscalar, VLIW etc.
Parallelism in Hardware (SIMD, Vector processors, GPUs)
Parallelism in Hardware (Multiprocessor)
– Shared-memory multiprocessors
– Distributed-memory multiprocessors
– Chip-multiprocessors a.k.a. Multi-cores
Parallelism in Hardware (Multicomputers a.k.a. clusters)
Parallelism in Software
– Task parallelism
– Data parallelism
24. Instruction Level Parallelism:-
• Multiple instructions from the same instruction stream can be
executed concurrently. The potential overlap among instructions is
called instruction level parallelism.
• Generated and managed by hardware (superscalar) or by compiler
(VLIW).
• Limited in practice by data and control dependences.
• There are two approaches to instruction level parallelism:
-Hardware.
-Software.
• Hardware level works upon dynamic parallelism whereas, the
software level works on static parallelism.
• Consider the following program:
1. e = a + b
2. f = c + d
3. m = e * f
25. • Operation 3 depends on the results of operations 1 and 2, so it
cannot be calculated until both of them are completed. However,
operations 1 and 2 do not depend on any other operation, so
they can be calculated simultaneously. If we assume that each
operation can be completed in one unit of time then these three
instructions can be completed in a total of two units of time,
giving an ILP of 3/2.
Thread-level or task-level parallelism (TLP):-
• Multiple threads or instruction sequences from the same
application can be executed concurrently.
• Generated by compiler/user and managed by compiler and
hardware.
• Limited in practice by communication/synchronization overheads
and by algorithm characteristics.
26. Data-level parallelism (DLP):-
• Instructions from a single stream operate concurrently on several
data
• Limited by non-regular data manipulation patterns and by
memory bandwidth.
Transaction-level parallelism:-
• Multiple threads/processes from different transactions can be
executed concurrently.
• Limited by access to metadata and by interconnection bandwidth.
27. Parallel Computing
• Use of multiple processors or computers working together on a
common task.
–Each processor works on its section of the problem.
–Processors can exchange information .
Grid of Problem to be solved
CPU #1 works on this area of the
problem
exchange
CPU #2 works on this area of the
problem
exchange
CPU #3 works on this area
of the problem
CPU #4 works on this area
of the problem
28. Why Do Parallel Computing?
Limits of single CPU computing
–performance
–available memory
Parallel computing allows one to:
–solve problems that don’t fit on a single CPU
–solve problems that can’t be solved in a reasonable time
We can solve…
–larger problems
–the same problem faster
–more cases
29. Brent`s Theorem
Statement:- Given A, a parallel algorithm with computation time t, if
parallel algorithm A performs m computational operations, then
processors can execute algorithm A in time:-
t+(m-1)/p
Proof:- :Let si be the no of computational operations performed by
parallel algorithm A at step i, (1<=i<=t)
Given t
Σ si = m
i=1
Since we have p no of processors, we can simulate step I in time
Ceil(si /p). So the entire computations of A can be performed with p
processors in time :-
30. t t
Σ ceil(si /p) <= Σ (si+p-1)/p
i=1 i=1
(Using the definition of ceiling Function)
t t
= Σ(p/p) + Σ(si -1 /p)
i=1 i=1
= t+(m-1)/p
(Hence Proved)