SlideShare a Scribd company logo
1 of 82
Download to read offline
Design and Development of a Floating-point Co-processor
for the Acceleration of Graphics functions
Sandip Jassar
Department of Electronics, The University of York
Academic Supervisors: Andy Tyrrell, Jonathan Dell
Xilinx Development Centre, Edinburgh
Industrial Supervisor: Richard Walke
15th
June, 2005
Final Project Report for degree of MEng in Electronic and Computer
Engineering
Abstract
This report details the design and development of a processing architecture, complete
with a Controller and DMA unit. The reader is shown how the architecture was
optimised for executing Nested-Loop Programs, and in particular those found in the
geometric transformation processes stage of the OpenGL pipeline.
2
Contents
Section 1 : Introduction …………………………………………………………pg4
1.1 Parallel Processing for Graphics Functions ………………...pg4-5
1.2 OpenGL ………….………………………………………….pg5
1.2.1 The OpenGL Pipeline ………………………………....pg5-8
1.2.2 The Geometric Transformation Process ……………....pg8
1.2.2.1 The ModelView Transformation ……………….pg8-10
1.2.2.2 The Projection Transformation ………………….pg10
1.2.2.3 Perspective Division and the
ViewPort Transformation ………………………..pg11
1.3 : Overview of Transformations on Nested Loop Programs .pg11-13
Section 2 : FIR Filter Analysis …………………………………………………..pg13
2.1 The Single MACC FIR Filter ……………………………pg13-16
2.2 The Transposed FIR Filter ………………………………..pg16-17
2.3 The Systolic FIR Filter ……………………………………pg18-19
2.4 The Semi-Parallel FIR filter ………………………………pg20-21
Section 3 : Matrix-Vector Multiplication Analysis ……………………………pg22
3.1 The Sequential Model of Computation ..…………………..pg22-24
3.2 Exploiting the Inherent Parallelism ...……………………...pg24
3.2.1 : Unrolling the Inner-Loop ………………………pg24-25
3.2.2 : Unrolling the Outer-Loop ……………………...pg26
3.3 Using Pipelined MACC Units ....………………………….pg26-27
3.3.1 : Optimising the Code for Execution on a
Pipelined MACC Unit ………………………….pg27-30
Section 4 : The Floating Point Unit ……………...……………………………pg31
4.1 Dealing with Hazards ……………………………………..pg32-33
4.1.1 : Using the Scoreboard to Detect Data Hazards…pg33-37
4.1.2 : Managing Data Hazards ..……………………...pg37-38
4.1.3 : Managing Structural Hazards ..………………...pg38-39
Section 5 : The Controller ………………………..……………………………pg40
5.1 Initial Look-Up Table Design ………....…………………..pg40-41
5.2 Optimising the Controller for Running Geometric
Transformation Programs ……….....……………………...pg41-43
5.2.1 : The Use of Induction Variables ..………………pg43-44
5.2.2 : Performing Strength Reduction on
Induction Variables ……..……………………...pg44-45
5.2.3 : Optimising a Geometric Transformation for
the Controller …….....………………………….p45-48
5.2.4 : Designing the Optimal Controller ……………..pg48
5.2.4.1 : The Data Address Generator Design ...pg49-50
5.2.4.2 : The Use of Data Address Generators as
Part of the Controller …...…………...pg51-53
5.2.4.3 : The Loop Counter Structure …………pg53-55
5.2.5 : The High-Level Controller …………………….pg55
3
5.2.6 : The DMA Unit and Context Switching ………..pg55-57
Section 6 : Testing and Integration ..……………..…………………………..pg58
6.1 Testing of the FPU’s Control Unit .…....…………………pg58-59
6.2 Testing and Integrating the Register File ………………...pg59
6.3 Testing and Integrating the Execution Pipeline ………….pg60
6.4 Testing and Integration of the Look-Up Table Controller .pg60
6.5 Testing of the Optimal Controller’s Data Address Generator..pg60
6.6 Testing and Integration of the Optimal Controller’s
Loop Counter Structure, Program Memory and DAGs …...pg61
6.7 Testing and Integration of the DMA Unit …………………pg61-62
6.8 Testing and Integration of the High-Level Controller ….....pg62
6.4 The OpenGL Demonstration ………………………………pg63-64
6.4 Progress in Developing a Hardware Based Demonstration .pg64-65
Section 7 : Conclusion …………………………………………………………pg65
Section 8 : Appendices ………………………………………………………...pg67-80
Section 9 : Acknowledgments …………………………………………………pg81
Section 10 : References ………………………………………………………..pg82
1 : Introduction
This section begins by describing the 3-D graphics rendering application domain, and
the characteristics of the associated algorithms which allow for parallel execution.
The industry standard OpenGL 3-D graphics rendering pipeline is analysed, where
special focus is given to the geometric transformations stage which is consistent of a
pipeline of matrix-vector manipulation operations that implement a key part of
4
OpenGL’s operation in defining what a scene looks like in terms of the position and
orientation of the objects in it. The section then closes by looking at how Nested
Loop Programs such as the FIR filter can be transformed to exploit their inherent
parallelism by making it more explicit.
Section 2 then follows on from this and examines the different combinations of
transformations that can be applied to the FIR filter and the resulting implementation
tradeoffs. Section 3 then carries out a similar analysis on another prominent NLP in
graphics processing, the matrix-vector multiplication algorithm, where similarities
and differences with the FIR filter are analysed, and tradeoffs between the different
ways to optimise the algorithm’s implementation are discussed. Section 4 then details
the design of a FPU which can exploit the characteristics inherent in the matrix-vector
multiplication algorithm if they are exposed by transforming the code as discussed in
the previous section, by employing temporal parallelism.
Section 5 goes through the design process for an Optimal Controller for the FPU
using a typical matrix-vector multiplication algorithm found in the geometric
transformations stage of the OpenGL pipeline, and goes on to show how the
processor architecture as a whole was optimised for this particular section of the
OpenGL pipeline. Section 6 explains how the complete processor architecture was
built up from its constituent blocks in a hierarchical fashion, after they were tested
against their specification and integrated together into sub-systems, and ends by
describing the demonstration of OpenGL’s geometric transformations based on the
processor architecture developed. Section 7 finishes by drawing conclusions from the
project and putting the results achieved into context.
1.1 : Parallel Processing for Graphics Functions
Currently multiprocessing is the technique used to carry out the significant arithmetic
processing required to implement the realistic rendering techniques used in advanced
3-D graphics applications. Most often this is in the form of clusters of commodity
PC’s [1].
As discussed previously, graphics arithmetic is largely consistent of matrix and vector
manipulation, and thus lends itself to parallel processing because of the independence
of the individual operations required within the high-level matrix/vector function.
Most high level functions encountered in a Graphics environment are also ‘order-
independent’, as the order they are done in has no effect on the final display. Such
functions can thus be executed in parallel to achieve higher throughput, which is one
of the fundamental strengths of FPGAs. Although some functions will be
encountered that are ‘sequential’ which must be executed after all preceding and
before all subsequent functions, and thus if a parallel processing architecture is
employed it must deal with sequential functions with minimal degradation to the
overall system performance.
Another benefit of having an identical number of processors operating in parallel is
that the programming of each processor can be the same, and so this method of
obtaining high performance can also greatly simplify the software development
5
1.2 : OpenGL
The Open Graphics Library (OpenGL) is the industry’s most widely used and
supported 2D and 3D graphics API [2]. As such there are thousands of applications
based on OpenGL that are used to render compelling 2D and 3D graphics, in markets
ranging from broadcasting, CAD/CAM, entertainment, cinematics, medical imaging
and virtual reality, the most famous of all being Pixar’s Renderman (used by the
movie industry to create special effects). Individual function calls in the OpenGL
environment can be executed on dedicated tuned hardware, run as a software routine
on the generic system CPU or implemented as a combination of both. As a result of
this implementation flexibility, OpenGL hardware acceleration can range from that
which simply renders 2-D lines and polygons to the more advanced floating-point
processor capable of transforming and computing geometric data.
1.2.1 : The OpenGL Pipeline
The two types of data that are input to the OpenGL pipeline are pixel data and
geometric data. Pixel data is the RGBA data associated with pixels on the screen, and
comes in the form of individual pixel colour values, images and bitmaps. Geometric
data is used to model objects (ranging in complexity from simple 2-D shapes to
realistic 3-D objects), and comes in the form of points, lines and polygons, which are
OpenGL’s three geometric primitives. All geometric data is eventually described as
vertices. The data associated with each vertex is its 3-D positional coordinate vector,
normal vector and material properties (used in lighting calculations), pixel colour
value (RGBA value or an index to a colour-map), and its texture coordinates (used to
map a texture onto the vertex’s parent object). Figure 1 below shows an overview of
the OpenGL pipeline in terms of its various stages and the order in which operations
occur.
Figure 1 : showing an overview of the OpenGL pipeline
As can be seen in figure 1 above the vertex and pixel data are initially processed
differently before both being used in the rasterization stage. All data can be input to
6
the pipeline from the application and processed immediately, or saved in a display list
which sends the data to the pipeline when the list is executed.
In the per-vertex operations stage, the three vectors associated with each vertex
(spatial coordinates, normal and texture coordinates) are transformed (multiplied) by
the current modelview matrix, its inverse transpose and the current texture matrix
respectively. These transformations are carried out in order to transform the position,
orientation and size of the vertices’ parent objects in the scene. Lighting calculations
are then performed on each vertex using their transformed spatial coordinate and
normal vectors, material properties and the current lighting model. These calculations
act so as to scale each vertex’s pixel colour value to reflect the location and
orientation of its parent polygon relative to the light source/s.
The viewing volume of the scene is defined by six clipping planes. In the primitive
assembly stage the spatial coordinate vector of all vertices is transformed (multiplied)
by the projection matrix, so as to clip the scene against the six planes of the viewing
volume. Depending on the type of clipping employed primitives may have vertices
rejected, modified or added. The programmer may also define additional clipping
planes to further restrict the viewing volume, in order to create cut-away views of
objects, and other similar effects. The equations that represent such additional
clipping planes are used to transform the spatial coordinate vectors before the
projection matrix.
The spatial coordinate vectors of all vertices then go through perspective division,
where primitives are scaled in size to reflect their distance from the viewer. This is
followed by the viewport transformation where the spatial coordinate vectors of all
vertices are transformed so as to map the 3-D scene onto the 2-D viewport (viewing
window) of the computer screen.
In the rasterization processing stage, all primitives (points, lines and polygons) are
rasterized into fragments, where the squares of the viewport’s integer pixel-grid that
are occupied by each primitive are determined. If enabled, advanced features to make
the rendered scene more realistic are also implemented in this stage. The most
commonly used of these is anti-aliasing, which is used to smooth jagged edges that
result from having to map non-vertical and non-horizontal lines to the square pixel-
grid of the viewport. Anti-aliasing calculates the portion of each square within the
pixel-grid that would be occupied by a line, if the line were to be drawn as originally
defined (before the viewport transformation), and this value is known as a pixel’s
coverage value. A pixel’s coverage value is used to scale the alpha component of its
RGBA colour value. Figure 2 illustrates how pixel coverage values are evaluated.
7
Figure 2 : showing an example to illustrate how a pixel’s coverage value is evaluated
With reference to figure 2 above, the green and orange show how a diagonal line
looks before and after it’s subjected to the viewport transformation respectively, and
the coverage values for the pixels occupied by the line are given on the right.
In the pixel operations stage, pixel data that is input to the pipeline is scaled, biased
and processed using a colour-map, after which the colour values are clamped to a
certain range. The resulting pixel data is then either rasterized into fragments or
written to texture memory for use in texture mapping. Data from the framebuffer can
be read back and placed in the host processor memory. Data for texture mapping can
be taken from either the host processor memory or the framebuffer.
If texturing is enabled, in the per-fragment operations processing stage, the texture
coordinate vector of each vertex (of a fragment’s primitive) is used to map the
vertices to specific points on a two-dimensional texture image, so that the texture
(known as the texel for that particular fragment) can be mapped onto the primitive
appropriately after its position an orientation are transformed in the per-vertex
operations stage. If enabled, other advanced features such as blending (used to create
a photo-realism effect) are also implemented in the per-fragment operations stage.
It’s also in this stage that the coverage values calculated in the rasterization stage are
applied, if anti-aliasing is enabled.
The final pixel values are then drawn (written) into the framebuffer.
1.2.2 : The Geometric Transformation Process
The overall transformation process for producing a scene for viewing is analogous to
that carried out by a camera when it’s used to take a photograph. This transformation
process is carried out on the spatial coordinate vector of each vertex in the OpenGL
pipeline. This process is depicted below in figure 3.
8
Figure 3 : showing the stages of transformation for spatial coordinate vectors
As can be seen in figure 3 above, a vertex’s spatial coordinate vector consists of not
only the vertex’s three-dimensional x, y, z coordinates, but also a w component which
is used in the perspective division stage. All three vectors of a vertex in OpenGL
have four elements, thus all matrices (that they are multiplied by) are 4x4.
1.2.2.1 : The ModelView Transformation
A vertex’s spatial coordinates are first presented to the pipeline as object coordinates.
In this form the spatial coordinates specify a vertex’s location in 3-D space when its
parent primitive is centred on the origin and oriented in such a way that makes it easy
for the programmer to visualise where its vertices are in 3-D space (i.e. with the
primitive’s edges parallel with and perpendicular to the axes). The modelview
transformation is the combination of the modelling transformation and the viewing
transformation, represented by the modelling and viewing matrices respectively. The
modelling transformation is always carried out on an object before the viewing
transformation, as by default the modelview matrix is formed by the viewing matrix
pre-multiplying the modelling matrix.
The modelling transformation positions an object at a particular location in 3-D space
relative to the origin (by performing translation), rotates the object relative to the
axes (by performing rotation) and scales the object in size (by performing scaling).
These three transformations are represented by their own matrices, and these are
depicted in figure 4 below. The order in which the modelling transformation carries
out these transformations is determined by the order in which their respective matrices
are multiplied together to form the modelling matrix. The transformation represented
by a post-multiplying matrix is carried out before that represented by the pre-
multiplying matrix, and this principle holds true in all instances where transformation
matrices are combined together through multiplication.
9
1 0 0 x
0 1 0 y
0 0 1 z
0 0 0 1
T
Matrix T translates the object from being centered at the origin to the location defined by x, y, z; but the object’s orientation and size
are maintained
x 0 0 0
0 y 0 0
0 0 z 0
0 0 0 1
S
Matrix S scales the object such that it is stretched by a factor of x, y, z in the direction of the corresponding axes; but the object’s
orientation and position are maintained
1 0 0 0
0 cosΦ -sinΦ 0
0 sinΦ cosΦ 0
0 0 0 1
Matrices Rx, Ry, and Rz rotate the object clockwise by Φ degrees about the x, y and z axes respectively; rotation about more than one axis is achieved by multiplying
the relevant R matrices together; but the object’s position and size are maintained
cosΦ 0 0
0 1 0 0
0 0
0 0 0 1
0 0
0 0
0 0 0 0
0 0 0 1
sinΦ
-sinΦ cosΦ
cosΦ
sinΦ
-sinΦ
cosΦ
Rx Ry Rz
Figure 4 : showing the translation, scaling and rotation matrices that are multiplied
together to form the modelling matrix
With reference to figure 4 above, it can be seen that the translation transformation on
a vertex is achieved by adding to each of its x, y and z components the multiplication
of the corresponding component of the translation vector down column 4 of the T
matrix and its w component. The scaling transformation on a vertex is achieved by
multiplying each of its components by the corresponding component of the scaling
vector along the leading diagonal of the S matrix. The action of the rotation matrix is
rather more complex, although it can be seen from the R matrices above that the
vertex component associated with the axis being rotated about remains unchanged
after the transformation.
The viewing transformation is analogous with adjusting the camera location and the
direction in which it points when taking a photograph of the scene. This
transformation is comprised of a combination of translation (adjusting the viewpoint’s
location) and rotation (adjusting the viewpoint’s direction), and the associated
matrices are the same as those used to produce the modelling matrix (shown above in
figure 4). These two transformations within the viewing transformation (on the
viewpoint) have the exact reverse effect on the appearance of the scene to the
corresponding transformations within the modelling transformation (on the scene’s
objects). The default viewpoint is at the origin and points in the negative z-direction.
As objects are most often initially defined as being centred on the origin, the
modelview transformation as a whole must perform a transformation simply for the
10
objects in the scene to be visible, although any of the elementary transformations can
be omitted by simply not including their respective matrices in the product forming
the modelview matrix.
1.2.2.2 : The Projection Transformation
The eye coordinates resulting from the modelview transformation then go through the
projection transformation where they are converted to clip coordinates. The
projection transformation defines the viewing volume. The shape of the viewing
volume determines how objects are projected onto the screen and which objects (or
portions of objects) are clipped out of the final scene. The most common type of
projection used is perspective projection which employs a frustum shaped viewing
volume, as illustrated below in figure 5.
Figure 5 : showing the frustum shaped viewing volume employed by perspective
projection
Perspective projection implements foreshortening, whereby the further away from the
viewpoint an object is, the smaller it appears in the scene, thus emulating the way the
human eye (or a camera) works. The projection matrix that represents the
(perspective) projection transformation is depicted below in figure 6.
0 0
0 0
0 0
0 0 -1 0
P
Matrix P clips the objects against the six planes of the viewing volume; the w component of each vertex is set to -z (the distance of
the vertex from the origin, in a direction away from the viewpoint)
2n
r - l
r + l
r - l
2n
t - b
t + b
t - b
-(f+n)
f - n
-2fn
f - n
n : near
f : far
r : right
l : left
t : top
b : bottom
Figure 6 : showing the projection matrix for perspective projection
11
1.2.2.3 : Perspective Division and the ViewPort Transformation
The clip coordinates resulting from the projection transformation are then converted
to normalized device coordinates through the process of perspective division, where
the x, y, and z coordinates of each vertex are divided by its w component (which is set
to –z in the projection transformation as described previously in figure 6). This scales
the objects down in size to implement foreshortening as discussed previously in
regards to perspective projection. After the perspective division stage the four
element spatial coordinate vector of each vertex becomes a three element vector as
the w component is discarded.
The last stage of the process is the viewport transformation which maps the three-
dimensional normalized device coordinates to the two-dimensional viewport, in
converting them to window coordinates. The equations that perform the viewport
transformation are shown below in figure 7.
xw = ( ( xnd + 1 ) x ( width ÷ 2 ) ) + xo
yw = ( ( ynd + 1 ) x ( height ÷ 2 ) ) + yo
Figure 7 : showing the equations that perform the viewport transformation
With reference to figure 7 above, the (xo, yo) coordinate is the position of the bottom
left hand corner of the viewport (viewing window) on the screen, relative the
corresponding corner of the screen. The width and height parameters are the
dimensions of the viewport (in pixels).
1.3 : Overview of Transformations on Nested Loop Programs
A key problem in parallel processing is the way in which the program for each
processor is generated, in a way such that the overall program is executed efficiently.
Most graphics functions are described as a set of Nested for-loops [3].
The Nested-Loop Program (NLP) form of an algorithm represents its sequential
model of computation. The FIR filter operation is a simple example of a NLP and
this is shown below in figure 8 for calculating four output values.
for j = N:N+3
for i = 0:N-1
y[j] = y[j] + ( x[j - i] x h[i] )
end
12
end
Figure 8 : showing the NLP form of the FIR filter operation
This sequential model of computation of an algorithm is representative of the way in
which the algorithm would be implemented as Software running on a standard single
thread processor.
The algorithmic transformation of unrolling transforms a NLP such that it task-level
parallelism is enhanced. As a result, tasks that are independent of each other are
explicitly shown to be mutually exclusive, although the resulting representation of the
algorithm is still functionally equivalent to the original. The independent tasks can
then be mapped to separate processors.
The algorithmic transformation of skewing makes the dependences of operations less
demanding, and thus allows for latency in the physical operators that will actually
execute them.
The dependency graph is a useful way of visualising the dependences between the
operations in a NLP, and represents an intermediate level of abstraction between the
NLP and its implementation. An example of a dependency graph is shown for the
FIR filter NLP of figure 8, where the outer j loop has been unrolled by a factor of two,
and N = 4. Data dependences between tasks (represented as circles) are shown by
one task forwarding data to another. The two independent tasks are highlighted in
different colours.
+0
+
+
+
h[0].x[4] + h[1].x[3] + h[2].x[2] +
h[3].x[1]
h[0]
h[1]
h[2]
h[3]
x[4-0]
x[4-1]
x[4-2]
x[4-3]
+0
+
+
+
h[0].x[5] + h[1].x[4] + h[2].x[3] +
h[3].x[2]
h[0]
h[1]
h[2]
h[3]
x[5-0]
x[5-1]
x[5-2]
x[5-3]
+0
+
+
+
h[0].x[6] + h[1].x[5] + h[2].x[4] +
h[3].x[3]
h[0]
h[1]
h[2]
h[3]
x[6-0]
x[6-1]
x[6-2]
x[6-3]
+0
+
+
+
h[0].x[7] + h[1].x[6] + h[2].x[5] +
h[3].x[4]
h[0]
h[1]
h[2]
h[3]
x[7-0]
x[7-1]
x[7-2]
x[7-3]
Figure 9 : Showing the effect of unrolling the outer loop of the FIR filter NLP by a
factor of 2
When the outer loop is unrolled by a factor of two, this is essentially the same as
making two copies of it that can be executed in parallel. As there are two copies, in
the implementation of this modified NLP each copy of the loop will need its own set
of registers for storing coefficient and data values. The unrolling transformation thus
translates to spatial parallelism being employed in the implementation.
13
The skewing transformation however translates to temporal parallelism (pipelining)
being employed in the implementation, and as such, a single set of registers are shared
by the different iterations of the internal loop, whose executions are overlapped.
These two transformations can be used to transform a sequential model of
computation into something that it is closer to a data-flow model, thus making it more
suitable for efficient implementation (exploiting parallelism) on hardware.
Section 2 : FIR Filter Analysis
The FIR filter operation essentially carries out a vector dot-product in calculating each
value of y[n]. This is illustrated below in figure 10 for N = 4.
x[n] x[n-3]x[n-2]x[n-1]
h[0]
h[1]
h[2]
h[3]
x[n].h[0] + x[n-1].h[1] + x[n-2].h[2] + x[n-3].h[3]
Figure 10 : showing how the FIR filter operation is comprised of vector dot-products
Section 2.1 : The Single MACC FIR Filter
The Single MACC FIR filter (shown below in figure 11) is an implementation of the
FIR filter’s sequential model of computation and as its name suggests it is based on a
single MACC (multiply-accumulate) unit. As such, the algorithmic description of this
implementation is identical to that of the NLP description of the FIR filter (shown
below in figure 12) without applying any unfolding or skewing transformations
(which were discussed earlier in section 1.3). For simplicity it is assumed throughout
this section unless otherwise stated that all references to MACC units refer to non-
pipelined MACCs with a total latency of 1 clock cycle.
14
Figure 11 : showing an example H/W implementation of the Single MACC FIR filter
[4]
The primary trade-off between sequential and parallel implementations of the same
algorithm is the amount of hardware resources required versus the throughput
achieved. As the Single MACC FIR filter implements the FIR filter function in a
completely sequential manner, the required hardware resources are reduced by a
factor of N, although so too is the throughput as compared to a fully parallel
implementation that would use one MACC unit for each of the N coefficients (where
the N MACCs would be cascaded).
void singleMaccFirFilter( int num_taps, int num_samples, const float *x, const float *h[ ], float *y )
{
int i, j; // ‘j’ is the outer-loop counter and ‘i’ is the inner-loop index
float y_accum; // output sample is accumulated into ‘y_accum’
const float *k; // pointer to the required input sample
for( j = 0; j < num_samples; j++ )
{
k = x++; // x points to x[n+j] and is incremented (post assignment) to point to
// x[(n+j)+1]
y_accum = 0.0;
for( i = 0; i < num_taps; i++ )
{
y_accum += h[i] * *(k--); // y[n+j] += h[i] * x[(n+j) - i]
}
*y++ = y_accum; // y points to the register address where y[n+j] is to be written and is
// incremented (post assignment) to point to the register address where the
// next output sample y[(n+j)+1] is to be written
}
}
Figure 12 : A code description of the Single MACC FIR filter
15
With reference to the code description of the Single MACC FIR filter shown above in
figure 12, all of the required input samples are assumed to be stored in the register file
(with a stride of 1) of the processor (a single MACC unit in this case) executing the
code, with x initially pointing to the input sample x[n] corresponding to the first
output sample to be calculated y[n]. It is also assumed that all of the required
coefficients are stored in the same way in a group of registers used to store the h[ ]
array.
As can be seen from figure 12 above, and more clearly from the dependency graph of
the Single MACC FIR filter (shown below in figure 13), this implementation
evaluates (accumulates) only one output value at a time. Assuming that each
multiplication of x[(n+j)-i] and h[i] takes one clock cycle, then the performance of
this implementation is given by the following equation:
Throughput = Clock frequency ÷ Number of coefficients
h(0)
h(1)
h(2)
x[n]
+
h(3)
h(0).x[n] + h(1).x[n-1] + h(2).x[n-2]
+ h(3).x[n-3]
+
+
x[n-1]
x[n-2]
x[n-3]
h(0)
h(1)
h(2)
x[n+1]
+
h(3)
h(0).x[n+1] + h(1).x[n] + h(2).x[n-1]
+ h(3).x[n-2]
+
+
x[n]
x[n-1]
x[n-2]
h(0)
h(1)
h(2)
x[n+2]
+
h(3)
h(0).x[n+2] + h(1).x[n+1] +
h(2).x[n] + h(3).x[n-1]
+
+
x[n+1]
x[n]
x[n-1]
h(0)
h(1)
h(2)
x[n+3]
+
h(3)
h(0).x[n+3] + h(1).x[n+2] +
h(2).x[n+1] + h(3).x[n]
+
+
x[n+2]
x[n+1]
x[n]
j
i
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
y-accum
y-accum
y-accum
y-accum
y-accum
y-accum
y-accum
y-accum
y-accum
y-accum
y-accum
y-accum
y-accum
y-accum
y-accum
y-accum
0 1 2 3
0
1
2
3
Figure 13 : dependency graph showing the operation of the Single MACC FIR filter
If the coefficients of the filter posses symmetry (i.e. h[0]=h[N-1], h[1]=h[N-2], etc)
a doubling of throughput can be achieved at the same clock frequency using a
variation of the Single MACC FIR filter called the Symmetric MACC FIR filter. This
implementation uses a single coefficient in place of those that are equal, and as such
only one multiplication by that single coefficient is required, although the other
multiplicand is now the sum of the data values corresponding to those equal
coefficients. Thus, the cost of this performance enhancement is another adder at the
input of the multiplier, as well as another RAM block (or using one Dual-port block
16
RAM), as the two input samples corresponding to the single coefficient need to be
fetched simultaneously. Although as the number of coefficients is halved, so too is
the amount of storage required for them. If N is an odd number, the Symmetric
MACC FIR filter reduces the number of coefficients to N/2 + 1. The Symmetric
MACC FIR is derived from the Single MACC FIR by unrolling its inner-loop by a
factor of two and reversing the order in which one of the loops processes its
respective coefficient-data pairs. This is a unique example of employing spatial
parallelism.
Employing spatial parallelism is one way to enhance the performance of the Single
MACC FIR filter, and essentially uses more than one MACC unit to evaluate each
output sample. As a result each MACC unit evaluates an equal share of the
coefficient-data sample multiplications, and as such if M MACC units are employed,
the throughput is increased by a factor of M over the Single MACC FIR filter,
although so too are the required hardware resources.
Section 2.2 : The Transposed FIR Filter
The Transposed FIR filter (an example H/W implementation of which is shown below
in figure 14) is a fully parallel implementation, as one MACC unit is used for each of
the N coefficients. Unlike the Direct form type Ι fully parallel implementation (which
employs an adder tree structure) the Transposed FIR filter employs an adder chain,
and as such the MACC units are much easier to connect together and the
implementation can easily be scaled up or down in terms of N. With regards to
targeting the design to the Xilinx Vertex 4 FPGA, because of this adder-chain
structure the Transposed implementation can be entirely contained within the
dedicated Xilinx DSP48 slices as apposed to using generic FPGA fabric, which would
yield a less efficient mapping.
Figure 14 : showing an example H/W implementation of the Transposed FIR filter [4]
The input data samples are broadcast to all MACC units simultaneously, and with this
implementation the coefficients are assigned (in ascending order) starting from the
right-most MACC unit (from which the final output is taken). As well as the spatial
parallelism through the use of N MACC units, temporal parallelism is also employed
as the evaluation of the successive output samples is overlapped, although this doesn’t
serve to increase performance (throughput) or decrease the latency before the first
output sample appears over the Direct form type Ι implementation.
17
A code description for the Transposed FIR filter (with the number of taps N = 4) is
given in figure A1 of appendix A.
The Transposed FIR filter design is yielded by a complete unrolling of the inner-loop
(i-loop) of the original Single MACC FIR filter code description, which results in the
number of MACC units required increasing to N. A skewing of the outer-loop (j-
loop) by a factor of 1 is also performed, which results in the temporal overlapping of
successive output sample calculations. This skewing is required to schedule apart the
dependences that arise because the N MACC operations within any single iteration of
the outer-loop are dependent on the MACC in the previous iteration of the inner-loop
(for their third argument).
The dependency graph of the Transposed FIR filter’s operation is shown below (with
N = 4) in figure 15.
h(0)
h(1)
h(1)
h(2)
x[n]
+ +
+
+
+
+
x[n-3] x[n-2]
x[n-2]
x[n-1]
x[n-1]
x[n-1]
x[n]
x[n]
x[n]
h(3)
h(2)
h(3)h(3)
h(2)
h(3)
y[n] = h(0).x[n] + h(1).x[n-1] +
h(2).x[n-2] + h(3).x[n-3]
+
+
+
x[n+1]
x[n+1]
x[n+1]
h(0)
h(1)
h(2)
h(3)
x[n+1]
y[n+1] = h(0).x[n+1] + h(1).x[n] +
h(2).x[n-1] + h(3).x[n-2]
y_accum0 = 0
j
MACC_1 MACC_1 MACC_1 MACC_1 MACC_1
MACC_2
MACC_2 MACC_2 MACC_2
MACC_3 MACC_3 MACC_3
MACC_4 MACC_4
+++++
y_accum0 = 0 y_accum0 = 0 y_accum0 = 0 y_accum0 = 0
y_accum1 y_accum1
y_accum1
y_accum1 y_accum1
y_accum2 y_accum2 y_accum2 y_accum2
y_accum3 y_accum3 y_accum3
0 1
Figure 15 : dependency graph showing the operation of the Transposed FIR filter
(with N = 4)
As can be seen above in figure 15, the initial latency before the first output sample
emerges from the output is the same as that seen with the fully-parallel FIR filter.
Although once this initial latency (of the spin-up procedure whereby the pipeline is
filled over the first N cycles) has been endured, the throughput yielded by the
Transposed FIR filter implementation is the same as that of the Direct form type Ι
implementation (equal to the Clock frequency). The latency period between when
each input sample is applied and the emergence of the corresponding output sample is
also the same as that seen with the Direct form type Ι implementation, and this is
equal to the latency of a single MACC unit.
18
Section 2.3 : The Systolic FIR Filter
As with the Transposed FIR filter, the Systolic FIR filter (an example H/W
implementation of which is shown below in figure 16) is also a fully parallel
implementation, and also uses an adder-chain to accumulate each value of y[n].
Figure 16 : showing an example H/W implementation of the Systolic FIR filter [4]
The Systolic FIR filter also employs temporal parallelism in addition to spatial
parallelism in the same way that the Transposed FIR filter does. However the
Systolic FIR filter’s coefficients are assigned (in ascending order) starting from the
left-most MACC unit (to which the input samples are applied), which is the opposite
way to how the coefficients are assigned to the Transposed FIR filter’s MACC units.
As such, the Systolic FIR filter evaluates the inner-products (of which each value of
y[n] is consists) in the reverse order to the Transposed FIR filter.
The input data samples are fed into a cascade of registers which have the effect of a
data buffer. The Systolic FIR filter differs from the Direct form type Ι
implementation not only with its use of an adder-chain to accumulate each value of
y[n], but also with the additional register between each of the taps. A code
description of the Systolic FIR filter (with the number of taps N = 4) is given in figure
A2 of appendix A.
As with the Transposed FIR filter, the Systolic FIR filter is yielded by a complete
unrolling of the inner-loop (i-loop) of the original Single MACC FIR filter, and a
skewing of its outer-loop by a factor of 1. However, in generating the Systolic FIR
filter from the Single MACC FIR, the outer-loop is skewed in the opposite direction to
how it is skewed in generating the Transposed FIR filter. This means that the inner-
products which are summed together to produce an output sample are evaluated in the
opposite order to that in which they’re evaluated by the Transposed FIR filter. This
difference is reflected in the two H/W implementations as the Transposed FIR
implementation employs a broadcast structure to feed the same input sample to each
of its MACC units on each clock cycle, whereas the Systolic implementation employs
a tapped delay line with a delay of two clock cycles between each MACC unit. The
dependency graph of the Systolic FIR filter’s operation is shown below in figure 17
(with N = 4).
19
h(0)
h(1)
h(3)
h(0) h(0) h(0)
h(1)h(1)
h(2)
h(2)
x[n] x[n+1] x[n+2] x[n+3]
x[n-1]
x[n-2]
x[n-3]
x[n+4]
h(0)
h(1)
h(2)
h(3)
x[n]
x[n-1]
x[n-2]
x[n+1] x[n+2]
x[n]
+ +
+ +
+ +
+
++
y[n] = h(0).x[n] + h(1).x[n-1] +
h(2).x[n-2] + h(3).x[n-3]
y[n+1] = h(0).x[n+1] + h(1).x[n] +
h(2).x[n-1] + h(3).x[n-2]
x[n+3]
x[n+1]
x[n-1]
x[n+4]
x[n+2]
x[n]
j
0 1
MACC_1 MACC_1 MACC_1 MACC_1 MACC_1
MACC_2 MACC_2 MACC_2 MACC_2
MACC_3 MACC_3 MACC_3
MACC_4 MACC_4
+ + + + +
y_accum0 = 0 y_accum0 = 0 y_accum0 = 0 y_accum0 = 0 y_accum0 = 0
y_accum1 y_accum1 y_accum1 y_accum1
y_accum1
y_accum2 y_accum2 y_accum2 y_accum2
y_accum3 y_accum3 y_accum3
Figure 17 : dependency graph showing the operation of the Systolic FIR filter
(with N = 4)
The green arrows represent the forwarding of a partial accumulation of an output
sample through the adder-chain, whilst the effect of the two registers between each of
the MACC units is represented by the blue arrows as they represent the forwarding of
input samples between successive MACC units.
As with the Transposed FIR filter, the initial latency before the first output sample
emerges from the output, and the throughput thereafter is the same as that seen with
the Direct form type Ι implementation. Although because the Systolic FIR filter
evaluates and accumulates the inner-products in the opposite order to the Transposed
FIR filter, the latency between each input sample being applied to the filter and the
corresponding output sample emerging from the output is N clock cycles (assuming
the latency of each MACC unit is 1 cycle). Thus this latency increases by a factor of
N compared to that seen with both the Transposed and Direct form type Ι
implementations. However, the advantage that the Systolic FIR filter holds over the
Transposed implementation is that its input is only applied to one MACC unit, unlike
that of the Transposed FIR filter whose input is broadcast to all of its MACC units
and thus has a high fan-out. Thus the Systolic implementation is more suitable the
Transposed implementation for higher values of N.
20
Section 2.4 : The Semi-Parallel FIR Filter
The Semi-Parallel FIR filter (sometimes called the Hardware-folded implementation)
divides its N coefficients amongst M multiply-add units. An example implementation
of the Semi-Parallel FIR filter is shown below in figure 18 (with N = 16, M = 4).
Figure 18 : showing an example H/W implementation of the Semi-Parallel FIR filter
(with N = 16, M = 4) [4]
Each group of N/M coefficients is assigned to one of the MACC units and stored in
order within the associated coefficient-memory. The first group (coefficients 0 to
(N/M – 1)) is assigned to the left-most MACC unit (to which the input samples are
applied), with ascending coefficient groups being assigned to the MACC units from
left to right. If N is not exactly integer-divisible by M, then the higher-order
coefficient-memories are padded with zeros.
Like the Transposed and Systolic implementations, the Semi-Parallel FIR filter
employs both spatial and temporal parallelism, but the degree to which it does this
depends on the ratio of M:N, with a higher M:N ratio resulting in a higher degree of
both spatial parallelism (as more MACC units are used) and temporal parallelism (as
each output sample is evaluated more quickly and thus the evaluation of more output
samples can be overlapped in time). Thus the trade off is performance obtained
versus the resources required, as can be seen by the equation for the performance of a
Semi-parallel FIR filter implementation:
Throughput = ( Clock frequency ÷ N ) * M
The Semi-parallel implementation may be extrapolated either towards being a fully-
parallel implementation like the Transposed and Systolic implementations by using
more MACC units, or the other way towards being a Single MACC FIR filter by
using fewer MACC units. A code description of the Semi-parallel FIR filter (with N
= 16, M = 4) is given in figure A3 of appendix A. The dependency graph of the
Semi-parallel FIR filter’s operation (with N = 16, M = 4) is shown below in figure 19.
21
h(0) h(1) h(2) h(3)
h(6)h(5)
h(9)
x[n] x[n-1] x[n-2] x[n-3]
+ +
+ +
+ +
+
+
+
+
+
+
x[n-8]
x[n-11]
x[n-14] x[n-15]
x[n-12]
x[n-4]
x[n-16]
x[n-8]
x[n-5]
x[n-12]
x[n-9]
x[n-6]
h(12)
h(8)
h(15)h(14)
h(11)
h(4)
h(13)
h(10)
h(7)
x[n+1]
h(0)
+
+
+
x[n]
+
+
+
x[n-1]
+
+
+
x[n-2]
h(1) h(2) h(3)
h(7) h(4) h(5) h(6)
h(10) h(11) h(8) h(9)
h(13) h(14) h(15) h(12)
x[n-7]
x[n-10]
x[n-13]
x[n-11]
x[n-14] x[n-15]
x[n-3] x[n-4]
x[n-7]
x[n-5]
x[n-8]
x[n-11]
x[n-8]
x[n-12]
x[n-3]
x[n-7]
x[n-11]
x[n-2]
x[n-4]
+ + + + + +
y[n] = h(0).x[n] + h(1).x[n-1] + h(2).x[n-2] + h(3).x[n-3] + h(4).x[n-4] +h(5).x[n-5] +
h(6).x[n-6] + h(7).x[n-7] + h(8).x[n-8] + h(9).x[n-9] + h(10).x[n-10] + h(11).x[n-11] +
h(12).x[n-12] + h(13).x[n-13] + h(14).x[n-14] + h(15).x[n-15]
y[n-1] = h(0).x[n-1] + h(1).x[n-2] + h(2).x[n-3] + h(3).x[n-4] + h(4).x[n-5] +h(5).x[n-6]
+ h(6).x[n-7] + h(7).x[n-8] + h(8).x[n-9] + h(9).x[n-10] + h(10).x[n-11] + h(11).x[n-12]
+ h(12).x[n-13] + h(13).x[n-14] + h(14).x[n-15] + h(15).x[n-16]
x[n+1] x[n+2]
+
+
++
+
+
MACC_1 MACC_1 MACC_1 MACC_1 MACC_1 MACC_1 MACC_1 MACC_1
MACC_2 MACC_2 MACC_2 MACC_2 MACC_2 MACC_2 MACC_2 MACC_2
MACC_3
MACC_3 MACC_3 MACC_3 MACC_3 MACC_3 MACC_3 MACC_3
MACC_4 MACC_4 MACC_4 MACC_4 MACC_4 MACC_4 MACC_4 MACC_4
ACC_1 ACC_1 ACC_1 ACC_1 ACC_1 ACC_1 ACC_1 ACC_1
++++++++
y_accum0 = 0 y_accum0 = 0 y_accum0 = 0 y_accum0 = 0 y_accum0 = 0 y_accum0 = 0 y_accum0 = 0 y_accum0 = 0
y_accum1 y_accum1
y_accum1 y_accum1
y_accum1 y_accum1 y_accum1 y_accum1
y_accum2 y_accum2 y_accum2 y_accum2
y_accum2 y_accum2 y_accum2 y_accum2
y_accum3 y_accum3 y_accum3 y_accum3 y_accum3 y_accum3 y_accum3 y_accum3
y_accum4 y_accum4 y_accum4 y_accum4 y_accum4 y_accum4 y_accum4 y_accum4
y_out y_out
y_out y_out y_out y_out
y_out y_out
j
0 1
0 1 2 3 0 1 2 3
i
Figure 19 : dependency graph showing the operation of the Semi-Parallel FIR filter
The red circles represent MACC units being used to calculate the inner-products of
y[n], and the dark-red circles represent the output-accumulator being used to
accumulate the inner-products of y[n]. The blue and dark-blue circles represent the
same for y[n-1], whilst the yellow circles represent MACC units being used to
calculate the inner-products of y[n+1]. As can be seen above in figure 19, the address
(which lies in the range [0:((N/M) – 1)] for all MACC units) applied to the data-
buffer and coefficient memory of each MACC unit lags one behind the corresponding
address of the immediately preceding (to the immediate left) MACC unit, and all such
addresses are continuously and monotonically cycling from 0 to ((N/M) – 1). This is
necessary in order to employ temporal parallelism by overlapping (in time) the
evaluation of successive output samples in the way shown in figure 19 above. This
temporal parallelism is in turn necessary to achieve the Semi-Parallel
implementation’s maximum throughput of one output sample every N/M clock cycles,
because once an output sample has been retrieved from the accumulator by the
capture register [show on one/both diagrams], the accumulator must be reset (to either
zero or its input value). For its input value to be the first sum of M inner-products of
the next output sample to be evaluated, then the evaluation of this sum needs to have
finished in the previous clock cycle, otherwise the accumulator will have to be reset to
zero at the start of a new result-cycle (in which an output sample is accumulated). If
the accumulator was set to zero between result-cycles in this way, one extra clock
cycle would be required for the evaluation of each output sample, thus degrading
performance.
22
Section 3 : Matrix-Vector Multiplication Analysis
Matrix-vector multiplication is essentially a series of vector dot-products, as element
(r,1) of the resultant vector is comprised of the multiplication of row r of the matrix
and the multiplicand column vector. This is illustrated below in figure 20, which
shows how the (4x1) resultant column vector R is formed from the multiplication of
the (4x4) matrix M and the (4x1) column vector V.
R(1) V(1)
V(2)
V(3)
V(4)
M(1,1).V(1) + M(1,2).V(2) + M(1,3).V(3) + M(1,4).V(4) M(1,1) M(1,2) M(1,3) M(1,4)
M(2,1) M(2,2) M(2,3) M(2,4)
M(3,1) M(3,2) M(3,3) M(3,4)
M(4,1) M(4,2) M(4,3) M(4,4)
X
M(2,1).V(1) + M(2,2).V(2) + M(2,3).V(3) + M(2,4).V(4)
M(3,1).V(1) + M(3,2).V(2) + M(3,3).V(3) + M(3,4).V(4)
M(4,1).V(1) + M(4,2).V(2) + M(4,3).V(3) + M(4,4).V(4)
R(2)
R(3)
R(4)
Figure 20 : showing how matrix-vector multiplication is comprised of a series of
vector dot-products
Section 3.1 : The Sequential Model of Computation
Matrix-vector multiplication is related to the FIR filter as both algorithms are
consistent of a series of vector dot-products. Considering both algorithms in their
sequential form, their outer for-loop is essentially for number of vector dot products required,
and their inner for-loop is essentially for number of vector-matrix_row element pairs. Figures 21
and 22 show a code description and dependency graph of the matrix-vector
multiplication problem’s sequential model of computation respectively. For
simplicity it is assumed throughout this section unless otherwise stated that all
references to MACC units refer to non-pipelined MACCs with a total latency of 1
clock cycle
23
void sequentialMatrixVectorMultiply(int num_matrix_rows, int num_matrix_cols, const float *m[ ][ ], const float *v[ ],
const float *r[ ])
{
int row, col;
for( row = 0; row < num_matrix_rows; row++ )
{
for( col = 0; col < num_matrix_cols; col++ )
{
r[row] += m[row][col] * v[col]; // matrix is processed in ROW-MAJOR order
}
}
}
Figure 21 : showing the code description of the sequential model of computation of
the matrix-vector multiplication algorithm
V(1)
V(2)
V(3)
M(1, 1)
+
V(4)
V(1).M(1, 1) + V(2).M(1, 2) +
V(3).M(1, 3) + V(4).M(1, 4)
+
+
M(1, 2)
M(1, 3)
M(1, 4)
V(1)
V(2)
V(3)
M(2, 1)
+
V(4)
V(1).M(2, 1) + V(2).M(2, 2) +
V(3).M(2, 3) + V(4).M(2, 4)
+
+
M(2, 2)
M(2, 3)
M(2, 4)
V(1)
V(2)
V(3)
M(3, 1)
+
V(4)
V(1).M(3, 1) + V(2).M(3, 2) +
V(3).M(3, 3) + V(4).M(3, 4)
+
+
M(3, 2)
M(3, 3)
M(3, 4)
V(1)
V(2)
V(3)
M(4, 1)
+
V(4)
V(1).M(4, 1) + V(2).M(4, 2) +
V(3).M(4, 3) + V(4).M(4, 4)
+
+
M(4, 2)
M(4, 3)
M(4, 4)
row
col
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
R[row]
R[row]
R[row]
R[row]
R[row]
R[row]
R[row]
R[row]
R[row]
R[row]
R[row]
R[row]
R[row]
R[row]
R[row]
R[row]
0 1 2 3
0
1
2
3
Figure 22 : showing the dependency graph of the matrix-vector multiplication
algorithm’s sequential model of computation (for a 4x4 matrix and 4x1 vectors)
However, as can be seen from figure 20 above, in matrix-vector multiplication each
element of the matrix is a multiplicand of strictly one inner-product in one vector dot-
product, and thus with reference to the program of figure 21, each element of the
matrix is strictly a multiplicand of one MACC operation in one specific iteration of
the inner-loop within one specific iteration of the outer-loop. This is in contrast to the
sequential (Single MACC) FIR filter algorithm where each input sample is a
multiplicand of one inner-product in N successive vector dot-products.
24
The multiplicand column-vector of a matrix-vector multiplication is analogous to the
coefficient vector used in the FIR filter algorithm, as all vector-dot products
performed by both algorithms multiply these vectors by another.
Section 3.2 : Exploiting the Inherent Parallelism
The Transposed and Systolic FIR filter implementations discussed previously in
sections 2.2 and 2.3 respectively, were formed by completely unrolling the inner-loop
and then skewing the outer-loop (by a factor of 1 in opposite directions) of the original
sequential FIR filter code. With reference to figure 21 above, if the inner-loop of the
matrix-vector multiplication algorithm is unrolled by any factor, as was previously
discussed in section 2.2 (with regards to the FIR filter algorithm) the outer-loop then
has to be skewed for the MACC operations scheduled for simultaneous execution (by
different MACC units) to be independent of one another. Although as already
discussed, each matrix element is used as a multiplicand only once throughout the
execution of the entire algorithm, and thus unlike with the FIR filter algorithm, the
direction in which the outer-loop is skewed essentially makes no difference, as each
MACC unit would still have to access a separate matrix element at the start of each
clock cycle. Thus unlike the Transposed FIR filter, the access of each matrix element
(analogous to each input sample of the FIR filter) can not be shared among all MACC
units. Similarly, unlike the Systolic FIR filter there is no sense in feeding the matrix
elements through a tapped delay line in order to amortise the overhead of accessing
them.
Section 3.2.1 : Unrolling the Inner-Loop
Figure 23 below shows the dependency diagram of the matrix-vector multiplication
code (in its sequential form) of figure 21 after its inner-loop has been completely
unrolled (with each iteration executed on a separate MACC unit) and its outer-loop
has subsequently been skewed by a factor of +1 in a way analogous to that used in
creating the Systolic FIR filter. With this series of transformations, each MACC unit
employed effectively processes one column of the matrix.
25
V(1)
V(2)
V(4)
V(1)
V(1) V(1)
V(2)V(2)
V(3) V(3)
M(1, 1) M(2, 1) M(3, 1) M(4, 1)
V(2)
V(3)
V(4)
+ +
+ +
+ +
+
++
R(1) = V(1).M(1, 1) + V(2).M(1, 2)
+ V(3).M(1, 3) + V(4).M(1, 4)
R(2) = V(1).M(2, 1) + V(2).M(2, 2)
+ V(3).M(2, 3) + V(4).M(2, 4)
row
0 1
MACC_1 MACC_1 MACC_1 MACC_1
MACC_2 MACC_2 MACC_2 MACC_2
MACC_3 MACC_3 MACC_3
MACC_4 MACC_4
+ + + +
R(1) = 0 R(2) = 0 R(3) = 0 R(4) = 0
R(1) R(2) R(3) R(4)
R(1) R(2) R(3) R(4)
R(1) R(2) R(3)
M(1, 2)
M(2, 3)
M(1, 4) M(2, 4)
M(1, 3)
M(2, 2) M(3, 2)
M(3, 3)
M(4, 2)
+
+
MACC_3
MACC_4
R(3)
+
MACC_4
V(3)
V(4) V(4)
M(4, 3)
M(3, 4) M(4, 4)
R(3) = V(1).M(3, 1) + V(2).M(3, 2)
+ V(3).M(3, 3) + V(4).M(3, 4)
R(4) = V(1).M(4, 1) + V(2).M(4, 2)
+ V(3).M(4, 3) + V(4).M(4, 4)
2 3
0
1
2
3
col
Figure 23 : showing the dependency diagram of the matrix-vector multiplication
algorithm’s sequential model of computation after complete unrolling its inner-loop
and skewing its outer-loop by a factor of +1
With reference to figure 23 above, the magenta and purple circles represent MACC
units used during the spin-up and spin-down procedures respectively, whilst the red
circles represent MACC units used during the steady-state. As can be seen from
figure 23, execution is only in the steady-state for one clock cycle. In order to achieve
better utilisation of the MACC units employed, several such matrix-vector
multiplication problems could be carried out in succession to amortise the overhead of
the spin-up and spin-down procedures. By doing this, the total execution time of all
the problems would be approximately four times faster than executing them on a
single MACC unit as is done for the matrix-vector multiplication problem’s
sequential model of computation detailed previously in figures 21 and 22.
Alternatively, the inner-loop could be unrolled only twice (thus employing only two
MACC units), where the sum of the first two inner-products of each vector dot-
product would need to be stored as intermediate values in a register file. This
implementation of the matrix-vector multiplication algorithm would be approximately
twice as fast as the implementation of it’s sequential model of computation, and
without as much need to chain together several problems to amortise the spin-up and
spin-down latency.
An advantage of this implementation (regardless of the factor the inner-loop is
unrolled by) is that each MACC unit uses the same vector-element throughout
processing its respective column of the matrix, and thus the fetch operation for each
vector element is amortised over the execution of the entire problem.
26
Section 3.2.2 : Unrolling the Outer-Loop
Figure 24 below shows the dependency diagram of the implementation that results if
instead the outer-loop is completely unrolled, and thus each MACC unit employed
processes one row of the matrix.
V(1)
V(1)
V(1)
M(1, 1)
V(1)
R(1) = V(1).M(1, 1) + V(2).M(1, 2)
+ V(3).M(1, 3) + V(4).M(1, 4)
+
M(2, 1)
M(3, 1)
M(4, 1)
V(2)
V(2)
V(2)
M(1, 2)
+
V(2)
R(2) = V(1).M(2, 1) + V(2).M(2, 2)
+ V(3).M(2, 3) + V(4).M(2, 4)
+
+
M(2, 2)
M(3, 2)
M(4, 2)
V(3)
V(3)
V(3)
M(1, 3)
+
V(3)
R(3) = V(1).M(3, 1) + V(2).M(3, 2)
+ V(3).M(3, 3) + V(4).M(3, 4)
+
+
M(2, 3)
M(3, 3)
M(4, 3)
V(4)
V(4)
V(4)
M(1, 4)
+
V(4)
R(4) = V(1).M(4, 1) + V(2).M(4, 2)
+ V(3).M(4, 3) + V(4).M(4, 4)
+
+
M(2, 4)
M(3, 4)
M(4, 4)
col
row
MACC_1
MACC_2
MACC_3
MACC_4
MACC_1
MACC_2
MACC_3
MACC_4
MACC_1
MACC_2
MACC_3
MACC_4
MACC_1
MACC_2
MACC_3
MACC_4
R(1) = 0
0 1 2 3
0
1
2
3
+
+
+ + + +
R(2) = 0
R(3) = 0
R(4) = 0
R(1) R(1) R(1)
R(2) R(2) R(2)
R(3) R(3) R(3)
R(4) R(4) R(4)
Figure 24 : showing the dependency diagram of the matrix-vector multiplication
algorithm’s sequential model of computation after its outer-loop is completely
unrolled
As can be seen from figure 24 above, there are no dependencies across separate
iterations of the outer-loop, so thus after this unrolling there’s no need to skew any
instance of the inner-loop. Therefore execution is always in the steady state (as only
spatial parallelism is employed), meaning that all MACC units are always utilised
during execution. This implementation of the matrix-vector multiplication algorithm
would be four times faster than the implementation of its sequential model of
computation, and there is no need to amortise any spin-up and spin-down latency over
the execution of multiple problems as was the case for the implementation that results
from unrolling the inner-loop. Another advantage of this implementation is that each
vector-element is fetched only once (as is also the case when the inner-loop is
unrolled) where the MACC units employed always share the same vector-element
argument.
3.3 : Using Pipelined MACC Units
Until now, the MACC units discussed have been non-pipelined, where new operands
are only issued to such a unit after it has finished processing its previous operands.
The Transposed, Systolic and Semi-Parallel FIR filter implementations discussed
previously in section 2, used a pipeline of non-pipelined MACC units. As was
27
discussed in section 2, pipelining temporally overlaps (skews) multiple execution
threads, and thus once the initial latency (whilst the pipeline is filled) has been
endured, the subsequent throughput achievable is n times greater (where n is the
degree of pipelining employed, and the pipeline is balanced). This results from the
non-pipelined execution unit being segmented into n stages, where each stage
contributes an nth
of the overall latency, and is thus able to be clocked at n times the
rate of the non-pipelined equivalent. Thus if a single MACC unit was pipelined by a
degree of n (and clocked at n times the clock frequency), once the initial latency of
filling the pipeline had been endured, the subsequent throughput would be n times
greater than that possible with the non-pipelined version. For simplicity, every
instance of the word pipeline throughout this document will refer to a balanced
pipeline, unless otherwise stated. The benefit of a pipelined MACC unit over its non-
pipelined equivalent is depicted below in figure 25.
0 1 2
0 1 2 3 4 5 6 7 8 9 10 11
MACC0 MACC1 MACC2
MACC0
MACC0
MACC0
MACC0 MACC1 MACC2 MACC3 MACC4 MACC5 MACC6 MACC7 MACC8
MACC1 MACC2 MACC3 MACC4 MACC5 MACC6 MACC7 MACC8
MACC1 MACC2 MACC3 MACC4 MACC5 MACC6 MACC7 MACC8
MACC1 MACC2 MACC3 MACC4 MACC5 MACC6 MACC7 MACC81
2
3
4
Non-Pipelined MACC
Pipelined MACC
Clock Cycle
Clock Cycle
Pipeline Stage
Figure 25 : showing an example to illustrate the benefit of a pipelined MACC over its
non-pipelined equivalent (with n = 4)
3.3.1 : Optimising the Code for Execution on a Pipelined MACC Unit
As discussed previously in section 2.2 the series of MACC operations that a specific
vector dot-product is comprised of have data-dependencies. Thus if the matrix-vector
multiplication problem’s sequential model of computation (shown previously in
figure 21) was executed on a pipelined MACC (consisting of a pipelined multiplier
followed by a pipelined adder), the achievable throughput would not be any higher
than that with the non-pipelined version of the MACC. This is illustrated below in
28
figure 26. When executing the sequential model of computation, the pipelined
MACC effectively skews the outer-loop.
0 1 2
0 1 2 3 4 5 6 7 8 9 10 11
MACC0 MACC1 MACC2
MACC0
MACC0
MACC0
MACC0 MACC1 MACC2
MACC1 MACC2
MACC1 MACC2
MACC1 MACC21
2
3
4
Non-Pipelined MACC
Pipelined MACC
Clock Cycle
Clock Cycle
Pipeline Stage
MACC3
MACC3
MACC3
MACC3
3
MACC2
13 14 1512
In both cases : MACC0 : R(1) += M(1, 1) * V(1); MACC1 : R(1) += M(1, 2) * V(2); MACC2 : R(1) += M(1, 3) * V(3); MACC3 : R(1) += M(1, 4) * V(4); where R(1) begins as zero
Figure 26 : showing an example to illustrate the effect of issuing successive MACC
instructions (that are dependent) to a pipelined MACC unit
For simplicity, it is assumed that all three arguments are supplied as part of a MACC
instruction when it is issued to a MACC unit.
The code description of the matrix-vector multiplication algorithm shown below in
figure 27 is a re-write of the sequential code shown previously in figure 21. The outer
and inner loops have been swapped around, which thus requires the matrix to be
processed in column-major order (as apposed to row-major order as was the case for
the sequential code). The reason the two loops have been swapped around is so that
dependent MACC operations are scheduled as far apart as is possible, and this is
illustrated in the dependency diagram of this code which is shown below in figure 28.
29
void pipelinedMatrixVectorMultiply(int num_matrix_rows, int num_matrix_cols, const float *m[ ][ ], const float *v[ ],
const float *r[ ])
{
int row, col;
for( col = 0; col < num_matrix_cols; col++ )
{
for( row = 0; row < num_matrix_rows; row++ )
{
r[row] += m[row][col] x v[col]; // matrix is processed in COLUMN-MAJOR order
}
}
}
Figure 27 : showing the code description of the sequential matrix-vector
multiplication algorithm re-written for execution on a single pipelined MACC unit
MACC_1
V(1)
M(1, 1)
+
MACC_1
V(2)
M(1, 2)
MACC_1
V(3)
M(1, 3)
MACC_1
V(4)
M(1, 4)
R(1) = 0
MACC_1
V(1)
R(2) = 0
MACC_1
V(2)
MACC_1
V(3)
MACC_1
V(4)
+
MACC_1
V(1)
M(3, 1)
MACC_1
V(2)
M(3, 2)
MACC_1
V(3)
M(3, 3)
MACC_1
V(4)
M(3, 4)
+R(3) = 0
MACC_1
V(1)
M(4, 1)
MACC_1
V(2)
M(4, 2)
MACC_1
V(3)
M(4, 3)
MACC_1
V(4)
M(4, 4)
+R(4) = 0
M(2, 1) M(2, 2) M(2, 3) M(2, 4)
R(1) = V(1).M(1, 1) + V(2).M(1, 2) +
V(3).M(1, 3) + V(4).M(1, 4)
R(2) = V(1).M(2, 1) + V(2).M(2, 2) +
V(3).M(2, 3) + V(4).M(2, 4)
R(3) = V(1).M(3, 1) + V(2).M(3, 2) +
V(3).M(3, 3) + V(4).M(3, 4)
R(4) = V(1).M(4, 1) + V(2).M(4, 2) +
V(3).M(4, 3) + V(4).M(4, 4)
col
0 1 2 3
row
0
1
2
3
+ + +
+
+
+
+
+
+
+
+ +
R(2) R(2) R(2)
R(3) R(3) R(3)
R(4) R(4)R(4)
Figure 28 : Showing the dependency diagram of the code of figure 27 above, with the
degree of pipelining n = 4
As can be seen from figure 28, this re-written code description of the matrix-vector
multiplication algorithm essentially overlaps the execution of the vector dot-products
by interleaving their constituent MACC operations. In this way, the first inner-
product of all the vector dot-products in turn is calculated and accumulated, after
which the same is done for the second inner-product of all the vector dot-products,
and so on.
The number of vector-dot products a particular matrix-vector multiplication consists
of is equal to the number of elements in the resultant vector, which is equal to the
30
number of rows in the matrix. With regards to the matrix-vector multiplication
algorithm detailed in figures 27 and 28 above, as this number (represented by the
variable num_matrix_rows in figure 27) is increased, dependent MACC operations are
scheduled further apart in time. If this number is greater than or equal to the number
of pipeline stages within the MACC, then the optimum throughput of the MACC can
be achieved (n times that of its non-pipelined version), as each time a MACC
instruction was issued, all those it was dependent on will have completed, thus
allowing it to be executed immediately.
As has been demonstrated, if the code is written such that dependencies are scheduled
far enough apart the use of a pipelined MACC can increase throughput by a factor of
n (where n is the degree of pipelining)
31
Section 4 : The Floating Point Unit
The implementations of the matrix-vector multiplication algorithms discussed
previously in section 3 are all based on what were termed MACC units, which in
concept have the capabilities of triple-word read, write and multiply-accumulate.
This section details the design and implementation of a Floating-Point Unit (FPU) that
acts as one of those MACC units, and is pipelined in accordance with the findings of
section 3. As was seen previously in section 1.1.2, multiply-accumulate is not the
only elemental operation required to implement high-level OpenGL functions (and
graphics rendering functions in general). With this in mind, the FPU has been
designed in such a way that its instruction set is easy to extend. At the core of the
FPU is a 5-stage pipelined multiplier and a 3-stage pipelined adder. These may be
used in immediate succession to execute a multiply-accumulate (MACC) instruction,
or individually to execute either a multiply or an add instruction. These particular
pipeline lengths were chosen by considering the type of OpenGL program it was
envisaged would be executed on the FPU during its developmental phase, and FPU
has been designed such that changing these pipeline lengths can easily be done.
Figure 29 below depicts this initial FPU architecture.
Figure 29 : showing the initial design of the FPU
The control unit of the FPU is modelled as a program written in M-code, which is
encapsulated within the Embedded Matlab Function block labelled
FPU_Control_Unit on diagram shown in figure B1 of appendix B. M-code is
Matlab’s equivalent to C-code, and the inputs to the FPU_Control_Unit block are
passed through to and processed by the embedded programme. This programme is
32
executed and assigns values to its outputs once per step of Simulink’s simulation time.
One of these simulation time steps is analogous to one clock cycle.
The instruction format of the FPU has been devise so that it is compliant with the IBM
PowerPC interface standard, and is shown below in figure 30. This has been done to
allow for the future possibility of employing the FPU alongside a PowerPC RISC as
the latter’s Fabric Co-Processor Module (FCM), as the PowerPC is known to run
OpenGL code in this configuration.
RBRART/S
067131420
MNEMONIC
2126
Figure 30 : showing the format of the FPU’s instruction word
With reference to figure 30 above, the mnemonic field of an arithmetic instruction
tells the FPU_Control_Unit program which of the three types of arithmetic
instruction it is. The RT/S field is the register number of the instruction’s destination
register (and third source register of a MACC), and the RA and RB fields are the
instruction’s source register numbers.
The word length of all data processed by the FPU is 32-bits (in accordance with the
standard binary representation of floating-point numbers). Also part of the FPU is a
100-word register file, which all data is read from and written to. Throughout this
section it is assumed that all input data has already been loaded into this register file,
with section 5.2.6 later describing a DMA unit that was designed and developed to
transfer data in and out of the register file without holding the FPU up. The register
file facilitates three simultaneous reads and one write per clock cycle. Throughout
section 3 it was assumed that when a MACC instruction was issued to a pipelined
MACC unit, all three arguments were also supplied at once. However, when the FPU
begins the execution of a MACC instruction, only the two multiplicand arguments are
fetched immediately, and the third argument is fetched at the start of the accumulate
stage. Thus three reads on the register file per clock cycle must be provided so that
the FPU has the capability to begin executing a new MACC or multiply instruction in
the same clock cycle that it starts executing the accumulate stage of a down-stream
MACC instruction.
4.1 : Dealing with Hazards
The ability to issue the FPU any instruction that it supports in any clock cycle
abstracts the programmer from the architecture. This allows them to get working
code earlier in the design cycle (before optimisation), as apposed to code only
working when it’s exactly optimised for this FPU. In the future, if a compiler is
developed, this capability will allow the FPU to execute code written for other
architectures. The FPU has this capability as a result of being designed to prevent
structural and data hazards from manifesting into errors.
33
The FPU has been designed to deal with the structural hazard that occurs when the
new instruction issued is an add, and the accumulate part of a MACC instruction is
due to begin in that clock cycle. The conflict here is that the accumulate part of a
MACC instruction also requires its associated input arguments to be entered into the
adder unit. In this event, priority is given to the accumulate, and the FPU is stalled,
whereby it does not allow new instructions to be issued until the adder becomes
available and execution of the pending add instruction has subsequently began.
The FPU has also been designed to deal with the data hazards that arise when a newly
issued instruction has a variable (represented by a specific register) that is the output
variable of an instruction in the Execution_pipeline at that time. These data hazards
are prevented from manifesting into an error by the FPU stalling whenever such an
instruction is issued to it. This prevents any instruction executed from fetching one of
its source registers before the register’s contents have been updated by a down-stream
instruction, and similarly writing to a register before its contents have been fetched by
the accumulate stage of a down-stream MACC instruction.
4.1.1 : Using the Scoreboard to Detect Data Hazards
The Scoreboard is essentially a table that maintains which registers in the register file
are a destination register of an instruction currently in the pipeline and when they will
be written to (updated). The FPU_Control_Unit program maintains the FPU’s
Scoreboard as a binary vector, with one 6 bit field for each of the 100 registers of the
register file. Each of these fields are split into two sub-fields as illustrated below in
figure 31.
34
SCOREBOARD
REGISTER COUNT DOWN EXECUTION UNIT
R1
R2
R3
R99
R100
......
R1R2R3R4R100 R99 R98 R97 . . . . . . . . . . . . . . . . . . .
0561112171823599 594 593 588 587 582 581 576
EXEC_UNITCOUNT DOWN
0125
The Scoreboard is conceptually a table with count down and execution unit entries for each register within the register file
(as illustrated above), although it is actually represented within the FPU_Control block as a binary vector as shown below
0 1 adder
1 0 multiplier
Figure 31 : Illustrating the concept of the Scoreboard and how it is represented
As can be seen above in figure 31, the 2 least significant bits of a field (its exec_unit
subfield) represent the actual execution unit that will produce the result to be written
into the field’s respective register, and the 4 most significant bits of a field (its count
down subfield) represent the number of clock cycles before that write-back operation
will occur. After consideration of the type of NLPs to be executed on the FPU (as
detailed previously in section 3 and even section 2), for simplicity the FPU was not
designed to have the capability of arbitrating the execution of multiple write-back
operations. Thus only programs that produce strictly no more than one write-back
operation per clock cycle are supported.
The position of the field within the Scoreboard vector as a whole is representative of
the actual register it represents, where the least significant field represents register R1,
and successive fields represent the registers in ascending order. In each simulation
time-step the Scoreboard is updated to decrement any non-zero count down sub-fields
and add details of a new instruction if one is submitted to the Execution_pipeline by
the FPU_Control_Unit program.
As stated previously in section 4.1.1, there are 100 registers in the register file and
thus the Scoreboard binary vector is consistent of 6 x 100 = 600 bits. In Simulink
35
unsigned integers are represented using 32 bits, and thus to model this vector it was
broken up into twenty 30 bit vectors. This is illustrated below in figure 32.
R5 R4 R3 R2 R1
Scoreboard_1
R10 R9 R8 R7 R6
Scoreboard_2
R95 R94 R93 R92 R91
Scoreboard_19
R100 R99 R98 R97 R96
Scoreboard_20
. . . . . . . . . . . . . . . . . . .
R5 R4 R3 R2 R1R10 R9 R8 R7 R6R95 R94 R93 R92 R91R100 R99 R98 R97 R96 . . . . . . . . . . . . . . . . . . .
EXEC_UNITCOUNT DOWN
0125
R6R7R8R9
0561112171823
R10
Scoreboard_2
2429
0293059599 570 569 540
Figure 32: Illustrating how the Scoreboard binary vector is split into 20 segements in
order to represent it in Simulink
With reference to figure 32 above, Scoreboard_1 holds the status of registers R1 to
R5, Scoreboard_2 holds the status of registers R6 to R10, and so on for successive
Scoreboards up to Scoreboard_20. Each of the twenty Scoreboard segments are
represented within the FPU_Control_Unit program as persistent variables.
Figure 33 below shows how the Scoreboard is updated each time the FPU submits a
new instruction to its Execution_pipeline.
36
SCOREBOARD
REGISTER COUNT DOWN EXECUTION UNIT
R1 0 0
R2 0 0
R3 0 0
R4 0 0
R5 9 1
SCOREBOARD
REGISTER COUNT DOWN EXECUTION UNIT
R1 6 2
R2 0 0
R3 0 0
R4 0 0
R5 8 1
SCOREBOARD
REGISTER COUNT DOWN EXECUTION UNIT
R1 5 2
R2 0 0
R3 0 0
R4 4 1
R5 7 1
0 1 2 3 4 5 6 7 8 9 10clock cycle
R5 += R1 * R2
0 1 2 3 4 5 6 7 8 9 10
R5 += R1 * R2
R1 = R3 * R4
0 1 2 3 4 5 6 7 8 9 10
R5 += R1 * R2
R1 = R3 * R4
clock cycle
clock cycle
R4 = R2 + R3
X
X X
X
The instruction issued in clock cycle 0 is : R5 += R1 * R2
The instruction issued in clock cycle 1 is : R1 = R3 * R4
The instruction issued in clock cycle 2 is : R4 = R2 + R3
R1
R2
R3
R4
R5
destination
register
R1
R2
R3
R4
R5
R1
R2
R3
R4
R5
destination
register
destination
register
Figure 33 : showing an example to illustrate how the FPU updates the Scoreboard
each time a new instruction is submitted to the Execution_Pipeline
Figure 33 above shows how the state of the Scoreboard (for the registers of concern)
changes over three successive clock cycles, in which a MACC, multiply and add
instructions are issued to the FPU and subsequently submitted to the
Execution_pipeline where their execution begins. As can be seen in figure 33, each
time an instruction is submitted to the Execution_pipeline the count down entry in the
Scoreboard for that instruction’s destination register is set to the latency of the
instruction (in clock cycles) and this value is subsequently decremented in each clock
cycle thereafter. With reference to figure 33, registers R4, R1 and R5 will be written
to in clock cycles 6, 7, and 9 respectively. The exec_unit entry for the instruction’s
destination register is set to the code representing the particular execution_unit within
the Execution_pipeline that will produce its result as detailed previously in figure 31.
A synopsis of the FPU_Control_Unit sub-function updateScore() that is responsible
for updating the Scoreboard (in the way shown in figure 33 above) each time a new
instruction is submitted to the Execution_pipeline is shown in figure B2 followed by a
description in appendix B.
37
The code of the sboardCycle() sub-function of FPU_Control_Unit is shown in figure
B3, followed by a description in appendix B. This sub-function is executed every
clock cycle (once per Scoreboard segment) to update the Scoreboard so as to reflect
the passing of one clock cycle. This is done by decrementing all non-zero countdown
sub-fields throughout the entire Scoreboard. A countdown value of 1 is encountered
when its parent field represents the register on which a write back operation is
scheduled to occur in the current clock cycle (as the countdown value will become
zero when it’s decremented in that same execution of sboardCycle()), thus in this
event sboardCycle() asserts the write back operation. After asserting a write back
operation, sboardCycle() sets the field’s exec_unit sub-field to zero. Thus all
Scoreboard fields whose respective registers are not destination registers of an
instruction currently in the Execution_pipeline are maintained with a zero value in
both sub-fields.
4.1.2 : Managing Data Hazards
As already discussed previously at the beginning of the section, when the Controller
(discussed later in section 5) issues the FPU with a new instruction,
FPU_Control_Unit checks the Scoreboard to decide whether or not submitting the
instruction to the Execution_pipeline may cause an error to occur due to a data
hazard. This is illustrated below in figure 34.
SCOREBOARD
REGISTER COUNT DOWN EXECUTION UNIT
R1 0 0
R2 0 0
R3 0 0
R4 0 0
R5 9 1
SCOREBOARD
REGISTER COUNT DOWN EXECUTION UNIT
R1 6 2
R2 0 0
R3 0 0
R4 0 0
R5 8 1
SCOREBOARD
REGISTER COUNT DOWN EXECUTION UNIT
R1 5 2
R2 0 0
R3 0 0
R4 0 0
R5 7 1
0 1 2 3 4 5 6 7 8 9 10clock cycle
R5 += R1 * R2
0 1 2 3 4 5 6 7 8 9 10
R5 += R1 * R2
R1 = R3 * R4
0 1 2 3 4 5 6 7 8 9 10
R5 += R1 * R2
R1 = R3 * R4
clock cycle
clock cycle
R5 = R1 + R2
X
X X
X
The instruction issued in clock cycle 0 is : R5 += R1 * R2
The instruction issued in clock cycle 1 is : R1 = R3 * R4
The instruction issued in clock cycle 2 is : R5 = R1 + R2
R1
R2
R3
R4
R5
destination
register
R1
R2
R3
R4
R5
R1
R2
R3
R4
R5
destination
register
destination
register
R5
38
Figure 34 : showing an example to illustrate how the Scoreboard is checked to
prevent data hazards
With reference to figure 34 above, the multiply instruction issued to the FPU in clock
cycle 1 (represented by the blue bar) passes the Scoreboard check (assuming there are
no instructions in the Execution_pipeline before clock cycle 0) because in clock cycle
1 neither of its source registers or it’s destination register is the target of a pending
write back operation. However the add instruction issued to the FPU in clock cycle 2
(represented by the green bar) fails the Scoreboard check for two reasons. Firstly,
one of its source registers (R1) is the target of the write back operation scheduled in
clock cycle 7 after the multiply instruction. Thus if this add instruction was submitted
to the Execution_pipeline in clock cycle 2, it would fetch and use the contents of R1
in the same clock cycle, before they had been updated in clock cycle 7 by the multiply
instruction.
In order to abstract the programmer from the architecture it must be assumed that they
don’t consider the latency of the instructions and that their intention in this event
would be for the result of the multiply instruction to be added to R2 by the add
instruction. Thus if this was the only cause for the Scoreboard check failure, the add
instruction could be submitted to the Execution_pipeline (without causing a data
hazard) in clock cycle 8 or thereafter.
However, a second cause of the add instruction’s Scoreboard check failure is that it’s
destination register (R5) is the target of a write back operation scheduled in clock
cycle 9 after the MACC instruction (represented by the red bar). Thus a data hazard
exists because the destination register of a MACC instruction is first used as its third
source register. As such, if the add instruction was submitted to the
Execution_pipeline whilst the MACC instruction was still being executed, it could
write to R5 before this register had been fetched for the accumulate stage of the
MACC instruction. For simplicity, this event always results in a Scoreboard check
failure, regardless of whether or not the instruction’s write back would occur after the
register fetch for the add stage of the MACC instruction.
The sub-function of FPU_Control_Unit that is employed to check the Scoreboard for
a particular register is checkScore(), a synopsis of which is shown in figure B4,
followed by a description in appendix B.
4.1.3 : Managing Structural Hazards
As well as the data hazards discussed previously in section 4.1.1, a potential
structural hazard also exists as the Execution_pipeline’s adder is used in executing
both the add instruction and the accumulate stage of the MACC instruction. Figure
35 below shows an example to illustrate this data hazard, and how it is dealt with.
39
SCOREBOARD
REGISTER COUNT DOWN EXECUTION UNIT
R1 0 0
R2 0 0
R3 0 0
R4 0 0
R5 9 1
SCOREBOARD
REGISTER COUNT DOWN EXECUTION UNIT
R1 6 2
R2 0 0
R3 0 0
R4 0 0
R5 4 1
SCOREBOARD
REGISTER COUNT DOWN EXECUTION UNIT
R1 0 0
R2 0 0
R3 0 0
R4 0 0
R5 4 1
0 1 2 3 4 5 6 7 8 9 10clock cycle
MACC_mult
0 1 2 3 4 5 6 7 8 9 10
MULT
0 1 2 3 4 5 6 7 8 9 10
clock cycle
clock cycle
ADD
X
X X
The instruction issued in clock cycle 0 is : R5 += R1 * R2
The instruction issued in clock cycle 5 is : R1 = R3 * R4
The instruction issued in clock cycle 5 is : R4 = R2 + R3
R1
R2
R3
R4
R5
destination
register
R1
R2
R3
R4
R5
R1
R2
R3
R4
R5
destination
register
destination
register
MACC_add
X X X X MACC_add
X X X MACC_add
Figure 35 : showing an example to illustrate the structural hazard concerning the use
of the Execution_pipeline’s adder by both MACC and add instructions
Figure 35 above shows a MACC instruction (issued in clock cycle 0) split into its two
stages, where the red bar represents the multiply-stage and the purple bar represents
the accumulate-stage. The dark red section of the combined bar represents the clock
cycle in which the last stage of the multiply and the first stage (register fetch) of the
add are conducted in parallel. This is done so as to hide the latency of the add-stage’s
register fetch. Figure 35 shows the outcome of issuing two alternative instructions to
the FPU in that clock cycle (5). As is the case with both the multiply and MACC
instructions, if the instruction issued in this clock cycle does not require immediate
use of the adder then it is eligible for submission to the Execution_pipeline.
However, as can be seen in figure 35 this is not the case with the add instruction, and
as such FPU_Control_Unit would not submit an add to the Execution_pipeline in this
situation, regardless of whether or not it passed its Scoreboard checks. Priority is
always given to the accumulate-stage of a MACC in this way for simplicity. In the
situation depicted by figure 35, the earliest time after this clock cycle that an add
instruction could be submitted to the Execution_pipeline would be clock cycle 6.
Details of how the FPU program implements the execution of the different
instructions and the management of hazards are contained in appendix B.
40
Section 5 : The Controller
The Controller is responsible for issuing the FPU with the next instruction to be
executed. If the FPU stalls in the event of detecting a structural or data hazard, it
asserts a ‘1’ on its stall output and the Controller must re-issue the stalled instruction
in subsequent clock cycles until the FPU does submit it to its Execution_pipeline and
assert a ‘0’ back on its stall output. In the clock cycle after this submission occurs,
the Controller must issue the FPU with the next instruction to be executed.
5.1 : Initial Look-Up Table Design
The initial design of the Controller was essentially a look-up table, where every
instruction of the program to be run was stored sequentially in program memory.
This look-up table design of the Controller is shown below in figure 36.
Figure 36 : showing the initial look-up table design of the Controller
For simplicity, during this developmental phase the PROG_COUNTER (counter) was
initialised with its count_from and count_to block parameters before running the
simulation. Similarly, the PROG_MEMORY (single-port RAM) was initialised with
the sequence of program instructions through its initial_value_vector block parameter.
The output of PROG_MEMORY is separated out into its constituent fields
(mnemonic, RT/S, RA and RB) as bit-slicing is not supported in Simulink’s Embedded
Matlab function block, thus preventing this from being carried out within
FPU_Control_Unit. With reference to figure 36 above, when the stall input is ‘0’,
both PROG_COUNTER and PROG_MEMORY are enabled, thus allowing the
program counter to progress by 1 and the output register of the RAM block to be
written to. As such, when the stall signal is ‘0’ successive instructions of the program
are output by PROG_MEMORY at a rate of one per clock cycle. Figure 37 below
uses an example to show how the Controller deals with an FPU stall.
41
0 1 2 3 4 5 6 7 8Clock Cycle
a0 a1 a2 a3 a4 a5 a6
i0 i1 i2 i3 i4 i5X
0
1
Stall
Instruction
word issued
Program
counter
Figure 37 : showing an example to illustrate how an FPU stall is dealt with by the
Controller
In the example shown in figure 37 above, the FPU is issued with instruction i2 in
clock cycle 3 but cannot submit it to the Execution_pipeline for two successive clock
cycles (3 and 4). As can be seen in figure 37 above, when the FPU stalls and asserts a
‘1’ on the stall signal, PROG_COUNTER has advanced to the address of the next
instruction (a3) by the time the count has been stopped. Although as the output
register of PROG_MEMORY is disabled, PROG_MEMORY’s output value remains
as the word-value of the stalled instruction. In the clock cycle that the FPU does
submit i2 to the Execution_pipeline, it asserts a ‘0’ back on the stall signal, thus
enabling PROG_MEMORY’s output register, which is then written to with the word-
value of the next instruction to be executed. This can be seen in figure 37, where the
stalled instruction (i2) is submitted for execution in clock cycle 5, and in the
subsequent clock cycle the FPU is issued with the next instruction (i3).
5.2 : Optimising the Controller for Running Geometric Transformation
Programs
As the problem size (number of instructions the program consists of) increases,
storing every single instruction requires the program memory capacity to be bigger
than is practical. For example, the matrix-vector multiplication program shown in
figure 21 and discussed previously in section 3.1 where the matrix is 4x4 and the
vector is 4x1), is executed using 16 separate MACC instructions when (ignoring any
load and store operations required to get data in and out of the register file). Thus the
size of the program memory required to store this program when completely unrolled
is 16 instruction words.
Figure 38 below illustrates a similar program to that shown in figure 21 which
multiplies eight 4x1 Vectors by the same 4x4 matrix. This is a typical example of the
program run in carrying out both the modelview and projection transformations in the
per-vertex operations stage of the OpenGL pipeline, as discussed previously in
section 1.2.2.
42
M(1,1) M(1,2) M(1,3) M(1,4)
M(2,1) M(2,2) M(2,3) M(2,4)
M(3,1) M(3,2) M(3,3) M(3,4)
M(4,1) M(4,2) M(4,3) M(4,4)
X
V1(1)
V1(2)
V1(3)
V1(4)
V2(1)
V2(2)
V2(3)
V2(4)
V3(1)
V3(2)
V3(3)
V3(4)
V4(1)
V4(2)
V4(3)
V4(4)
V5(1)
V5(2)
V5(3)
V5(4)
V6(1)
V6(2)
V6(3)
V6(4)
V7(1)
V7(2)
V7(3)
V7(4)
V8(1)
V8(2)
V8(3)
V8(4)
R1(1)
R1(2)
R1(3)
R1(4)
R2(1)
R2(2)
R2(3)
R2(4)
R3(1)
R3(2)
R3(3)
R3(4)
R4(1)
R4(2)
R4(3)
R4(4)
R5(1)
R5(2)
R5(3)
R5(4)
R6(1)
R6(2)
R6(3)
R6(4)
R7(1)
R7(2)
R7(3)
R7(4)
R8(1)
R8(2)
R8(3)
R8(4)
Figure 38 : illustrating an example of the matrix-vector multiplication program carried
out by OpenGL’s modelview and projection transformations.
With reference to figure 38 above, the M matrix represents either the modelview or
projection matrix depending on which transformation is to be performed, and vectors
V1 to V8 represent the object or eye coordinate vectors of an object in the scene. The
object being transformed in this particular program has eight vertices, the simplest
example of which would be a cube. Figure 39 below shows a code description of this
program.
void pipelined1Matrix8VectorMultiply(int num_matrix_rows, int num_matrix_cols, const float *m[ ][ ], const float
*v1[ ], const float *v2[ ], const float *v3[ ], const float *v4[ ], const float *v5[ ], const float *v6[ ], const float *v7[ ],
const float *v8[ ], const float *r1[ ], const float *r2[ ], const float *r3[ ], const float *r4[ ], const float *r5[ ], const float
*r6[ ], const float *r7[ ], const float *r8[ ])
{
int row, col;
for( col = 0; col < num_matrix_cols; col++ )
{
for( row = 0; row < num_matrix_rows; row++ )
{
r1[row] += m[row][col] x v1[col]; // matrix m is processed in COLUMN-MAJOR order
r2[row] += m[row][col] x v2[col];
r3[row] += m[row][col] x v3[col];
r4[row] += m[row][col] x v4[col];
r5[row] += m[row][col] x v5[col];
r6[row] += m[row][col] x v6[col];
r7[row] += m[row][col] x v7[col];
r8[row] += m[row][col] x v8[col];
}
}
}
Figure 39 : showing a code description of the example geometric transformation
program depicted previously in figure 38
43
Previously in section 3.3.1 an analysis of how to optimise the matrix-vector
multiplication algorithm for execution on a pipelined MACC unit was detailed. In
conjunction with the findings of this analysis, the two loops in code of figure 39
above are arranged such that the matrix is processed in column major order, so as to
schedule dependent MACC instructions further apart in time and thus avoid periods of
latency due to FPU stalls. However, this program has eight vectors and the eight
separate matrix-vector multiplication problems are interleaved so as to schedule the
dependent MACC instructions (within each individual problem) even further apart in
time, allowing for an even greater overall pipeline depth and thus a higher throughput
to be achieved.
Completely unrolling both loops will yield the fastest execution speed as it entirely
removes the overhead of having to test loop conditions and execute branch operations
(to set the program counter back to the beginning of a loop). Although these
overheads can also be eliminated without unrolling any loops, by implementing loop
tests and branch operations within the Controller, although this would be at the
expense of added hardware resources. Although as discussed previously, the
disadvantage of completely unrolling both loops is that the size of program memory
required is increased by a factor equal to degree of unrolling.
As can be seen from figure 39 above, the inner-loop of the program contains 8 MACC
instructions, and so if this program was represented with both loops completely
unrolled, the size of the required program memory would be 8x4x4=128 instruction-
words. With both loops completely unrolled, transformations of larger sizes (i.e.
where there are more vertices in the scene overall) could be solved by running the
program on small sets of vectors (vertices) at a time, thus reducing the number of
instructions that need to be stored at any one time, and likewise the size of program
memory required. Although this approach would introduce the overhead of switching
between these smaller programs.
5.2.1 : The Use of Induction Variables
As can be seen from the code of figure 39 above, all of the instructions are MACC
instructions. Thus when the program is represented with both loops completely
unrolled, the instructions stored in program memory would all have the same
mnemonic, and differ only in their three source/destination register fields (RT/S, RA
and RB). All of the program’s instruction-words have the same mnemonic, and their
RT/S, RA and RB fields always address the same arrays.
When the majority of DSP compilers need to address arrays in NLP programs, to
simplify this they introduce induction variables which by definition are derived from
the loop index values. Considering the geometric transformation program of figure
39 for just one vector, a code description of this illustrating how induction variables
would be used to address the output and two input arrays is shown below in figure 40.
44
for( col = 0; col < num_matrix_cols; col++ )
{
for( row = 0; row < num_matrix_rows; row++ )
{
// matrix m is processed in column major order
p = (col x num_matrix_rows) + row; // p is an induction variable
*(r1 + row) += *(m + p) x *(v1 + col);
}
}
Figure 40 : showing an example to illustrate the use of induction variables for
addressing arrays
As can be seen in figure 40 above, the m array is indexed by adding its respective
induction variable p to its base pointer. Although p is the only new variable
introduced, the r1 and v1 arrays are also addressed in this way, where their respective
induction variables are exactly the values of the loop indices. With reference to figure
40 above, it can be seen how the use of induction variables allows for the successive
program instructions to be generated, with only a single generic instruction-word
stored in program memory of the form shown below in figure 41.
V1_base_pointer
(RB)
M_base_pointer
(RA)
R1_base_pointer
(RT/S)
067131420
MACC
(MNEMONIC)
2126
Figure 41 : showing the single generic instruction-word from which all program
instructions could be derived for the program of figure 40
With reference to figure 41 above, the Controller would pass the mnemonic field
straight on to the FPU, but between issuing successive instructions it would have to
evaluate the values of the RT/S, RA and RB fields, which would require additional
hardware resources. If this penalty was migrated into software by issuing the FPU
with instructions to calculate the induction variable values, this would eliminate the
need for extra resources (apart from the extra program memory required) at the cost of
increased execution time. Although this is not an option as the FPU does not have a
register address internal data-format. With reference to the code of figure 40 above,
for this program these extra hardware resources would amount to two adders for
evaluating the R1 array addresses, likewise another two adders for evaluating the V1
array addresses and an adder and a multiply-add unit for evaluating the M array
address.
5.2.2 : Performing Strength Reduction on Induction Variables
In applying strength reduction to all three induction variables of the program of figure
40, the overhead cost of each inner-loop iteration is reduced. Figure 42 below shows
the code description after all three induction variables have undergone strength
reduction.
45
p = 0;
r = 0;
c = 0;
for( col = 0; col < num_matrix_cols; col++ )
{
for( row = 0; row < num_matrix_rows; row++ )
{
*(r1 + r) += *(m + p) x *(v1 + c); // matrix m is processed in column major order
r++;
p++;
}
r = 0;
c++;
}
Figure 42 : showing the code description of the matrix-vector multiplication algorithm
after all three induction variables have undergone strength reduction
As can be seen from figure 42 above, there is no longer the need for a multiplication
in evaluating successive values of p, thus the additional hardware required to evaluate
the p induction variable is now down to two adders. To facilitate this strength
reduction on p, the m matrix must be stored in column major order. Strength
reduction also removes any dependencies of induction variables on the loop indices
(as is the case for those associated with the R1 and V1 arrays), which provides more
flexibility, as not all programs will have induction variables that directly correspond
to the loop indices.
5.2.3 : Optimising a Geometric Transformation for the Controller
Considering the code shown in figure 39 above, where eight separate matrix-vector
multiplication problems are interleaved, this could be executed using separate base
pointers for the arrays of the separate problems, whilst using the same induction
variables across all problems. A code description of this solution is shown below in
figure 43.
p = 0;
r = 0;
c = 0;
for( col = 0; col < num_matrix_cols; col++ )
{
for( row = 0; row < num_matrix_rows; row++ )
{
// matrix m is processed in column major order
*(r1 + r) += *(m + p) x *(v1 + c);
*(r2 + r) += *(m + p) x *(v2 + c);
46
*(r3 + r) += *(m + p) x *(v3 + c);
*(r4 + r) += *(m + p) x *(v4 + c);
*(r5 + r) += *(m + p) x *(v5 + c);
*(r6 + r) += *(m + p) x *(v6 + c);
*(r7 + r) += *(m + p) x *(v7 + c);
*(r8 + r) += *(m + p) x *(v8 + c);
r++;
p++;
}
r = 0;
c++;
}
Figure 43 : showing a code description of the program with eight matrix-vector
multiplication problems interleaved, with all induction variables having undergone
strength reduction.
With reference to figure 43 above, to run this code the Controller would have to store
one generic instruction word of the form shown previously in figure 41 for all eight
problems (i.e. one instruction word per vertex). This disadvantage arises from the
corresponding array base pointers having different values across the eight different
problems. The number of vertices an object has, or the number that are in a scene can
be huge (reaching the tens of thousands in very detailed scenes), thus it is desired that
only one generic instruction word be stored in program memory, from which all
successive program instructions are derived.
In order to achieve this, the corresponding arrays across the different problems need
to be combined. Considering the way the geometric transformation interleaves the
execution of the problems, in order to keep the expressions for evaluating the
induction variable values as simple as possible, the best way to combine the arrays is
to interleave them. As such, the register numbers of successive array elements
accessed can be evaluated largely by simple increment operations. This is illustrated
below in figure 44 which illustrates an example register file arrangement for the three
arrays.
47
M(1, 1) M(2, 1) M(3, 1) M(4, 1) M(1, 2)
R1
M(2, 2)
M(3, 2) M(4, 2) M(1, 3) M(2, 3) M(3, 3) M(4, 3)
R2 R3 R4 R5 R6
R7 R8 R9 R10 R11 R12
V1(1) V2(1) V3(1) V4(1) V5(1)
R17
V6(1)
V7(1) V8(1) V1(2) V2(2) V3(2) V4(2)
R18 R19 R20 R21 R22
R23 R24 R25 R26 R27 R28
R1(1) R2(1) R3(1) R4(1) R5(1)
R49
R6(1)
R7(1) R8(1) R1(2) R2(2) R3(2) R4(2)
R50 R51 R52 R53 R54
R55 R56 R57 R58 R59 R60
Figure 44 : showing an example arrangement within the FPU’s register file, where the
three arrays of eight matrix-vector multiplication problems are interleaved together
As is illustrated in figure 44 above, the eight source and corresponding eight resultant
vectors are stored in the register file, such that the first element of all eight vectors are
stored adjacent to each other and in the order the vector pairs are processed by the
program (1 to 8), followed by the second element of all 8 vectors, and so on. The M
matrix is stored in column major order, as was discussed earlier in this section. The
arrangement of the source and destination registers means that the complexity of
interleaving and de-interleaving the arrays is handled outside the Controller by
whatever loads the data in and stores it out of the FPU’s register file. Subsequently
the register number sequences that need to be generated by the Controller are simpler
(than if the arrays were simply concatenated), and this will ease compiler
development if it is undertaken in the future. A code description of the program using
this register file arrangement, and requiring only one generic instruction word to be
stored in program memory is shown below in figure 45.
48
res = 0;
res_base = 0;
p = 0;
vec_base = 0;
vec = 0;
for( col = 0; col < num_matrix_cols; col++ )
{
for( row = 0; row < num_matrix_rows; row++ )
{
for( vec_no = 0; vec_no < num_vectors; vec_no++ )
{
*(r0 + res) += *(m + p) x *(v0 + vec);
res++;
vec++;
}
p++;
vec = vec_base;
}
res = res_base;
vec = vec_base = vec_base + num_vectors;
}
Figure 45 : showing a code description of the program executing eight interleaved
matrix-vector multiplication problems with only one generic instruction word stored
in program memory
5.2.4 : Designing the Optimal Controller
Considering the code of figure 45 above, it can be seen that to evaluate the three
induction variables between issuing successive instructions, the range of operations
the Controller must be able to carry out on an induction variable to produce its
subsequent value are incrementing it, setting it to a base value and adding a constant
to that base value.
As was discussed previously in section 1.2.1, as well as matrix-vector multiplication,
the two other prominent NLPs executed within the OpenGL pipeline (and the graphics
rendering domain in general) are the FIR filter and matrix-matrix multiplication. For
flexibility and thus the ability to efficiently support a wider range of NLPs, the
Controller needs to have the ability to perform any combination of these operations on
any of the three induction variables in any one evaluation cycle between the issuing of
successive instructions.
Parallel Processor for Graphics Acceleration
Parallel Processor for Graphics Acceleration
Parallel Processor for Graphics Acceleration
Parallel Processor for Graphics Acceleration
Parallel Processor for Graphics Acceleration
Parallel Processor for Graphics Acceleration
Parallel Processor for Graphics Acceleration
Parallel Processor for Graphics Acceleration
Parallel Processor for Graphics Acceleration
Parallel Processor for Graphics Acceleration
Parallel Processor for Graphics Acceleration
Parallel Processor for Graphics Acceleration
Parallel Processor for Graphics Acceleration
Parallel Processor for Graphics Acceleration
Parallel Processor for Graphics Acceleration
Parallel Processor for Graphics Acceleration
Parallel Processor for Graphics Acceleration
Parallel Processor for Graphics Acceleration
Parallel Processor for Graphics Acceleration
Parallel Processor for Graphics Acceleration
Parallel Processor for Graphics Acceleration
Parallel Processor for Graphics Acceleration
Parallel Processor for Graphics Acceleration
Parallel Processor for Graphics Acceleration
Parallel Processor for Graphics Acceleration
Parallel Processor for Graphics Acceleration
Parallel Processor for Graphics Acceleration
Parallel Processor for Graphics Acceleration
Parallel Processor for Graphics Acceleration
Parallel Processor for Graphics Acceleration
Parallel Processor for Graphics Acceleration
Parallel Processor for Graphics Acceleration
Parallel Processor for Graphics Acceleration
Parallel Processor for Graphics Acceleration

More Related Content

What's hot

Paper id 25201467
Paper id 25201467Paper id 25201467
Paper id 25201467
IJRAT
 
Iaetsd vlsi architecture for exploiting carry save arithmetic using verilog hdl
Iaetsd vlsi architecture for exploiting carry save arithmetic using verilog hdlIaetsd vlsi architecture for exploiting carry save arithmetic using verilog hdl
Iaetsd vlsi architecture for exploiting carry save arithmetic using verilog hdl
Iaetsd Iaetsd
 

What's hot (18)

Real-time traffic sign detection and recognition using Raspberry Pi
Real-time traffic sign detection and recognition using Raspberry Pi Real-time traffic sign detection and recognition using Raspberry Pi
Real-time traffic sign detection and recognition using Raspberry Pi
 
Design of a Novel Multiplier and Accumulator using Modified Booth Algorithm w...
Design of a Novel Multiplier and Accumulator using Modified Booth Algorithm w...Design of a Novel Multiplier and Accumulator using Modified Booth Algorithm w...
Design of a Novel Multiplier and Accumulator using Modified Booth Algorithm w...
 
Paper id 25201467
Paper id 25201467Paper id 25201467
Paper id 25201467
 
Distributed Traffic management framework
Distributed Traffic management frameworkDistributed Traffic management framework
Distributed Traffic management framework
 
A BINARY TO RESIDUE CONVERSION USING NEW PROPOSED NON-COPRIME MODULI SET
A BINARY TO RESIDUE CONVERSION USING NEW PROPOSED NON-COPRIME MODULI SETA BINARY TO RESIDUE CONVERSION USING NEW PROPOSED NON-COPRIME MODULI SET
A BINARY TO RESIDUE CONVERSION USING NEW PROPOSED NON-COPRIME MODULI SET
 
Performance comparison of row per slave and rows set
Performance comparison of row per slave and rows setPerformance comparison of row per slave and rows set
Performance comparison of row per slave and rows set
 
Fpga sotcore architecture for lifting scheme revised
Fpga sotcore architecture for lifting scheme revisedFpga sotcore architecture for lifting scheme revised
Fpga sotcore architecture for lifting scheme revised
 
Gpu based image segmentation using
Gpu based image segmentation usingGpu based image segmentation using
Gpu based image segmentation using
 
Kv3419501953
Kv3419501953Kv3419501953
Kv3419501953
 
Floor planning
Floor planningFloor planning
Floor planning
 
FPGA configuration of an alloyed correlated branch predictor used with RISC p...
FPGA configuration of an alloyed correlated branch predictor used with RISC p...FPGA configuration of an alloyed correlated branch predictor used with RISC p...
FPGA configuration of an alloyed correlated branch predictor used with RISC p...
 
Performance comparison of row per slave and rows set per slave method in pvm ...
Performance comparison of row per slave and rows set per slave method in pvm ...Performance comparison of row per slave and rows set per slave method in pvm ...
Performance comparison of row per slave and rows set per slave method in pvm ...
 
Iaetsd vlsi architecture for exploiting carry save arithmetic using verilog hdl
Iaetsd vlsi architecture for exploiting carry save arithmetic using verilog hdlIaetsd vlsi architecture for exploiting carry save arithmetic using verilog hdl
Iaetsd vlsi architecture for exploiting carry save arithmetic using verilog hdl
 
Simulated annealing for location area planning in cellular networks
Simulated annealing for location area planning in cellular networksSimulated annealing for location area planning in cellular networks
Simulated annealing for location area planning in cellular networks
 
FPGA Implementation of 2-D DCT & DWT Engines for Vision Based Tracking of Dyn...
FPGA Implementation of 2-D DCT & DWT Engines for Vision Based Tracking of Dyn...FPGA Implementation of 2-D DCT & DWT Engines for Vision Based Tracking of Dyn...
FPGA Implementation of 2-D DCT & DWT Engines for Vision Based Tracking of Dyn...
 
Braun’s Multiplier Implementation using FPGA with Bypassing Techniques.
Braun’s Multiplier Implementation using FPGA with Bypassing Techniques.Braun’s Multiplier Implementation using FPGA with Bypassing Techniques.
Braun’s Multiplier Implementation using FPGA with Bypassing Techniques.
 
2-DOF BLOCK POLE PLACEMENT CONTROL APPLICATION TO:HAVE-DASH-IIBTT MISSILE
2-DOF BLOCK POLE PLACEMENT CONTROL APPLICATION TO:HAVE-DASH-IIBTT MISSILE2-DOF BLOCK POLE PLACEMENT CONTROL APPLICATION TO:HAVE-DASH-IIBTT MISSILE
2-DOF BLOCK POLE PLACEMENT CONTROL APPLICATION TO:HAVE-DASH-IIBTT MISSILE
 
Transformation and dynamic visualization of images from computer through an F...
Transformation and dynamic visualization of images from computer through an F...Transformation and dynamic visualization of images from computer through an F...
Transformation and dynamic visualization of images from computer through an F...
 

Similar to Parallel Processor for Graphics Acceleration

Design and implementation of complex floating point processor using fpga
Design and implementation of complex floating point processor using fpgaDesign and implementation of complex floating point processor using fpga
Design and implementation of complex floating point processor using fpga
VLSICS Design
 
The Principle Of Ultrasound Imaging System
The Principle Of Ultrasound Imaging SystemThe Principle Of Ultrasound Imaging System
The Principle Of Ultrasound Imaging System
Melissa Luster
 
Parallelization of Graceful Labeling Using Open MP
Parallelization of Graceful Labeling Using Open MPParallelization of Graceful Labeling Using Open MP
Parallelization of Graceful Labeling Using Open MP
IJSRED
 
Orthogonal Matching Pursuit in 2D for Java with GPGPU Prospectives
Orthogonal Matching Pursuit in 2D for Java with GPGPU ProspectivesOrthogonal Matching Pursuit in 2D for Java with GPGPU Prospectives
Orthogonal Matching Pursuit in 2D for Java with GPGPU Prospectives
Matt Simons
 

Similar to Parallel Processor for Graphics Acceleration (20)

Automatic generation of power system network diagram(Mimic diagram) from a CI...
Automatic generation of power system network diagram(Mimic diagram) from a CI...Automatic generation of power system network diagram(Mimic diagram) from a CI...
Automatic generation of power system network diagram(Mimic diagram) from a CI...
 
IMQA Paper
IMQA PaperIMQA Paper
IMQA Paper
 
Design and implementation of complex floating point processor using fpga
Design and implementation of complex floating point processor using fpgaDesign and implementation of complex floating point processor using fpga
Design and implementation of complex floating point processor using fpga
 
E0364025031
E0364025031E0364025031
E0364025031
 
High Performance Medical Reconstruction Using Stream Programming Paradigms
High Performance Medical Reconstruction Using Stream Programming ParadigmsHigh Performance Medical Reconstruction Using Stream Programming Paradigms
High Performance Medical Reconstruction Using Stream Programming Paradigms
 
Digital scaling
Digital scaling Digital scaling
Digital scaling
 
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGA
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGAEFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGA
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGA
 
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGA
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGAEFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGA
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGA
 
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving SystemsPRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
 
Dynamic sorting algorithm vizualizer.pdf
Dynamic sorting algorithm vizualizer.pdfDynamic sorting algorithm vizualizer.pdf
Dynamic sorting algorithm vizualizer.pdf
 
HARDWARE/SOFTWARE CO-DESIGN OF A 2D GRAPHICS SYSTEM ON FPGA
HARDWARE/SOFTWARE CO-DESIGN OF A 2D GRAPHICS SYSTEM ON FPGAHARDWARE/SOFTWARE CO-DESIGN OF A 2D GRAPHICS SYSTEM ON FPGA
HARDWARE/SOFTWARE CO-DESIGN OF A 2D GRAPHICS SYSTEM ON FPGA
 
Auto conversion of serial C code to CUDA code
Auto conversion of serial C code to CUDA codeAuto conversion of serial C code to CUDA code
Auto conversion of serial C code to CUDA code
 
CAD STANDARDS - SMART MANUFACTURING MECH
CAD STANDARDS - SMART MANUFACTURING MECHCAD STANDARDS - SMART MANUFACTURING MECH
CAD STANDARDS - SMART MANUFACTURING MECH
 
Efficient Data Center Virtualization with QLogic 10GbE Solutions from HP
Efficient Data Center Virtualization with QLogic 10GbE Solutions from HPEfficient Data Center Virtualization with QLogic 10GbE Solutions from HP
Efficient Data Center Virtualization with QLogic 10GbE Solutions from HP
 
PID2143641
PID2143641PID2143641
PID2143641
 
The Principle Of Ultrasound Imaging System
The Principle Of Ultrasound Imaging SystemThe Principle Of Ultrasound Imaging System
The Principle Of Ultrasound Imaging System
 
Accelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsAccelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous Platforms
 
Parallelization of Graceful Labeling Using Open MP
Parallelization of Graceful Labeling Using Open MPParallelization of Graceful Labeling Using Open MP
Parallelization of Graceful Labeling Using Open MP
 
Feature detection & extraction
Feature detection & extractionFeature detection & extraction
Feature detection & extraction
 
Orthogonal Matching Pursuit in 2D for Java with GPGPU Prospectives
Orthogonal Matching Pursuit in 2D for Java with GPGPU ProspectivesOrthogonal Matching Pursuit in 2D for Java with GPGPU Prospectives
Orthogonal Matching Pursuit in 2D for Java with GPGPU Prospectives
 

Recently uploaded

Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
ssuser89054b
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Kandungan 087776558899
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
MsecMca
 

Recently uploaded (20)

Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
22-prompt engineering noted slide shown.pdf
22-prompt engineering noted slide shown.pdf22-prompt engineering noted slide shown.pdf
22-prompt engineering noted slide shown.pdf
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 

Parallel Processor for Graphics Acceleration

  • 1. Design and Development of a Floating-point Co-processor for the Acceleration of Graphics functions Sandip Jassar Department of Electronics, The University of York Academic Supervisors: Andy Tyrrell, Jonathan Dell Xilinx Development Centre, Edinburgh Industrial Supervisor: Richard Walke 15th June, 2005 Final Project Report for degree of MEng in Electronic and Computer Engineering Abstract This report details the design and development of a processing architecture, complete with a Controller and DMA unit. The reader is shown how the architecture was optimised for executing Nested-Loop Programs, and in particular those found in the geometric transformation processes stage of the OpenGL pipeline.
  • 2. 2 Contents Section 1 : Introduction …………………………………………………………pg4 1.1 Parallel Processing for Graphics Functions ………………...pg4-5 1.2 OpenGL ………….………………………………………….pg5 1.2.1 The OpenGL Pipeline ………………………………....pg5-8 1.2.2 The Geometric Transformation Process ……………....pg8 1.2.2.1 The ModelView Transformation ……………….pg8-10 1.2.2.2 The Projection Transformation ………………….pg10 1.2.2.3 Perspective Division and the ViewPort Transformation ………………………..pg11 1.3 : Overview of Transformations on Nested Loop Programs .pg11-13 Section 2 : FIR Filter Analysis …………………………………………………..pg13 2.1 The Single MACC FIR Filter ……………………………pg13-16 2.2 The Transposed FIR Filter ………………………………..pg16-17 2.3 The Systolic FIR Filter ……………………………………pg18-19 2.4 The Semi-Parallel FIR filter ………………………………pg20-21 Section 3 : Matrix-Vector Multiplication Analysis ……………………………pg22 3.1 The Sequential Model of Computation ..…………………..pg22-24 3.2 Exploiting the Inherent Parallelism ...……………………...pg24 3.2.1 : Unrolling the Inner-Loop ………………………pg24-25 3.2.2 : Unrolling the Outer-Loop ……………………...pg26 3.3 Using Pipelined MACC Units ....………………………….pg26-27 3.3.1 : Optimising the Code for Execution on a Pipelined MACC Unit ………………………….pg27-30 Section 4 : The Floating Point Unit ……………...……………………………pg31 4.1 Dealing with Hazards ……………………………………..pg32-33 4.1.1 : Using the Scoreboard to Detect Data Hazards…pg33-37 4.1.2 : Managing Data Hazards ..……………………...pg37-38 4.1.3 : Managing Structural Hazards ..………………...pg38-39 Section 5 : The Controller ………………………..……………………………pg40 5.1 Initial Look-Up Table Design ………....…………………..pg40-41 5.2 Optimising the Controller for Running Geometric Transformation Programs ……….....……………………...pg41-43 5.2.1 : The Use of Induction Variables ..………………pg43-44 5.2.2 : Performing Strength Reduction on Induction Variables ……..……………………...pg44-45 5.2.3 : Optimising a Geometric Transformation for the Controller …….....………………………….p45-48 5.2.4 : Designing the Optimal Controller ……………..pg48 5.2.4.1 : The Data Address Generator Design ...pg49-50 5.2.4.2 : The Use of Data Address Generators as Part of the Controller …...…………...pg51-53 5.2.4.3 : The Loop Counter Structure …………pg53-55 5.2.5 : The High-Level Controller …………………….pg55
  • 3. 3 5.2.6 : The DMA Unit and Context Switching ………..pg55-57 Section 6 : Testing and Integration ..……………..…………………………..pg58 6.1 Testing of the FPU’s Control Unit .…....…………………pg58-59 6.2 Testing and Integrating the Register File ………………...pg59 6.3 Testing and Integrating the Execution Pipeline ………….pg60 6.4 Testing and Integration of the Look-Up Table Controller .pg60 6.5 Testing of the Optimal Controller’s Data Address Generator..pg60 6.6 Testing and Integration of the Optimal Controller’s Loop Counter Structure, Program Memory and DAGs …...pg61 6.7 Testing and Integration of the DMA Unit …………………pg61-62 6.8 Testing and Integration of the High-Level Controller ….....pg62 6.4 The OpenGL Demonstration ………………………………pg63-64 6.4 Progress in Developing a Hardware Based Demonstration .pg64-65 Section 7 : Conclusion …………………………………………………………pg65 Section 8 : Appendices ………………………………………………………...pg67-80 Section 9 : Acknowledgments …………………………………………………pg81 Section 10 : References ………………………………………………………..pg82 1 : Introduction This section begins by describing the 3-D graphics rendering application domain, and the characteristics of the associated algorithms which allow for parallel execution. The industry standard OpenGL 3-D graphics rendering pipeline is analysed, where special focus is given to the geometric transformations stage which is consistent of a pipeline of matrix-vector manipulation operations that implement a key part of
  • 4. 4 OpenGL’s operation in defining what a scene looks like in terms of the position and orientation of the objects in it. The section then closes by looking at how Nested Loop Programs such as the FIR filter can be transformed to exploit their inherent parallelism by making it more explicit. Section 2 then follows on from this and examines the different combinations of transformations that can be applied to the FIR filter and the resulting implementation tradeoffs. Section 3 then carries out a similar analysis on another prominent NLP in graphics processing, the matrix-vector multiplication algorithm, where similarities and differences with the FIR filter are analysed, and tradeoffs between the different ways to optimise the algorithm’s implementation are discussed. Section 4 then details the design of a FPU which can exploit the characteristics inherent in the matrix-vector multiplication algorithm if they are exposed by transforming the code as discussed in the previous section, by employing temporal parallelism. Section 5 goes through the design process for an Optimal Controller for the FPU using a typical matrix-vector multiplication algorithm found in the geometric transformations stage of the OpenGL pipeline, and goes on to show how the processor architecture as a whole was optimised for this particular section of the OpenGL pipeline. Section 6 explains how the complete processor architecture was built up from its constituent blocks in a hierarchical fashion, after they were tested against their specification and integrated together into sub-systems, and ends by describing the demonstration of OpenGL’s geometric transformations based on the processor architecture developed. Section 7 finishes by drawing conclusions from the project and putting the results achieved into context. 1.1 : Parallel Processing for Graphics Functions Currently multiprocessing is the technique used to carry out the significant arithmetic processing required to implement the realistic rendering techniques used in advanced 3-D graphics applications. Most often this is in the form of clusters of commodity PC’s [1]. As discussed previously, graphics arithmetic is largely consistent of matrix and vector manipulation, and thus lends itself to parallel processing because of the independence of the individual operations required within the high-level matrix/vector function. Most high level functions encountered in a Graphics environment are also ‘order- independent’, as the order they are done in has no effect on the final display. Such functions can thus be executed in parallel to achieve higher throughput, which is one of the fundamental strengths of FPGAs. Although some functions will be encountered that are ‘sequential’ which must be executed after all preceding and before all subsequent functions, and thus if a parallel processing architecture is employed it must deal with sequential functions with minimal degradation to the overall system performance. Another benefit of having an identical number of processors operating in parallel is that the programming of each processor can be the same, and so this method of obtaining high performance can also greatly simplify the software development
  • 5. 5 1.2 : OpenGL The Open Graphics Library (OpenGL) is the industry’s most widely used and supported 2D and 3D graphics API [2]. As such there are thousands of applications based on OpenGL that are used to render compelling 2D and 3D graphics, in markets ranging from broadcasting, CAD/CAM, entertainment, cinematics, medical imaging and virtual reality, the most famous of all being Pixar’s Renderman (used by the movie industry to create special effects). Individual function calls in the OpenGL environment can be executed on dedicated tuned hardware, run as a software routine on the generic system CPU or implemented as a combination of both. As a result of this implementation flexibility, OpenGL hardware acceleration can range from that which simply renders 2-D lines and polygons to the more advanced floating-point processor capable of transforming and computing geometric data. 1.2.1 : The OpenGL Pipeline The two types of data that are input to the OpenGL pipeline are pixel data and geometric data. Pixel data is the RGBA data associated with pixels on the screen, and comes in the form of individual pixel colour values, images and bitmaps. Geometric data is used to model objects (ranging in complexity from simple 2-D shapes to realistic 3-D objects), and comes in the form of points, lines and polygons, which are OpenGL’s three geometric primitives. All geometric data is eventually described as vertices. The data associated with each vertex is its 3-D positional coordinate vector, normal vector and material properties (used in lighting calculations), pixel colour value (RGBA value or an index to a colour-map), and its texture coordinates (used to map a texture onto the vertex’s parent object). Figure 1 below shows an overview of the OpenGL pipeline in terms of its various stages and the order in which operations occur. Figure 1 : showing an overview of the OpenGL pipeline As can be seen in figure 1 above the vertex and pixel data are initially processed differently before both being used in the rasterization stage. All data can be input to
  • 6. 6 the pipeline from the application and processed immediately, or saved in a display list which sends the data to the pipeline when the list is executed. In the per-vertex operations stage, the three vectors associated with each vertex (spatial coordinates, normal and texture coordinates) are transformed (multiplied) by the current modelview matrix, its inverse transpose and the current texture matrix respectively. These transformations are carried out in order to transform the position, orientation and size of the vertices’ parent objects in the scene. Lighting calculations are then performed on each vertex using their transformed spatial coordinate and normal vectors, material properties and the current lighting model. These calculations act so as to scale each vertex’s pixel colour value to reflect the location and orientation of its parent polygon relative to the light source/s. The viewing volume of the scene is defined by six clipping planes. In the primitive assembly stage the spatial coordinate vector of all vertices is transformed (multiplied) by the projection matrix, so as to clip the scene against the six planes of the viewing volume. Depending on the type of clipping employed primitives may have vertices rejected, modified or added. The programmer may also define additional clipping planes to further restrict the viewing volume, in order to create cut-away views of objects, and other similar effects. The equations that represent such additional clipping planes are used to transform the spatial coordinate vectors before the projection matrix. The spatial coordinate vectors of all vertices then go through perspective division, where primitives are scaled in size to reflect their distance from the viewer. This is followed by the viewport transformation where the spatial coordinate vectors of all vertices are transformed so as to map the 3-D scene onto the 2-D viewport (viewing window) of the computer screen. In the rasterization processing stage, all primitives (points, lines and polygons) are rasterized into fragments, where the squares of the viewport’s integer pixel-grid that are occupied by each primitive are determined. If enabled, advanced features to make the rendered scene more realistic are also implemented in this stage. The most commonly used of these is anti-aliasing, which is used to smooth jagged edges that result from having to map non-vertical and non-horizontal lines to the square pixel- grid of the viewport. Anti-aliasing calculates the portion of each square within the pixel-grid that would be occupied by a line, if the line were to be drawn as originally defined (before the viewport transformation), and this value is known as a pixel’s coverage value. A pixel’s coverage value is used to scale the alpha component of its RGBA colour value. Figure 2 illustrates how pixel coverage values are evaluated.
  • 7. 7 Figure 2 : showing an example to illustrate how a pixel’s coverage value is evaluated With reference to figure 2 above, the green and orange show how a diagonal line looks before and after it’s subjected to the viewport transformation respectively, and the coverage values for the pixels occupied by the line are given on the right. In the pixel operations stage, pixel data that is input to the pipeline is scaled, biased and processed using a colour-map, after which the colour values are clamped to a certain range. The resulting pixel data is then either rasterized into fragments or written to texture memory for use in texture mapping. Data from the framebuffer can be read back and placed in the host processor memory. Data for texture mapping can be taken from either the host processor memory or the framebuffer. If texturing is enabled, in the per-fragment operations processing stage, the texture coordinate vector of each vertex (of a fragment’s primitive) is used to map the vertices to specific points on a two-dimensional texture image, so that the texture (known as the texel for that particular fragment) can be mapped onto the primitive appropriately after its position an orientation are transformed in the per-vertex operations stage. If enabled, other advanced features such as blending (used to create a photo-realism effect) are also implemented in the per-fragment operations stage. It’s also in this stage that the coverage values calculated in the rasterization stage are applied, if anti-aliasing is enabled. The final pixel values are then drawn (written) into the framebuffer. 1.2.2 : The Geometric Transformation Process The overall transformation process for producing a scene for viewing is analogous to that carried out by a camera when it’s used to take a photograph. This transformation process is carried out on the spatial coordinate vector of each vertex in the OpenGL pipeline. This process is depicted below in figure 3.
  • 8. 8 Figure 3 : showing the stages of transformation for spatial coordinate vectors As can be seen in figure 3 above, a vertex’s spatial coordinate vector consists of not only the vertex’s three-dimensional x, y, z coordinates, but also a w component which is used in the perspective division stage. All three vectors of a vertex in OpenGL have four elements, thus all matrices (that they are multiplied by) are 4x4. 1.2.2.1 : The ModelView Transformation A vertex’s spatial coordinates are first presented to the pipeline as object coordinates. In this form the spatial coordinates specify a vertex’s location in 3-D space when its parent primitive is centred on the origin and oriented in such a way that makes it easy for the programmer to visualise where its vertices are in 3-D space (i.e. with the primitive’s edges parallel with and perpendicular to the axes). The modelview transformation is the combination of the modelling transformation and the viewing transformation, represented by the modelling and viewing matrices respectively. The modelling transformation is always carried out on an object before the viewing transformation, as by default the modelview matrix is formed by the viewing matrix pre-multiplying the modelling matrix. The modelling transformation positions an object at a particular location in 3-D space relative to the origin (by performing translation), rotates the object relative to the axes (by performing rotation) and scales the object in size (by performing scaling). These three transformations are represented by their own matrices, and these are depicted in figure 4 below. The order in which the modelling transformation carries out these transformations is determined by the order in which their respective matrices are multiplied together to form the modelling matrix. The transformation represented by a post-multiplying matrix is carried out before that represented by the pre- multiplying matrix, and this principle holds true in all instances where transformation matrices are combined together through multiplication.
  • 9. 9 1 0 0 x 0 1 0 y 0 0 1 z 0 0 0 1 T Matrix T translates the object from being centered at the origin to the location defined by x, y, z; but the object’s orientation and size are maintained x 0 0 0 0 y 0 0 0 0 z 0 0 0 0 1 S Matrix S scales the object such that it is stretched by a factor of x, y, z in the direction of the corresponding axes; but the object’s orientation and position are maintained 1 0 0 0 0 cosΦ -sinΦ 0 0 sinΦ cosΦ 0 0 0 0 1 Matrices Rx, Ry, and Rz rotate the object clockwise by Φ degrees about the x, y and z axes respectively; rotation about more than one axis is achieved by multiplying the relevant R matrices together; but the object’s position and size are maintained cosΦ 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 sinΦ -sinΦ cosΦ cosΦ sinΦ -sinΦ cosΦ Rx Ry Rz Figure 4 : showing the translation, scaling and rotation matrices that are multiplied together to form the modelling matrix With reference to figure 4 above, it can be seen that the translation transformation on a vertex is achieved by adding to each of its x, y and z components the multiplication of the corresponding component of the translation vector down column 4 of the T matrix and its w component. The scaling transformation on a vertex is achieved by multiplying each of its components by the corresponding component of the scaling vector along the leading diagonal of the S matrix. The action of the rotation matrix is rather more complex, although it can be seen from the R matrices above that the vertex component associated with the axis being rotated about remains unchanged after the transformation. The viewing transformation is analogous with adjusting the camera location and the direction in which it points when taking a photograph of the scene. This transformation is comprised of a combination of translation (adjusting the viewpoint’s location) and rotation (adjusting the viewpoint’s direction), and the associated matrices are the same as those used to produce the modelling matrix (shown above in figure 4). These two transformations within the viewing transformation (on the viewpoint) have the exact reverse effect on the appearance of the scene to the corresponding transformations within the modelling transformation (on the scene’s objects). The default viewpoint is at the origin and points in the negative z-direction. As objects are most often initially defined as being centred on the origin, the modelview transformation as a whole must perform a transformation simply for the
  • 10. 10 objects in the scene to be visible, although any of the elementary transformations can be omitted by simply not including their respective matrices in the product forming the modelview matrix. 1.2.2.2 : The Projection Transformation The eye coordinates resulting from the modelview transformation then go through the projection transformation where they are converted to clip coordinates. The projection transformation defines the viewing volume. The shape of the viewing volume determines how objects are projected onto the screen and which objects (or portions of objects) are clipped out of the final scene. The most common type of projection used is perspective projection which employs a frustum shaped viewing volume, as illustrated below in figure 5. Figure 5 : showing the frustum shaped viewing volume employed by perspective projection Perspective projection implements foreshortening, whereby the further away from the viewpoint an object is, the smaller it appears in the scene, thus emulating the way the human eye (or a camera) works. The projection matrix that represents the (perspective) projection transformation is depicted below in figure 6. 0 0 0 0 0 0 0 0 -1 0 P Matrix P clips the objects against the six planes of the viewing volume; the w component of each vertex is set to -z (the distance of the vertex from the origin, in a direction away from the viewpoint) 2n r - l r + l r - l 2n t - b t + b t - b -(f+n) f - n -2fn f - n n : near f : far r : right l : left t : top b : bottom Figure 6 : showing the projection matrix for perspective projection
  • 11. 11 1.2.2.3 : Perspective Division and the ViewPort Transformation The clip coordinates resulting from the projection transformation are then converted to normalized device coordinates through the process of perspective division, where the x, y, and z coordinates of each vertex are divided by its w component (which is set to –z in the projection transformation as described previously in figure 6). This scales the objects down in size to implement foreshortening as discussed previously in regards to perspective projection. After the perspective division stage the four element spatial coordinate vector of each vertex becomes a three element vector as the w component is discarded. The last stage of the process is the viewport transformation which maps the three- dimensional normalized device coordinates to the two-dimensional viewport, in converting them to window coordinates. The equations that perform the viewport transformation are shown below in figure 7. xw = ( ( xnd + 1 ) x ( width ÷ 2 ) ) + xo yw = ( ( ynd + 1 ) x ( height ÷ 2 ) ) + yo Figure 7 : showing the equations that perform the viewport transformation With reference to figure 7 above, the (xo, yo) coordinate is the position of the bottom left hand corner of the viewport (viewing window) on the screen, relative the corresponding corner of the screen. The width and height parameters are the dimensions of the viewport (in pixels). 1.3 : Overview of Transformations on Nested Loop Programs A key problem in parallel processing is the way in which the program for each processor is generated, in a way such that the overall program is executed efficiently. Most graphics functions are described as a set of Nested for-loops [3]. The Nested-Loop Program (NLP) form of an algorithm represents its sequential model of computation. The FIR filter operation is a simple example of a NLP and this is shown below in figure 8 for calculating four output values. for j = N:N+3 for i = 0:N-1 y[j] = y[j] + ( x[j - i] x h[i] ) end
  • 12. 12 end Figure 8 : showing the NLP form of the FIR filter operation This sequential model of computation of an algorithm is representative of the way in which the algorithm would be implemented as Software running on a standard single thread processor. The algorithmic transformation of unrolling transforms a NLP such that it task-level parallelism is enhanced. As a result, tasks that are independent of each other are explicitly shown to be mutually exclusive, although the resulting representation of the algorithm is still functionally equivalent to the original. The independent tasks can then be mapped to separate processors. The algorithmic transformation of skewing makes the dependences of operations less demanding, and thus allows for latency in the physical operators that will actually execute them. The dependency graph is a useful way of visualising the dependences between the operations in a NLP, and represents an intermediate level of abstraction between the NLP and its implementation. An example of a dependency graph is shown for the FIR filter NLP of figure 8, where the outer j loop has been unrolled by a factor of two, and N = 4. Data dependences between tasks (represented as circles) are shown by one task forwarding data to another. The two independent tasks are highlighted in different colours. +0 + + + h[0].x[4] + h[1].x[3] + h[2].x[2] + h[3].x[1] h[0] h[1] h[2] h[3] x[4-0] x[4-1] x[4-2] x[4-3] +0 + + + h[0].x[5] + h[1].x[4] + h[2].x[3] + h[3].x[2] h[0] h[1] h[2] h[3] x[5-0] x[5-1] x[5-2] x[5-3] +0 + + + h[0].x[6] + h[1].x[5] + h[2].x[4] + h[3].x[3] h[0] h[1] h[2] h[3] x[6-0] x[6-1] x[6-2] x[6-3] +0 + + + h[0].x[7] + h[1].x[6] + h[2].x[5] + h[3].x[4] h[0] h[1] h[2] h[3] x[7-0] x[7-1] x[7-2] x[7-3] Figure 9 : Showing the effect of unrolling the outer loop of the FIR filter NLP by a factor of 2 When the outer loop is unrolled by a factor of two, this is essentially the same as making two copies of it that can be executed in parallel. As there are two copies, in the implementation of this modified NLP each copy of the loop will need its own set of registers for storing coefficient and data values. The unrolling transformation thus translates to spatial parallelism being employed in the implementation.
  • 13. 13 The skewing transformation however translates to temporal parallelism (pipelining) being employed in the implementation, and as such, a single set of registers are shared by the different iterations of the internal loop, whose executions are overlapped. These two transformations can be used to transform a sequential model of computation into something that it is closer to a data-flow model, thus making it more suitable for efficient implementation (exploiting parallelism) on hardware. Section 2 : FIR Filter Analysis The FIR filter operation essentially carries out a vector dot-product in calculating each value of y[n]. This is illustrated below in figure 10 for N = 4. x[n] x[n-3]x[n-2]x[n-1] h[0] h[1] h[2] h[3] x[n].h[0] + x[n-1].h[1] + x[n-2].h[2] + x[n-3].h[3] Figure 10 : showing how the FIR filter operation is comprised of vector dot-products Section 2.1 : The Single MACC FIR Filter The Single MACC FIR filter (shown below in figure 11) is an implementation of the FIR filter’s sequential model of computation and as its name suggests it is based on a single MACC (multiply-accumulate) unit. As such, the algorithmic description of this implementation is identical to that of the NLP description of the FIR filter (shown below in figure 12) without applying any unfolding or skewing transformations (which were discussed earlier in section 1.3). For simplicity it is assumed throughout this section unless otherwise stated that all references to MACC units refer to non- pipelined MACCs with a total latency of 1 clock cycle.
  • 14. 14 Figure 11 : showing an example H/W implementation of the Single MACC FIR filter [4] The primary trade-off between sequential and parallel implementations of the same algorithm is the amount of hardware resources required versus the throughput achieved. As the Single MACC FIR filter implements the FIR filter function in a completely sequential manner, the required hardware resources are reduced by a factor of N, although so too is the throughput as compared to a fully parallel implementation that would use one MACC unit for each of the N coefficients (where the N MACCs would be cascaded). void singleMaccFirFilter( int num_taps, int num_samples, const float *x, const float *h[ ], float *y ) { int i, j; // ‘j’ is the outer-loop counter and ‘i’ is the inner-loop index float y_accum; // output sample is accumulated into ‘y_accum’ const float *k; // pointer to the required input sample for( j = 0; j < num_samples; j++ ) { k = x++; // x points to x[n+j] and is incremented (post assignment) to point to // x[(n+j)+1] y_accum = 0.0; for( i = 0; i < num_taps; i++ ) { y_accum += h[i] * *(k--); // y[n+j] += h[i] * x[(n+j) - i] } *y++ = y_accum; // y points to the register address where y[n+j] is to be written and is // incremented (post assignment) to point to the register address where the // next output sample y[(n+j)+1] is to be written } } Figure 12 : A code description of the Single MACC FIR filter
  • 15. 15 With reference to the code description of the Single MACC FIR filter shown above in figure 12, all of the required input samples are assumed to be stored in the register file (with a stride of 1) of the processor (a single MACC unit in this case) executing the code, with x initially pointing to the input sample x[n] corresponding to the first output sample to be calculated y[n]. It is also assumed that all of the required coefficients are stored in the same way in a group of registers used to store the h[ ] array. As can be seen from figure 12 above, and more clearly from the dependency graph of the Single MACC FIR filter (shown below in figure 13), this implementation evaluates (accumulates) only one output value at a time. Assuming that each multiplication of x[(n+j)-i] and h[i] takes one clock cycle, then the performance of this implementation is given by the following equation: Throughput = Clock frequency ÷ Number of coefficients h(0) h(1) h(2) x[n] + h(3) h(0).x[n] + h(1).x[n-1] + h(2).x[n-2] + h(3).x[n-3] + + x[n-1] x[n-2] x[n-3] h(0) h(1) h(2) x[n+1] + h(3) h(0).x[n+1] + h(1).x[n] + h(2).x[n-1] + h(3).x[n-2] + + x[n] x[n-1] x[n-2] h(0) h(1) h(2) x[n+2] + h(3) h(0).x[n+2] + h(1).x[n+1] + h(2).x[n] + h(3).x[n-1] + + x[n+1] x[n] x[n-1] h(0) h(1) h(2) x[n+3] + h(3) h(0).x[n+3] + h(1).x[n+2] + h(2).x[n+1] + h(3).x[n] + + x[n+2] x[n+1] x[n] j i MACC_1 MACC_1 MACC_1 MACC_1 MACC_1 MACC_1 MACC_1 MACC_1 MACC_1 MACC_1 MACC_1 MACC_1 MACC_1 MACC_1 MACC_1 MACC_1 y-accum y-accum y-accum y-accum y-accum y-accum y-accum y-accum y-accum y-accum y-accum y-accum y-accum y-accum y-accum y-accum 0 1 2 3 0 1 2 3 Figure 13 : dependency graph showing the operation of the Single MACC FIR filter If the coefficients of the filter posses symmetry (i.e. h[0]=h[N-1], h[1]=h[N-2], etc) a doubling of throughput can be achieved at the same clock frequency using a variation of the Single MACC FIR filter called the Symmetric MACC FIR filter. This implementation uses a single coefficient in place of those that are equal, and as such only one multiplication by that single coefficient is required, although the other multiplicand is now the sum of the data values corresponding to those equal coefficients. Thus, the cost of this performance enhancement is another adder at the input of the multiplier, as well as another RAM block (or using one Dual-port block
  • 16. 16 RAM), as the two input samples corresponding to the single coefficient need to be fetched simultaneously. Although as the number of coefficients is halved, so too is the amount of storage required for them. If N is an odd number, the Symmetric MACC FIR filter reduces the number of coefficients to N/2 + 1. The Symmetric MACC FIR is derived from the Single MACC FIR by unrolling its inner-loop by a factor of two and reversing the order in which one of the loops processes its respective coefficient-data pairs. This is a unique example of employing spatial parallelism. Employing spatial parallelism is one way to enhance the performance of the Single MACC FIR filter, and essentially uses more than one MACC unit to evaluate each output sample. As a result each MACC unit evaluates an equal share of the coefficient-data sample multiplications, and as such if M MACC units are employed, the throughput is increased by a factor of M over the Single MACC FIR filter, although so too are the required hardware resources. Section 2.2 : The Transposed FIR Filter The Transposed FIR filter (an example H/W implementation of which is shown below in figure 14) is a fully parallel implementation, as one MACC unit is used for each of the N coefficients. Unlike the Direct form type Ι fully parallel implementation (which employs an adder tree structure) the Transposed FIR filter employs an adder chain, and as such the MACC units are much easier to connect together and the implementation can easily be scaled up or down in terms of N. With regards to targeting the design to the Xilinx Vertex 4 FPGA, because of this adder-chain structure the Transposed implementation can be entirely contained within the dedicated Xilinx DSP48 slices as apposed to using generic FPGA fabric, which would yield a less efficient mapping. Figure 14 : showing an example H/W implementation of the Transposed FIR filter [4] The input data samples are broadcast to all MACC units simultaneously, and with this implementation the coefficients are assigned (in ascending order) starting from the right-most MACC unit (from which the final output is taken). As well as the spatial parallelism through the use of N MACC units, temporal parallelism is also employed as the evaluation of the successive output samples is overlapped, although this doesn’t serve to increase performance (throughput) or decrease the latency before the first output sample appears over the Direct form type Ι implementation.
  • 17. 17 A code description for the Transposed FIR filter (with the number of taps N = 4) is given in figure A1 of appendix A. The Transposed FIR filter design is yielded by a complete unrolling of the inner-loop (i-loop) of the original Single MACC FIR filter code description, which results in the number of MACC units required increasing to N. A skewing of the outer-loop (j- loop) by a factor of 1 is also performed, which results in the temporal overlapping of successive output sample calculations. This skewing is required to schedule apart the dependences that arise because the N MACC operations within any single iteration of the outer-loop are dependent on the MACC in the previous iteration of the inner-loop (for their third argument). The dependency graph of the Transposed FIR filter’s operation is shown below (with N = 4) in figure 15. h(0) h(1) h(1) h(2) x[n] + + + + + + x[n-3] x[n-2] x[n-2] x[n-1] x[n-1] x[n-1] x[n] x[n] x[n] h(3) h(2) h(3)h(3) h(2) h(3) y[n] = h(0).x[n] + h(1).x[n-1] + h(2).x[n-2] + h(3).x[n-3] + + + x[n+1] x[n+1] x[n+1] h(0) h(1) h(2) h(3) x[n+1] y[n+1] = h(0).x[n+1] + h(1).x[n] + h(2).x[n-1] + h(3).x[n-2] y_accum0 = 0 j MACC_1 MACC_1 MACC_1 MACC_1 MACC_1 MACC_2 MACC_2 MACC_2 MACC_2 MACC_3 MACC_3 MACC_3 MACC_4 MACC_4 +++++ y_accum0 = 0 y_accum0 = 0 y_accum0 = 0 y_accum0 = 0 y_accum1 y_accum1 y_accum1 y_accum1 y_accum1 y_accum2 y_accum2 y_accum2 y_accum2 y_accum3 y_accum3 y_accum3 0 1 Figure 15 : dependency graph showing the operation of the Transposed FIR filter (with N = 4) As can be seen above in figure 15, the initial latency before the first output sample emerges from the output is the same as that seen with the fully-parallel FIR filter. Although once this initial latency (of the spin-up procedure whereby the pipeline is filled over the first N cycles) has been endured, the throughput yielded by the Transposed FIR filter implementation is the same as that of the Direct form type Ι implementation (equal to the Clock frequency). The latency period between when each input sample is applied and the emergence of the corresponding output sample is also the same as that seen with the Direct form type Ι implementation, and this is equal to the latency of a single MACC unit.
  • 18. 18 Section 2.3 : The Systolic FIR Filter As with the Transposed FIR filter, the Systolic FIR filter (an example H/W implementation of which is shown below in figure 16) is also a fully parallel implementation, and also uses an adder-chain to accumulate each value of y[n]. Figure 16 : showing an example H/W implementation of the Systolic FIR filter [4] The Systolic FIR filter also employs temporal parallelism in addition to spatial parallelism in the same way that the Transposed FIR filter does. However the Systolic FIR filter’s coefficients are assigned (in ascending order) starting from the left-most MACC unit (to which the input samples are applied), which is the opposite way to how the coefficients are assigned to the Transposed FIR filter’s MACC units. As such, the Systolic FIR filter evaluates the inner-products (of which each value of y[n] is consists) in the reverse order to the Transposed FIR filter. The input data samples are fed into a cascade of registers which have the effect of a data buffer. The Systolic FIR filter differs from the Direct form type Ι implementation not only with its use of an adder-chain to accumulate each value of y[n], but also with the additional register between each of the taps. A code description of the Systolic FIR filter (with the number of taps N = 4) is given in figure A2 of appendix A. As with the Transposed FIR filter, the Systolic FIR filter is yielded by a complete unrolling of the inner-loop (i-loop) of the original Single MACC FIR filter, and a skewing of its outer-loop by a factor of 1. However, in generating the Systolic FIR filter from the Single MACC FIR, the outer-loop is skewed in the opposite direction to how it is skewed in generating the Transposed FIR filter. This means that the inner- products which are summed together to produce an output sample are evaluated in the opposite order to that in which they’re evaluated by the Transposed FIR filter. This difference is reflected in the two H/W implementations as the Transposed FIR implementation employs a broadcast structure to feed the same input sample to each of its MACC units on each clock cycle, whereas the Systolic implementation employs a tapped delay line with a delay of two clock cycles between each MACC unit. The dependency graph of the Systolic FIR filter’s operation is shown below in figure 17 (with N = 4).
  • 19. 19 h(0) h(1) h(3) h(0) h(0) h(0) h(1)h(1) h(2) h(2) x[n] x[n+1] x[n+2] x[n+3] x[n-1] x[n-2] x[n-3] x[n+4] h(0) h(1) h(2) h(3) x[n] x[n-1] x[n-2] x[n+1] x[n+2] x[n] + + + + + + + ++ y[n] = h(0).x[n] + h(1).x[n-1] + h(2).x[n-2] + h(3).x[n-3] y[n+1] = h(0).x[n+1] + h(1).x[n] + h(2).x[n-1] + h(3).x[n-2] x[n+3] x[n+1] x[n-1] x[n+4] x[n+2] x[n] j 0 1 MACC_1 MACC_1 MACC_1 MACC_1 MACC_1 MACC_2 MACC_2 MACC_2 MACC_2 MACC_3 MACC_3 MACC_3 MACC_4 MACC_4 + + + + + y_accum0 = 0 y_accum0 = 0 y_accum0 = 0 y_accum0 = 0 y_accum0 = 0 y_accum1 y_accum1 y_accum1 y_accum1 y_accum1 y_accum2 y_accum2 y_accum2 y_accum2 y_accum3 y_accum3 y_accum3 Figure 17 : dependency graph showing the operation of the Systolic FIR filter (with N = 4) The green arrows represent the forwarding of a partial accumulation of an output sample through the adder-chain, whilst the effect of the two registers between each of the MACC units is represented by the blue arrows as they represent the forwarding of input samples between successive MACC units. As with the Transposed FIR filter, the initial latency before the first output sample emerges from the output, and the throughput thereafter is the same as that seen with the Direct form type Ι implementation. Although because the Systolic FIR filter evaluates and accumulates the inner-products in the opposite order to the Transposed FIR filter, the latency between each input sample being applied to the filter and the corresponding output sample emerging from the output is N clock cycles (assuming the latency of each MACC unit is 1 cycle). Thus this latency increases by a factor of N compared to that seen with both the Transposed and Direct form type Ι implementations. However, the advantage that the Systolic FIR filter holds over the Transposed implementation is that its input is only applied to one MACC unit, unlike that of the Transposed FIR filter whose input is broadcast to all of its MACC units and thus has a high fan-out. Thus the Systolic implementation is more suitable the Transposed implementation for higher values of N.
  • 20. 20 Section 2.4 : The Semi-Parallel FIR Filter The Semi-Parallel FIR filter (sometimes called the Hardware-folded implementation) divides its N coefficients amongst M multiply-add units. An example implementation of the Semi-Parallel FIR filter is shown below in figure 18 (with N = 16, M = 4). Figure 18 : showing an example H/W implementation of the Semi-Parallel FIR filter (with N = 16, M = 4) [4] Each group of N/M coefficients is assigned to one of the MACC units and stored in order within the associated coefficient-memory. The first group (coefficients 0 to (N/M – 1)) is assigned to the left-most MACC unit (to which the input samples are applied), with ascending coefficient groups being assigned to the MACC units from left to right. If N is not exactly integer-divisible by M, then the higher-order coefficient-memories are padded with zeros. Like the Transposed and Systolic implementations, the Semi-Parallel FIR filter employs both spatial and temporal parallelism, but the degree to which it does this depends on the ratio of M:N, with a higher M:N ratio resulting in a higher degree of both spatial parallelism (as more MACC units are used) and temporal parallelism (as each output sample is evaluated more quickly and thus the evaluation of more output samples can be overlapped in time). Thus the trade off is performance obtained versus the resources required, as can be seen by the equation for the performance of a Semi-parallel FIR filter implementation: Throughput = ( Clock frequency ÷ N ) * M The Semi-parallel implementation may be extrapolated either towards being a fully- parallel implementation like the Transposed and Systolic implementations by using more MACC units, or the other way towards being a Single MACC FIR filter by using fewer MACC units. A code description of the Semi-parallel FIR filter (with N = 16, M = 4) is given in figure A3 of appendix A. The dependency graph of the Semi-parallel FIR filter’s operation (with N = 16, M = 4) is shown below in figure 19.
  • 21. 21 h(0) h(1) h(2) h(3) h(6)h(5) h(9) x[n] x[n-1] x[n-2] x[n-3] + + + + + + + + + + + + x[n-8] x[n-11] x[n-14] x[n-15] x[n-12] x[n-4] x[n-16] x[n-8] x[n-5] x[n-12] x[n-9] x[n-6] h(12) h(8) h(15)h(14) h(11) h(4) h(13) h(10) h(7) x[n+1] h(0) + + + x[n] + + + x[n-1] + + + x[n-2] h(1) h(2) h(3) h(7) h(4) h(5) h(6) h(10) h(11) h(8) h(9) h(13) h(14) h(15) h(12) x[n-7] x[n-10] x[n-13] x[n-11] x[n-14] x[n-15] x[n-3] x[n-4] x[n-7] x[n-5] x[n-8] x[n-11] x[n-8] x[n-12] x[n-3] x[n-7] x[n-11] x[n-2] x[n-4] + + + + + + y[n] = h(0).x[n] + h(1).x[n-1] + h(2).x[n-2] + h(3).x[n-3] + h(4).x[n-4] +h(5).x[n-5] + h(6).x[n-6] + h(7).x[n-7] + h(8).x[n-8] + h(9).x[n-9] + h(10).x[n-10] + h(11).x[n-11] + h(12).x[n-12] + h(13).x[n-13] + h(14).x[n-14] + h(15).x[n-15] y[n-1] = h(0).x[n-1] + h(1).x[n-2] + h(2).x[n-3] + h(3).x[n-4] + h(4).x[n-5] +h(5).x[n-6] + h(6).x[n-7] + h(7).x[n-8] + h(8).x[n-9] + h(9).x[n-10] + h(10).x[n-11] + h(11).x[n-12] + h(12).x[n-13] + h(13).x[n-14] + h(14).x[n-15] + h(15).x[n-16] x[n+1] x[n+2] + + ++ + + MACC_1 MACC_1 MACC_1 MACC_1 MACC_1 MACC_1 MACC_1 MACC_1 MACC_2 MACC_2 MACC_2 MACC_2 MACC_2 MACC_2 MACC_2 MACC_2 MACC_3 MACC_3 MACC_3 MACC_3 MACC_3 MACC_3 MACC_3 MACC_3 MACC_4 MACC_4 MACC_4 MACC_4 MACC_4 MACC_4 MACC_4 MACC_4 ACC_1 ACC_1 ACC_1 ACC_1 ACC_1 ACC_1 ACC_1 ACC_1 ++++++++ y_accum0 = 0 y_accum0 = 0 y_accum0 = 0 y_accum0 = 0 y_accum0 = 0 y_accum0 = 0 y_accum0 = 0 y_accum0 = 0 y_accum1 y_accum1 y_accum1 y_accum1 y_accum1 y_accum1 y_accum1 y_accum1 y_accum2 y_accum2 y_accum2 y_accum2 y_accum2 y_accum2 y_accum2 y_accum2 y_accum3 y_accum3 y_accum3 y_accum3 y_accum3 y_accum3 y_accum3 y_accum3 y_accum4 y_accum4 y_accum4 y_accum4 y_accum4 y_accum4 y_accum4 y_accum4 y_out y_out y_out y_out y_out y_out y_out y_out j 0 1 0 1 2 3 0 1 2 3 i Figure 19 : dependency graph showing the operation of the Semi-Parallel FIR filter The red circles represent MACC units being used to calculate the inner-products of y[n], and the dark-red circles represent the output-accumulator being used to accumulate the inner-products of y[n]. The blue and dark-blue circles represent the same for y[n-1], whilst the yellow circles represent MACC units being used to calculate the inner-products of y[n+1]. As can be seen above in figure 19, the address (which lies in the range [0:((N/M) – 1)] for all MACC units) applied to the data- buffer and coefficient memory of each MACC unit lags one behind the corresponding address of the immediately preceding (to the immediate left) MACC unit, and all such addresses are continuously and monotonically cycling from 0 to ((N/M) – 1). This is necessary in order to employ temporal parallelism by overlapping (in time) the evaluation of successive output samples in the way shown in figure 19 above. This temporal parallelism is in turn necessary to achieve the Semi-Parallel implementation’s maximum throughput of one output sample every N/M clock cycles, because once an output sample has been retrieved from the accumulator by the capture register [show on one/both diagrams], the accumulator must be reset (to either zero or its input value). For its input value to be the first sum of M inner-products of the next output sample to be evaluated, then the evaluation of this sum needs to have finished in the previous clock cycle, otherwise the accumulator will have to be reset to zero at the start of a new result-cycle (in which an output sample is accumulated). If the accumulator was set to zero between result-cycles in this way, one extra clock cycle would be required for the evaluation of each output sample, thus degrading performance.
  • 22. 22 Section 3 : Matrix-Vector Multiplication Analysis Matrix-vector multiplication is essentially a series of vector dot-products, as element (r,1) of the resultant vector is comprised of the multiplication of row r of the matrix and the multiplicand column vector. This is illustrated below in figure 20, which shows how the (4x1) resultant column vector R is formed from the multiplication of the (4x4) matrix M and the (4x1) column vector V. R(1) V(1) V(2) V(3) V(4) M(1,1).V(1) + M(1,2).V(2) + M(1,3).V(3) + M(1,4).V(4) M(1,1) M(1,2) M(1,3) M(1,4) M(2,1) M(2,2) M(2,3) M(2,4) M(3,1) M(3,2) M(3,3) M(3,4) M(4,1) M(4,2) M(4,3) M(4,4) X M(2,1).V(1) + M(2,2).V(2) + M(2,3).V(3) + M(2,4).V(4) M(3,1).V(1) + M(3,2).V(2) + M(3,3).V(3) + M(3,4).V(4) M(4,1).V(1) + M(4,2).V(2) + M(4,3).V(3) + M(4,4).V(4) R(2) R(3) R(4) Figure 20 : showing how matrix-vector multiplication is comprised of a series of vector dot-products Section 3.1 : The Sequential Model of Computation Matrix-vector multiplication is related to the FIR filter as both algorithms are consistent of a series of vector dot-products. Considering both algorithms in their sequential form, their outer for-loop is essentially for number of vector dot products required, and their inner for-loop is essentially for number of vector-matrix_row element pairs. Figures 21 and 22 show a code description and dependency graph of the matrix-vector multiplication problem’s sequential model of computation respectively. For simplicity it is assumed throughout this section unless otherwise stated that all references to MACC units refer to non-pipelined MACCs with a total latency of 1 clock cycle
  • 23. 23 void sequentialMatrixVectorMultiply(int num_matrix_rows, int num_matrix_cols, const float *m[ ][ ], const float *v[ ], const float *r[ ]) { int row, col; for( row = 0; row < num_matrix_rows; row++ ) { for( col = 0; col < num_matrix_cols; col++ ) { r[row] += m[row][col] * v[col]; // matrix is processed in ROW-MAJOR order } } } Figure 21 : showing the code description of the sequential model of computation of the matrix-vector multiplication algorithm V(1) V(2) V(3) M(1, 1) + V(4) V(1).M(1, 1) + V(2).M(1, 2) + V(3).M(1, 3) + V(4).M(1, 4) + + M(1, 2) M(1, 3) M(1, 4) V(1) V(2) V(3) M(2, 1) + V(4) V(1).M(2, 1) + V(2).M(2, 2) + V(3).M(2, 3) + V(4).M(2, 4) + + M(2, 2) M(2, 3) M(2, 4) V(1) V(2) V(3) M(3, 1) + V(4) V(1).M(3, 1) + V(2).M(3, 2) + V(3).M(3, 3) + V(4).M(3, 4) + + M(3, 2) M(3, 3) M(3, 4) V(1) V(2) V(3) M(4, 1) + V(4) V(1).M(4, 1) + V(2).M(4, 2) + V(3).M(4, 3) + V(4).M(4, 4) + + M(4, 2) M(4, 3) M(4, 4) row col MACC_1 MACC_1 MACC_1 MACC_1 MACC_1 MACC_1 MACC_1 MACC_1 MACC_1 MACC_1 MACC_1 MACC_1 MACC_1 MACC_1 MACC_1 MACC_1 R[row] R[row] R[row] R[row] R[row] R[row] R[row] R[row] R[row] R[row] R[row] R[row] R[row] R[row] R[row] R[row] 0 1 2 3 0 1 2 3 Figure 22 : showing the dependency graph of the matrix-vector multiplication algorithm’s sequential model of computation (for a 4x4 matrix and 4x1 vectors) However, as can be seen from figure 20 above, in matrix-vector multiplication each element of the matrix is a multiplicand of strictly one inner-product in one vector dot- product, and thus with reference to the program of figure 21, each element of the matrix is strictly a multiplicand of one MACC operation in one specific iteration of the inner-loop within one specific iteration of the outer-loop. This is in contrast to the sequential (Single MACC) FIR filter algorithm where each input sample is a multiplicand of one inner-product in N successive vector dot-products.
  • 24. 24 The multiplicand column-vector of a matrix-vector multiplication is analogous to the coefficient vector used in the FIR filter algorithm, as all vector-dot products performed by both algorithms multiply these vectors by another. Section 3.2 : Exploiting the Inherent Parallelism The Transposed and Systolic FIR filter implementations discussed previously in sections 2.2 and 2.3 respectively, were formed by completely unrolling the inner-loop and then skewing the outer-loop (by a factor of 1 in opposite directions) of the original sequential FIR filter code. With reference to figure 21 above, if the inner-loop of the matrix-vector multiplication algorithm is unrolled by any factor, as was previously discussed in section 2.2 (with regards to the FIR filter algorithm) the outer-loop then has to be skewed for the MACC operations scheduled for simultaneous execution (by different MACC units) to be independent of one another. Although as already discussed, each matrix element is used as a multiplicand only once throughout the execution of the entire algorithm, and thus unlike with the FIR filter algorithm, the direction in which the outer-loop is skewed essentially makes no difference, as each MACC unit would still have to access a separate matrix element at the start of each clock cycle. Thus unlike the Transposed FIR filter, the access of each matrix element (analogous to each input sample of the FIR filter) can not be shared among all MACC units. Similarly, unlike the Systolic FIR filter there is no sense in feeding the matrix elements through a tapped delay line in order to amortise the overhead of accessing them. Section 3.2.1 : Unrolling the Inner-Loop Figure 23 below shows the dependency diagram of the matrix-vector multiplication code (in its sequential form) of figure 21 after its inner-loop has been completely unrolled (with each iteration executed on a separate MACC unit) and its outer-loop has subsequently been skewed by a factor of +1 in a way analogous to that used in creating the Systolic FIR filter. With this series of transformations, each MACC unit employed effectively processes one column of the matrix.
  • 25. 25 V(1) V(2) V(4) V(1) V(1) V(1) V(2)V(2) V(3) V(3) M(1, 1) M(2, 1) M(3, 1) M(4, 1) V(2) V(3) V(4) + + + + + + + ++ R(1) = V(1).M(1, 1) + V(2).M(1, 2) + V(3).M(1, 3) + V(4).M(1, 4) R(2) = V(1).M(2, 1) + V(2).M(2, 2) + V(3).M(2, 3) + V(4).M(2, 4) row 0 1 MACC_1 MACC_1 MACC_1 MACC_1 MACC_2 MACC_2 MACC_2 MACC_2 MACC_3 MACC_3 MACC_3 MACC_4 MACC_4 + + + + R(1) = 0 R(2) = 0 R(3) = 0 R(4) = 0 R(1) R(2) R(3) R(4) R(1) R(2) R(3) R(4) R(1) R(2) R(3) M(1, 2) M(2, 3) M(1, 4) M(2, 4) M(1, 3) M(2, 2) M(3, 2) M(3, 3) M(4, 2) + + MACC_3 MACC_4 R(3) + MACC_4 V(3) V(4) V(4) M(4, 3) M(3, 4) M(4, 4) R(3) = V(1).M(3, 1) + V(2).M(3, 2) + V(3).M(3, 3) + V(4).M(3, 4) R(4) = V(1).M(4, 1) + V(2).M(4, 2) + V(3).M(4, 3) + V(4).M(4, 4) 2 3 0 1 2 3 col Figure 23 : showing the dependency diagram of the matrix-vector multiplication algorithm’s sequential model of computation after complete unrolling its inner-loop and skewing its outer-loop by a factor of +1 With reference to figure 23 above, the magenta and purple circles represent MACC units used during the spin-up and spin-down procedures respectively, whilst the red circles represent MACC units used during the steady-state. As can be seen from figure 23, execution is only in the steady-state for one clock cycle. In order to achieve better utilisation of the MACC units employed, several such matrix-vector multiplication problems could be carried out in succession to amortise the overhead of the spin-up and spin-down procedures. By doing this, the total execution time of all the problems would be approximately four times faster than executing them on a single MACC unit as is done for the matrix-vector multiplication problem’s sequential model of computation detailed previously in figures 21 and 22. Alternatively, the inner-loop could be unrolled only twice (thus employing only two MACC units), where the sum of the first two inner-products of each vector dot- product would need to be stored as intermediate values in a register file. This implementation of the matrix-vector multiplication algorithm would be approximately twice as fast as the implementation of it’s sequential model of computation, and without as much need to chain together several problems to amortise the spin-up and spin-down latency. An advantage of this implementation (regardless of the factor the inner-loop is unrolled by) is that each MACC unit uses the same vector-element throughout processing its respective column of the matrix, and thus the fetch operation for each vector element is amortised over the execution of the entire problem.
  • 26. 26 Section 3.2.2 : Unrolling the Outer-Loop Figure 24 below shows the dependency diagram of the implementation that results if instead the outer-loop is completely unrolled, and thus each MACC unit employed processes one row of the matrix. V(1) V(1) V(1) M(1, 1) V(1) R(1) = V(1).M(1, 1) + V(2).M(1, 2) + V(3).M(1, 3) + V(4).M(1, 4) + M(2, 1) M(3, 1) M(4, 1) V(2) V(2) V(2) M(1, 2) + V(2) R(2) = V(1).M(2, 1) + V(2).M(2, 2) + V(3).M(2, 3) + V(4).M(2, 4) + + M(2, 2) M(3, 2) M(4, 2) V(3) V(3) V(3) M(1, 3) + V(3) R(3) = V(1).M(3, 1) + V(2).M(3, 2) + V(3).M(3, 3) + V(4).M(3, 4) + + M(2, 3) M(3, 3) M(4, 3) V(4) V(4) V(4) M(1, 4) + V(4) R(4) = V(1).M(4, 1) + V(2).M(4, 2) + V(3).M(4, 3) + V(4).M(4, 4) + + M(2, 4) M(3, 4) M(4, 4) col row MACC_1 MACC_2 MACC_3 MACC_4 MACC_1 MACC_2 MACC_3 MACC_4 MACC_1 MACC_2 MACC_3 MACC_4 MACC_1 MACC_2 MACC_3 MACC_4 R(1) = 0 0 1 2 3 0 1 2 3 + + + + + + R(2) = 0 R(3) = 0 R(4) = 0 R(1) R(1) R(1) R(2) R(2) R(2) R(3) R(3) R(3) R(4) R(4) R(4) Figure 24 : showing the dependency diagram of the matrix-vector multiplication algorithm’s sequential model of computation after its outer-loop is completely unrolled As can be seen from figure 24 above, there are no dependencies across separate iterations of the outer-loop, so thus after this unrolling there’s no need to skew any instance of the inner-loop. Therefore execution is always in the steady state (as only spatial parallelism is employed), meaning that all MACC units are always utilised during execution. This implementation of the matrix-vector multiplication algorithm would be four times faster than the implementation of its sequential model of computation, and there is no need to amortise any spin-up and spin-down latency over the execution of multiple problems as was the case for the implementation that results from unrolling the inner-loop. Another advantage of this implementation is that each vector-element is fetched only once (as is also the case when the inner-loop is unrolled) where the MACC units employed always share the same vector-element argument. 3.3 : Using Pipelined MACC Units Until now, the MACC units discussed have been non-pipelined, where new operands are only issued to such a unit after it has finished processing its previous operands. The Transposed, Systolic and Semi-Parallel FIR filter implementations discussed previously in section 2, used a pipeline of non-pipelined MACC units. As was
  • 27. 27 discussed in section 2, pipelining temporally overlaps (skews) multiple execution threads, and thus once the initial latency (whilst the pipeline is filled) has been endured, the subsequent throughput achievable is n times greater (where n is the degree of pipelining employed, and the pipeline is balanced). This results from the non-pipelined execution unit being segmented into n stages, where each stage contributes an nth of the overall latency, and is thus able to be clocked at n times the rate of the non-pipelined equivalent. Thus if a single MACC unit was pipelined by a degree of n (and clocked at n times the clock frequency), once the initial latency of filling the pipeline had been endured, the subsequent throughput would be n times greater than that possible with the non-pipelined version. For simplicity, every instance of the word pipeline throughout this document will refer to a balanced pipeline, unless otherwise stated. The benefit of a pipelined MACC unit over its non- pipelined equivalent is depicted below in figure 25. 0 1 2 0 1 2 3 4 5 6 7 8 9 10 11 MACC0 MACC1 MACC2 MACC0 MACC0 MACC0 MACC0 MACC1 MACC2 MACC3 MACC4 MACC5 MACC6 MACC7 MACC8 MACC1 MACC2 MACC3 MACC4 MACC5 MACC6 MACC7 MACC8 MACC1 MACC2 MACC3 MACC4 MACC5 MACC6 MACC7 MACC8 MACC1 MACC2 MACC3 MACC4 MACC5 MACC6 MACC7 MACC81 2 3 4 Non-Pipelined MACC Pipelined MACC Clock Cycle Clock Cycle Pipeline Stage Figure 25 : showing an example to illustrate the benefit of a pipelined MACC over its non-pipelined equivalent (with n = 4) 3.3.1 : Optimising the Code for Execution on a Pipelined MACC Unit As discussed previously in section 2.2 the series of MACC operations that a specific vector dot-product is comprised of have data-dependencies. Thus if the matrix-vector multiplication problem’s sequential model of computation (shown previously in figure 21) was executed on a pipelined MACC (consisting of a pipelined multiplier followed by a pipelined adder), the achievable throughput would not be any higher than that with the non-pipelined version of the MACC. This is illustrated below in
  • 28. 28 figure 26. When executing the sequential model of computation, the pipelined MACC effectively skews the outer-loop. 0 1 2 0 1 2 3 4 5 6 7 8 9 10 11 MACC0 MACC1 MACC2 MACC0 MACC0 MACC0 MACC0 MACC1 MACC2 MACC1 MACC2 MACC1 MACC2 MACC1 MACC21 2 3 4 Non-Pipelined MACC Pipelined MACC Clock Cycle Clock Cycle Pipeline Stage MACC3 MACC3 MACC3 MACC3 3 MACC2 13 14 1512 In both cases : MACC0 : R(1) += M(1, 1) * V(1); MACC1 : R(1) += M(1, 2) * V(2); MACC2 : R(1) += M(1, 3) * V(3); MACC3 : R(1) += M(1, 4) * V(4); where R(1) begins as zero Figure 26 : showing an example to illustrate the effect of issuing successive MACC instructions (that are dependent) to a pipelined MACC unit For simplicity, it is assumed that all three arguments are supplied as part of a MACC instruction when it is issued to a MACC unit. The code description of the matrix-vector multiplication algorithm shown below in figure 27 is a re-write of the sequential code shown previously in figure 21. The outer and inner loops have been swapped around, which thus requires the matrix to be processed in column-major order (as apposed to row-major order as was the case for the sequential code). The reason the two loops have been swapped around is so that dependent MACC operations are scheduled as far apart as is possible, and this is illustrated in the dependency diagram of this code which is shown below in figure 28.
  • 29. 29 void pipelinedMatrixVectorMultiply(int num_matrix_rows, int num_matrix_cols, const float *m[ ][ ], const float *v[ ], const float *r[ ]) { int row, col; for( col = 0; col < num_matrix_cols; col++ ) { for( row = 0; row < num_matrix_rows; row++ ) { r[row] += m[row][col] x v[col]; // matrix is processed in COLUMN-MAJOR order } } } Figure 27 : showing the code description of the sequential matrix-vector multiplication algorithm re-written for execution on a single pipelined MACC unit MACC_1 V(1) M(1, 1) + MACC_1 V(2) M(1, 2) MACC_1 V(3) M(1, 3) MACC_1 V(4) M(1, 4) R(1) = 0 MACC_1 V(1) R(2) = 0 MACC_1 V(2) MACC_1 V(3) MACC_1 V(4) + MACC_1 V(1) M(3, 1) MACC_1 V(2) M(3, 2) MACC_1 V(3) M(3, 3) MACC_1 V(4) M(3, 4) +R(3) = 0 MACC_1 V(1) M(4, 1) MACC_1 V(2) M(4, 2) MACC_1 V(3) M(4, 3) MACC_1 V(4) M(4, 4) +R(4) = 0 M(2, 1) M(2, 2) M(2, 3) M(2, 4) R(1) = V(1).M(1, 1) + V(2).M(1, 2) + V(3).M(1, 3) + V(4).M(1, 4) R(2) = V(1).M(2, 1) + V(2).M(2, 2) + V(3).M(2, 3) + V(4).M(2, 4) R(3) = V(1).M(3, 1) + V(2).M(3, 2) + V(3).M(3, 3) + V(4).M(3, 4) R(4) = V(1).M(4, 1) + V(2).M(4, 2) + V(3).M(4, 3) + V(4).M(4, 4) col 0 1 2 3 row 0 1 2 3 + + + + + + + + + + + + R(2) R(2) R(2) R(3) R(3) R(3) R(4) R(4)R(4) Figure 28 : Showing the dependency diagram of the code of figure 27 above, with the degree of pipelining n = 4 As can be seen from figure 28, this re-written code description of the matrix-vector multiplication algorithm essentially overlaps the execution of the vector dot-products by interleaving their constituent MACC operations. In this way, the first inner- product of all the vector dot-products in turn is calculated and accumulated, after which the same is done for the second inner-product of all the vector dot-products, and so on. The number of vector-dot products a particular matrix-vector multiplication consists of is equal to the number of elements in the resultant vector, which is equal to the
  • 30. 30 number of rows in the matrix. With regards to the matrix-vector multiplication algorithm detailed in figures 27 and 28 above, as this number (represented by the variable num_matrix_rows in figure 27) is increased, dependent MACC operations are scheduled further apart in time. If this number is greater than or equal to the number of pipeline stages within the MACC, then the optimum throughput of the MACC can be achieved (n times that of its non-pipelined version), as each time a MACC instruction was issued, all those it was dependent on will have completed, thus allowing it to be executed immediately. As has been demonstrated, if the code is written such that dependencies are scheduled far enough apart the use of a pipelined MACC can increase throughput by a factor of n (where n is the degree of pipelining)
  • 31. 31 Section 4 : The Floating Point Unit The implementations of the matrix-vector multiplication algorithms discussed previously in section 3 are all based on what were termed MACC units, which in concept have the capabilities of triple-word read, write and multiply-accumulate. This section details the design and implementation of a Floating-Point Unit (FPU) that acts as one of those MACC units, and is pipelined in accordance with the findings of section 3. As was seen previously in section 1.1.2, multiply-accumulate is not the only elemental operation required to implement high-level OpenGL functions (and graphics rendering functions in general). With this in mind, the FPU has been designed in such a way that its instruction set is easy to extend. At the core of the FPU is a 5-stage pipelined multiplier and a 3-stage pipelined adder. These may be used in immediate succession to execute a multiply-accumulate (MACC) instruction, or individually to execute either a multiply or an add instruction. These particular pipeline lengths were chosen by considering the type of OpenGL program it was envisaged would be executed on the FPU during its developmental phase, and FPU has been designed such that changing these pipeline lengths can easily be done. Figure 29 below depicts this initial FPU architecture. Figure 29 : showing the initial design of the FPU The control unit of the FPU is modelled as a program written in M-code, which is encapsulated within the Embedded Matlab Function block labelled FPU_Control_Unit on diagram shown in figure B1 of appendix B. M-code is Matlab’s equivalent to C-code, and the inputs to the FPU_Control_Unit block are passed through to and processed by the embedded programme. This programme is
  • 32. 32 executed and assigns values to its outputs once per step of Simulink’s simulation time. One of these simulation time steps is analogous to one clock cycle. The instruction format of the FPU has been devise so that it is compliant with the IBM PowerPC interface standard, and is shown below in figure 30. This has been done to allow for the future possibility of employing the FPU alongside a PowerPC RISC as the latter’s Fabric Co-Processor Module (FCM), as the PowerPC is known to run OpenGL code in this configuration. RBRART/S 067131420 MNEMONIC 2126 Figure 30 : showing the format of the FPU’s instruction word With reference to figure 30 above, the mnemonic field of an arithmetic instruction tells the FPU_Control_Unit program which of the three types of arithmetic instruction it is. The RT/S field is the register number of the instruction’s destination register (and third source register of a MACC), and the RA and RB fields are the instruction’s source register numbers. The word length of all data processed by the FPU is 32-bits (in accordance with the standard binary representation of floating-point numbers). Also part of the FPU is a 100-word register file, which all data is read from and written to. Throughout this section it is assumed that all input data has already been loaded into this register file, with section 5.2.6 later describing a DMA unit that was designed and developed to transfer data in and out of the register file without holding the FPU up. The register file facilitates three simultaneous reads and one write per clock cycle. Throughout section 3 it was assumed that when a MACC instruction was issued to a pipelined MACC unit, all three arguments were also supplied at once. However, when the FPU begins the execution of a MACC instruction, only the two multiplicand arguments are fetched immediately, and the third argument is fetched at the start of the accumulate stage. Thus three reads on the register file per clock cycle must be provided so that the FPU has the capability to begin executing a new MACC or multiply instruction in the same clock cycle that it starts executing the accumulate stage of a down-stream MACC instruction. 4.1 : Dealing with Hazards The ability to issue the FPU any instruction that it supports in any clock cycle abstracts the programmer from the architecture. This allows them to get working code earlier in the design cycle (before optimisation), as apposed to code only working when it’s exactly optimised for this FPU. In the future, if a compiler is developed, this capability will allow the FPU to execute code written for other architectures. The FPU has this capability as a result of being designed to prevent structural and data hazards from manifesting into errors.
  • 33. 33 The FPU has been designed to deal with the structural hazard that occurs when the new instruction issued is an add, and the accumulate part of a MACC instruction is due to begin in that clock cycle. The conflict here is that the accumulate part of a MACC instruction also requires its associated input arguments to be entered into the adder unit. In this event, priority is given to the accumulate, and the FPU is stalled, whereby it does not allow new instructions to be issued until the adder becomes available and execution of the pending add instruction has subsequently began. The FPU has also been designed to deal with the data hazards that arise when a newly issued instruction has a variable (represented by a specific register) that is the output variable of an instruction in the Execution_pipeline at that time. These data hazards are prevented from manifesting into an error by the FPU stalling whenever such an instruction is issued to it. This prevents any instruction executed from fetching one of its source registers before the register’s contents have been updated by a down-stream instruction, and similarly writing to a register before its contents have been fetched by the accumulate stage of a down-stream MACC instruction. 4.1.1 : Using the Scoreboard to Detect Data Hazards The Scoreboard is essentially a table that maintains which registers in the register file are a destination register of an instruction currently in the pipeline and when they will be written to (updated). The FPU_Control_Unit program maintains the FPU’s Scoreboard as a binary vector, with one 6 bit field for each of the 100 registers of the register file. Each of these fields are split into two sub-fields as illustrated below in figure 31.
  • 34. 34 SCOREBOARD REGISTER COUNT DOWN EXECUTION UNIT R1 R2 R3 R99 R100 ...... R1R2R3R4R100 R99 R98 R97 . . . . . . . . . . . . . . . . . . . 0561112171823599 594 593 588 587 582 581 576 EXEC_UNITCOUNT DOWN 0125 The Scoreboard is conceptually a table with count down and execution unit entries for each register within the register file (as illustrated above), although it is actually represented within the FPU_Control block as a binary vector as shown below 0 1 adder 1 0 multiplier Figure 31 : Illustrating the concept of the Scoreboard and how it is represented As can be seen above in figure 31, the 2 least significant bits of a field (its exec_unit subfield) represent the actual execution unit that will produce the result to be written into the field’s respective register, and the 4 most significant bits of a field (its count down subfield) represent the number of clock cycles before that write-back operation will occur. After consideration of the type of NLPs to be executed on the FPU (as detailed previously in section 3 and even section 2), for simplicity the FPU was not designed to have the capability of arbitrating the execution of multiple write-back operations. Thus only programs that produce strictly no more than one write-back operation per clock cycle are supported. The position of the field within the Scoreboard vector as a whole is representative of the actual register it represents, where the least significant field represents register R1, and successive fields represent the registers in ascending order. In each simulation time-step the Scoreboard is updated to decrement any non-zero count down sub-fields and add details of a new instruction if one is submitted to the Execution_pipeline by the FPU_Control_Unit program. As stated previously in section 4.1.1, there are 100 registers in the register file and thus the Scoreboard binary vector is consistent of 6 x 100 = 600 bits. In Simulink
  • 35. 35 unsigned integers are represented using 32 bits, and thus to model this vector it was broken up into twenty 30 bit vectors. This is illustrated below in figure 32. R5 R4 R3 R2 R1 Scoreboard_1 R10 R9 R8 R7 R6 Scoreboard_2 R95 R94 R93 R92 R91 Scoreboard_19 R100 R99 R98 R97 R96 Scoreboard_20 . . . . . . . . . . . . . . . . . . . R5 R4 R3 R2 R1R10 R9 R8 R7 R6R95 R94 R93 R92 R91R100 R99 R98 R97 R96 . . . . . . . . . . . . . . . . . . . EXEC_UNITCOUNT DOWN 0125 R6R7R8R9 0561112171823 R10 Scoreboard_2 2429 0293059599 570 569 540 Figure 32: Illustrating how the Scoreboard binary vector is split into 20 segements in order to represent it in Simulink With reference to figure 32 above, Scoreboard_1 holds the status of registers R1 to R5, Scoreboard_2 holds the status of registers R6 to R10, and so on for successive Scoreboards up to Scoreboard_20. Each of the twenty Scoreboard segments are represented within the FPU_Control_Unit program as persistent variables. Figure 33 below shows how the Scoreboard is updated each time the FPU submits a new instruction to its Execution_pipeline.
  • 36. 36 SCOREBOARD REGISTER COUNT DOWN EXECUTION UNIT R1 0 0 R2 0 0 R3 0 0 R4 0 0 R5 9 1 SCOREBOARD REGISTER COUNT DOWN EXECUTION UNIT R1 6 2 R2 0 0 R3 0 0 R4 0 0 R5 8 1 SCOREBOARD REGISTER COUNT DOWN EXECUTION UNIT R1 5 2 R2 0 0 R3 0 0 R4 4 1 R5 7 1 0 1 2 3 4 5 6 7 8 9 10clock cycle R5 += R1 * R2 0 1 2 3 4 5 6 7 8 9 10 R5 += R1 * R2 R1 = R3 * R4 0 1 2 3 4 5 6 7 8 9 10 R5 += R1 * R2 R1 = R3 * R4 clock cycle clock cycle R4 = R2 + R3 X X X X The instruction issued in clock cycle 0 is : R5 += R1 * R2 The instruction issued in clock cycle 1 is : R1 = R3 * R4 The instruction issued in clock cycle 2 is : R4 = R2 + R3 R1 R2 R3 R4 R5 destination register R1 R2 R3 R4 R5 R1 R2 R3 R4 R5 destination register destination register Figure 33 : showing an example to illustrate how the FPU updates the Scoreboard each time a new instruction is submitted to the Execution_Pipeline Figure 33 above shows how the state of the Scoreboard (for the registers of concern) changes over three successive clock cycles, in which a MACC, multiply and add instructions are issued to the FPU and subsequently submitted to the Execution_pipeline where their execution begins. As can be seen in figure 33, each time an instruction is submitted to the Execution_pipeline the count down entry in the Scoreboard for that instruction’s destination register is set to the latency of the instruction (in clock cycles) and this value is subsequently decremented in each clock cycle thereafter. With reference to figure 33, registers R4, R1 and R5 will be written to in clock cycles 6, 7, and 9 respectively. The exec_unit entry for the instruction’s destination register is set to the code representing the particular execution_unit within the Execution_pipeline that will produce its result as detailed previously in figure 31. A synopsis of the FPU_Control_Unit sub-function updateScore() that is responsible for updating the Scoreboard (in the way shown in figure 33 above) each time a new instruction is submitted to the Execution_pipeline is shown in figure B2 followed by a description in appendix B.
  • 37. 37 The code of the sboardCycle() sub-function of FPU_Control_Unit is shown in figure B3, followed by a description in appendix B. This sub-function is executed every clock cycle (once per Scoreboard segment) to update the Scoreboard so as to reflect the passing of one clock cycle. This is done by decrementing all non-zero countdown sub-fields throughout the entire Scoreboard. A countdown value of 1 is encountered when its parent field represents the register on which a write back operation is scheduled to occur in the current clock cycle (as the countdown value will become zero when it’s decremented in that same execution of sboardCycle()), thus in this event sboardCycle() asserts the write back operation. After asserting a write back operation, sboardCycle() sets the field’s exec_unit sub-field to zero. Thus all Scoreboard fields whose respective registers are not destination registers of an instruction currently in the Execution_pipeline are maintained with a zero value in both sub-fields. 4.1.2 : Managing Data Hazards As already discussed previously at the beginning of the section, when the Controller (discussed later in section 5) issues the FPU with a new instruction, FPU_Control_Unit checks the Scoreboard to decide whether or not submitting the instruction to the Execution_pipeline may cause an error to occur due to a data hazard. This is illustrated below in figure 34. SCOREBOARD REGISTER COUNT DOWN EXECUTION UNIT R1 0 0 R2 0 0 R3 0 0 R4 0 0 R5 9 1 SCOREBOARD REGISTER COUNT DOWN EXECUTION UNIT R1 6 2 R2 0 0 R3 0 0 R4 0 0 R5 8 1 SCOREBOARD REGISTER COUNT DOWN EXECUTION UNIT R1 5 2 R2 0 0 R3 0 0 R4 0 0 R5 7 1 0 1 2 3 4 5 6 7 8 9 10clock cycle R5 += R1 * R2 0 1 2 3 4 5 6 7 8 9 10 R5 += R1 * R2 R1 = R3 * R4 0 1 2 3 4 5 6 7 8 9 10 R5 += R1 * R2 R1 = R3 * R4 clock cycle clock cycle R5 = R1 + R2 X X X X The instruction issued in clock cycle 0 is : R5 += R1 * R2 The instruction issued in clock cycle 1 is : R1 = R3 * R4 The instruction issued in clock cycle 2 is : R5 = R1 + R2 R1 R2 R3 R4 R5 destination register R1 R2 R3 R4 R5 R1 R2 R3 R4 R5 destination register destination register R5
  • 38. 38 Figure 34 : showing an example to illustrate how the Scoreboard is checked to prevent data hazards With reference to figure 34 above, the multiply instruction issued to the FPU in clock cycle 1 (represented by the blue bar) passes the Scoreboard check (assuming there are no instructions in the Execution_pipeline before clock cycle 0) because in clock cycle 1 neither of its source registers or it’s destination register is the target of a pending write back operation. However the add instruction issued to the FPU in clock cycle 2 (represented by the green bar) fails the Scoreboard check for two reasons. Firstly, one of its source registers (R1) is the target of the write back operation scheduled in clock cycle 7 after the multiply instruction. Thus if this add instruction was submitted to the Execution_pipeline in clock cycle 2, it would fetch and use the contents of R1 in the same clock cycle, before they had been updated in clock cycle 7 by the multiply instruction. In order to abstract the programmer from the architecture it must be assumed that they don’t consider the latency of the instructions and that their intention in this event would be for the result of the multiply instruction to be added to R2 by the add instruction. Thus if this was the only cause for the Scoreboard check failure, the add instruction could be submitted to the Execution_pipeline (without causing a data hazard) in clock cycle 8 or thereafter. However, a second cause of the add instruction’s Scoreboard check failure is that it’s destination register (R5) is the target of a write back operation scheduled in clock cycle 9 after the MACC instruction (represented by the red bar). Thus a data hazard exists because the destination register of a MACC instruction is first used as its third source register. As such, if the add instruction was submitted to the Execution_pipeline whilst the MACC instruction was still being executed, it could write to R5 before this register had been fetched for the accumulate stage of the MACC instruction. For simplicity, this event always results in a Scoreboard check failure, regardless of whether or not the instruction’s write back would occur after the register fetch for the add stage of the MACC instruction. The sub-function of FPU_Control_Unit that is employed to check the Scoreboard for a particular register is checkScore(), a synopsis of which is shown in figure B4, followed by a description in appendix B. 4.1.3 : Managing Structural Hazards As well as the data hazards discussed previously in section 4.1.1, a potential structural hazard also exists as the Execution_pipeline’s adder is used in executing both the add instruction and the accumulate stage of the MACC instruction. Figure 35 below shows an example to illustrate this data hazard, and how it is dealt with.
  • 39. 39 SCOREBOARD REGISTER COUNT DOWN EXECUTION UNIT R1 0 0 R2 0 0 R3 0 0 R4 0 0 R5 9 1 SCOREBOARD REGISTER COUNT DOWN EXECUTION UNIT R1 6 2 R2 0 0 R3 0 0 R4 0 0 R5 4 1 SCOREBOARD REGISTER COUNT DOWN EXECUTION UNIT R1 0 0 R2 0 0 R3 0 0 R4 0 0 R5 4 1 0 1 2 3 4 5 6 7 8 9 10clock cycle MACC_mult 0 1 2 3 4 5 6 7 8 9 10 MULT 0 1 2 3 4 5 6 7 8 9 10 clock cycle clock cycle ADD X X X The instruction issued in clock cycle 0 is : R5 += R1 * R2 The instruction issued in clock cycle 5 is : R1 = R3 * R4 The instruction issued in clock cycle 5 is : R4 = R2 + R3 R1 R2 R3 R4 R5 destination register R1 R2 R3 R4 R5 R1 R2 R3 R4 R5 destination register destination register MACC_add X X X X MACC_add X X X MACC_add Figure 35 : showing an example to illustrate the structural hazard concerning the use of the Execution_pipeline’s adder by both MACC and add instructions Figure 35 above shows a MACC instruction (issued in clock cycle 0) split into its two stages, where the red bar represents the multiply-stage and the purple bar represents the accumulate-stage. The dark red section of the combined bar represents the clock cycle in which the last stage of the multiply and the first stage (register fetch) of the add are conducted in parallel. This is done so as to hide the latency of the add-stage’s register fetch. Figure 35 shows the outcome of issuing two alternative instructions to the FPU in that clock cycle (5). As is the case with both the multiply and MACC instructions, if the instruction issued in this clock cycle does not require immediate use of the adder then it is eligible for submission to the Execution_pipeline. However, as can be seen in figure 35 this is not the case with the add instruction, and as such FPU_Control_Unit would not submit an add to the Execution_pipeline in this situation, regardless of whether or not it passed its Scoreboard checks. Priority is always given to the accumulate-stage of a MACC in this way for simplicity. In the situation depicted by figure 35, the earliest time after this clock cycle that an add instruction could be submitted to the Execution_pipeline would be clock cycle 6. Details of how the FPU program implements the execution of the different instructions and the management of hazards are contained in appendix B.
  • 40. 40 Section 5 : The Controller The Controller is responsible for issuing the FPU with the next instruction to be executed. If the FPU stalls in the event of detecting a structural or data hazard, it asserts a ‘1’ on its stall output and the Controller must re-issue the stalled instruction in subsequent clock cycles until the FPU does submit it to its Execution_pipeline and assert a ‘0’ back on its stall output. In the clock cycle after this submission occurs, the Controller must issue the FPU with the next instruction to be executed. 5.1 : Initial Look-Up Table Design The initial design of the Controller was essentially a look-up table, where every instruction of the program to be run was stored sequentially in program memory. This look-up table design of the Controller is shown below in figure 36. Figure 36 : showing the initial look-up table design of the Controller For simplicity, during this developmental phase the PROG_COUNTER (counter) was initialised with its count_from and count_to block parameters before running the simulation. Similarly, the PROG_MEMORY (single-port RAM) was initialised with the sequence of program instructions through its initial_value_vector block parameter. The output of PROG_MEMORY is separated out into its constituent fields (mnemonic, RT/S, RA and RB) as bit-slicing is not supported in Simulink’s Embedded Matlab function block, thus preventing this from being carried out within FPU_Control_Unit. With reference to figure 36 above, when the stall input is ‘0’, both PROG_COUNTER and PROG_MEMORY are enabled, thus allowing the program counter to progress by 1 and the output register of the RAM block to be written to. As such, when the stall signal is ‘0’ successive instructions of the program are output by PROG_MEMORY at a rate of one per clock cycle. Figure 37 below uses an example to show how the Controller deals with an FPU stall.
  • 41. 41 0 1 2 3 4 5 6 7 8Clock Cycle a0 a1 a2 a3 a4 a5 a6 i0 i1 i2 i3 i4 i5X 0 1 Stall Instruction word issued Program counter Figure 37 : showing an example to illustrate how an FPU stall is dealt with by the Controller In the example shown in figure 37 above, the FPU is issued with instruction i2 in clock cycle 3 but cannot submit it to the Execution_pipeline for two successive clock cycles (3 and 4). As can be seen in figure 37 above, when the FPU stalls and asserts a ‘1’ on the stall signal, PROG_COUNTER has advanced to the address of the next instruction (a3) by the time the count has been stopped. Although as the output register of PROG_MEMORY is disabled, PROG_MEMORY’s output value remains as the word-value of the stalled instruction. In the clock cycle that the FPU does submit i2 to the Execution_pipeline, it asserts a ‘0’ back on the stall signal, thus enabling PROG_MEMORY’s output register, which is then written to with the word- value of the next instruction to be executed. This can be seen in figure 37, where the stalled instruction (i2) is submitted for execution in clock cycle 5, and in the subsequent clock cycle the FPU is issued with the next instruction (i3). 5.2 : Optimising the Controller for Running Geometric Transformation Programs As the problem size (number of instructions the program consists of) increases, storing every single instruction requires the program memory capacity to be bigger than is practical. For example, the matrix-vector multiplication program shown in figure 21 and discussed previously in section 3.1 where the matrix is 4x4 and the vector is 4x1), is executed using 16 separate MACC instructions when (ignoring any load and store operations required to get data in and out of the register file). Thus the size of the program memory required to store this program when completely unrolled is 16 instruction words. Figure 38 below illustrates a similar program to that shown in figure 21 which multiplies eight 4x1 Vectors by the same 4x4 matrix. This is a typical example of the program run in carrying out both the modelview and projection transformations in the per-vertex operations stage of the OpenGL pipeline, as discussed previously in section 1.2.2.
  • 42. 42 M(1,1) M(1,2) M(1,3) M(1,4) M(2,1) M(2,2) M(2,3) M(2,4) M(3,1) M(3,2) M(3,3) M(3,4) M(4,1) M(4,2) M(4,3) M(4,4) X V1(1) V1(2) V1(3) V1(4) V2(1) V2(2) V2(3) V2(4) V3(1) V3(2) V3(3) V3(4) V4(1) V4(2) V4(3) V4(4) V5(1) V5(2) V5(3) V5(4) V6(1) V6(2) V6(3) V6(4) V7(1) V7(2) V7(3) V7(4) V8(1) V8(2) V8(3) V8(4) R1(1) R1(2) R1(3) R1(4) R2(1) R2(2) R2(3) R2(4) R3(1) R3(2) R3(3) R3(4) R4(1) R4(2) R4(3) R4(4) R5(1) R5(2) R5(3) R5(4) R6(1) R6(2) R6(3) R6(4) R7(1) R7(2) R7(3) R7(4) R8(1) R8(2) R8(3) R8(4) Figure 38 : illustrating an example of the matrix-vector multiplication program carried out by OpenGL’s modelview and projection transformations. With reference to figure 38 above, the M matrix represents either the modelview or projection matrix depending on which transformation is to be performed, and vectors V1 to V8 represent the object or eye coordinate vectors of an object in the scene. The object being transformed in this particular program has eight vertices, the simplest example of which would be a cube. Figure 39 below shows a code description of this program. void pipelined1Matrix8VectorMultiply(int num_matrix_rows, int num_matrix_cols, const float *m[ ][ ], const float *v1[ ], const float *v2[ ], const float *v3[ ], const float *v4[ ], const float *v5[ ], const float *v6[ ], const float *v7[ ], const float *v8[ ], const float *r1[ ], const float *r2[ ], const float *r3[ ], const float *r4[ ], const float *r5[ ], const float *r6[ ], const float *r7[ ], const float *r8[ ]) { int row, col; for( col = 0; col < num_matrix_cols; col++ ) { for( row = 0; row < num_matrix_rows; row++ ) { r1[row] += m[row][col] x v1[col]; // matrix m is processed in COLUMN-MAJOR order r2[row] += m[row][col] x v2[col]; r3[row] += m[row][col] x v3[col]; r4[row] += m[row][col] x v4[col]; r5[row] += m[row][col] x v5[col]; r6[row] += m[row][col] x v6[col]; r7[row] += m[row][col] x v7[col]; r8[row] += m[row][col] x v8[col]; } } } Figure 39 : showing a code description of the example geometric transformation program depicted previously in figure 38
  • 43. 43 Previously in section 3.3.1 an analysis of how to optimise the matrix-vector multiplication algorithm for execution on a pipelined MACC unit was detailed. In conjunction with the findings of this analysis, the two loops in code of figure 39 above are arranged such that the matrix is processed in column major order, so as to schedule dependent MACC instructions further apart in time and thus avoid periods of latency due to FPU stalls. However, this program has eight vectors and the eight separate matrix-vector multiplication problems are interleaved so as to schedule the dependent MACC instructions (within each individual problem) even further apart in time, allowing for an even greater overall pipeline depth and thus a higher throughput to be achieved. Completely unrolling both loops will yield the fastest execution speed as it entirely removes the overhead of having to test loop conditions and execute branch operations (to set the program counter back to the beginning of a loop). Although these overheads can also be eliminated without unrolling any loops, by implementing loop tests and branch operations within the Controller, although this would be at the expense of added hardware resources. Although as discussed previously, the disadvantage of completely unrolling both loops is that the size of program memory required is increased by a factor equal to degree of unrolling. As can be seen from figure 39 above, the inner-loop of the program contains 8 MACC instructions, and so if this program was represented with both loops completely unrolled, the size of the required program memory would be 8x4x4=128 instruction- words. With both loops completely unrolled, transformations of larger sizes (i.e. where there are more vertices in the scene overall) could be solved by running the program on small sets of vectors (vertices) at a time, thus reducing the number of instructions that need to be stored at any one time, and likewise the size of program memory required. Although this approach would introduce the overhead of switching between these smaller programs. 5.2.1 : The Use of Induction Variables As can be seen from the code of figure 39 above, all of the instructions are MACC instructions. Thus when the program is represented with both loops completely unrolled, the instructions stored in program memory would all have the same mnemonic, and differ only in their three source/destination register fields (RT/S, RA and RB). All of the program’s instruction-words have the same mnemonic, and their RT/S, RA and RB fields always address the same arrays. When the majority of DSP compilers need to address arrays in NLP programs, to simplify this they introduce induction variables which by definition are derived from the loop index values. Considering the geometric transformation program of figure 39 for just one vector, a code description of this illustrating how induction variables would be used to address the output and two input arrays is shown below in figure 40.
  • 44. 44 for( col = 0; col < num_matrix_cols; col++ ) { for( row = 0; row < num_matrix_rows; row++ ) { // matrix m is processed in column major order p = (col x num_matrix_rows) + row; // p is an induction variable *(r1 + row) += *(m + p) x *(v1 + col); } } Figure 40 : showing an example to illustrate the use of induction variables for addressing arrays As can be seen in figure 40 above, the m array is indexed by adding its respective induction variable p to its base pointer. Although p is the only new variable introduced, the r1 and v1 arrays are also addressed in this way, where their respective induction variables are exactly the values of the loop indices. With reference to figure 40 above, it can be seen how the use of induction variables allows for the successive program instructions to be generated, with only a single generic instruction-word stored in program memory of the form shown below in figure 41. V1_base_pointer (RB) M_base_pointer (RA) R1_base_pointer (RT/S) 067131420 MACC (MNEMONIC) 2126 Figure 41 : showing the single generic instruction-word from which all program instructions could be derived for the program of figure 40 With reference to figure 41 above, the Controller would pass the mnemonic field straight on to the FPU, but between issuing successive instructions it would have to evaluate the values of the RT/S, RA and RB fields, which would require additional hardware resources. If this penalty was migrated into software by issuing the FPU with instructions to calculate the induction variable values, this would eliminate the need for extra resources (apart from the extra program memory required) at the cost of increased execution time. Although this is not an option as the FPU does not have a register address internal data-format. With reference to the code of figure 40 above, for this program these extra hardware resources would amount to two adders for evaluating the R1 array addresses, likewise another two adders for evaluating the V1 array addresses and an adder and a multiply-add unit for evaluating the M array address. 5.2.2 : Performing Strength Reduction on Induction Variables In applying strength reduction to all three induction variables of the program of figure 40, the overhead cost of each inner-loop iteration is reduced. Figure 42 below shows the code description after all three induction variables have undergone strength reduction.
  • 45. 45 p = 0; r = 0; c = 0; for( col = 0; col < num_matrix_cols; col++ ) { for( row = 0; row < num_matrix_rows; row++ ) { *(r1 + r) += *(m + p) x *(v1 + c); // matrix m is processed in column major order r++; p++; } r = 0; c++; } Figure 42 : showing the code description of the matrix-vector multiplication algorithm after all three induction variables have undergone strength reduction As can be seen from figure 42 above, there is no longer the need for a multiplication in evaluating successive values of p, thus the additional hardware required to evaluate the p induction variable is now down to two adders. To facilitate this strength reduction on p, the m matrix must be stored in column major order. Strength reduction also removes any dependencies of induction variables on the loop indices (as is the case for those associated with the R1 and V1 arrays), which provides more flexibility, as not all programs will have induction variables that directly correspond to the loop indices. 5.2.3 : Optimising a Geometric Transformation for the Controller Considering the code shown in figure 39 above, where eight separate matrix-vector multiplication problems are interleaved, this could be executed using separate base pointers for the arrays of the separate problems, whilst using the same induction variables across all problems. A code description of this solution is shown below in figure 43. p = 0; r = 0; c = 0; for( col = 0; col < num_matrix_cols; col++ ) { for( row = 0; row < num_matrix_rows; row++ ) { // matrix m is processed in column major order *(r1 + r) += *(m + p) x *(v1 + c); *(r2 + r) += *(m + p) x *(v2 + c);
  • 46. 46 *(r3 + r) += *(m + p) x *(v3 + c); *(r4 + r) += *(m + p) x *(v4 + c); *(r5 + r) += *(m + p) x *(v5 + c); *(r6 + r) += *(m + p) x *(v6 + c); *(r7 + r) += *(m + p) x *(v7 + c); *(r8 + r) += *(m + p) x *(v8 + c); r++; p++; } r = 0; c++; } Figure 43 : showing a code description of the program with eight matrix-vector multiplication problems interleaved, with all induction variables having undergone strength reduction. With reference to figure 43 above, to run this code the Controller would have to store one generic instruction word of the form shown previously in figure 41 for all eight problems (i.e. one instruction word per vertex). This disadvantage arises from the corresponding array base pointers having different values across the eight different problems. The number of vertices an object has, or the number that are in a scene can be huge (reaching the tens of thousands in very detailed scenes), thus it is desired that only one generic instruction word be stored in program memory, from which all successive program instructions are derived. In order to achieve this, the corresponding arrays across the different problems need to be combined. Considering the way the geometric transformation interleaves the execution of the problems, in order to keep the expressions for evaluating the induction variable values as simple as possible, the best way to combine the arrays is to interleave them. As such, the register numbers of successive array elements accessed can be evaluated largely by simple increment operations. This is illustrated below in figure 44 which illustrates an example register file arrangement for the three arrays.
  • 47. 47 M(1, 1) M(2, 1) M(3, 1) M(4, 1) M(1, 2) R1 M(2, 2) M(3, 2) M(4, 2) M(1, 3) M(2, 3) M(3, 3) M(4, 3) R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 V1(1) V2(1) V3(1) V4(1) V5(1) R17 V6(1) V7(1) V8(1) V1(2) V2(2) V3(2) V4(2) R18 R19 R20 R21 R22 R23 R24 R25 R26 R27 R28 R1(1) R2(1) R3(1) R4(1) R5(1) R49 R6(1) R7(1) R8(1) R1(2) R2(2) R3(2) R4(2) R50 R51 R52 R53 R54 R55 R56 R57 R58 R59 R60 Figure 44 : showing an example arrangement within the FPU’s register file, where the three arrays of eight matrix-vector multiplication problems are interleaved together As is illustrated in figure 44 above, the eight source and corresponding eight resultant vectors are stored in the register file, such that the first element of all eight vectors are stored adjacent to each other and in the order the vector pairs are processed by the program (1 to 8), followed by the second element of all 8 vectors, and so on. The M matrix is stored in column major order, as was discussed earlier in this section. The arrangement of the source and destination registers means that the complexity of interleaving and de-interleaving the arrays is handled outside the Controller by whatever loads the data in and stores it out of the FPU’s register file. Subsequently the register number sequences that need to be generated by the Controller are simpler (than if the arrays were simply concatenated), and this will ease compiler development if it is undertaken in the future. A code description of the program using this register file arrangement, and requiring only one generic instruction word to be stored in program memory is shown below in figure 45.
  • 48. 48 res = 0; res_base = 0; p = 0; vec_base = 0; vec = 0; for( col = 0; col < num_matrix_cols; col++ ) { for( row = 0; row < num_matrix_rows; row++ ) { for( vec_no = 0; vec_no < num_vectors; vec_no++ ) { *(r0 + res) += *(m + p) x *(v0 + vec); res++; vec++; } p++; vec = vec_base; } res = res_base; vec = vec_base = vec_base + num_vectors; } Figure 45 : showing a code description of the program executing eight interleaved matrix-vector multiplication problems with only one generic instruction word stored in program memory 5.2.4 : Designing the Optimal Controller Considering the code of figure 45 above, it can be seen that to evaluate the three induction variables between issuing successive instructions, the range of operations the Controller must be able to carry out on an induction variable to produce its subsequent value are incrementing it, setting it to a base value and adding a constant to that base value. As was discussed previously in section 1.2.1, as well as matrix-vector multiplication, the two other prominent NLPs executed within the OpenGL pipeline (and the graphics rendering domain in general) are the FIR filter and matrix-matrix multiplication. For flexibility and thus the ability to efficiently support a wider range of NLPs, the Controller needs to have the ability to perform any combination of these operations on any of the three induction variables in any one evaluation cycle between the issuing of successive instructions.