Parallel Processor for Graphics Acceleration

Design and Development of a Floating-point Co-processor
for the Acceleration of Graphics functions
Sandip Jassar
Department of Electronics, The University of York
Academic Supervisors: Andy Tyrrell, Jonathan Dell
Xilinx Development Centre, Edinburgh
Industrial Supervisor: Richard Walke
15th
June, 2005
Final Project Report for degree of MEng in Electronic and Computer
Engineering
Abstract
This report details the design and development of a processing architecture, complete
with a Controller and DMA unit. The reader is shown how the architecture was
optimised for executing Nested-Loop Programs, and in particular those found in the
geometric transformation processes stage of the OpenGL pipeline.

2
Contents
Section 1 : Introduction …………………………………………………………pg4
1.1 Parallel Processing for Graphics Functions ………………...pg4-5
1.2 OpenGL ………….………………………………………….pg5
1.2.1 The OpenGL Pipeline ………………………………....pg5-8
1.2.2 The Geometric Transformation Process ……………....pg8
1.2.2.1 The ModelView Transformation ……………….pg8-10
1.2.2.2 The Projection Transformation ………………….pg10
1.2.2.3 Perspective Division and the
ViewPort Transformation ………………………..pg11
1.3 : Overview of Transformations on Nested Loop Programs .pg11-13
Section 2 : FIR Filter Analysis …………………………………………………..pg13
2.1 The Single MACC FIR Filter ……………………………pg13-16
2.2 The Transposed FIR Filter ………………………………..pg16-17
2.3 The Systolic FIR Filter ……………………………………pg18-19
2.4 The Semi-Parallel FIR filter ………………………………pg20-21
Section 3 : Matrix-Vector Multiplication Analysis ……………………………pg22
3.1 The Sequential Model of Computation ..…………………..pg22-24
3.2 Exploiting the Inherent Parallelism ...……………………...pg24
3.2.1 : Unrolling the Inner-Loop ………………………pg24-25
3.2.2 : Unrolling the Outer-Loop ……………………...pg26
3.3 Using Pipelined MACC Units ....………………………….pg26-27
3.3.1 : Optimising the Code for Execution on a
Pipelined MACC Unit ………………………….pg27-30
Section 4 : The Floating Point Unit ……………...……………………………pg31
4.1 Dealing with Hazards ……………………………………..pg32-33
4.1.1 : Using the Scoreboard to Detect Data Hazards…pg33-37
4.1.2 : Managing Data Hazards ..……………………...pg37-38
4.1.3 : Managing Structural Hazards ..………………...pg38-39
Section 5 : The Controller ………………………..……………………………pg40
5.1 Initial Look-Up Table Design ………....…………………..pg40-41
5.2 Optimising the Controller for Running Geometric
Transformation Programs ……….....……………………...pg41-43
5.2.1 : The Use of Induction Variables ..………………pg43-44
5.2.2 : Performing Strength Reduction on
Induction Variables ……..……………………...pg44-45
5.2.3 : Optimising a Geometric Transformation for
the Controller …….....………………………….p45-48
5.2.4 : Designing the Optimal Controller ……………..pg48
5.2.4.1 : The Data Address Generator Design ...pg49-50
5.2.4.2 : The Use of Data Address Generators as
Part of the Controller …...…………...pg51-53
5.2.4.3 : The Loop Counter Structure …………pg53-55
5.2.5 : The High-Level Controller …………………….pg55

3
5.2.6 : The DMA Unit and Context Switching ………..pg55-57
Section 6 : Testing and Integration ..……………..…………………………..pg58
6.1 Testing of the FPU’s Control Unit .…....…………………pg58-59
6.2 Testing and Integrating the Register File ………………...pg59
6.3 Testing and Integrating the Execution Pipeline ………….pg60
6.4 Testing and Integration of the Look-Up Table Controller .pg60
6.5 Testing of the Optimal Controller’s Data Address Generator..pg60
6.6 Testing and Integration of the Optimal Controller’s
Loop Counter Structure, Program Memory and DAGs …...pg61
6.7 Testing and Integration of the DMA Unit …………………pg61-62
6.8 Testing and Integration of the High-Level Controller ….....pg62
6.4 The OpenGL Demonstration ………………………………pg63-64
6.4 Progress in Developing a Hardware Based Demonstration .pg64-65
Section 7 : Conclusion …………………………………………………………pg65
Section 8 : Appendices ………………………………………………………...pg67-80
Section 9 : Acknowledgments …………………………………………………pg81
Section 10 : References ………………………………………………………..pg82
1 : Introduction
This section begins by describing the 3-D graphics rendering application domain, and
the characteristics of the associated algorithms which allow for parallel execution.
The industry standard OpenGL 3-D graphics rendering pipeline is analysed, where
special focus is given to the geometric transformations stage which is consistent of a
pipeline of matrix-vector manipulation operations that implement a key part of

4
OpenGL’s operation in defining what a scene looks like in terms of the position and
orientation of the objects in it. The section then closes by looking at how Nested
Loop Programs such as the FIR filter can be transformed to exploit their inherent
parallelism by making it more explicit.
Section 2 then follows on from this and examines the different combinations of
transformations that can be applied to the FIR filter and the resulting implementation
tradeoffs. Section 3 then carries out a similar analysis on another prominent NLP in
graphics processing, the matrix-vector multiplication algorithm, where similarities
and differences with the FIR filter are analysed, and tradeoffs between the different
ways to optimise the algorithm’s implementation are discussed. Section 4 then details
the design of a FPU which can exploit the characteristics inherent in the matrix-vector
multiplication algorithm if they are exposed by transforming the code as discussed in
the previous section, by employing temporal parallelism.
Section 5 goes through the design process for an Optimal Controller for the FPU
using a typical matrix-vector multiplication algorithm found in the geometric
transformations stage of the OpenGL pipeline, and goes on to show how the
processor architecture as a whole was optimised for this particular section of the
OpenGL pipeline. Section 6 explains how the complete processor architecture was
built up from its constituent blocks in a hierarchical fashion, after they were tested
against their specification and integrated together into sub-systems, and ends by
describing the demonstration of OpenGL’s geometric transformations based on the
processor architecture developed. Section 7 finishes by drawing conclusions from the
project and putting the results achieved into context.
1.1 : Parallel Processing for Graphics Functions
Currently multiprocessing is the technique used to carry out the significant arithmetic
processing required to implement the realistic rendering techniques used in advanced
3-D graphics applications. Most often this is in the form of clusters of commodity
PC’s [1].
As discussed previously, graphics arithmetic is largely consistent of matrix and vector
manipulation, and thus lends itself to parallel processing because of the independence
of the individual operations required within the high-level matrix/vector function.
Most high level functions encountered in a Graphics environment are also ‘order-
independent’, as the order they are done in has no effect on the final display. Such
functions can thus be executed in parallel to achieve higher throughput, which is one
of the fundamental strengths of FPGAs. Although some functions will be
encountered that are ‘sequential’ which must be executed after all preceding and
before all subsequent functions, and thus if a parallel processing architecture is
employed it must deal with sequential functions with minimal degradation to the
overall system performance.
Another benefit of having an identical number of processors operating in parallel is
that the programming of each processor can be the same, and so this method of
obtaining high performance can also greatly simplify the software development

5
1.2 : OpenGL
The Open Graphics Library (OpenGL) is the industry’s most widely used and
supported 2D and 3D graphics API [2]. As such there are thousands of applications
based on OpenGL that are used to render compelling 2D and 3D graphics, in markets
ranging from broadcasting, CAD/CAM, entertainment, cinematics, medical imaging
and virtual reality, the most famous of all being Pixar’s Renderman (used by the
movie industry to create special effects). Individual function calls in the OpenGL
environment can be executed on dedicated tuned hardware, run as a software routine
on the generic system CPU or implemented as a combination of both. As a result of
this implementation flexibility, OpenGL hardware acceleration can range from that
which simply renders 2-D lines and polygons to the more advanced floating-point
processor capable of transforming and computing geometric data.
1.2.1 : The OpenGL Pipeline
The two types of data that are input to the OpenGL pipeline are pixel data and
geometric data. Pixel data is the RGBA data associated with pixels on the screen, and
comes in the form of individual pixel colour values, images and bitmaps. Geometric
data is used to model objects (ranging in complexity from simple 2-D shapes to
realistic 3-D objects), and comes in the form of points, lines and polygons, which are
OpenGL’s three geometric primitives. All geometric data is eventually described as
vertices. The data associated with each vertex is its 3-D positional coordinate vector,
normal vector and material properties (used in lighting calculations), pixel colour
value (RGBA value or an index to a colour-map), and its texture coordinates (used to
map a texture onto the vertex’s parent object). Figure 1 below shows an overview of
the OpenGL pipeline in terms of its various stages and the order in which operations
occur.
Figure 1 : showing an overview of the OpenGL pipeline
As can be seen in figure 1 above the vertex and pixel data are initially processed
differently before both being used in the rasterization stage. All data can be input to

6
the pipeline from the application and processed immediately, or saved in a display list
which sends the data to the pipeline when the list is executed.
In the per-vertex operations stage, the three vectors associated with each vertex
(spatial coordinates, normal and texture coordinates) are transformed (multiplied) by
the current modelview matrix, its inverse transpose and the current texture matrix
respectively. These transformations are carried out in order to transform the position,
orientation and size of the vertices’ parent objects in the scene. Lighting calculations
are then performed on each vertex using their transformed spatial coordinate and
normal vectors, material properties and the current lighting model. These calculations
act so as to scale each vertex’s pixel colour value to reflect the location and
orientation of its parent polygon relative to the light source/s.
The viewing volume of the scene is defined by six clipping planes. In the primitive
assembly stage the spatial coordinate vector of all vertices is transformed (multiplied)
by the projection matrix, so as to clip the scene against the six planes of the viewing
volume. Depending on the type of clipping employed primitives may have vertices
rejected, modified or added. The programmer may also define additional clipping
planes to further restrict the viewing volume, in order to create cut-away views of
objects, and other similar effects. The equations that represent such additional
clipping planes are used to transform the spatial coordinate vectors before the
projection matrix.
The spatial coordinate vectors of all vertices then go through perspective division,
where primitives are scaled in size to reflect their distance from the viewer. This is
followed by the viewport transformation where the spatial coordinate vectors of all
vertices are transformed so as to map the 3-D scene onto the 2-D viewport (viewing
window) of the computer screen.
In the rasterization processing stage, all primitives (points, lines and polygons) are
rasterized into fragments, where the squares of the viewport’s integer pixel-grid that
are occupied by each primitive are determined. If enabled, advanced features to make
the rendered scene more realistic are also implemented in this stage. The most
commonly used of these is anti-aliasing, which is used to smooth jagged edges that
result from having to map non-vertical and non-horizontal lines to the square pixel-
grid of the viewport. Anti-aliasing calculates the portion of each square within the
pixel-grid that would be occupied by a line, if the line were to be drawn as originally
defined (before the viewport transformation), and this value is known as a pixel’s
coverage value. A pixel’s coverage value is used to scale the alpha component of its
RGBA colour value. Figure 2 illustrates how pixel coverage values are evaluated.

7
Figure 2 : showing an example to illustrate how a pixel’s coverage value is evaluated
With reference to figure 2 above, the green and orange show how a diagonal line
looks before and after it’s subjected to the viewport transformation respectively, and
the coverage values for the pixels occupied by the line are given on the right.
In the pixel operations stage, pixel data that is input to the pipeline is scaled, biased
and processed using a colour-map, after which the colour values are clamped to a
certain range. The resulting pixel data is then either rasterized into fragments or
written to texture memory for use in texture mapping. Data from the framebuffer can
be read back and placed in the host processor memory. Data for texture mapping can
be taken from either the host processor memory or the framebuffer.
If texturing is enabled, in the per-fragment operations processing stage, the texture
coordinate vector of each vertex (of a fragment’s primitive) is used to map the
vertices to specific points on a two-dimensional texture image, so that the texture
(known as the texel for that particular fragment) can be mapped onto the primitive
appropriately after its position an orientation are transformed in the per-vertex
operations stage. If enabled, other advanced features such as blending (used to create
a photo-realism effect) are also implemented in the per-fragment operations stage.
It’s also in this stage that the coverage values calculated in the rasterization stage are
applied, if anti-aliasing is enabled.
The final pixel values are then drawn (written) into the framebuffer.
1.2.2 : The Geometric Transformation Process
The overall transformation process for producing a scene for viewing is analogous to
that carried out by a camera when it’s used to take a photograph. This transformation
process is carried out on the spatial coordinate vector of each vertex in the OpenGL
pipeline. This process is depicted below in figure 3.

8
Figure 3 : showing the stages of transformation for spatial coordinate vectors
As can be seen in figure 3 above, a vertex’s spatial coordinate vector consists of not
only the vertex’s three-dimensional x, y, z coordinates, but also a w component which
is used in the perspective division stage. All three vectors of a vertex in OpenGL
have four elements, thus all matrices (that they are multiplied by) are 4x4.
1.2.2.1 : The ModelView Transformation
A vertex’s spatial coordinates are first presented to the pipeline as object coordinates.
In this form the spatial coordinates specify a vertex’s location in 3-D space when its
parent primitive is centred on the origin and oriented in such a way that makes it easy
for the programmer to visualise where its vertices are in 3-D space (i.e. with the
primitive’s edges parallel with and perpendicular to the axes). The modelview
transformation is the combination of the modelling transformation and the viewing
transformation, represented by the modelling and viewing matrices respectively. The
modelling transformation is always carried out on an object before the viewing
transformation, as by default the modelview matrix is formed by the viewing matrix
pre-multiplying the modelling matrix.
The modelling transformation positions an object at a particular location in 3-D space
relative to the origin (by performing translation), rotates the object relative to the
axes (by performing rotation) and scales the object in size (by performing scaling).
These three transformations are represented by their own matrices, and these are
depicted in figure 4 below. The order in which the modelling transformation carries
out these transformations is determined by the order in which their respective matrices
are multiplied together to form the modelling matrix. The transformation represented
by a post-multiplying matrix is carried out before that represented by the pre-
multiplying matrix, and this principle holds true in all instances where transformation
matrices are combined together through multiplication.

9
1 0 0 x
0 1 0 y
0 0 1 z
0 0 0 1
T
Matrix T translates the object from being centered at the origin to the location defined by x, y, z; but the object’s orientation and size
are maintained
x 0 0 0
0 y 0 0
0 0 z 0
0 0 0 1
S
Matrix S scales the object such that it is stretched by a factor of x, y, z in the direction of the corresponding axes; but the object’s
orientation and position are maintained
1 0 0 0
0 cosΦ -sinΦ 0
0 sinΦ cosΦ 0
0 0 0 1
Matrices Rx, Ry, and Rz rotate the object clockwise by Φ degrees about the x, y and z axes respectively; rotation about more than one axis is achieved by multiplying
the relevant R matrices together; but the object’s position and size are maintained
cosΦ 0 0
0 1 0 0
0 0
0 0 0 1
0 0
0 0
0 0 0 0
0 0 0 1
sinΦ
-sinΦ cosΦ
cosΦ
sinΦ
-sinΦ
cosΦ
Rx Ry Rz
Figure 4 : showing the translation, scaling and rotation matrices that are multiplied
together to form the modelling matrix
With reference to figure 4 above, it can be seen that the translation transformation on
a vertex is achieved by adding to each of its x, y and z components the multiplication
of the corresponding component of the translation vector down column 4 of the T
matrix and its w component. The scaling transformation on a vertex is achieved by
multiplying each of its components by the corresponding component of the scaling
vector along the leading diagonal of the S matrix. The action of the rotation matrix is
rather more complex, although it can be seen from the R matrices above that the
vertex component associated with the axis being rotated about remains unchanged
after the transformation.
The viewing transformation is analogous with adjusting the camera location and the
direction in which it points when taking a photograph of the scene. This
transformation is comprised of a combination of translation (adjusting the viewpoint’s
location) and rotation (adjusting the viewpoint’s direction), and the associated
matrices are the same as those used to produce the modelling matrix (shown above in
figure 4). These two transformations within the viewing transformation (on the
viewpoint) have the exact reverse effect on the appearance of the scene to the
corresponding transformations within the modelling transformation (on the scene’s
objects). The default viewpoint is at the origin and points in the negative z-direction.
As objects are most often initially defined as being centred on the origin, the
modelview transformation as a whole must perform a transformation simply for the

10
objects in the scene to be visible, although any of the elementary transformations can
be omitted by simply not including their respective matrices in the product forming
the modelview matrix.
1.2.2.2 : The Projection Transformation
The eye coordinates resulting from the modelview transformation then go through the
projection transformation where they are converted to clip coordinates. The
projection transformation defines the viewing volume. The shape of the viewing
volume determines how objects are projected onto the screen and which objects (or
portions of objects) are clipped out of the final scene. The most common type of
projection used is perspective projection which employs a frustum shaped viewing
volume, as illustrated below in figure 5.
Figure 5 : showing the frustum shaped viewing volume employed by perspective
projection
Perspective projection implements foreshortening, whereby the further away from the
viewpoint an object is, the smaller it appears in the scene, thus emulating the way the
human eye (or a camera) works. The projection matrix that represents the
(perspective) projection transformation is depicted below in figure 6.
0 0
0 0
0 0
0 0 -1 0
P
Matrix P clips the objects against the six planes of the viewing volume; the w component of each vertex is set to -z (the distance of
the vertex from the origin, in a direction away from the viewpoint)
2n
r - l
r + l
r - l
2n
t - b
t + b
t - b
-(f+n)
f - n
-2fn
f - n
n : near
f : far
r : right
l : left
t : top
b : bottom
Figure 6 : showing the projection matrix for perspective projection

11
1.2.2.3 : Perspective Division and the ViewPort Transformation
The clip coordinates resulting from the projection transformation are then converted
to normalized device coordinates through the process of perspective division, where
the x, y, and z coordinates of each vertex are divided by its w component (which is set
to –z in the projection transformation as described previously in figure 6). This scales
the objects down in size to implement foreshortening as discussed previously in
regards to perspective projection. After the perspective division stage the four
element spatial coordinate vector of each vertex becomes a three element vector as
the w component is discarded.
The last stage of the process is the viewport transformation which maps the three-
dimensional normalized device coordinates to the two-dimensional viewport, in
converting them to window coordinates. The equations that perform the viewport
transformation are shown below in figure 7.
xw = ( ( xnd + 1 ) x ( width ÷ 2 ) ) + xo
yw = ( ( ynd + 1 ) x ( height ÷ 2 ) ) + yo
Figure 7 : showing the equations that perform the viewport transformation
With reference to figure 7 above, the (xo, yo) coordinate is the position of the bottom
left hand corner of the viewport (viewing window) on the screen, relative the
corresponding corner of the screen. The width and height parameters are the
dimensions of the viewport (in pixels).
1.3 : Overview of Transformations on Nested Loop Programs
A key problem in parallel processing is the way in which the program for each
processor is generated, in a way such that the overall program is executed efficiently.
Most graphics functions are described as a set of Nested for-loops [3].
The Nested-Loop Program (NLP) form of an algorithm represents its sequential
model of computation. The FIR filter operation is a simple example of a NLP and
this is shown below in figure 8 for calculating four output values.
for j = N:N+3
for i = 0:N-1
y[j] = y[j] + ( x[j - i] x h[i] )
end

12
end
Figure 8 : showing the NLP form of the FIR filter operation
This sequential model of computation of an algorithm is representative of the way in
which the algorithm would be implemented as Software running on a standard single
thread processor.
The algorithmic transformation of unrolling transforms a NLP such that it task-level
parallelism is enhanced. As a result, tasks that are independent of each other are
explicitly shown to be mutually exclusive, although the resulting representation of the
algorithm is still functionally equivalent to the original. The independent tasks can
then be mapped to separate processors.
The algorithmic transformation of skewing makes the dependences of operations less
demanding, and thus allows for latency in the physical operators that will actually
execute them.
The dependency graph is a useful way of visualising the dependences between the
operations in a NLP, and represents an intermediate level of abstraction between the
NLP and its implementation. An example of a dependency graph is shown for the
FIR filter NLP of figure 8, where the outer j loop has been unrolled by a factor of two,
and N = 4. Data dependences between tasks (represented as circles) are shown by
one task forwarding data to another. The two independent tasks are highlighted in
different colours.
+0
+
+
+
h[0].x[4] + h[1].x[3] + h[2].x[2] +
h[3].x[1]
h[0]
h[1]
h[2]
h[3]
x[4-0]
x[4-1]
x[4-2]
x[4-3]
+0
+
+
+
h[0].x[5] + h[1].x[4] + h[2].x[3] +
h[3].x[2]
h[0]
h[1]
h[2]
h[3]
x[5-0]
x[5-1]
x[5-2]
x[5-3]
+0
+
+
+
h[0].x[6] + h[1].x[5] + h[2].x[4] +
h[3].x[3]
h[0]
h[1]
h[2]
h[3]
x[6-0]
x[6-1]
x[6-2]
x[6-3]
+0
+
+
+
h[0].x[7] + h[1].x[6] + h[2].x[5] +
h[3].x[4]
h[0]
h[1]
h[2]
h[3]
x[7-0]
x[7-1]
x[7-2]
x[7-3]
Figure 9 : Showing the effect of unrolling the outer loop of the FIR filter NLP by a
factor of 2
When the outer loop is unrolled by a factor of two, this is essentially the same as
making two copies of it that can be executed in parallel. As there are two copies, in
the implementation of this modified NLP each copy of the loop will need its own set
of registers for storing coefficient and data values. The unrolling transformation thus
translates to spatial parallelism being employed in the implementation.

13
The skewing transformation however translates to temporal parallelism (pipelining)
being employed in the implementation, and as such, a single set of registers are shared
by the different iterations of the internal loop, whose executions are overlapped.
These two transformations can be used to transform a sequential model of
computation into something that it is closer to a data-flow model, thus making it more
suitable for efficient implementation (exploiting parallelism) on hardware.
Section 2 : FIR Filter Analysis
The FIR filter operation essentially carries out a vector dot-product in calculating each
value of y[n]. This is illustrated below in figure 10 for N = 4.
x[n] x[n-3]x[n-2]x[n-1]
h[0]
h[1]
h[2]
h[3]
x[n].h[0] + x[n-1].h[1] + x[n-2].h[2] + x[n-3].h[3]
Figure 10 : showing how the FIR filter operation is comprised of vector dot-products
Section 2.1 : The Single MACC FIR Filter
The Single MACC FIR filter (shown below in figure 11) is an implementation of the
FIR filter’s sequential model of computation and as its name suggests it is based on a
single MACC (multiply-accumulate) unit. As such, the algorithmic description of this
implementation is identical to that of the NLP description of the FIR filter (shown
below in figure 12) without applying any unfolding or skewing transformations
(which were discussed earlier in section 1.3). For simplicity it is assumed throughout
this section unless otherwise stated that all references to MACC units refer to non-
pipelined MACCs with a total latency of 1 clock cycle.

14
Figure 11 : showing an example H/W implementation of the Single MACC FIR filter
[4]
The primary trade-off between sequential and parallel implementations of the same
algorithm is the amount of hardware resources required versus the throughput
achieved. As the Single MACC FIR filter implements the FIR filter function in a
completely sequential manner, the required hardware resources are reduced by a
factor of N, although so too is the throughput as compared to a fully parallel
implementation that would use one MACC unit for each of the N coefficients (where
the N MACCs would be cascaded).
void singleMaccFirFilter( int num_taps, int num_samples, const float *x, const float *h[ ], float *y )
{
int i, j; // ‘j’ is the outer-loop counter and ‘i’ is the inner-loop index
float y_accum; // output sample is accumulated into ‘y_accum’
const float *k; // pointer to the required input sample
for( j = 0; j < num_samples; j++ )
{
k = x++; // x points to x[n+j] and is incremented (post assignment) to point to
// x[(n+j)+1]
y_accum = 0.0;
for( i = 0; i < num_taps; i++ )
{
y_accum += h[i] * *(k--); // y[n+j] += h[i] * x[(n+j) - i]
}
*y++ = y_accum; // y points to the register address where y[n+j] is to be written and is
// incremented (post assignment) to point to the register address where the
// next output sample y[(n+j)+1] is to be written
}
}
Figure 12 : A code description of the Single MACC FIR filter

15
With reference to the code description of the Single MACC FIR filter shown above in
figure 12, all of the required input samples are assumed to be stored in the register file
(with a stride of 1) of the processor (a single MACC unit in this case) executing the
code, with x initially pointing to the input sample x[n] corresponding to the first
output sample to be calculated y[n]. It is also assumed that all of the required
coefficients are stored in the same way in a group of registers used to store the h[ ]
array.
As can be seen from figure 12 above, and more clearly from the dependency graph of
the Single MACC FIR filter (shown below in figure 13), this implementation
evaluates (accumulates) only one output value at a time. Assuming that each
multiplication of x[(n+j)-i] and h[i] takes one clock cycle, then the performance of
this implementation is given by the following equation:
Throughput = Clock frequency ÷ Number of coefficients
h(0)
h(1)
h(2)
x[n]
+
h(3)
h(0).x[n] + h(1).x[n-1] + h(2).x[n-2]
+ h(3).x[n-3]
+
+
x[n-1]
x[n-2]
x[n-3]
h(0)
h(1)
h(2)
x[n+1]
+
h(3)
h(0).x[n+1] + h(1).x[n] + h(2).x[n-1]
+ h(3).x[n-2]
+
+
x[n]
x[n-1]
x[n-2]
h(0)
h(1)
h(2)
x[n+2]
+
h(3)
h(0).x[n+2] + h(1).x[n+1] +
h(2).x[n] + h(3).x[n-1]
+
+
x[n+1]
x[n]
x[n-1]
h(0)
h(1)
h(2)
x[n+3]
+
h(3)
h(0).x[n+3] + h(1).x[n+2] +
h(2).x[n+1] + h(3).x[n]
+
+
x[n+2]
x[n+1]
x[n]
j
i
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
y-accum
y-accum
y-accum
y-accum
y-accum
y-accum
y-accum
y-accum
y-accum
y-accum
y-accum
y-accum
y-accum
y-accum
y-accum
y-accum
0 1 2 3
0
1
2
3
Figure 13 : dependency graph showing the operation of the Single MACC FIR filter
If the coefficients of the filter posses symmetry (i.e. h[0]=h[N-1], h[1]=h[N-2], etc)
a doubling of throughput can be achieved at the same clock frequency using a
variation of the Single MACC FIR filter called the Symmetric MACC FIR filter. This
implementation uses a single coefficient in place of those that are equal, and as such
only one multiplication by that single coefficient is required, although the other
multiplicand is now the sum of the data values corresponding to those equal
coefficients. Thus, the cost of this performance enhancement is another adder at the
input of the multiplier, as well as another RAM block (or using one Dual-port block

16
RAM), as the two input samples corresponding to the single coefficient need to be
fetched simultaneously. Although as the number of coefficients is halved, so too is
the amount of storage required for them. If N is an odd number, the Symmetric
MACC FIR filter reduces the number of coefficients to N/2 + 1. The Symmetric
MACC FIR is derived from the Single MACC FIR by unrolling its inner-loop by a
factor of two and reversing the order in which one of the loops processes its
respective coefficient-data pairs. This is a unique example of employing spatial
parallelism.
Employing spatial parallelism is one way to enhance the performance of the Single
MACC FIR filter, and essentially uses more than one MACC unit to evaluate each
output sample. As a result each MACC unit evaluates an equal share of the
coefficient-data sample multiplications, and as such if M MACC units are employed,
the throughput is increased by a factor of M over the Single MACC FIR filter,
although so too are the required hardware resources.
Section 2.2 : The Transposed FIR Filter
The Transposed FIR filter (an example H/W implementation of which is shown below
in figure 14) is a fully parallel implementation, as one MACC unit is used for each of
the N coefficients. Unlike the Direct form type Ι fully parallel implementation (which
employs an adder tree structure) the Transposed FIR filter employs an adder chain,
and as such the MACC units are much easier to connect together and the
implementation can easily be scaled up or down in terms of N. With regards to
targeting the design to the Xilinx Vertex 4 FPGA, because of this adder-chain
structure the Transposed implementation can be entirely contained within the
dedicated Xilinx DSP48 slices as apposed to using generic FPGA fabric, which would
yield a less efficient mapping.
Figure 14 : showing an example H/W implementation of the Transposed FIR filter [4]
The input data samples are broadcast to all MACC units simultaneously, and with this
implementation the coefficients are assigned (in ascending order) starting from the
right-most MACC unit (from which the final output is taken). As well as the spatial
parallelism through the use of N MACC units, temporal parallelism is also employed
as the evaluation of the successive output samples is overlapped, although this doesn’t
serve to increase performance (throughput) or decrease the latency before the first
output sample appears over the Direct form type Ι implementation.

17
A code description for the Transposed FIR filter (with the number of taps N = 4) is
given in figure A1 of appendix A.
The Transposed FIR filter design is yielded by a complete unrolling of the inner-loop
(i-loop) of the original Single MACC FIR filter code description, which results in the
number of MACC units required increasing to N. A skewing of the outer-loop (j-
loop) by a factor of 1 is also performed, which results in the temporal overlapping of
successive output sample calculations. This skewing is required to schedule apart the
dependences that arise because the N MACC operations within any single iteration of
the outer-loop are dependent on the MACC in the previous iteration of the inner-loop
(for their third argument).
The dependency graph of the Transposed FIR filter’s operation is shown below (with
N = 4) in figure 15.
h(0)
h(1)
h(1)
h(2)
x[n]
+ +
+
+
+
+
x[n-3] x[n-2]
x[n-2]
x[n-1]
x[n-1]
x[n-1]
x[n]
x[n]
x[n]
h(3)
h(2)
h(3)h(3)
h(2)
h(3)
y[n] = h(0).x[n] + h(1).x[n-1] +
h(2).x[n-2] + h(3).x[n-3]
+
+
+
x[n+1]
x[n+1]
x[n+1]
h(0)
h(1)
h(2)
h(3)
x[n+1]
y[n+1] = h(0).x[n+1] + h(1).x[n] +
h(2).x[n-1] + h(3).x[n-2]
y_accum0 = 0
j
MACC_1 MACC_1 MACC_1 MACC_1 MACC_1
MACC_2
MACC_2 MACC_2 MACC_2
MACC_4 MACC_4
+++++
y_accum0 = 0 y_accum0 = 0 y_accum0 = 0 y_accum0 = 0
y_accum1 y_accum1
y_accum1
y_accum1 y_accum1
y_accum2 y_accum2 y_accum2 y_accum2
y_accum3 y_accum3 y_accum3
0 1
Figure 15 : dependency graph showing the operation of the Transposed FIR filter
(with N = 4)
As can be seen above in figure 15, the initial latency before the first output sample
emerges from the output is the same as that seen with the fully-parallel FIR filter.
Although once this initial latency (of the spin-up procedure whereby the pipeline is
filled over the first N cycles) has been endured, the throughput yielded by the
Transposed FIR filter implementation is the same as that of the Direct form type Ι
implementation (equal to the Clock frequency). The latency period between when
each input sample is applied and the emergence of the corresponding output sample is
also the same as that seen with the Direct form type Ι implementation, and this is
equal to the latency of a single MACC unit.

18
Section 2.3 : The Systolic FIR Filter
As with the Transposed FIR filter, the Systolic FIR filter (an example H/W
implementation of which is shown below in figure 16) is also a fully parallel
implementation, and also uses an adder-chain to accumulate each value of y[n].
Figure 16 : showing an example H/W implementation of the Systolic FIR filter [4]
The Systolic FIR filter also employs temporal parallelism in addition to spatial
parallelism in the same way that the Transposed FIR filter does. However the
Systolic FIR filter’s coefficients are assigned (in ascending order) starting from the
left-most MACC unit (to which the input samples are applied), which is the opposite
way to how the coefficients are assigned to the Transposed FIR filter’s MACC units.
As such, the Systolic FIR filter evaluates the inner-products (of which each value of
y[n] is consists) in the reverse order to the Transposed FIR filter.
The input data samples are fed into a cascade of registers which have the effect of a
data buffer. The Systolic FIR filter differs from the Direct form type Ι
implementation not only with its use of an adder-chain to accumulate each value of
y[n], but also with the additional register between each of the taps. A code
description of the Systolic FIR filter (with the number of taps N = 4) is given in figure
A2 of appendix A.
As with the Transposed FIR filter, the Systolic FIR filter is yielded by a complete
unrolling of the inner-loop (i-loop) of the original Single MACC FIR filter, and a
skewing of its outer-loop by a factor of 1. However, in generating the Systolic FIR
filter from the Single MACC FIR, the outer-loop is skewed in the opposite direction to
how it is skewed in generating the Transposed FIR filter. This means that the inner-
products which are summed together to produce an output sample are evaluated in the
opposite order to that in which they’re evaluated by the Transposed FIR filter. This
difference is reflected in the two H/W implementations as the Transposed FIR
implementation employs a broadcast structure to feed the same input sample to each
of its MACC units on each clock cycle, whereas the Systolic implementation employs
a tapped delay line with a delay of two clock cycles between each MACC unit. The
dependency graph of the Systolic FIR filter’s operation is shown below in figure 17
(with N = 4).

19
h(0)
h(1)
h(3)
h(0) h(0) h(0)
h(1)h(1)
h(2)
h(2)
x[n] x[n+1] x[n+2] x[n+3]
x[n-1]
x[n-2]
x[n-3]
x[n+4]
h(0)
h(1)
h(2)
h(3)
x[n]
x[n-1]
x[n-2]
x[n+1] x[n+2]
x[n]
+ +
+ +
+ +
+
++
y[n] = h(0).x[n] + h(1).x[n-1] +
h(2).x[n-2] + h(3).x[n-3]
y[n+1] = h(0).x[n+1] + h(1).x[n] +
h(2).x[n-1] + h(3).x[n-2]
x[n+3]
x[n+1]
x[n-1]
x[n+4]
x[n+2]
x[n]
j
0 1
MACC_1 MACC_1 MACC_1 MACC_1 MACC_1
MACC_2 MACC_2 MACC_2 MACC_2
MACC_4 MACC_4
+ + + + +
y_accum0 = 0 y_accum0 = 0 y_accum0 = 0 y_accum0 = 0 y_accum0 = 0
y_accum1
y_accum3 y_accum3 y_accum3
Figure 17 : dependency graph showing the operation of the Systolic FIR filter
(with N = 4)
The green arrows represent the forwarding of a partial accumulation of an output
sample through the adder-chain, whilst the effect of the two registers between each of
the MACC units is represented by the blue arrows as they represent the forwarding of
input samples between successive MACC units.
As with the Transposed FIR filter, the initial latency before the first output sample
emerges from the output, and the throughput thereafter is the same as that seen with
the Direct form type Ι implementation. Although because the Systolic FIR filter
evaluates and accumulates the inner-products in the opposite order to the Transposed
FIR filter, the latency between each input sample being applied to the filter and the
corresponding output sample emerging from the output is N clock cycles (assuming
the latency of each MACC unit is 1 cycle). Thus this latency increases by a factor of
N compared to that seen with both the Transposed and Direct form type Ι
implementations. However, the advantage that the Systolic FIR filter holds over the
Transposed implementation is that its input is only applied to one MACC unit, unlike
that of the Transposed FIR filter whose input is broadcast to all of its MACC units
and thus has a high fan-out. Thus the Systolic implementation is more suitable the
Transposed implementation for higher values of N.

20
Section 2.4 : The Semi-Parallel FIR Filter
The Semi-Parallel FIR filter (sometimes called the Hardware-folded implementation)
divides its N coefficients amongst M multiply-add units. An example implementation
of the Semi-Parallel FIR filter is shown below in figure 18 (with N = 16, M = 4).
Figure 18 : showing an example H/W implementation of the Semi-Parallel FIR filter
(with N = 16, M = 4) [4]
Each group of N/M coefficients is assigned to one of the MACC units and stored in
order within the associated coefficient-memory. The first group (coefficients 0 to
(N/M – 1)) is assigned to the left-most MACC unit (to which the input samples are
applied), with ascending coefficient groups being assigned to the MACC units from
left to right. If N is not exactly integer-divisible by M, then the higher-order
coefficient-memories are padded with zeros.
Like the Transposed and Systolic implementations, the Semi-Parallel FIR filter
employs both spatial and temporal parallelism, but the degree to which it does this
depends on the ratio of M:N, with a higher M:N ratio resulting in a higher degree of
both spatial parallelism (as more MACC units are used) and temporal parallelism (as
each output sample is evaluated more quickly and thus the evaluation of more output
samples can be overlapped in time). Thus the trade off is performance obtained
versus the resources required, as can be seen by the equation for the performance of a
Semi-parallel FIR filter implementation:
Throughput = ( Clock frequency ÷ N ) * M
The Semi-parallel implementation may be extrapolated either towards being a fully-
parallel implementation like the Transposed and Systolic implementations by using
more MACC units, or the other way towards being a Single MACC FIR filter by
using fewer MACC units. A code description of the Semi-parallel FIR filter (with N
= 16, M = 4) is given in figure A3 of appendix A. The dependency graph of the
Semi-parallel FIR filter’s operation (with N = 16, M = 4) is shown below in figure 19.

21
h(0) h(1) h(2) h(3)
h(6)h(5)
h(9)
x[n] x[n-1] x[n-2] x[n-3]
+ +
+ +
+ +
+
+
+
+
+
+
x[n-8]
x[n-11]
x[n-14] x[n-15]
x[n-12]
x[n-4]
x[n-16]
x[n-8]
x[n-5]
x[n-12]
x[n-9]
x[n-6]
h(12)
h(8)
h(15)h(14)
h(11)
h(4)
h(13)
h(10)
h(7)
x[n+1]
h(0)
+
+
+
x[n]
+
+
+
x[n-1]
+
+
+
x[n-2]
h(1) h(2) h(3)
h(7) h(4) h(5) h(6)
h(10) h(11) h(8) h(9)
h(13) h(14) h(15) h(12)
x[n-7]
x[n-10]
x[n-13]
x[n-11]
x[n-14] x[n-15]
x[n-3] x[n-4]
x[n-7]
x[n-5]
x[n-8]
x[n-11]
x[n-8]
x[n-12]
x[n-3]
x[n-7]
x[n-11]
x[n-2]
x[n-4]
+ + + + + +
y[n] = h(0).x[n] + h(1).x[n-1] + h(2).x[n-2] + h(3).x[n-3] + h(4).x[n-4] +h(5).x[n-5] +
h(6).x[n-6] + h(7).x[n-7] + h(8).x[n-8] + h(9).x[n-9] + h(10).x[n-10] + h(11).x[n-11] +
h(12).x[n-12] + h(13).x[n-13] + h(14).x[n-14] + h(15).x[n-15]
y[n-1] = h(0).x[n-1] + h(1).x[n-2] + h(2).x[n-3] + h(3).x[n-4] + h(4).x[n-5] +h(5).x[n-6]
+ h(6).x[n-7] + h(7).x[n-8] + h(8).x[n-9] + h(9).x[n-10] + h(10).x[n-11] + h(11).x[n-12]
+ h(12).x[n-13] + h(13).x[n-14] + h(14).x[n-15] + h(15).x[n-16]
x[n+1] x[n+2]
+
+
++
+
+
MACC_1 MACC_1 MACC_1 MACC_1 MACC_1 MACC_1 MACC_1 MACC_1
MACC_3
MACC_3 MACC_3 MACC_3 MACC_3 MACC_3 MACC_3 MACC_3
ACC_1 ACC_1 ACC_1 ACC_1 ACC_1 ACC_1 ACC_1 ACC_1
++++++++
y_accum0 = 0 y_accum0 = 0 y_accum0 = 0 y_accum0 = 0 y_accum0 = 0 y_accum0 = 0 y_accum0 = 0 y_accum0 = 0
y_accum1 y_accum1
y_accum1 y_accum1
y_accum3 y_accum3 y_accum3 y_accum3 y_accum3 y_accum3 y_accum3 y_accum3
y_accum4 y_accum4 y_accum4 y_accum4 y_accum4 y_accum4 y_accum4 y_accum4
y_out y_out
y_out y_out y_out y_out
y_out y_out
j
0 1
0 1 2 3 0 1 2 3
i
Figure 19 : dependency graph showing the operation of the Semi-Parallel FIR filter
The red circles represent MACC units being used to calculate the inner-products of
y[n], and the dark-red circles represent the output-accumulator being used to
accumulate the inner-products of y[n]. The blue and dark-blue circles represent the
same for y[n-1], whilst the yellow circles represent MACC units being used to
calculate the inner-products of y[n+1]. As can be seen above in figure 19, the address
(which lies in the range [0:((N/M) – 1)] for all MACC units) applied to the data-
buffer and coefficient memory of each MACC unit lags one behind the corresponding
address of the immediately preceding (to the immediate left) MACC unit, and all such
addresses are continuously and monotonically cycling from 0 to ((N/M) – 1). This is
necessary in order to employ temporal parallelism by overlapping (in time) the
evaluation of successive output samples in the way shown in figure 19 above. This
temporal parallelism is in turn necessary to achieve the Semi-Parallel
implementation’s maximum throughput of one output sample every N/M clock cycles,
because once an output sample has been retrieved from the accumulator by the
capture register [show on one/both diagrams], the accumulator must be reset (to either
zero or its input value). For its input value to be the first sum of M inner-products of
the next output sample to be evaluated, then the evaluation of this sum needs to have
finished in the previous clock cycle, otherwise the accumulator will have to be reset to
zero at the start of a new result-cycle (in which an output sample is accumulated). If
the accumulator was set to zero between result-cycles in this way, one extra clock
cycle would be required for the evaluation of each output sample, thus degrading
performance.

22
Section 3 : Matrix-Vector Multiplication Analysis
Matrix-vector multiplication is essentially a series of vector dot-products, as element
(r,1) of the resultant vector is comprised of the multiplication of row r of the matrix
and the multiplicand column vector. This is illustrated below in figure 20, which
shows how the (4x1) resultant column vector R is formed from the multiplication of
the (4x4) matrix M and the (4x1) column vector V.
R(1) V(1)
V(2)
V(3)
V(4)
M(1,1).V(1) + M(1,2).V(2) + M(1,3).V(3) + M(1,4).V(4) M(1,1) M(1,2) M(1,3) M(1,4)
M(2,1) M(2,2) M(2,3) M(2,4)
M(3,1) M(3,2) M(3,3) M(3,4)
M(4,1) M(4,2) M(4,3) M(4,4)
X
M(2,1).V(1) + M(2,2).V(2) + M(2,3).V(3) + M(2,4).V(4)
M(3,1).V(1) + M(3,2).V(2) + M(3,3).V(3) + M(3,4).V(4)
M(4,1).V(1) + M(4,2).V(2) + M(4,3).V(3) + M(4,4).V(4)
R(2)
R(3)
R(4)
Figure 20 : showing how matrix-vector multiplication is comprised of a series of
vector dot-products
Section 3.1 : The Sequential Model of Computation
Matrix-vector multiplication is related to the FIR filter as both algorithms are
consistent of a series of vector dot-products. Considering both algorithms in their
sequential form, their outer for-loop is essentially for number of vector dot products required,
and their inner for-loop is essentially for number of vector-matrix_row element pairs. Figures 21
and 22 show a code description and dependency graph of the matrix-vector
multiplication problem’s sequential model of computation respectively. For
simplicity it is assumed throughout this section unless otherwise stated that all
references to MACC units refer to non-pipelined MACCs with a total latency of 1
clock cycle

23
void sequentialMatrixVectorMultiply(int num_matrix_rows, int num_matrix_cols, const float *m[ ][ ], const float *v[ ],
const float *r[ ])
{
int row, col;
for( row = 0; row < num_matrix_rows; row++ )
{
for( col = 0; col < num_matrix_cols; col++ )
{
r[row] += m[row][col] * v[col]; // matrix is processed in ROW-MAJOR order
}
}
}
Figure 21 : showing the code description of the sequential model of computation of
the matrix-vector multiplication algorithm
V(1)
V(2)
V(3)
M(1, 1)
+
V(4)
V(1).M(1, 1) + V(2).M(1, 2) +
V(3).M(1, 3) + V(4).M(1, 4)
+
+
M(1, 2)
M(1, 3)
M(1, 4)
V(1)
V(2)
V(3)
M(2, 1)
+
V(4)
V(1).M(2, 1) + V(2).M(2, 2) +
V(3).M(2, 3) + V(4).M(2, 4)
+
+
M(2, 2)
M(2, 3)
M(2, 4)
V(1)
V(2)
V(3)
M(3, 1)
+
V(4)
V(1).M(3, 1) + V(2).M(3, 2) +
V(3).M(3, 3) + V(4).M(3, 4)
+
+
M(3, 2)
M(3, 3)
M(3, 4)
V(1)
V(2)
V(3)
M(4, 1)
+
V(4)
V(1).M(4, 1) + V(2).M(4, 2) +
V(3).M(4, 3) + V(4).M(4, 4)
+
+
M(4, 2)
M(4, 3)
M(4, 4)
row
col
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
MACC_1
R[row]
R[row]
R[row]
R[row]
R[row]
R[row]
R[row]
R[row]
R[row]
R[row]
R[row]
R[row]
R[row]
R[row]
R[row]
R[row]
0 1 2 3
0
1
2
3
Figure 22 : showing the dependency graph of the matrix-vector multiplication
algorithm’s sequential model of computation (for a 4x4 matrix and 4x1 vectors)
However, as can be seen from figure 20 above, in matrix-vector multiplication each
element of the matrix is a multiplicand of strictly one inner-product in one vector dot-
product, and thus with reference to the program of figure 21, each element of the
matrix is strictly a multiplicand of one MACC operation in one specific iteration of
the inner-loop within one specific iteration of the outer-loop. This is in contrast to the
sequential (Single MACC) FIR filter algorithm where each input sample is a
multiplicand of one inner-product in N successive vector dot-products.

24
The multiplicand column-vector of a matrix-vector multiplication is analogous to the
coefficient vector used in the FIR filter algorithm, as all vector-dot products
performed by both algorithms multiply these vectors by another.
Section 3.2 : Exploiting the Inherent Parallelism
The Transposed and Systolic FIR filter implementations discussed previously in
sections 2.2 and 2.3 respectively, were formed by completely unrolling the inner-loop
and then skewing the outer-loop (by a factor of 1 in opposite directions) of the original
sequential FIR filter code. With reference to figure 21 above, if the inner-loop of the
matrix-vector multiplication algorithm is unrolled by any factor, as was previously
discussed in section 2.2 (with regards to the FIR filter algorithm) the outer-loop then
has to be skewed for the MACC operations scheduled for simultaneous execution (by
different MACC units) to be independent of one another. Although as already
discussed, each matrix element is used as a multiplicand only once throughout the
execution of the entire algorithm, and thus unlike with the FIR filter algorithm, the
direction in which the outer-loop is skewed essentially makes no difference, as each
MACC unit would still have to access a separate matrix element at the start of each
clock cycle. Thus unlike the Transposed FIR filter, the access of each matrix element
(analogous to each input sample of the FIR filter) can not be shared among all MACC
units. Similarly, unlike the Systolic FIR filter there is no sense in feeding the matrix
elements through a tapped delay line in order to amortise the overhead of accessing
them.
Section 3.2.1 : Unrolling the Inner-Loop
Figure 23 below shows the dependency diagram of the matrix-vector multiplication
code (in its sequential form) of figure 21 after its inner-loop has been completely
unrolled (with each iteration executed on a separate MACC unit) and its outer-loop
has subsequently been skewed by a factor of +1 in a way analogous to that used in
creating the Systolic FIR filter. With this series of transformations, each MACC unit
employed effectively processes one column of the matrix.

25
V(1)
V(2)
V(4)
V(1)
V(1) V(1)
V(2)V(2)
V(3) V(3)
M(1, 1) M(2, 1) M(3, 1) M(4, 1)
V(2)
V(3)
V(4)
+ +
+ +
+ +
+
++
R(1) = V(1).M(1, 1) + V(2).M(1, 2)
+ V(3).M(1, 3) + V(4).M(1, 4)
R(2) = V(1).M(2, 1) + V(2).M(2, 2)
+ V(3).M(2, 3) + V(4).M(2, 4)
row
0 1
MACC_4 MACC_4
+ + + +
R(1) = 0 R(2) = 0 R(3) = 0 R(4) = 0
R(1) R(2) R(3) R(4)
R(1) R(2) R(3) R(4)
R(1) R(2) R(3)
M(1, 2)
M(2, 3)
M(1, 4) M(2, 4)
M(1, 3)
M(2, 2) M(3, 2)
M(3, 3)
M(4, 2)
+
+
MACC_3
MACC_4
R(3)
+
MACC_4
V(3)
V(4) V(4)
M(4, 3)
M(3, 4) M(4, 4)
R(3) = V(1).M(3, 1) + V(2).M(3, 2)
+ V(3).M(3, 3) + V(4).M(3, 4)
R(4) = V(1).M(4, 1) + V(2).M(4, 2)
+ V(3).M(4, 3) + V(4).M(4, 4)
2 3
0
1
2
3
col
Figure 23 : showing the dependency diagram of the matrix-vector multiplication
algorithm’s sequential model of computation after complete unrolling its inner-loop
and skewing its outer-loop by a factor of +1
With reference to figure 23 above, the magenta and purple circles represent MACC
units used during the spin-up and spin-down procedures respectively, whilst the red
circles represent MACC units used during the steady-state. As can be seen from
figure 23, execution is only in the steady-state for one clock cycle. In order to achieve
better utilisation of the MACC units employed, several such matrix-vector
multiplication problems could be carried out in succession to amortise the overhead of
the spin-up and spin-down procedures. By doing this, the total execution time of all
the problems would be approximately four times faster than executing them on a
single MACC unit as is done for the matrix-vector multiplication problem’s
sequential model of computation detailed previously in figures 21 and 22.
Alternatively, the inner-loop could be unrolled only twice (thus employing only two
MACC units), where the sum of the first two inner-products of each vector dot-
product would need to be stored as intermediate values in a register file. This
implementation of the matrix-vector multiplication algorithm would be approximately
twice as fast as the implementation of it’s sequential model of computation, and
without as much need to chain together several problems to amortise the spin-up and
spin-down latency.
An advantage of this implementation (regardless of the factor the inner-loop is
unrolled by) is that each MACC unit uses the same vector-element throughout
processing its respective column of the matrix, and thus the fetch operation for each
vector element is amortised over the execution of the entire problem.

26
Section 3.2.2 : Unrolling the Outer-Loop
Figure 24 below shows the dependency diagram of the implementation that results if
instead the outer-loop is completely unrolled, and thus each MACC unit employed
processes one row of the matrix.
V(1)
V(1)
V(1)
M(1, 1)
V(1)
R(1) = V(1).M(1, 1) + V(2).M(1, 2)
+ V(3).M(1, 3) + V(4).M(1, 4)
+
M(2, 1)
M(3, 1)
M(4, 1)
V(2)
V(2)
V(2)
M(1, 2)
+
V(2)
R(2) = V(1).M(2, 1) + V(2).M(2, 2)
+ V(3).M(2, 3) + V(4).M(2, 4)
+
+
M(2, 2)
M(3, 2)
M(4, 2)
V(3)
V(3)
V(3)
M(1, 3)
+
V(3)
R(3) = V(1).M(3, 1) + V(2).M(3, 2)
+ V(3).M(3, 3) + V(4).M(3, 4)
+
+
M(2, 3)
M(3, 3)
M(4, 3)
V(4)
V(4)
V(4)
M(1, 4)
+
V(4)
R(4) = V(1).M(4, 1) + V(2).M(4, 2)
+ V(3).M(4, 3) + V(4).M(4, 4)
+
+
M(2, 4)
M(3, 4)
M(4, 4)
col
row
MACC_1
MACC_2
MACC_3
MACC_4
MACC_1
MACC_2
MACC_3
MACC_4
MACC_1
MACC_2
MACC_3
MACC_4
MACC_1
MACC_2
MACC_3
MACC_4
R(1) = 0
0 1 2 3
0
1
2
3
+
+
+ + + +
R(2) = 0
R(3) = 0
R(4) = 0
R(1) R(1) R(1)
R(2) R(2) R(2)
R(3) R(3) R(3)
R(4) R(4) R(4)
Figure 24 : showing the dependency diagram of the matrix-vector multiplication
algorithm’s sequential model of computation after its outer-loop is completely
unrolled
As can be seen from figure 24 above, there are no dependencies across separate
iterations of the outer-loop, so thus after this unrolling there’s no need to skew any
instance of the inner-loop. Therefore execution is always in the steady state (as only
spatial parallelism is employed), meaning that all MACC units are always utilised
during execution. This implementation of the matrix-vector multiplication algorithm
would be four times faster than the implementation of its sequential model of
computation, and there is no need to amortise any spin-up and spin-down latency over
the execution of multiple problems as was the case for the implementation that results
from unrolling the inner-loop. Another advantage of this implementation is that each
vector-element is fetched only once (as is also the case when the inner-loop is
unrolled) where the MACC units employed always share the same vector-element
argument.
3.3 : Using Pipelined MACC Units
Until now, the MACC units discussed have been non-pipelined, where new operands
are only issued to such a unit after it has finished processing its previous operands.
The Transposed, Systolic and Semi-Parallel FIR filter implementations discussed
previously in section 2, used a pipeline of non-pipelined MACC units. As was

27
discussed in section 2, pipelining temporally overlaps (skews) multiple execution
threads, and thus once the initial latency (whilst the pipeline is filled) has been
endured, the subsequent throughput achievable is n times greater (where n is the
degree of pipelining employed, and the pipeline is balanced). This results from the
non-pipelined execution unit being segmented into n stages, where each stage
contributes an nth
of the overall latency, and is thus able to be clocked at n times the
rate of the non-pipelined equivalent. Thus if a single MACC unit was pipelined by a
degree of n (and clocked at n times the clock frequency), once the initial latency of
filling the pipeline had been endured, the subsequent throughput would be n times
greater than that possible with the non-pipelined version. For simplicity, every
instance of the word pipeline throughout this document will refer to a balanced
pipeline, unless otherwise stated. The benefit of a pipelined MACC unit over its non-
pipelined equivalent is depicted below in figure 25.
0 1 2
0 1 2 3 4 5 6 7 8 9 10 11
MACC0 MACC1 MACC2
MACC0
MACC0
MACC0
MACC0 MACC1 MACC2 MACC3 MACC4 MACC5 MACC6 MACC7 MACC8
MACC1 MACC2 MACC3 MACC4 MACC5 MACC6 MACC7 MACC8
2
3
4
Non-Pipelined MACC
Pipelined MACC
Clock Cycle
Clock Cycle
Pipeline Stage
Figure 25 : showing an example to illustrate the benefit of a pipelined MACC over its
non-pipelined equivalent (with n = 4)
3.3.1 : Optimising the Code for Execution on a Pipelined MACC Unit
As discussed previously in section 2.2 the series of MACC operations that a specific
vector dot-product is comprised of have data-dependencies. Thus if the matrix-vector
multiplication problem’s sequential model of computation (shown previously in
figure 21) was executed on a pipelined MACC (consisting of a pipelined multiplier
followed by a pipelined adder), the achievable throughput would not be any higher
than that with the non-pipelined version of the MACC. This is illustrated below in

28
figure 26. When executing the sequential model of computation, the pipelined
MACC effectively skews the outer-loop.
0 1 2
0 1 2 3 4 5 6 7 8 9 10 11
MACC0 MACC1 MACC2
MACC0
MACC0
MACC0
MACC0 MACC1 MACC2
MACC1 MACC2
MACC1 MACC2
MACC1 MACC21
2
3
4
Non-Pipelined MACC
Pipelined MACC
Clock Cycle
Clock Cycle
Pipeline Stage
MACC3
MACC3
MACC3
MACC3
3
MACC2
13 14 1512
In both cases : MACC0 : R(1) += M(1, 1) * V(1); MACC1 : R(1) += M(1, 2) * V(2); MACC2 : R(1) += M(1, 3) * V(3); MACC3 : R(1) += M(1, 4) * V(4); where R(1) begins as zero
Figure 26 : showing an example to illustrate the effect of issuing successive MACC
instructions (that are dependent) to a pipelined MACC unit
For simplicity, it is assumed that all three arguments are supplied as part of a MACC
instruction when it is issued to a MACC unit.
The code description of the matrix-vector multiplication algorithm shown below in
figure 27 is a re-write of the sequential code shown previously in figure 21. The outer
and inner loops have been swapped around, which thus requires the matrix to be
processed in column-major order (as apposed to row-major order as was the case for
the sequential code). The reason the two loops have been swapped around is so that
dependent MACC operations are scheduled as far apart as is possible, and this is
illustrated in the dependency diagram of this code which is shown below in figure 28.

29
void pipelinedMatrixVectorMultiply(int num_matrix_rows, int num_matrix_cols, const float *m[ ][ ], const float *v[ ],
const float *r[ ])
{
int row, col;
{
{
r[row] += m[row][col] x v[col]; // matrix is processed in COLUMN-MAJOR order
}
}
}
Figure 27 : showing the code description of the sequential matrix-vector
multiplication algorithm re-written for execution on a single pipelined MACC unit
MACC_1
V(1)
M(1, 1)
+
MACC_1
V(2)
M(1, 2)
MACC_1
V(3)
M(1, 3)
MACC_1
V(4)
M(1, 4)
R(1) = 0
MACC_1
V(1)
R(2) = 0
MACC_1
V(2)
MACC_1
V(3)
MACC_1
V(4)
+
MACC_1
V(1)
M(3, 1)
MACC_1
V(2)
M(3, 2)
MACC_1
V(3)
M(3, 3)
MACC_1
V(4)
M(3, 4)
+R(3) = 0
MACC_1
V(1)
M(4, 1)
MACC_1
V(2)
M(4, 2)
MACC_1
V(3)
M(4, 3)
MACC_1
V(4)
M(4, 4)
+R(4) = 0
M(2, 1) M(2, 2) M(2, 3) M(2, 4)
R(1) = V(1).M(1, 1) + V(2).M(1, 2) +
V(3).M(1, 3) + V(4).M(1, 4)
R(2) = V(1).M(2, 1) + V(2).M(2, 2) +
V(3).M(2, 3) + V(4).M(2, 4)
R(3) = V(1).M(3, 1) + V(2).M(3, 2) +
V(3).M(3, 3) + V(4).M(3, 4)
R(4) = V(1).M(4, 1) + V(2).M(4, 2) +
V(3).M(4, 3) + V(4).M(4, 4)
col
0 1 2 3
row
0
1
2
3
+ + +
+
+
+
+
+
+
+
+ +
R(2) R(2) R(2)
R(3) R(3) R(3)
R(4) R(4)R(4)
Figure 28 : Showing the dependency diagram of the code of figure 27 above, with the
degree of pipelining n = 4
As can be seen from figure 28, this re-written code description of the matrix-vector
multiplication algorithm essentially overlaps the execution of the vector dot-products
by interleaving their constituent MACC operations. In this way, the first inner-
product of all the vector dot-products in turn is calculated and accumulated, after
which the same is done for the second inner-product of all the vector dot-products,
and so on.
The number of vector-dot products a particular matrix-vector multiplication consists
of is equal to the number of elements in the resultant vector, which is equal to the

30
number of rows in the matrix. With regards to the matrix-vector multiplication
algorithm detailed in figures 27 and 28 above, as this number (represented by the
variable num_matrix_rows in figure 27) is increased, dependent MACC operations are
scheduled further apart in time. If this number is greater than or equal to the number
of pipeline stages within the MACC, then the optimum throughput of the MACC can
be achieved (n times that of its non-pipelined version), as each time a MACC
instruction was issued, all those it was dependent on will have completed, thus
allowing it to be executed immediately.
As has been demonstrated, if the code is written such that dependencies are scheduled
far enough apart the use of a pipelined MACC can increase throughput by a factor of
n (where n is the degree of pipelining)

31
Section 4 : The Floating Point Unit
The implementations of the matrix-vector multiplication algorithms discussed
previously in section 3 are all based on what were termed MACC units, which in
concept have the capabilities of triple-word read, write and multiply-accumulate.
This section details the design and implementation of a Floating-Point Unit (FPU) that
acts as one of those MACC units, and is pipelined in accordance with the findings of
section 3. As was seen previously in section 1.1.2, multiply-accumulate is not the
only elemental operation required to implement high-level OpenGL functions (and
graphics rendering functions in general). With this in mind, the FPU has been
designed in such a way that its instruction set is easy to extend. At the core of the
FPU is a 5-stage pipelined multiplier and a 3-stage pipelined adder. These may be
used in immediate succession to execute a multiply-accumulate (MACC) instruction,
or individually to execute either a multiply or an add instruction. These particular
pipeline lengths were chosen by considering the type of OpenGL program it was
envisaged would be executed on the FPU during its developmental phase, and FPU
has been designed such that changing these pipeline lengths can easily be done.
Figure 29 below depicts this initial FPU architecture.
Figure 29 : showing the initial design of the FPU
The control unit of the FPU is modelled as a program written in M-code, which is
encapsulated within the Embedded Matlab Function block labelled
FPU_Control_Unit on diagram shown in figure B1 of appendix B. M-code is
Matlab’s equivalent to C-code, and the inputs to the FPU_Control_Unit block are
passed through to and processed by the embedded programme. This programme is

32
executed and assigns values to its outputs once per step of Simulink’s simulation time.
One of these simulation time steps is analogous to one clock cycle.
The instruction format of the FPU has been devise so that it is compliant with the IBM
PowerPC interface standard, and is shown below in figure 30. This has been done to
allow for the future possibility of employing the FPU alongside a PowerPC RISC as
the latter’s Fabric Co-Processor Module (FCM), as the PowerPC is known to run
OpenGL code in this configuration.
RBRART/S
067131420
MNEMONIC
2126
Figure 30 : showing the format of the FPU’s instruction word
With reference to figure 30 above, the mnemonic field of an arithmetic instruction
tells the FPU_Control_Unit program which of the three types of arithmetic
instruction it is. The RT/S field is the register number of the instruction’s destination
register (and third source register of a MACC), and the RA and RB fields are the
instruction’s source register numbers.
The word length of all data processed by the FPU is 32-bits (in accordance with the
standard binary representation of floating-point numbers). Also part of the FPU is a
100-word register file, which all data is read from and written to. Throughout this
section it is assumed that all input data has already been loaded into this register file,
with section 5.2.6 later describing a DMA unit that was designed and developed to
transfer data in and out of the register file without holding the FPU up. The register
file facilitates three simultaneous reads and one write per clock cycle. Throughout
section 3 it was assumed that when a MACC instruction was issued to a pipelined
MACC unit, all three arguments were also supplied at once. However, when the FPU
begins the execution of a MACC instruction, only the two multiplicand arguments are
fetched immediately, and the third argument is fetched at the start of the accumulate
stage. Thus three reads on the register file per clock cycle must be provided so that
the FPU has the capability to begin executing a new MACC or multiply instruction in
the same clock cycle that it starts executing the accumulate stage of a down-stream
MACC instruction.
4.1 : Dealing with Hazards
The ability to issue the FPU any instruction that it supports in any clock cycle
abstracts the programmer from the architecture. This allows them to get working
code earlier in the design cycle (before optimisation), as apposed to code only
working when it’s exactly optimised for this FPU. In the future, if a compiler is
developed, this capability will allow the FPU to execute code written for other
architectures. The FPU has this capability as a result of being designed to prevent
structural and data hazards from manifesting into errors.

33
The FPU has been designed to deal with the structural hazard that occurs when the
new instruction issued is an add, and the accumulate part of a MACC instruction is
due to begin in that clock cycle. The conflict here is that the accumulate part of a
MACC instruction also requires its associated input arguments to be entered into the
adder unit. In this event, priority is given to the accumulate, and the FPU is stalled,
whereby it does not allow new instructions to be issued until the adder becomes
available and execution of the pending add instruction has subsequently began.
The FPU has also been designed to deal with the data hazards that arise when a newly
issued instruction has a variable (represented by a specific register) that is the output
variable of an instruction in the Execution_pipeline at that time. These data hazards
are prevented from manifesting into an error by the FPU stalling whenever such an
instruction is issued to it. This prevents any instruction executed from fetching one of
its source registers before the register’s contents have been updated by a down-stream
instruction, and similarly writing to a register before its contents have been fetched by
the accumulate stage of a down-stream MACC instruction.
4.1.1 : Using the Scoreboard to Detect Data Hazards
The Scoreboard is essentially a table that maintains which registers in the register file
are a destination register of an instruction currently in the pipeline and when they will
be written to (updated). The FPU_Control_Unit program maintains the FPU’s
Scoreboard as a binary vector, with one 6 bit field for each of the 100 registers of the
register file. Each of these fields are split into two sub-fields as illustrated below in
figure 31.

34
SCOREBOARD
REGISTER COUNT DOWN EXECUTION UNIT
R1
R2
R3
R99
R100
......
R1R2R3R4R100 R99 R98 R97 . . . . . . . . . . . . . . . . . . .
0561112171823599 594 593 588 587 582 581 576
EXEC_UNITCOUNT DOWN
0125
The Scoreboard is conceptually a table with count down and execution unit entries for each register within the register file
(as illustrated above), although it is actually represented within the FPU_Control block as a binary vector as shown below
0 1 adder
1 0 multiplier
Figure 31 : Illustrating the concept of the Scoreboard and how it is represented
As can be seen above in figure 31, the 2 least significant bits of a field (its exec_unit
subfield) represent the actual execution unit that will produce the result to be written
into the field’s respective register, and the 4 most significant bits of a field (its count
down subfield) represent the number of clock cycles before that write-back operation
will occur. After consideration of the type of NLPs to be executed on the FPU (as
detailed previously in section 3 and even section 2), for simplicity the FPU was not
designed to have the capability of arbitrating the execution of multiple write-back
operations. Thus only programs that produce strictly no more than one write-back
operation per clock cycle are supported.
The position of the field within the Scoreboard vector as a whole is representative of
the actual register it represents, where the least significant field represents register R1,
and successive fields represent the registers in ascending order. In each simulation
time-step the Scoreboard is updated to decrement any non-zero count down sub-fields
and add details of a new instruction if one is submitted to the Execution_pipeline by
the FPU_Control_Unit program.
As stated previously in section 4.1.1, there are 100 registers in the register file and
thus the Scoreboard binary vector is consistent of 6 x 100 = 600 bits. In Simulink

35
unsigned integers are represented using 32 bits, and thus to model this vector it was
broken up into twenty 30 bit vectors. This is illustrated below in figure 32.
R5 R4 R3 R2 R1
Scoreboard_1
R10 R9 R8 R7 R6
Scoreboard_2
R95 R94 R93 R92 R91
Scoreboard_19
R100 R99 R98 R97 R96
Scoreboard_20
. . . . . . . . . . . . . . . . . . .
R5 R4 R3 R2 R1R10 R9 R8 R7 R6R95 R94 R93 R92 R91R100 R99 R98 R97 R96 . . . . . . . . . . . . . . . . . . .
EXEC_UNITCOUNT DOWN
0125
R6R7R8R9
0561112171823
R10
Scoreboard_2
2429
0293059599 570 569 540
Figure 32: Illustrating how the Scoreboard binary vector is split into 20 segements in
order to represent it in Simulink
With reference to figure 32 above, Scoreboard_1 holds the status of registers R1 to
R5, Scoreboard_2 holds the status of registers R6 to R10, and so on for successive
Scoreboards up to Scoreboard_20. Each of the twenty Scoreboard segments are
represented within the FPU_Control_Unit program as persistent variables.
Figure 33 below shows how the Scoreboard is updated each time the FPU submits a
new instruction to its Execution_pipeline.

36
SCOREBOARD
R1 0 0
R2 0 0
R3 0 0
R4 0 0
R5 9 1
SCOREBOARD
R1 6 2
R2 0 0
R3 0 0
R4 0 0
R5 8 1
SCOREBOARD
R1 5 2
R2 0 0
R3 0 0
R4 4 1
R5 7 1
0 1 2 3 4 5 6 7 8 9 10clock cycle
R5 += R1 * R2
0 1 2 3 4 5 6 7 8 9 10
R5 += R1 * R2
R1 = R3 * R4
0 1 2 3 4 5 6 7 8 9 10
R5 += R1 * R2
R1 = R3 * R4
clock cycle
clock cycle
R4 = R2 + R3
X
X X
X
The instruction issued in clock cycle 0 is : R5 += R1 * R2
The instruction issued in clock cycle 1 is : R1 = R3 * R4
The instruction issued in clock cycle 2 is : R4 = R2 + R3
R1
R2
R3
R4
R5
destination
register
R1
R2
R3
R4
R5
R1
R2
R3
R4
R5
destination
register
destination
register
Figure 33 : showing an example to illustrate how the FPU updates the Scoreboard
each time a new instruction is submitted to the Execution_Pipeline
Figure 33 above shows how the state of the Scoreboard (for the registers of concern)
changes over three successive clock cycles, in which a MACC, multiply and add
instructions are issued to the FPU and subsequently submitted to the
Execution_pipeline where their execution begins. As can be seen in figure 33, each
time an instruction is submitted to the Execution_pipeline the count down entry in the
Scoreboard for that instruction’s destination register is set to the latency of the
instruction (in clock cycles) and this value is subsequently decremented in each clock
cycle thereafter. With reference to figure 33, registers R4, R1 and R5 will be written
to in clock cycles 6, 7, and 9 respectively. The exec_unit entry for the instruction’s
destination register is set to the code representing the particular execution_unit within
the Execution_pipeline that will produce its result as detailed previously in figure 31.
A synopsis of the FPU_Control_Unit sub-function updateScore() that is responsible
for updating the Scoreboard (in the way shown in figure 33 above) each time a new
instruction is submitted to the Execution_pipeline is shown in figure B2 followed by a
description in appendix B.

37
The code of the sboardCycle() sub-function of FPU_Control_Unit is shown in figure
B3, followed by a description in appendix B. This sub-function is executed every
clock cycle (once per Scoreboard segment) to update the Scoreboard so as to reflect
the passing of one clock cycle. This is done by decrementing all non-zero countdown
sub-fields throughout the entire Scoreboard. A countdown value of 1 is encountered
when its parent field represents the register on which a write back operation is
scheduled to occur in the current clock cycle (as the countdown value will become
zero when it’s decremented in that same execution of sboardCycle()), thus in this
event sboardCycle() asserts the write back operation. After asserting a write back
operation, sboardCycle() sets the field’s exec_unit sub-field to zero. Thus all
Scoreboard fields whose respective registers are not destination registers of an
instruction currently in the Execution_pipeline are maintained with a zero value in
both sub-fields.
4.1.2 : Managing Data Hazards
As already discussed previously at the beginning of the section, when the Controller
(discussed later in section 5) issues the FPU with a new instruction,
FPU_Control_Unit checks the Scoreboard to decide whether or not submitting the
instruction to the Execution_pipeline may cause an error to occur due to a data
hazard. This is illustrated below in figure 34.
SCOREBOARD
R1 0 0
R2 0 0
R3 0 0
R4 0 0
R5 9 1
SCOREBOARD
R1 6 2
R2 0 0
R3 0 0
R4 0 0
R5 8 1
SCOREBOARD
R1 5 2
R2 0 0
R3 0 0
R4 0 0
R5 7 1
0 1 2 3 4 5 6 7 8 9 10clock cycle
R5 += R1 * R2
0 1 2 3 4 5 6 7 8 9 10
R5 += R1 * R2
R1 = R3 * R4
0 1 2 3 4 5 6 7 8 9 10
R5 += R1 * R2
R1 = R3 * R4
clock cycle
clock cycle
R5 = R1 + R2
X
X X
X
R1
R2
R3
R4
R5
destination
register
R1
R2
R3
R4
R5
R1
R2
R3
R4
R5
destination
register
destination
register
R5

38
Figure 34 : showing an example to illustrate how the Scoreboard is checked to
prevent data hazards
With reference to figure 34 above, the multiply instruction issued to the FPU in clock
cycle 1 (represented by the blue bar) passes the Scoreboard check (assuming there are
no instructions in the Execution_pipeline before clock cycle 0) because in clock cycle
1 neither of its source registers or it’s destination register is the target of a pending
write back operation. However the add instruction issued to the FPU in clock cycle 2
(represented by the green bar) fails the Scoreboard check for two reasons. Firstly,
one of its source registers (R1) is the target of the write back operation scheduled in
clock cycle 7 after the multiply instruction. Thus if this add instruction was submitted
to the Execution_pipeline in clock cycle 2, it would fetch and use the contents of R1
in the same clock cycle, before they had been updated in clock cycle 7 by the multiply
instruction.
In order to abstract the programmer from the architecture it must be assumed that they
don’t consider the latency of the instructions and that their intention in this event
would be for the result of the multiply instruction to be added to R2 by the add
instruction. Thus if this was the only cause for the Scoreboard check failure, the add
instruction could be submitted to the Execution_pipeline (without causing a data
hazard) in clock cycle 8 or thereafter.
However, a second cause of the add instruction’s Scoreboard check failure is that it’s
destination register (R5) is the target of a write back operation scheduled in clock
cycle 9 after the MACC instruction (represented by the red bar). Thus a data hazard
exists because the destination register of a MACC instruction is first used as its third
source register. As such, if the add instruction was submitted to the
Execution_pipeline whilst the MACC instruction was still being executed, it could
write to R5 before this register had been fetched for the accumulate stage of the
MACC instruction. For simplicity, this event always results in a Scoreboard check
failure, regardless of whether or not the instruction’s write back would occur after the
register fetch for the add stage of the MACC instruction.
The sub-function of FPU_Control_Unit that is employed to check the Scoreboard for
a particular register is checkScore(), a synopsis of which is shown in figure B4,
followed by a description in appendix B.
4.1.3 : Managing Structural Hazards
As well as the data hazards discussed previously in section 4.1.1, a potential
structural hazard also exists as the Execution_pipeline’s adder is used in executing
both the add instruction and the accumulate stage of the MACC instruction. Figure
35 below shows an example to illustrate this data hazard, and how it is dealt with.

39
SCOREBOARD
R1 0 0
R2 0 0
R3 0 0
R4 0 0
R5 9 1
SCOREBOARD
R1 6 2
R2 0 0
R3 0 0
R4 0 0
R5 4 1
SCOREBOARD
R1 0 0
R2 0 0
R3 0 0
R4 0 0
R5 4 1
0 1 2 3 4 5 6 7 8 9 10clock cycle
MACC_mult
0 1 2 3 4 5 6 7 8 9 10
MULT
0 1 2 3 4 5 6 7 8 9 10
clock cycle
clock cycle
ADD
X
X X
R1
R2
R3
R4
R5
destination
register
R1
R2
R3
R4
R5
R1
R2
R3
R4
R5
destination
register
destination
register
MACC_add
X X X X MACC_add
X X X MACC_add
Figure 35 : showing an example to illustrate the structural hazard concerning the use
of the Execution_pipeline’s adder by both MACC and add instructions
Figure 35 above shows a MACC instruction (issued in clock cycle 0) split into its two
stages, where the red bar represents the multiply-stage and the purple bar represents
the accumulate-stage. The dark red section of the combined bar represents the clock
cycle in which the last stage of the multiply and the first stage (register fetch) of the
add are conducted in parallel. This is done so as to hide the latency of the add-stage’s
register fetch. Figure 35 shows the outcome of issuing two alternative instructions to
the FPU in that clock cycle (5). As is the case with both the multiply and MACC
instructions, if the instruction issued in this clock cycle does not require immediate
use of the adder then it is eligible for submission to the Execution_pipeline.
However, as can be seen in figure 35 this is not the case with the add instruction, and
as such FPU_Control_Unit would not submit an add to the Execution_pipeline in this
situation, regardless of whether or not it passed its Scoreboard checks. Priority is
always given to the accumulate-stage of a MACC in this way for simplicity. In the
situation depicted by figure 35, the earliest time after this clock cycle that an add
instruction could be submitted to the Execution_pipeline would be clock cycle 6.
Details of how the FPU program implements the execution of the different
instructions and the management of hazards are contained in appendix B.

40
Section 5 : The Controller
The Controller is responsible for issuing the FPU with the next instruction to be
executed. If the FPU stalls in the event of detecting a structural or data hazard, it
asserts a ‘1’ on its stall output and the Controller must re-issue the stalled instruction
in subsequent clock cycles until the FPU does submit it to its Execution_pipeline and
assert a ‘0’ back on its stall output. In the clock cycle after this submission occurs,
the Controller must issue the FPU with the next instruction to be executed.
5.1 : Initial Look-Up Table Design
The initial design of the Controller was essentially a look-up table, where every
instruction of the program to be run was stored sequentially in program memory.
This look-up table design of the Controller is shown below in figure 36.
Figure 36 : showing the initial look-up table design of the Controller
For simplicity, during this developmental phase the PROG_COUNTER (counter) was
initialised with its count_from and count_to block parameters before running the
simulation. Similarly, the PROG_MEMORY (single-port RAM) was initialised with
the sequence of program instructions through its initial_value_vector block parameter.
The output of PROG_MEMORY is separated out into its constituent fields
(mnemonic, RT/S, RA and RB) as bit-slicing is not supported in Simulink’s Embedded
Matlab function block, thus preventing this from being carried out within
FPU_Control_Unit. With reference to figure 36 above, when the stall input is ‘0’,
both PROG_COUNTER and PROG_MEMORY are enabled, thus allowing the
program counter to progress by 1 and the output register of the RAM block to be
written to. As such, when the stall signal is ‘0’ successive instructions of the program
are output by PROG_MEMORY at a rate of one per clock cycle. Figure 37 below
uses an example to show how the Controller deals with an FPU stall.

41
0 1 2 3 4 5 6 7 8Clock Cycle
a0 a1 a2 a3 a4 a5 a6
i0 i1 i2 i3 i4 i5X
0
1
Stall
Instruction
word issued
Program
counter
Figure 37 : showing an example to illustrate how an FPU stall is dealt with by the
Controller
In the example shown in figure 37 above, the FPU is issued with instruction i2 in
clock cycle 3 but cannot submit it to the Execution_pipeline for two successive clock
cycles (3 and 4). As can be seen in figure 37 above, when the FPU stalls and asserts a
‘1’ on the stall signal, PROG_COUNTER has advanced to the address of the next
instruction (a3) by the time the count has been stopped. Although as the output
register of PROG_MEMORY is disabled, PROG_MEMORY’s output value remains
as the word-value of the stalled instruction. In the clock cycle that the FPU does
submit i2 to the Execution_pipeline, it asserts a ‘0’ back on the stall signal, thus
enabling PROG_MEMORY’s output register, which is then written to with the word-
value of the next instruction to be executed. This can be seen in figure 37, where the
stalled instruction (i2) is submitted for execution in clock cycle 5, and in the
subsequent clock cycle the FPU is issued with the next instruction (i3).
5.2 : Optimising the Controller for Running Geometric Transformation
Programs
As the problem size (number of instructions the program consists of) increases,
storing every single instruction requires the program memory capacity to be bigger
than is practical. For example, the matrix-vector multiplication program shown in
figure 21 and discussed previously in section 3.1 where the matrix is 4x4 and the
vector is 4x1), is executed using 16 separate MACC instructions when (ignoring any
load and store operations required to get data in and out of the register file). Thus the
size of the program memory required to store this program when completely unrolled
is 16 instruction words.
Figure 38 below illustrates a similar program to that shown in figure 21 which
multiplies eight 4x1 Vectors by the same 4x4 matrix. This is a typical example of the
program run in carrying out both the modelview and projection transformations in the
per-vertex operations stage of the OpenGL pipeline, as discussed previously in
section 1.2.2.

42
M(1,1) M(1,2) M(1,3) M(1,4)
M(2,1) M(2,2) M(2,3) M(2,4)
M(3,1) M(3,2) M(3,3) M(3,4)
M(4,1) M(4,2) M(4,3) M(4,4)
X
V1(1)
V1(2)
V1(3)
V1(4)
V2(1)
V2(2)
V2(3)
V2(4)
V3(1)
V3(2)
V3(3)
V3(4)
V4(1)
V4(2)
V4(3)
V4(4)
V5(1)
V5(2)
V5(3)
V5(4)
V6(1)
V6(2)
V6(3)
V6(4)
V7(1)
V7(2)
V7(3)
V7(4)
V8(1)
V8(2)
V8(3)
V8(4)
R1(1)
R1(2)
R1(3)
R1(4)
R2(1)
R2(2)
R2(3)
R2(4)
R3(1)
R3(2)
R3(3)
R3(4)
R4(1)
R4(2)
R4(3)
R4(4)
R5(1)
R5(2)
R5(3)
R5(4)
R6(1)
R6(2)
R6(3)
R6(4)
R7(1)
R7(2)
R7(3)
R7(4)
R8(1)
R8(2)
R8(3)
R8(4)
Figure 38 : illustrating an example of the matrix-vector multiplication program carried
out by OpenGL’s modelview and projection transformations.
With reference to figure 38 above, the M matrix represents either the modelview or
projection matrix depending on which transformation is to be performed, and vectors
V1 to V8 represent the object or eye coordinate vectors of an object in the scene. The
object being transformed in this particular program has eight vertices, the simplest
example of which would be a cube. Figure 39 below shows a code description of this
program.
void pipelined1Matrix8VectorMultiply(int num_matrix_rows, int num_matrix_cols, const float *m[ ][ ], const float
*v1[ ], const float *v2[ ], const float *v3[ ], const float *v4[ ], const float *v5[ ], const float *v6[ ], const float *v7[ ],
const float *v8[ ], const float *r1[ ], const float *r2[ ], const float *r3[ ], const float *r4[ ], const float *r5[ ], const float
*r6[ ], const float *r7[ ], const float *r8[ ])
{
int row, col;
{
{
r1[row] += m[row][col] x v1[col]; // matrix m is processed in COLUMN-MAJOR order
r2[row] += m[row][col] x v2[col];
}
}
}
Figure 39 : showing a code description of the example geometric transformation
program depicted previously in figure 38

43
Previously in section 3.3.1 an analysis of how to optimise the matrix-vector
multiplication algorithm for execution on a pipelined MACC unit was detailed. In
conjunction with the findings of this analysis, the two loops in code of figure 39
above are arranged such that the matrix is processed in column major order, so as to
schedule dependent MACC instructions further apart in time and thus avoid periods of
latency due to FPU stalls. However, this program has eight vectors and the eight
separate matrix-vector multiplication problems are interleaved so as to schedule the
dependent MACC instructions (within each individual problem) even further apart in
time, allowing for an even greater overall pipeline depth and thus a higher throughput
to be achieved.
Completely unrolling both loops will yield the fastest execution speed as it entirely
removes the overhead of having to test loop conditions and execute branch operations
(to set the program counter back to the beginning of a loop). Although these
overheads can also be eliminated without unrolling any loops, by implementing loop
tests and branch operations within the Controller, although this would be at the
expense of added hardware resources. Although as discussed previously, the
disadvantage of completely unrolling both loops is that the size of program memory
required is increased by a factor equal to degree of unrolling.
As can be seen from figure 39 above, the inner-loop of the program contains 8 MACC
instructions, and so if this program was represented with both loops completely
unrolled, the size of the required program memory would be 8x4x4=128 instruction-
words. With both loops completely unrolled, transformations of larger sizes (i.e.
where there are more vertices in the scene overall) could be solved by running the
program on small sets of vectors (vertices) at a time, thus reducing the number of
instructions that need to be stored at any one time, and likewise the size of program
memory required. Although this approach would introduce the overhead of switching
between these smaller programs.
5.2.1 : The Use of Induction Variables
As can be seen from the code of figure 39 above, all of the instructions are MACC
instructions. Thus when the program is represented with both loops completely
unrolled, the instructions stored in program memory would all have the same
mnemonic, and differ only in their three source/destination register fields (RT/S, RA
and RB). All of the program’s instruction-words have the same mnemonic, and their
RT/S, RA and RB fields always address the same arrays.
When the majority of DSP compilers need to address arrays in NLP programs, to
simplify this they introduce induction variables which by definition are derived from
the loop index values. Considering the geometric transformation program of figure
39 for just one vector, a code description of this illustrating how induction variables
would be used to address the output and two input arrays is shown below in figure 40.

44
{
{
// matrix m is processed in column major order
p = (col x num_matrix_rows) + row; // p is an induction variable
*(r1 + row) += *(m + p) x *(v1 + col);
}
}
Figure 40 : showing an example to illustrate the use of induction variables for
addressing arrays
As can be seen in figure 40 above, the m array is indexed by adding its respective
induction variable p to its base pointer. Although p is the only new variable
introduced, the r1 and v1 arrays are also addressed in this way, where their respective
induction variables are exactly the values of the loop indices. With reference to figure
40 above, it can be seen how the use of induction variables allows for the successive
program instructions to be generated, with only a single generic instruction-word
stored in program memory of the form shown below in figure 41.
V1_base_pointer
(RB)
M_base_pointer
(RA)
R1_base_pointer
(RT/S)
067131420
MACC
(MNEMONIC)
2126
Figure 41 : showing the single generic instruction-word from which all program
instructions could be derived for the program of figure 40
With reference to figure 41 above, the Controller would pass the mnemonic field
straight on to the FPU, but between issuing successive instructions it would have to
evaluate the values of the RT/S, RA and RB fields, which would require additional
hardware resources. If this penalty was migrated into software by issuing the FPU
with instructions to calculate the induction variable values, this would eliminate the
need for extra resources (apart from the extra program memory required) at the cost of
increased execution time. Although this is not an option as the FPU does not have a
register address internal data-format. With reference to the code of figure 40 above,
for this program these extra hardware resources would amount to two adders for
evaluating the R1 array addresses, likewise another two adders for evaluating the V1
array addresses and an adder and a multiply-add unit for evaluating the M array
address.
5.2.2 : Performing Strength Reduction on Induction Variables
In applying strength reduction to all three induction variables of the program of figure
40, the overhead cost of each inner-loop iteration is reduced. Figure 42 below shows
the code description after all three induction variables have undergone strength
reduction.

45
p = 0;
r = 0;
c = 0;
{
{
*(r1 + r) += *(m + p) x *(v1 + c); // matrix m is processed in column major order
r++;
p++;
}
r = 0;
c++;
}
Figure 42 : showing the code description of the matrix-vector multiplication algorithm
after all three induction variables have undergone strength reduction
As can be seen from figure 42 above, there is no longer the need for a multiplication
in evaluating successive values of p, thus the additional hardware required to evaluate
the p induction variable is now down to two adders. To facilitate this strength
reduction on p, the m matrix must be stored in column major order. Strength
reduction also removes any dependencies of induction variables on the loop indices
(as is the case for those associated with the R1 and V1 arrays), which provides more
flexibility, as not all programs will have induction variables that directly correspond
to the loop indices.
5.2.3 : Optimising a Geometric Transformation for the Controller
Considering the code shown in figure 39 above, where eight separate matrix-vector
multiplication problems are interleaved, this could be executed using separate base
pointers for the arrays of the separate problems, whilst using the same induction
variables across all problems. A code description of this solution is shown below in
figure 43.
p = 0;
r = 0;
c = 0;
{
{
// matrix m is processed in column major order
*(r1 + r) += *(m + p) x *(v1 + c);
*(r2 + r) += *(m + p) x *(v2 + c);

46
*(r3 + r) += *(m + p) x *(v3 + c);
*(r4 + r) += *(m + p) x *(v4 + c);
*(r5 + r) += *(m + p) x *(v5 + c);
*(r6 + r) += *(m + p) x *(v6 + c);
*(r7 + r) += *(m + p) x *(v7 + c);
*(r8 + r) += *(m + p) x *(v8 + c);
r++;
p++;
}
r = 0;
c++;
}
Figure 43 : showing a code description of the program with eight matrix-vector
multiplication problems interleaved, with all induction variables having undergone
strength reduction.
With reference to figure 43 above, to run this code the Controller would have to store
one generic instruction word of the form shown previously in figure 41 for all eight
problems (i.e. one instruction word per vertex). This disadvantage arises from the
corresponding array base pointers having different values across the eight different
problems. The number of vertices an object has, or the number that are in a scene can
be huge (reaching the tens of thousands in very detailed scenes), thus it is desired that
only one generic instruction word be stored in program memory, from which all
successive program instructions are derived.
In order to achieve this, the corresponding arrays across the different problems need
to be combined. Considering the way the geometric transformation interleaves the
execution of the problems, in order to keep the expressions for evaluating the
induction variable values as simple as possible, the best way to combine the arrays is
to interleave them. As such, the register numbers of successive array elements
accessed can be evaluated largely by simple increment operations. This is illustrated
below in figure 44 which illustrates an example register file arrangement for the three
arrays.

47
M(1, 1) M(2, 1) M(3, 1) M(4, 1) M(1, 2)
R1
M(2, 2)
M(3, 2) M(4, 2) M(1, 3) M(2, 3) M(3, 3) M(4, 3)
R2 R3 R4 R5 R6
R7 R8 R9 R10 R11 R12
V1(1) V2(1) V3(1) V4(1) V5(1)
R17
V6(1)
V7(1) V8(1) V1(2) V2(2) V3(2) V4(2)
R18 R19 R20 R21 R22
R23 R24 R25 R26 R27 R28
R1(1) R2(1) R3(1) R4(1) R5(1)
R49
R6(1)
R7(1) R8(1) R1(2) R2(2) R3(2) R4(2)
R50 R51 R52 R53 R54
R55 R56 R57 R58 R59 R60
Figure 44 : showing an example arrangement within the FPU’s register file, where the
three arrays of eight matrix-vector multiplication problems are interleaved together
As is illustrated in figure 44 above, the eight source and corresponding eight resultant
vectors are stored in the register file, such that the first element of all eight vectors are
stored adjacent to each other and in the order the vector pairs are processed by the
program (1 to 8), followed by the second element of all 8 vectors, and so on. The M
matrix is stored in column major order, as was discussed earlier in this section. The
arrangement of the source and destination registers means that the complexity of
interleaving and de-interleaving the arrays is handled outside the Controller by
whatever loads the data in and stores it out of the FPU’s register file. Subsequently
the register number sequences that need to be generated by the Controller are simpler
(than if the arrays were simply concatenated), and this will ease compiler
development if it is undertaken in the future. A code description of the program using
this register file arrangement, and requiring only one generic instruction word to be
stored in program memory is shown below in figure 45.

48
res = 0;
res_base = 0;
p = 0;
vec_base = 0;
vec = 0;
{
{
for( vec_no = 0; vec_no < num_vectors; vec_no++ )
{
*(r0 + res) += *(m + p) x *(v0 + vec);
res++;
vec++;
}
p++;
vec = vec_base;
}
res = res_base;
vec = vec_base = vec_base + num_vectors;
}
Figure 45 : showing a code description of the program executing eight interleaved
matrix-vector multiplication problems with only one generic instruction word stored
in program memory
5.2.4 : Designing the Optimal Controller
Considering the code of figure 45 above, it can be seen that to evaluate the three
induction variables between issuing successive instructions, the range of operations
the Controller must be able to carry out on an induction variable to produce its
subsequent value are incrementing it, setting it to a base value and adding a constant
to that base value.
As was discussed previously in section 1.2.1, as well as matrix-vector multiplication,
the two other prominent NLPs executed within the OpenGL pipeline (and the graphics
rendering domain in general) are the FIR filter and matrix-matrix multiplication. For
flexibility and thus the ability to efficiently support a wider range of NLPs, the
Controller needs to have the ability to perform any combination of these operations on
any of the three induction variables in any one evaluation cycle between the issuing of
successive instructions.

Parallel Processor for Graphics Acceleration

Parallel Processor for Graphics Acceleration

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Parallel Processor for Graphics Acceleration

Similar to Parallel Processor for Graphics Acceleration (20)

Recently uploaded

Recently uploaded (20)

Parallel Processor for Graphics Acceleration