igorFreire_UCI_real-time-dsp_reports

Experiment 1.5.1
Igor Antonio Auad Freire
Conclusions
From this experiment, it is possible to observe the importance of proper configuration in the linker com-
mand file. CCS already provides a standard linker command file for the chosen chip architecture, but solely
with the purpose of presenting all the possible memory sections in the architecture. It is the designer res-
ponsibility to organize the memory sections for the linker in a way that, for example, is sufficient to store
the program and provides enough heap and stack space. In the case of TMS320C5515, the standard linker
command file provided by CCS would easily yield the following error if the stdio library was included:
"../C5515.cmd", line 74: error #10099-D: program will not fit into available
memory. placement with alignment/blocking fails for section ".text" size
When the linker command file is properly configure, resume command runs the program until the end
and prints the program output to the console (in this case simply “Helllo World”). In contrast, step into
executes the entire line and stops over it, in the next line. If a method is present on a line, for example,
step over executes the entire method and stops after it.
1

Experiment 1.5.2
Conclusions
This experiment verifies basic skills with I/O functions for binary and text files. In summary, three main
operations are executed:
1. Read from .pcm file and write to .wav file.
2. Read from .wav file and write to .xls file.
3. Read from .xls file and write to .wav file.
The first operation requires the reading of a binary data file (.pcm) and binary write to sound file in
.wav format. For this, it is necessary to append the WAVE format header containing 44 bytes, which
essentially provides some metadata regarding sampling frequency, data size, quantization, number of bits
per sample, data rate, etc.
The second operation requires the reading of a binary data file and the text-formatted writing to a .xls
file. In contrast to the first, this writing is better performed by fprintf(), instead of fwrite().
Finally, the third operation requires the reading of a text-formatted file (.xls) and the writing to a
binary data file (sound file .wav). The procedure adopted to perform this operation is to read the .xls file
line by line, since it was written (in operation 2) using line breaks for each 16-bit word. Then, for each
line, the read string is converted to a 16-bit word and written (using fwrite()) byte by byte into the
output file.
1

Experiment 1.5.3
Conclusions
From this experiment, it is possible to navigate through some of the debugging tools provided by the Code
Composer Studio. By using the variables tool, for example, it is possible to easily visualize the information
from each variable on memory by simply right-clicking and choosing “View on Memory”. Similarly, it is
possible to graphically visualize the data contained in the variable by right-clicking on it and choosing
“Graph”.
Figures 1 and 2 show, for illutration, the sinusoid contained in the array table from UITest.c for
a sampling frequency of 8 kHz and of 48 kHz, obtained using the graph tool. Furthermore, in order
to deeper explore this tool, a better formatted version of the same plot is presented in Figure 3, which
presents modiﬁed axis, grid and plot tick.
Figura 1: Sinusoid for 8000 Hz sampling frequency
1

2

Experiment 1.5.4
General Comments
Hum after program execution
It is possible to note the AIC3204 continuously outputs a hum after the program finishes its execution
(after the tone is played during the specified time). The program provided by the book already provides
code to disable the AIC3204 after execution (below). After activating this snippet, it is possible to verify
the absence of hum, since the AIC3204 audio codec chip is disabled after the tone is played.
#if 1
USBSTK5505_I2S_close(); // Disable I2S
AIC3204_rset( 1, 1 ); // Reset codec
USBSTK5505_GPIO_setOutput( GPIO26, 0 ); // Disable AIC3204
#endif
Tone generation
First of all, the program plays the same tone with frequency of 1 kHz for all allowed sampling frequencies.
Additionally, it always stores the tone to be played in a buffer of 48 signed 16-bit int samples. Hence, If a
sampling frequency of 48 kHz is adopted, for example, the buffer of 48 samples corresponds to a buffering
of 1 ms, which is enough to contain a complete period of the tone (1 ms, since the tone frequency is 1 kHz).
In the same manner, if a sampling frequency of 16 kHz is adopted, the buffer of 48 samples corresponds
to a buffering of 3 ms of the signal. Thus, since the signal period is 1 ms, 3 complete periods are contained
in the buffer (dataTable). Similarly, 6 complete periods are buffered for fs = 8 kHz, 4 periods for
fs = 12 kHz and 2 for fs = 24 kHz.
This is exactly what is done in the following loop:
for (n=k=0, i=0; i<m; i++) // Fill in the data table
{
for (j=k; j<SIZE; j+=m)
{
dataTable[n++] = table[j];
}
k++;
}
In this code, m is the number of periods contained in the 48 buffer. For fs = 8 kHz, for example,
m = 6. In this case, the loop fills the dataTable buffer as follows:
Iteration #1: dataTable[0] = table[0]
1

. . .
. . .
. . .
. . .
. . .
. . .
and so on, until...
. . .
. . .
. . .
This corresponds to filling dataTable with 6 periods from the waveform stored in buffer table. Note,
however, the 6 periods are not exactly equal to each other. The first period, for example, is composed by
samples 0, 6, 12, ... 42 from table, while the second period is composed by samples 1, 7, 13, ..., 43.
Figures 1, 2 and 3 present the waveform stored in the 48 samples buffer for frequencies of 8 kHz, 16 kHz
and 48 kHz, respectively.
Figura 1: Waveform stored in the 48 samples buffer for a sampling frequency of 8 kHz
2

Experiment 1.5.5
General Comments
Assignment: Modify the audio loopback experiment such that it runs at 8000 Hz or
other sampling frequencies
The sampling frequency can be modified by changing the sampling frequency parameter passed to function
AIC3204_init(), which initializes the stereo audio coded chip AIC3204. Thus,
AIC3204_init(SF48KHz, DAC_GAIN, (Uint16)ADC_GAIN);
is replaced by:
AIC3204_init(SF8KHz, DAC_GAIN, (Uint16)ADC_GAIN);
The consequence is audible, since the quality of the audio becomes poor. This is because the higher
frequencies in the audio at the input can‘t be reproduced with a lower sampling frequency. This translates
into a sensation of lower quality, since a significant part of the quality in the audio is formed by the high
frequency harmonics.
Assignment: Modify the experiment such that the left audio output channel will
output the sum of input signals from both left and right channels, while the right
audio output channel will be output the difference of input signals from the left and
right channels.
For this assignment, the two functions below were included:
Int16 dsp_process_sum(Int16 *inputl, Int16 *inputr, Int16 *output, Int16 size)
{
Int16 i;
for(i=0; i<size; i++)
{
*(output + i) = *(inputl + i) + (*(inputr + i));
}
return 1;
}
Int16 dsp_process_diff(Int16 *inputl, Int16 *inputr, Int16 *output, Int16 size)
{
Int16 i;
for(i=0; i<size; i++)
{
1

*(output + i) = *(inputl + i) - (*(inputr + i));
}
return 1;
}
The first function takes the left and right input buffers and outputs the sum of them to the output
buffer. The second takes the left and right input buffers and outputs the difference of them to the output
buffer.
Using those functions, the loop for looping back the audio at the input of AIC3204 becomes:
while (status) // Forever loop for the demo if status is set
{
if((leftChannel == 1)||(rightChannel == 1))
{
leftChannel = 0;
rightChannel= 0;
if ((CurrentRxL_DMAChannel == 2)||(CurrentRxR_DMAChannel == 2))
{
status = dsp_process_sum(RcvL1, RcvR1, XmitL1, XMIT_BUFF_SIZE);
status = dsp_process_diff(RcvL1, RcvR1, XmitR1, XMIT_BUFF_SIZE);
}
else
{
status = dsp_process_sum(RcvL2, RcvR2, XmitL2, XMIT_BUFF_SIZE);
status = dsp_process_diff(RcvL2, RcvR2, XmitR2, XMIT_BUFF_SIZE);
}
}
}
The consequence of this processing is also audible and easy to validate. When using a recording as
input to the AIC3204, the signals that are common to both left and right channels (namely recorded with
a pan of 0) are not present in the difference signal, which is the output to the right channel. This is easy to
note when looping back a music recording with vocals, for example. Since lead vocals are usually recorded
with a pan of 0, the signal at the output of the right channel from the AIC3204 effectively eliminates the
vocals.
2

Appendix C Experiments
1 Experiment C.1
This experiment tests the examples C.1 to C.11.
1.1 Example C.1
Example C.1 uses direct addressing mode to move a 23-bit constant (0x30100) to a the extended data-page
pointer register (XDP). This operation is done separetely by moving the upper 7 bits (0x3) to DPH and
the lower 16-bits (0x0100) to DP. Figures 1 and 2 present the data stored in the XDP register before and
after the operation.
Figura 1: Example C.1 - Before Operation
Figura 2: Example C.1 - After Operation
1.2 Example C.2
Example C.2 uses indirect addressing mode to move the data in the address pointed by the auxiliary
register AR0 to the 40-bit accumulator register AC0. Figure 3 presents the address the auxiliary register
1

AR0 points to (0x0100) and the value at AC0 before the operation (0x000FAB8678), while Figure 4 presents
the accumulator value after the operation (0x12AB).
1.3 Example C.3
Example C.3 uses dual AR-indirect mode to load the 40-bit accumulator AC0 with the data at the address
pointed by auxiliary register AR2 and AR3. The 16-bit word at the address pointed by AR3 is extended
to 24 bits and stored at the most-significant 24 bits in AC0, while the 16-bit word at the address pointed
by AR2 is stored at the 16 least significant bits in AC0. Additionally, the operation
mov *AR2+, *AR3-, AC0
post increments the value at AR2 and post decrement the value at AR3, which is useful for sequential
indirect memory access. The value in AC0 before and after the operation is presented in Figures 5 and 6,
together with the value at the auxiliary register AR2 and AR3.
1.4 Example C.4
Example C.4 uses indirect addressing on Coefficient Data Pointer 23-bit register. It also pre-increment
the CDP register value by 2, and then moves the coefficient at the the address pointed by the incremented
CDP value to the 40-bit accumulator AC3. Figures 7 and 8 show the value at CDP and AC3 before and
after the operation.
2

3

1.5 Example C.5
Example C.5 demonstrates the use of absolute addressing mode, which is expressed as *(address) in the
instruction. It moves the value at address 0x011234 (1st page, offset 0x1234) to the temporary 16-bit
register T2. Figures 9 and 10 presents the value at T2 before and after the operation.
1.6 Example C.6
Examples C.6 demonstrates how to use direct-addressing mode on an MMR (memory-mapped register).
The following operation is executed:
mov mmap(@AC0L), T2
It moves the value at (since @ is used) the least-significant 16-bit in AC0 (namely AC0L) to temporary
register T2. Figures 11 and 12 presents the value at T2 and AC0 before and after the operation.
1.7 Example C.7
Examples C.7 uses the btstp mnemonic, which takes two operands: the bit offset (starting at 0 from the
LSB) and the source register. It tests two consecutive bits (at offset and offset+1) in the source register
and copy their status to TC1 and TC2 bits in the status register ST0, which are at offset 12 and 13. In
the example, since bits 28 and 29 in AC0 are 1, bit 12 and 13 in ST0 are set.
1.8 Example C.8
Example C.8 demonstrates the use of the arithmetic instruction mpym, which multiplies the operands and
stores the result in an accumulator. The instruction is as follows:
4

5

mpym *AR0+, *CDP-, AC0
Thus, the content in the address pointed by AR0 (0x12AB or 4,779) multiplies the content in the address
pointed by CDP (0x5631 or 22,065). The result (0x64904BB or 105,448,635) is stored in the 40-bit
accumulator AC0. After this, the AR0 and CDP values are incremented and decremented, respectively.
Figures 13 and 14 present the value at AR0, CDP and AC0 before and after the operation.
1.9 Example C.9
Example C.9 demonstrated the use of a multiply and accumulate (MAC) instruction:
macmr40 T3=*AR1+, *AR2+, AC3
The content in the address pointed by AR1 is multiplied to the content pointed by AR2 and added to the
content in accumulator AC3. Additionally, the temporary register T3 is loaded with the value pointed by
AR1 and a maximum of 40 bits is set as the overﬂow limit. If overﬂows occurs, AC3 would be saturated.
Finally, AR1 and AR2 values are incremented. Figure 15 presents the values ate the AR1, AR2, T3 and
AC3 after the operation.
1.10 Example C.10
Example C.10 demonstrates the use of the bit manipulation instruction bit clear:
bclr #11, ST0_55
This instruction clears bit 11 of the status register ST0. Figure 16 and 17 present the value of ST0 before
and after the instruction.
1.11 Example C.11
Example C.11 demonstrates the use of a conditional execution:
xcc label,TC1
mov *AR1+,AC0
label
If TC1 bit is 1, the value at the address to which AR1 points is loaded at AC0 and AR1 is incremented.
6

7

8

1.12 Example C.12
Example C.12 demonstrates the use of a partial conditional execution:
xccpart label,TC1
mov *AR1+,AC0
label
The difference with respect to xcc is that the read operation (indirect-addressing in the operand) is
executed, independently of the condition in bit TC1 from the status register. Thus, since the read also
post-increments AR1, AR1 is incremented unconditionally. The mov operation, however, is only executed
when the condition is satisfied.
2 Experiment C.2
Experiment C.2 demonstrates the use of assembly code for C55xx.
Initially, in the main assembly code "assembly.asm", two assembly constants are defined:
N .set 128
stk_size .set 0x100
Then, three arrays (xin, xout and spectrum) are allocated in two different sections: one for input data
(in_data) and another for output data (out_data).
_Xin .usect ".in_data",(2*N) ; Input data array
_Xout .usect ".out_data",(2*N) ; Output data array
_Spectrum .usect ".out_data",N ; Data spectrum array
Sequentially, the current memory section from the assembler perspective is changed to the .data section
and an external include file (input.inc) is read through the directive .copy.
.sect .data
input .copy input.inc ; Copy input.inc into program
Since the input.inc file is as follows:
.word 0,20543,16781,6203,9830,11961,-2879,-13019
.word 0,13019,2879,-11961,-9830,-6203,-16781,-20543
.word 0,20543,16781,6203,9830,11961,-2879,-13019
.word 0,13019,2879,-11961,-9830,-6203,-16781,-20543
.word 0,20543,16781,6203,9830,11961,-2879,-13019
.word 0,13019,2879,-11961,-9830,-6203,-16781,-20543
.word 0,20543,16781,6203,9830,11961,-2879,-13019
.word 0,13019,2879,-11961,-9830,-6203,-16781,-20543
.word 0,20543,16781,6203,9830,11961,-2879,-13019
.word 0,13019,2879,-11961,-9830,-6203,-16781,-20543
.word 0,20543,16781,6203,9830,11961,-2879,-13019
.word 0,13019,2879,-11961,-9830,-6203,-16781,-20543
.word 0,20543,16781,6203,9830,11961,-2879,-13019
.word 0,13019,2879,-11961,-9830,-6203,-16781,-20543
.word 0,20543,16781,6203,9830,11961,-2879,-13019
.word 0,13019,2879,-11961,-9830,-6203,-16781,-20543
9

the .word directive places the consecutive 16-bit words into consecutive memory addresses. Thus, 128
words are stored through this instruction. The first word adress is saved in the assembly constant "input".
Next, the start function is defined as global (through .def directive). Similarly, arrays xin, xout
and spectrum are made global. Additionally, dft_128 and mag_128 defined in external assembly files
dft_128.asm and mag_128.asm are referenced using .def directive.
.def _start ; Define this program entry point
.def _Xin,_Xout,_Spectrum ; Make these data global data
.ref _dft_128,_mag_128 ; Reference external functions
After this initialization, the current memory section from the assembler perspective is switched to
.text. Then, the global function "start"is defined:
.sect .text
_start
bset SATD ; Set up saturation for D unit
bset SATA ; Set up saturation for A unit
bset SXMD ; Set up sign extension mode
mov #N-1,BRC0 ; Init counter for loop N times
amov #input,XAR0 ; Input data array pointer
amov #_Xin,XAR1 ; Xin array pointer
rptblocal complex_data_loop-1 ; Form complex data
mov *AR0+,*AR1+
mov #0,*AR1+
complex_data_loop
amov #_Xin,XAR0 ; Xin array pointer
amov #_Xout,XAR1 ; Xout array pointer
call _dft_128 ; Perform 128-point DFT
amov #_Xout,XAR0 ; Xout pointer
amov #_Spectrum,XAR1 ; Spectrum array pointer
call _mag_128 ; Compute squared-magnitude response
ret
The address for the input array (with 128 16-bit words) is loaded at the extended auxiliary register
XAR0. Similarly, the address for Xin is loaded on XAR1. Then, the loop with two instructions is placed
in the instruction buffer queue (IBQ) at the Instruction Unit (UI). This loop sequentially moves the data
from the input buffer to the array Xin. For each 16-bit word loaded in xin, a 0 is loaded. This way, the
256 16-bit words in Xin are filled interchangeably with zeros and the 16-bit words loaded from "input.inc".
Since the repeat counter BRC0 is loaded with value N - 1 (127), the loop is executed 128 times.
Figure 18 presents the input buffer and Figure 19 presents the Xin buffer.
After the loop is executed, the auxiliary register AR0 and AR1 are reloaded with the addresses for Xin
and Xout, such that the function _dft_128 can calculate the DFT of the input and store the result in the
output buffer. Similarly, after this AR0 and AR1 are loaded with the addresses for Xout (which contains
the DFT) and Spectrum. Then, function mag_128 is called and the magnitude of the spectrum (the DFT)
is stored in the spectrum buffer.
10

Figura 18: Experiment C.2: Input Buﬀer
Figura 19: Experiment C.2: Xin buﬀer
11

2.1 What is the difference between the CCS Step Into and Step Over operations?
How does the CCS Step Return operation work?
Step over executes the line completely and jumps to the next line. In case the line contains a call to a
function or subroutine, step over executes the entire function/subroutine and jumps to the next line. In
contrast, step into executes the line but goes into the code called by this line. Hence, in case there is a
call to a function or subroutine, step into executes the current line and jumps into the next line inside the
function or subroutine code.
Step return executes the rest of the code (from the point where the debugger current is to the end),
until it returns. It works as if the rest of the code was one single step.
2.2 How can a data section be defined in the assembly program and linker command
file?
The linker command file must determine where in the physical memory the given memory section are
going to be located. So, in the case of:
_Xin .usect ". in_data",(2* N),
in_data and out_data can be assigned to SARAM (single-access RAM) by including the following
snippet in the linker command file:
.in_data >> SARAM3 /* User vars */
.out_data >> SARAM3 /* User vars */
After the memory section is defined, it can be used in the assembly code, for example by the allocation
provided by:
_Xin .usect ".in_data",(2*N) ; Input data array
_Xout .usect ".out_data",(2*N) ; Output data array
_Spectrum .usect ".out_data",N ; Data spectrum array
3 Experiment C.3
Experiment C.3 demonstrates the correct way of multiplying integers in C. Since the accumulators in the
Arithmetic Unit (AU) contain 40 bits, it is important to assure the compiler uses those 40 bits to store
the result of a multiplication.
In C55x, a 16-bit multiplication must be casted to 32-bit, in order for the compiler to understand the
32-bit result is desired.
The disassembly for the first method of multiplication is the following:
23 c1 = a * b;
0138f5: a531002abd MOV *(#02abdh),T1
0138fa: d33105002abc MPYM *(#02abch),T1,AC0
013900: a010_98 MOV mmap(@AC0L),AC0
013903: eb3108002abe MOV AC0,dbl(*(#02abeh))
Note, MPYM assembly instruction multiplies two 16-bit words (a and b) and stores the result in AC0,
which is sign-extended to 40-bits. However, only the 16 lower significant bits are stored at c1. This is
because the multiplication, without proper casting, was interpreted by the compiler to return a 16 bits
result.
The second method of multiplication has the following assembly implementation:
12

25 c2 = (Int32)(a * b);
013909: d33105002abc MPYM *(#02abch),T1,AC0
01390f: a010_98 MOV mmap(@AC0L),AC0
013912: eb3108002ac0 MOV AC0,dbl(*(#02ac0h))
Again, only the 16 lower significant bits in the product were stored at variable c2. This occurred
because the parenthesis around the multiplication gave it precedence and the compiler interpreted the
multiplication without considering the type casting. Thus, it interpreted the multiplication as one that
produces a 16 bits result
Finally, the third method of multiplication has the following assembly:
27 c3 = (Int32)a * b;
013918: d33105002abc MPYM *(#02abch),T1,AC0
01391e: eb3108002ac2 MOV AC0,dbl(*(#02ac2h))
This is the correct way of multiplying to ints for C55xx. Note the assembler moves the 32 bits in ac0
to the address correspondent to variable c3.
4 Experiment C.4
Experiment C.4 demonstrates the possibility of writing significantly more efficient loops using assembly
code. It also demonstrates the use of the clock counting tool in CCS. Figures 20 and 21 present the
count of clock cycles for an inefficient loop in C code and the total clock cycles for the same loop written
efficiently in assembly code, respectively.
Figura 20: Experiment C.4: Inefficient loop in C code
5 Experiment C.5
Similar to Experiment C.4, experiment C.5 demonstrates the efficient implementation of a modulo ope-
ration for C55xx in C code. The number of clock cycles required for each of the three simulated ways of
computing the modulo of a number is evaluated. The first method uses the support library operator %
for modulo operation. The second method uses the same operator, but for a nunber that is a power of
2. Finally, the logical operator AND is adopted, which has a instruction in assembly for computing in 1
cycle. Figures 22, 23 and 24 present the count of clock cycles for each of the three methods.
13

Figura 21: Experiment C.4: Eﬃcient loop written in assembly
Figura 22: Experiment C.5: Modulo operation method 1, using support library
Figura 23: Experiment C.5: Modulo operation method 2, using support library on a power of 2 number
14

Figura 24: Experiment C.5: Modulo operation method 3, using AND
6 Experiment C.6
Experiment C.6 demonstrates the use of mixed C and Assembly programming. Two assembly functions
are used: “findMax” and “arraySort”. The first is implemented as follows:
; Function prototype:
; short findMax(short *p, short n);
;
; Entry: AR0 is the pointer of p and T0 contains length n
; Exit: T0 contains the maximum data value found
;
.def _findMax ; Using "_" prefix for C-callable
.text
_findMax:
sub #2,T0
mov T0,BRC0 ; Setup up loop counter
mov *AR0+,T0 ; Place the first data in T0
|| rptblocal loop-1 ; Loop through entire array
mov *AR0+,AR1 ; Place the next data in AR1
cmp T0<AR1,TC1 ; Check to see if new data is greater
|| nop ; than the maximum?
xccpart TC1 ; If find new maximum, place it in T0
|| mov AR1,T0
loop
ret ; Return with maximum data in T0
.end
Since findMax is going to be global (used externally by the C code), the .def directive precedes its
declaration. The function takes two parameters, a pointer to a Int16 and an Int16 data. Thus, from the
register ordering for the C55x family, it is known that the pointer is going to be stored at AR0 and the
data is going to be stored at T0. Given this, the routine first stores the data in T0 (array length) in the
loop counter register BRC0 and stores the first value in the array (pointed by the function argument) in
T0. Then, it iterates over the remaining values in the array and checks if a higher value was found. If it
is indeed found, the value stored at T0 is updated. In the end, the value at T0 is returned.
Function findMax is used by the C program. In contrast, function arraySort, is both used by the C
program and uses functions defined in the C program. Its implementation is as follows:
15

;
;
.def _arraySort ; Using "_" prefix for C-callable
.ref _sort
.ref _a, _b
.text
_arraySort:
amov #_a,XAR0 ; AR0 is the pointer to a[]
amov #_b,XAR1 ; AR1 is the pointer to b[]
mov #8,T0 ; T0 is the length
call _sort ; Sort a[] and reorder in b[]
ret
.end
The function is hard coded to find the address from array “a” and “b”, which are externally defined
(since the .ref directive is used). Then, direct addressing is used to load the address from “a”, “b” and the
array length of 8 into XAR0, XAR1 and T0, respectively. After this, the externally defined function “sort”
(in C code) is called. Since the standard order for a C program to load function arguments mandates
that a function with two Int16 pointers and one Int16 data arguments load these variables from AR0,
AR1 and T0, the C function "sort"encounters its arguments and procedes with the array sorting. In the
end, array “b” is a sorted version of array “a”. Finally, since the memory content itself is altered by these
functions, it is not necessary to have any return value.
7 Experiment C.7
Experiment C.7 demonstrates the use of the audio codec chip AIC3204. In the example program, the
same tone with frequency of 1 kHz is played for different sampling frequencies. The approach is to store
a period of the tone to be played in a buffer of 48 signed 16-bit int samples. Then, a new buffer with 48
samples is formed from this period depending on the sampling frequency. If a sampling frequency of 48
kHz is adopted, for example, the buffer of 48 samples corresponds to a buffering of 1 ms, which is enough
to contain a complete period of the tone (1 ms, since the tone frequency is 1 kHz). If a sampling frequency
of 16 kHz is adopted, the buffer of 48 samples corresponds to a buffering of 3 ms of the signal. Thus,
since the signal period is 1 ms, 3 complete periods are contained in the buffer (dataTable). Similarly, 6
complete periods are buffered for fs = 8 kHz, 4 periods for fs = 12 kHz and 2 for fs = 24 kHz.
This is exactly what is done in the following loop:
for (n=k=0, i=0; i<m; i++) // Fill in the data table
{
for (j=k; j<SIZE; j+=m)
{
dataTable[n++] = table[j];
}
k++;
}
In this code, m is the number of periods contained in the 48 buffer. For fs = 8 kHz, for example,
m = 6. In this case, the loop fills the dataTable buffer as follows:
16

. . .
. . .
. . .
. . .
. . .
. . .
and so on, until...
. . .
. . .
. . .
This corresponds to filling dataTable with 6 periods from the waveform stored in buffer table. Note,
however, the 6 periods are not exactly equal to each other. The first period, for example, is composed by
samples 0, 6, 12, ... 42 from table, while the second period is composed by samples 1, 7, 13, ..., 43.
Figures 25, 26 and 27 present the waveform stored in the 48 samples buffer for frequencies of 8 kHz,
16 kHz and 48 kHz, respectively.
17

18

Experiment 2.6.1
1 Experiment 2.6.1
Experiment 2.6.1 demonstrates the occurrence of overflow and the use of overflow protection (saturation
mode), which is activated by the status bit SATD. If SATD = 1, when an overflow is detected, the
accumulator holding the result of the operation is set to the saturated value of 00 7FFF FFFFh (positive
overflow) or FF 8000 0000h (negative overflow).
In this experiment, an assembly program (ovf_sat.asm) is defined to generate and verify overflow.
This program is called by a test function written in C (overflowTest.c).
The assembly program starts with the following block of code, which defines the function “ovftest”
(externally called by the C program) and two buffers with size of 256 (0x100) words. The function is
the function that tests the occurrence of overflow, the first buffer is used to store values progressively
accumulated (positively and negatively) in an iterative manner, as will be discussed in the sequel, while
the second buffer is used to store a sinusoidal sequence.
.def _ovftest
.def _buff,_buff1
.bss _buff,(0x100)
.bss _buff1,(0x100)
In the beginning of the program, the overflow flag passed to the “ovftest” function is tested. If it is
positive, then the status bit SATD must be set. This is what is done in the following block of code. Note
since the flag is a 16 bit data, it is passed to the assembly function in register T0, according to the register
precedence rules for C55xx.
;
; Code start
;
_ovftest
bclr SATD ; Clear saturation bit if set
xcc start,T0!=#0 ; If T0!=0, set saturation bit
bset SATD
After this, the two buffers (buff and buff1) are entirely cleared (set to 0), as implemented by the code
below. Note before proceeding with the code the context stored in auxily register AR5 is pushed to the
stack in order to save it for future restoration.
start
pshboth XAR5 ; Save XAR5
nop
nop
1

mov #0,AC0
amov #_buff,XAR5 ; Set buffer pointer
rpt #0x100-1 ; Clear buff
mov AC0,*AR5+
amov #_buff1,XAR5 ; Set buffer pointer
rpt #0x100-1 ; Clear buff1
mov AC0,*AR5+
Sequentially, the buffer “buff’’ is filled with values that increase by 0x140 (320 in decimal) in each of
the 128 iterations, starting at an offset of 128 (with respect to the 256 words in the buffer). These values
are generated in the upper 16 bits of accumulator AC0 (bits 16 to 31). Thus, since a signed int has a
range from -32,768 to 32,767, it is expected that iteration 103 will already be sufficient for surpassing the
upper limit (since 103⇤320 = 32, 960). In this case, if saturation mode is not active, oveflow would change
one of the 8 extra bits in AC0 (bit 32) and the 16 bits in bit 16 to 31 would wrap to a negative value.
Otherwise, the value in the upper 16 bits of AC0 would be saturated to 00 7FFF FFFFh.
mov #0x80-1,BRC0 ; Initialize loop counts for addition
amov #_buff+0x80,XAR5 ; Initialize buffer pointer
rptblocal add_loop_end-1
add #0x140<<#16,AC0 ; Use upper AC0 as a ramp up counter
mov hi(AC0),*AR5+ ; Save the counter to buffer
add_loop_end
Similarly to the 128 iterations previously described, the code below implements another 128 iterations.
This time, starting from an offset of 127, the remaining 128 words in buff are filled backwards (from offset
127 to offset 0) with progressively accumulated negative values (0x140 is subtracted from the accumulator
in each iteration and stored to the current memory position). Again, if SATD is enabled, the values
around offset 24 (127 - 103 = 24) will be negatively saturated to FF 8000 0000h. Otherwise, they would
be wrapped to positive values.
mov #0x80-1,BRC0 ; Initialize loop counts for subtraction
mov #0,AC0
amov #_buff+0x7f,XAR5 ; Initialize buffer pointer
rptblocal sub_loop_end-1
sub #0x140<<#16,AC0 ; Use upper AC0 as a ramp down counter
mov hi(AC0),*AR5- ; Save the counter to buffer
sub_loop_end
At this point, the work with “buff ” is done and now buffer “buff1” is used. In the code below, a circular
addressing for a buffer of 40 words is configured and a loop with 256 iterations is prepared by setting the
branch counter BRC0 to 255. The circular buffer is in the address pointed by a pointer passed as argument
to function “ovftest” in the C code. This pointer is stored in register AR0, according to the register rules
for C55.
mov #0x100-1,BRC0 ; Initialize loop counts for sinewave
amov #_buff1,XAR5 ; Initialize buffer pointer
mov mmap(@AR0),BSA01 ; Initialize base register
mov #40,BK03 ; Set buffer to size 40
mov #20,AR0 ; Start with offset of 20 samples
bset AR0LC ; Active circular buffer
2

In each iteration of this loop, the sine values in a predefined buffer with length 40 is circularly fetched,
scaled and copied to the current position (among the 256) in buffer buff1.
rptblocal sine_loop_end-1
mov *ar0+<<#16,AC0 ; Get sine value into high AC0
sfts AC0,#9 ; Scale the sine value
mov hi(AC0),*AR5+ ; Save scaled value
sine_loop_end
In the end, the occurrence of overflow is tested and returned via register T0, as mandated by the
register precedence for assembly functions.
mov #0,T0 ; Return 0 if no overflow
xcc set_ovf_flag,overflow(AC0)
mov #1,T0 ; Return 1 if overflow detected
set_ovf_flag
Note, before finishing the function execution, the context previously existent in AR5 is restored by
popping it from the stack with command popboth.
bclr AR0LC ; Reset circilar buffer bit
bclr SATD ; Reset saturation bit
popboth XAR5 ; Restore AR5
ret
.end
Figures 1 and 2 demonstrate the results for the progressively accumulated buffer buff with and without
overflow protection (saturation mode), respectively. Similarly, Figures 3 and 4 present the results for the
sine wave stored in buffer buff1 with and without overflow protection, respectively.
Figure 1: Overflow in buff
3

Figure 2: Overflow protection in buff
Figure 3: Overflow in buff1
Figure 4: Overflow protection in buff1
4

Experiment 2.6.2
1 Experiment 2.6.2
1.1 Implementation using Floating-Point C
In this part of the experiment, two different functions for computing the cosine of a number are measured
in terms of clock cycles. The first (fcos1) uses 12 multiplications, while the second (fcos2) uses 4
multiplications to implement the computation. The measurements are presented in Figures 1, 2, 3, and 4.
Note fcos1 takes 946 clock cycles for computing cos(0) and 3,011 cycles for the computation of a cosine
with non-zero argument (cos(x)). In contrast, fcos2 takes 436 clock cycles for computing cos(0) and 1,108
clock cycles for a non-zero argument.
Figure 1: fcos1 cos(0) computation
Figure 2: fcos1 cos(x) computation
Both computations lead to very close results. The only difference in value is noted for y[4], although
nearly insignificant. The values computed by fcos1 and fcos2 are shown in Figures 5 and 6, respectively.
1.2 Implementation using Fixed-Point C
Part B of the experiment demonstrates the fixed-point implementation iCos1 for the computation of the
cosine. Using this implementation, the measured number of clock cycles is significantly lower than the
one demanded by the floating-point implementation. For both the zero and non-zero cosine argument, the
fixed-point implementation demands 71 clock cycles, as illustrated in Figure 7 and 8.
1

Figure 3: fcos2 cos(0) computation
Figure 4: fcos2 cos(x) computation
Figure 5: fcos1 values
Figure 6: fcos2 values
Figure 7: iCos1 cos(0) computation
The values obtained by the ﬁxed-point implementation are presented in Figure 9.
2

Figure 8: iCos1 cos(x) computation
Figure 9: iCos1 values
Another function for calculating the cosine (iCos) in a manner that simulates the assembly implemen-
tation is also measured in this part of the experiment. This function takes 82 clock cycles, as illustrated
in Figure 10. The values obtained are presented in Figure 11.
Figure 10: iCos cos(x) computation
Figure 11: iCos values
1.3 Implementation Using C55xx Assembly Program
Part C of the experiments evaluates the number of clock cycles for the same cosine function in an assembly
implementation (function cosine). As Figure 12 shows, the function takes 32 clock cycles to compute the
value, which is a signiﬁcant improvement with respect to the original ﬂoating-point implementation. The
values obtained through this method are shown in Figure 13.
3

Figure 12: cosine cos(x) computation
Figure 13: cosine values
Table 1.3 summarizes the values and number of clock cycles obtained for each of the five different
implementations (fcos1, fcos, icos1, icos and cosine).
Function fcos1 fcos2 icos1 icos cosine
Number of
clock cycles
3,011 1,108 71 82 32
Return
Values
2
6
6
6
4
1.0
0.8660253
0.7071033
0.4999647
0.0008943696
3
7
7
7
5
2
6
6
6
4
1.0
0.8660253
0.7071033
0.4999647
0.0008944273
3
7
7
7
5
2
6
6
6
4
1.0
0.8660542
0.7071444
0.5000153
0.0007324442
3
7
7
7
5
2
6
6
6
4
1.0
0.8660542
0.7071444
0.5000153
0.0007324442
3
7
7
7
5
2
6
6
6
4
1.0
0.8660542
0.7071444
0.5000153
0.0007324442
3
7
7
7
5
1.4 Practical Applications
Part D of the experiment demonstrates some function approximations that can be useful on practical
DSP applications. A procedure to map the angles to the first quadrant (where both the cosine and
sine are positive) is adopted to reduce the algorithm complexity for the cosine and sine computations.
Additionally, three other function approximations are presented:
p
x, 1p
x
and tan 1(x). Except for
p
x,
the approximations presented in the book are not very close to the actual function value in the range
between 0.5 and 1.
For 1p
x
a better Taylor series expansion is obtained in MATLAB with:
syms x
inv_sqrt_approx = taylor(1/sqrt(x), x, ’ExpansionPoint’, 1, ’Order’, 6);
inv_sqrt_coeffs = sym2poly(inv_sqrt_approx);
This leads to the following equation:
1
p
x
= 0.2461x5
+ 1.5039x4
3.8672x3
+ 5.4141x2
4.5117x + 2.7070 (1)
For tan 1(x) a better Taylor series expansion is obtained in MATLAB with:
syms x
atan_approx = taylor(atan(x), x, ’ExpansionPoint’, 1, ’Order’, 6);
atan_coeffs = sym2poly(atan_approx);
4

tan 1
(x) = 0.0250x5
+ 0.1250x4
0.1667x3
0.2500x2
+ 1.1250x 0.0229 (2)
5

Experiment 2.6.3
1 Experiment 2.6.3
1.1 Part A
Part A of the experiment explores the generation of a tone, a noise signal and the junction of both in a
noisy tone. The tone is configured by defining the DAC sampling frequency and the tone frequency itself,
inside the initialization function presented below:
void initFTone(Uint16 f, Uint16 Fs)
{
n = 0;
twoPI_f_Fs = 2.0*PI*(float)f/(float)Fs; // Define frequency
}
Meanwhile, the noise signal is initialized by setting an initial seed for the random number generator
through the initialization function below:
void initRand(Uint16 seed) // Random number initialization
{
srand(seed);
}
Then, 2 buffers are used to store the signal values iteratively. While one buffer is being filled by the
generator (either for tone, noise or noisy tone) via direct-memory access (DMA), the other buffer is being
accessed by the DAC (AIC3204) for output. The buffers are filled by direct setting at its memory positions
the values returned by the tone function and the noise function, which are presented below:
Int16 fTone(Uint16 Fs) // Cosine generation
{
n++;
if (n >= Fs)
n=0;
return( (Int16)(cos(twoPI_f_Fs*(float)n)*UINTQ14));
}
Int16 randNoise(void) // Random number generation
{
return((rand()-RAND_MAX/2)>> 1);
}
The program allows configuration of the tone frequency in initFTone(800, SF) and the signal-to-noise
ratio by scaling the signal and noise values with a right or left shift (e.g. randNoise() >> 1).
1

1.2 Part B
Part B of the experiment demonstrates the generation of a tone using fixed-point arithmetic. Similar to
part A, part B generates the sample for a tone with given frequency f through the function below:
Int16 tone(void)
{
Int16 theta;
theta = (Int16)(twoPI_f_Fs * n++);
return (cosine(theta));
}
The difference with respect to Part A is that the latter cosine function is implemented in assembly.
This implementation is similar to the one used in Experiment 2.6.1, but slightly more efficient. The
implementation adopts the use of the multiply accumulate (mac) instruction, one of the advantageous
features of C55xx and DSPs with respect to microcontrollers.
For example, given the argument of the cosine is passed to the assembly function in register T0, the
following block of code computes one portion of the whole cosine Chebyshev expansion:
and #0x7fff,T0 ; Mask out sign bit
mov *AR0-<<#16,AC0 ; AC0 = d5
mov *AR0-<<#16,AC1 ; AC1 = d4
mac AC0,T0,AC1 ; AC1 = (d5*x+d4)
Note T0 (holding the argument x), is multiplied to AC0 (holding coefficient d5) and accumulated to
AC1 (holding coefficient d4).
The tone frequency is easily configured in this experiment, just like Part A. It suffices to pass the
target frequency as the first argument in function initTone().
1.3 Part C
Part C demonstrates the generation of a random number without using the standard library ("stdlib.h")
used in part A of the experiment. The rand() command from stdlib demands a high number of instructions,
so that a lightweight random number generator is desirable. In this experiment, the random number is
generated from the recursive equation below:
x[n] = mod {ax(n 1) + b}M (1)
where M 1 is the maximum value that can be obtained.
The seed (x[0]) is defined in the function "initRand(Uint16 seed)" as 12357. Additionally, b is set
to zero, M = 220 and a is set to three different values. All the three values yield the same noisy sound,
since the random nature of the signal is not altered.
In the experiment, two different implementations for Equation (1) are provided: randNumber1() and
randNumber2(). The former is more complex, taking around 4,240 clock cycles to generate one random
number, while the latter is much more efficient, since it takes around 500 clock cycles to return one random
number. The reason why randNumber1() has a higher complexity is because it uses 2 multiplications and
2 divisions, while randNumber2() uses only 1 multiplication and does the rest of the calculation using
logical and bit shift operations, which are less demanding.
1.4 Part D
Part D demonstrates the generation of a random signal (noise signal) using assembly code that implements
Equation (1). The signal sounds similar to the one generated in Part C.
2

1.5 Part E
Part E of the experiments demonstrates the combination of the previous parts into one single ﬁle: the
tone is generated using the C function that calls the assembly cosine implementation, while the random
noise is generated using the assembly random number generator. In comparison to the program in part A,
this is a much more eﬃcient program. Moreover, similar to part A, in this program it is also possible to
adjust the tone frequency - passed as argument to initTone() - and the signal-to-noise ratio, by scaling
the tone or noise signal components.
3

Experiment 3.1
1 Experiment 3.1
This experiment implements FIR filtering using block processing and fixed-point arithmetic. The filter
design step is carefully analyzed and the performance of the real-time filtering in C55x is evaluated.
1.1 Equiripple linear-phase bandpass FIR filter
1.1.1 Filter Design
Initially, an equiripple linear-phase bandpass FIR filter is designed using Parks-McClellan algorithm pro-
vided in MATLAB function firpm. The goal is to design a filter with the specifications summarized in
Table 1.
Frequencies Magnitude
Passband edges [1600, 2000] [1, 1]
Stopband edges [1200, 2400] [0, 0]
Table 1: Bandpass Filter Specifications
Considering the sampling frequency is 8000 Hz, the following MATLAB code implements the filter
design:
%% Filter design:
fs = 8000; % Sampling Frequency
F = [0 0.3 0.4 0.5 0.6 1]; % Frequency band edges
% Note: In the normalized frequency vector, 1 corresponds to the Nyquist
% frequency or half the sampling frequency
m = [0 0 1 1 0 0]; % Desired magnitude response
f = F*(fs/2) % Print Analog Frequencies
b = firpm(47, F, m); % Filter coefficients
%% Filter analysis
fvtool(b, 1)
The last line of the code provides a graphical user interface (GUI) for filter analysis, in which is possible
to view the filter magnitude response in Fig. 1 and the phase response in Fig. 2. Note the phase response
is linear in the frequency interval responding to the passband, with some margin. This is done on purpose
by the optimization algorithm executed to find the filter coefficients, since it is not necessary to achieve a
linear phase at the frequencies that will be highly attenuated.
1

0 0.5 1 1.5 2 2.5 3 3.5
−70
−60
−50
−40
−30
−20
−10
0
Freq uency (k Hz)
Magnitude(dB)
M agnitude R esp onse (dB )
Figure 1: Equiripple Bandpass FIR Filter: Magnitude response
0 0.5 1 1.5 2 2.5 3 3.5
−20
−15
−10
−5
0
Freq uency (k Hz)
Phase(radians)
Phase R esp onse
Figure 2: Equiripple Bandpass FIR Filter: Phase response
1.1.2 Filter real-time fixed-point implementation
The fixed point implemenation of the block FIR processing takes 7 arguments: the address for the input
sample, the block length (number of samples in a block), a pointer to the FIR coefficients, the FIR order,
the address for the output sample, the input circular buffer w and the index inside the circular buffer that
points to the most recent sample (index). The function is presented below:
void fixedPointBlockFir(Int16 *x, Int16 blkSize,
Int16 *h, Int16 order,
Int16 *y,
Int16 *w, Int16 *index)
{
Int16 i,j,k;
Int32 sum;
Int16 *c;
k = *index;
for (j=0; j<blkSize; j++) // Block processing
{
2

w[k] = *x++; // Get the current data to delay line
c = h;
for (sum=0, i=0; i<order; i++) // FIR filter processing
{
sum += *c++ * (Int32)w[k++];
if (k == NUM_TAPS) // Simulate circular buffer
{
k = 0;
}
}
sum += 0x4000; // Rounding
*y++ = (Int16)(sum>>15); // Save filter output
if (k-- <=0) // Update index for next time
{
k = NUM_TAPS-1;
}
}
*index = k; // Update circular buffer index
}
Note, for example, the first call made by the program to this function passes index = 0 (set in the
initialization), meaning the most recent input sample inside the circular buffer is at position 0. The loop
then computes the convolution in terms of point-wise multiplication between the circular buffer and the
filter coefficients. Note, however, that while the filter coefficients are always at the same address, the
input samples are located in a circular buffer whose initial address moves counterclockwise for each sample
processed from the block of input samples. So, for example, the first output sample y[0] is the result of
the point-wise product between the coefficients and the samples in the buffer w starting from index k = 0
(w[0]). The second output sample y[1] is obtained by the point-wise product between the coefficients and
the samples in buffer w starting from k = 47 (buffer length is 48). At this point, w[47] has the most
recent sample, w[0] has the second most recent sample (same sample from the previous iteration) and the
remaining values are 0. For the third output sample (y[2]), w[46] has the most recent sample, w[47] has
the second most recent, w[0] has the third most recent and the remaining values in the circular buffer
are zero. This way, it is not necessary to shift all the samples in the buffer, but simply to store the new
upcoming input sample in a circular addressing manner.
The block processing is called for each block of 80 samples. Since the pointer to the starting address
inside the circular buffer and the circular buffer itself are preserved within calls (they are global memory
positions), each call of the block processing will maintain the correct value of the convolution. For example,
after the first block with 80 samples are processed, the first iteration in the second call will compute the
point-wise product between the filter coefficients and an array whose first element is the first element of
the new block and the remaining elements are the elements from the previous block (the first block).
1.1.3 Results
By listening to the input and output audio files, it is possible to note the tone with the intermediary
frequency (1800 Hz) is preserved, while the other two tones (800 and 3300 Hz) are filtered, as expected.
This is better illustrated in the spectrogram of both the input and the output files, presented in Fig. 3
and Fig. 4, respectively. Note the attenuation of approximately 40 db in the tones at 800 and 3300 Hz,
which conforms to the designed filter.
3

0.05 0.1 0.15 0.2 0.25 0.3 0.35
0
500
1000
1500
2000
2500
3000
3500
4000
Time
Frequency(Hz)
−120
−110
−100
−90
−80
−70
−60
−50
−40
−30
Figure 3: Spectrogram of the input signal
0.05 0.1 0.15 0.2 0.25 0.3 0.35
0
500
1000
1500
2000
2500
3000
3500
4000
Time
Frequency(Hz)
−140
−120
−100
−80
−60
−40
Figure 4: Bandpass ﬁlter output spectogram
4

1.1.4 Complexity
The implementation of the FIR filtering using fixed-point arithmetic has a low complexity relative to a
floating-point counterpart. The main function is "fixedPointBlockFir", which filters the block of sample.
Inside this function, most of the instructions are within the same average of clock cycles, between 1 and
10 cycles. For example, two of the most complex expressions in the loop are:
"sum += *c++ * (Int32)w[k++];"
"*y++ = (Int16)(sum>>15);"
The first takes 10 clock cycles to be evaluated, while the second takes 7 clock cycles. The entire block
processing takes 147,089 clock cycles, since it repeats these and many other operations 80 times. What is
inside the loop that repeats for each element of the block takes 147,050 cycles from this total.
1.2 Equiripple FIR Bandstop filter
In this part of the experiment, a bandstop linear-phase equiripple FIR filter is designed using FDAtool on
MATLAB. The filter specifications are summarized in Table 2.
Frequencies Attenuation
Passband edges [1400, 22000] [1, 1] dB
Stopband edges [1700, 1900] [50, 50] dB
Table 2: Bandstop Filter Specifications
In Fig. 5, the magnitude (blue line) and phase response (green line) of the filter designed in FDAtool
are overplotted. Similar to the bandpass filter, the phase response of this filter is linear in passband, since
it is not necessary for the optimization algorithm to try to achieve a linear phase in the stopband.
0 0.5 1 1.5 2 2.5 3 3.5
−70
−60
−50
−40
−30
−20
−10
0
Fr e q u e n c y ( k Hz )
Magnitude(dB)
M agnitude R esp onse (dB ) and Phase R esp onse
−64.8356
−55.6277
−46.4198
−37.212
−28.0041
−18.7962
−9.5883
−0.3804
Phase(radians)
Figure 5: Equiripple Bandstop FIR Filter: Magnitude and Phase response
In FDAtool, the filter coefficient can be quantized to Q15 format and exported to a C header:
const int16_T firCoefFixedPoint[49] = {
130, -115, 518, -304, -736, -206, 386, 132, -306,
-146, -56, -309, 189, 1001, -6, -1844, -717, 2448,
1866, -2524, -3167, 1904, 4191, -710, 28187, -710, 4191,
1904, -3167, -2524, 1866, 2448, -717, -1844, -6, 1001,
5

189, -309, -56, -146, -306, 132, 386, -206, -736,
-304, 518, -115, 130
};
Using this header in the previous C program, it is possible to evaluate the filter performance, as
described in Section 1.1.2.
1.2.1 Results
By listening to the input and the bandstop filter output, it is possible to note the tone with the intermediary
frequency (1800 Hz) is correctly filtered. The other two tones (800 and 3300 Hz), on the other hand, are
preserved with no attenuation, as expected. This is better illustrated in the spectrogram of the output
files, presented in Fig. 6.
0.05 0.1 0.15 0.2 0.25 0.3 0.35
0
500
1000
1500
2000
2500
3000
3500
4000
Time
Frequency(Hz)
−100
−90
−80
−70
−60
−50
−40
−30
Figure 6: Bandstop filter output spectogram
6

Experiment 3.2
1 Experiment 3.2
Experiment 3.2 demonstrates the block FIR ﬁltering implemented in an assembly function. The imple-
mentation is as follows:
;----------------------------------------------------------------------
; void blockFir(Int16 *x, => AR0
; Int16 blkSize, => T0
; Int16 *h, => AR1
; Int16 order, => T1
; Int16 *y, => AR2
; Int16 *w, => AR3
; Int16 *index) => AR4
;----------------------------------------------------------------------
_blockFir:
pshm ST1_55 ; Save ST1, ST2, and ST3
pshm ST2_55
pshm ST3_55
or #0x340,mmap(ST1_55); Set FRCT,SXMD,SATD
bset SMUL ; Set SMUL
mov mmap(AR1),BSA01 ; AR1=base address for coeff
mov mmap(T1),BK03 ; Set coefficient array size (order)
mov mmap(AR3),BSA23 ; AR3=base address for signal buffer
or #0xA,mmap(ST2_55) ; AR1 & AR3 as circular pointers
mov #0,AR1 ; Coefficient start from h[0]
mov *AR4,AR3 ; Signal buffer start from w[index]
|| sub #1,T0 ; T0=blkSize-1
mov T0,BRC0 ; Initialize outer loop to blkSize-1
sub #3,T1,T0 ; T0=order-3
mov T0,CSR ; Initialize inner loop order-2 times
|| rptblocal sample_loop-1 ; Start the outer loop
mov *AR0+,*AR3 ; Put the new sample to signal buffer
mpym *AR3+,*AR1+,AC0 ; Do the 1st operation
|| rpt CSR ; Start the inner loop
macm *AR3+,*AR1+,AC0
macmr *AR3,*AR1+,AC0 ; Do the last operation with rounding
mov hi(AC0),*AR2+ ; Save Q15 filtered value
sample_loop
popm ST3_55 ; Restore ST1, ST2, and ST3
1

popm ST2_55
popm ST1_55
mov AR3,*AR4 ; Update signal buffer index
ret
.end
Initially, the context in the status register is saved to the stack:
pshm ST1_55 ; Save ST1, ST2, and ST3
pshm ST2_55
pshm ST3_55
Next, the Fractional Mode is activated by setting FRACT, which essentally configures the ALU to
shift the result of a MAC by 1. This is useful because, for example, when multiplying two Q15 numbers,
the result is Q30. Now, if only the 16 most significant bits are desired and the desired output is a signed
16 bit int, it is still necessary to shift the result by 1 to the left. To avoid the overhead of having one
more instruction, fractional mode already executes a shift in hardware in the product. Additionally,
SXMD is set, so that input operands are sign extended (required for multiplying negative numbers in
two’s complement). Finally, SATD and SMUL are set, in order to activate saturation mode and guarantee
saturation is applied before addition in a MAC operation. This initial configuration is implemented by
the following two lines:
or #0x340,mmap(ST1_55); Set FRCT,SXMD,SATD
Sequentially, a circular buffer addressing mode is configured:
mov mmap(AR1),BSA01 ; AR1=base address for coeff
mov mmap(T1),BK03 ; Set coefficient array size (order)
mov mmap(AR3),BSA23 ; AR3=base address for signal buffer
or #0xA,mmap(ST2_55) ; AR1 & AR3 as circular pointers
mov #0,AR1 ; Coefficient start from h[0]
mov *AR4,AR3 ; Signal buffer start from w[index]
Note auxiliary register AR1 points to the starting index of the coefficients circular buffer, which is always
zero, while auxiliary register AR3 points to the starting index of the samples circular buffer, which is
passed as argument to the block processing function (argument index).
After this, the following code implements the point-wise multiplication between the filter coefficients
and the sample block:
|| sub #1,T0 ; T0=blkSize-1
mov T0,BRC0 ; Initialize outer loop to blkSize-1
sub #3,T1,T0 ; T0=order-3
mov T0,CSR ; Initialize inner loop order-2 times
|| rptblocal sample_loop-1 ; Start the outer loop
2

Note the first multiplication (mpym *AR3+,*AR1+,AC0) does not need to be a MAC, since it is the
first. The remaining operations are MAC and the last one performs rounding, as determined by the r
added to the mnemonic (macmr).
Finally, the context in ST1, ST2 and ST3 is restored and the index in the circular buffering corre-
sponding to the most recent sample is updated for the next function call.
popm ST3_55 ; Restore ST1, ST2, and ST3
popm ST2_55
popm ST1_55
mov AR3,*AR4 ; Update signal buffer index
ret
.end
1.1 Results
By listening to the output waveform, it is possible to note both the bandpass and bandstop filters work
as expected, just like in Experiment 3.1.
1.2 Complexity
The assembly implementation reduces the complexity of the FIR block processing to a total of 4,132 clock
cycles, which is significantly lower than the complexity of the fixed-point C implementation (147,089 clock
cycles). The block that produces an output sample (y[i]), repeated below, takes 51 clock cycles to be
computed.
3

Experiment 3.3
1 Experiment 3.3
This experiment demonstrates the performance improvement provided by the specific C55x functions for
filtering with symmetric FIRs.
1.1 Implementation Analysis
The program for evaluating the performance of the symmetric FIR is composed by the C main program
and the assembly function for block filtering. The C program is nearly the same as in experiment 3.2. The
assembly function will be detailed in the sequel.
Initially, the circular buffer length is defined to be equal to the number of filter taps. Additionally, a
few configurations are made: fractional mode, sign extension and saturation mode in Data Unit (D-Unit)
are activated.
mov mmap(T1),BK03 ; Set signal buffer size = order
or #0x340,mmap(ST1_55) ; Set FRCT,SXMD,SATD
Next, the starting address of the coefficients (passed in auxiliary register XAR1) is copied to the
extended coefficient data pointer, in order to prepare it to be used. Then, the CDP is configured to point
to a circular buffer whose starting address is the address of the first coefficient and length is half the filter
length (half the number of taps in the filter).
mov XAR1,XCDP ; CDP as coefficient pointer
mov mmap(AR1),BSAC ; Set up base address for CDP
sfts T1,#-1 ; T1 = order/2
|| mov #0,CDP ; Start from the 1st coefficient
mov mmap(T1),BKC ; Set the coefficient array size
In addition to the coefficient circular buffer, two other circular buffers are configured, both with length
equal to the number of taps. These buffers are going to be referenced henceforward as the upper input
buffer, pointed by AR3, and the lower input buffer, pointed by AR1. The initial address of the upper
buffer is an index that is globally maintained within calls of the block FIR filter function (variable index),
while the initial address of the lower buffer is the address immediately before the address of the upper
buffer.
mov XAR3,XAR1 ; AR1 & AR3 are signal buffer pointers
mov mmap(AR3),BSA01 ; Set base address of AR1 for signal buffer
mov mmap(AR3),BSA23 ; Set base address of AR3 for signal buffer
or #0x10A,mmap(ST2_55) ; CDP, AR1, AR3 are circular pointers
mov *AR4,AR3 ; AR3 is the Head of signal buffer
mov *AR4,AR1 ; AR1 is the Tail of signal buffer
|| sub #1,T0
amar *AR1- ; Adjust tail starting point
|| mov T0,BRC0 ; Outer loop counter blkSize-1
1

Next, the loop counter is prepared for L
2 2 iterations. The first input sample is copied to AC1,
converted to Q14 and copied to the upper circular buffer. At this point, assuming this is the first call to
the block FIR filtering function, the index pointed in the upper circular buffer is the first index, which
holds the most recent input sample x[0], while the index pointed in the lower circular buffer is the last
index, which holds the oldest input sample x[n L + 1]. These are the samples that would be multiplied
by the same coefficient, since the FIR filter is symmetric. Then, to reduce the computational cost, these
two samples are added and stored in AC1 for a single multiplication by the coefficient, provided by:
|| rpt CSR ; Do order/2-2 iterations
firsadd *AR3+,*AR1-,*CDP+,AC1,AC0
This instruction takes the content of AC1, multiplies it by the coefficient pointed by CDP (coefficient
data pointer) and accumulates the result to AC0. In parallel, it takes the value in the address pointed
by AR3, adds it to the value in the address pointed by AR1 and stores the result at AC1. This is useful
because in a single instruction (also in a single clock cycle), the product due to two filter taps (symmetric
taps) is computed and the circular buffer address is updated for both addresses. It is important to note
that firsadd accesses three memory values in a single cycle. For this to occur, the value referenced by
CDP must be located in a memory bank different from the one containing the AR1 and AR3 values.
Note the pointer in the upper circular buffer increments (rotates clockwise), while the in the lower
circular buffer it decrements (rotates counter-clockwise). Note also only L/2 - 2 iterations are executed,
such that the coefficients fetched are never repeated. Now, since the lower and upper input circular buffers
are pointed by the same address, a "new" input sample is added to both the upper and lower circular
buffers.
After the L/2 - 2 iterations are executed, one more multiplication is performed, but now, after mul-
tiplication, accumulation of the result and addition between the next pair of input samples, the circular
pointer is restored to its original position. This is implemented by the following line:
firsadd *(AR3-T0),*(AR1+T1),*CDP+,AC1,AC0
Finally, the last multiplication (multiplication L/2) is executed by the lines below. Note the result in
AC0 is rounded, converted to Q15 in the upper 16 bits and finally stored in the output buffer.
macm *CDP+,AC1,AC0 ; Finish the last macm instruction
mov rnd(hi(AC0<<#1)),*AR2+; Store the rounded & scaled result
|| mov *AR0+,AC1 ; Get next sample
This process repeats for each of the 80 samples in the input block.
1.2 Results
Fig 1 and 2 present the input and output signal magnitude spectrogram. Note how the filter effectively
attenutes the tones at
With respect to the output obtained with the filter in Experiment 3.2, this output has a tiny difference
in approximatino, specially because it truncates the input samples to Q14. The error between the output
obtained in Experiment 3.2 and the current experiment is in the order of 10 5, as presented in Fig. 3.
2

0.05 0.1 0.15 0.2 0.25 0.3 0.35
0
500
1000
1500
2000
2500
3000
3500
4000
Time
Frequency(Hz)
0.05 0.1 0.15 0.2 0.25 0.3 0.35
0
500
1000
1500
2000
2500
3000
3500
4000
Time
Frequency(Hz)
−140
−120
−100
−80
−60
−40
3

0 50 100 150 200 250 300 350 400
−4
−3
−2
−1
0
1
2
3
4
x 10
−5
Time (ms)
Error
Figure 3: Error between the output of Experiment 3.2 and Experiment 3.3
1.3 Computational Cost
One iteration of the loop repeated below demands 28 clock cycles. In another words, one output sample
takes 28 clock cycles to be computed, if the configuration instructions are ignored. Since this loop is
repeated 80 times (for 80 input samples), the block processing takes 2,240 clock cycles. Now, from the
C test function, a single call to the block filtering function symFir takes 2,309 clock cycles (this count
includes the configuration instructions). When comparing to the assembly implementation that does not
take advantage of the filter symmetry (which demands 4, 132 clock cycles), this is a significant improvement.
mov #0,AC0 ; input is scaled to Q14 format
|| mov AC1<<#-1,*AR3 ; Put input to signal buffer in Q14
add *AR3+,*AR1-,AC1 ; AC1=[x(n)+x(n-L+1)]<<16
|| rpt CSR ; Do order/2-2 iterations
firsadd *AR3+,*AR1-,*CDP+,AC1,AC0
mov rnd(hi(AC0<<#1)),*AR2+; Store the rounded & scaled result
|| mov *AR0+,AC1 ; Get next sample
1.4 Bandpass Filter Design
An equiripple linear-phase bandpass FIR filter is designed using Parks-McClellan algorithm, through
MATLAB function firpm. The passband extends from a normalized frequency of 0.4 up to 0.5, which
translates to a frequency range from 3.2 Hz to 4 kHz for a sampling frequency of 8 kHz.
The following MATLAB script implements the filter design, analysis, conversion to Q15 format and
exportation of the FIR taps in Q15 format to a C header.
f = [0 0.3 0.4 0.5 0.6 1]; % Frequency band edges
m = [0 0 1 1 0 0]; % Desired magnitude response
b = firpm(48, f, m); % Filter taps
4

% Analyze filter:
fvtool(b, 1)
% Define quantizer object
q = quantizer([16 15], 'RoundMode', 'round');
% Quantize FIR coefficients:
b_quantized = quantize(q, b);
% Int16 Q15 format:
q15_multiplier = hex2dec('7FFF');
b_q15 = cast(b_quantized * q15_multiplier, 'Int16');
%% Export filter coefficients to a C Header
fd = fopen('firCoef.c', 'w');
fprintf(fd, 'Int16 firCoefFixedPoint[NUM_TAPS]={ n');
for i=1:length(b_q15_float)
if (mod(i, 10) == 0)
fprintf( fd, 'n');
end
fprintf( fd, '%s, ', num2str(b_q15(i)) );
end
fprintf(fd, '};');
fclose(fd);
The ﬁlter magnitude and phase response are presented in Fig. 4 and Fig. 5, respectively. Note the
phase response is linear in the frequency interval responding to the passband, as desired.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
−70
−60
−50
−40
−30
−20
−10
0
Normalized Frequency (×π rad/sample)
Magnitude(dB)
Magnitude Response (dB)
5

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
−20
−15
−10
−5
0
Phase(radians)
Phase Response
The following FIR taps were obtained:
const Int16 firCoefFixedPoint[NUM_TAPS]={
-103, 101, 148, 3, -18, 21, -212, -186, 462,
525, -480, -756, 221, 477, 0, 499, 388, -1837,
-1761, 2748, 3873, -2504, -5833, 1013, 6634, 1013, -5833,
-2504, 3873, 2748, -1761, -1837, 388, 499, 0, 477,
221, -756, -480, 525, 462, -186, -212, 21, -18,
3, 148, 101, -103
};
1.5 Filtering with odd length FIR filter
In order to effectively filter using an FIR filter with odd number of taps, the assembly function must
be modified. Initially, a flag indicating an odd length must be set, as provided by the following bit test
instruction:
btst @#0, T1, TC1 ; Test if order is odd
Then, along the program, this flag is verified under conditional executions specific for odd length filters.
In the case of the designed filter with order 48 (49 taps), the main filtering loop must be repeated 23 times,
instead of 22. Thus, the following modification must be added for the value of T1:
sfts T1,#-1 ; T1 = order/2 - If T1 is odd, T1 = order/2 - 1
xcc TC1 ; Test if odd
add #1, T1 ; If odd, make T1 = (order - 1)/2 + 1
In the end of the 23 iterations in the main loop, both index pointers (upper and lower input buffer)
point to sample at index 24. Thus, the approach is to continue executing
but followed by the following two lines, which divide AC1 by two, since it holds 2x[25] (twice the 25-th
input sample).
6

xcc TC1 ; If number of taps is odd
sfts AC1, #-1 ; Divide AC1 by two, because the input sample was
; added twice to it
Then, the final multiplication can be executed normally:
Note AR3 in the end is updated to 48 (instead of 47), while AR1 is updated to 47 (instead of 46). This
is provided by the fact that before the loop T0 is loaded with 25 ((L-1)/2 + 1) and T1 is loaded with 23.
1.5.1 Computational Cost
In terms of computational cost, these modifications make the the entire block filtering function symFir
demand 2,554 clock cycles (including configuration instructions).
7

Experiment 3.4
1 Experiment 3.4
This experiment demonstrates the use of Dual-MAC instructions on FIR filtering.
1.1 Implementation Analysis
In the beginning of C test function, two different sections are used for the filter coefficients (".const:fir")
and for the input buffer (".bss:fir"), in order to avoid bus contention. Additionally, the output buffer is
aligned to 32-bit (2 words), in order to allow dual-memory store instructions.
#pragma DATA_SECTION(dualMacFirCoef, ".const:fir");
#pragma DATA_SECTION(w, ".bss:fir");
#pragma DATA_SECTION(y, ".bss:fir");
#pragma DATA_ALIGN(y,2); // Alignment is needed for dual accumulator store
The rest of the C test function is nearly the same as in the previous experiment. The main difference
lies in the assembly function for filtering a block of input samples.
Differently than the assembly function in the previous experiment (3.5.3), the input buffers now have
size equal to the number of filter taps plus one, as given by:
|| add #1,T1
mov mmap(T1),BK03 ; Set signal buffer x[] size as order+1
Additionally, the coefficient buffer is of length equal to the number of filter taps (the symmetry in
coefficients is not explored).
In this implementation, the two input circular buffers are not for fetching input samples that are going
to be multiplied by symmetric coefficients. Instead, they are used to fetch consecutive input samples, such
that Dual-MAC instructions can be executed. Note the following instructions, which prepares one of the
input circular buffers to be pointing for an index immediately consecutive to the address pointed by the
other circular buffer:
amar *AR1+ ; AR1 delayline index+1
Given that two consecutive input samples are stored separately in the two input circular buffers, two
MAC operations are executed in parallel inside the main loop, as given by the lines below. Note the results
are stored in AC0 and AC1
|| rpt CSR
mac *AR1+,*CDP+,AC0 ; The rest MAC iterations
:: mac *AR3+,*CDP+,AC1
In the end, the "pair" mnemonic on ACx determines that the data in a pair of accumulators (ACx
and ACx+1) is moved to the pointed address. In the line below, the upper 16 bits from AC0 and AC1
are moved to the output buffer index pointed by AR2.
mov pair(hi(AC0)),dbl(*AR2+); Store two output data (must be aligned in 32-bit!)
1

1.2 Results
Fig. 1 and 2 present the input and output spectrograms. Note the tones at 800 Hz and 3300 Hz are
effectively filtered.
0.05 0.1 0.15 0.2 0.25 0.3 0.35
0
500
1000
1500
2000
2500
3000
3500
4000
Time
Frequency(Hz)
0.05 0.1 0.15 0.2 0.25 0.3 0.35
0
500
1000
1500
2000
2500
3000
3500
4000
Time
Frequency(Hz)
−140
−120
−100
−80
−60
−40
1.3 Computational Cost
The FIR filtering using Dual-MAC instructions demands 2,181 clock cycles. Again, a significant improve-
ment in computational cost with respect to previous experiments.
Table 1 presents a comparison between the complexity obtained with dual-MAC FIR filtering and the
2

Implementation Number of Clock Cycles
C fixed-point 147,089
Regular Assembly 4,132
Symmetric FIR 2,309
Dual-MAC 2,181
Table 1: Computational cost comparison between the different implementations of FIR block processing
computational complexity of filtering with the C fixed-point implementation, assembly implementation
and symmetric FIR implementation.
1.4 Part 5 - Filter Design
An equiripple lowpass FIR filter with 24 coefficients is designed using FDATool, with the goal of attending
the specifications summarized in Table 2.
Frequencies Magnitude
Passband edge 1 kHz 1 dB
Stopband edges 1.2 kHz 50 dB
Table 2: Bandpass Filter Specifications
The filter magnitude and phase response are presented in Fig. 3 and Fig. 4, respectively. Note the
phase response is linear in the frequency interval responding to the passband, as desired.
0 0.5 1 1.5 2 2.5 3 3.5
−60
−50
−40
−30
−20
−10
0
Freq uency (k Hz)
Magnitude(dB)
3

0 0.5 1 1.5 2 2.5 3 3.5
−10
−8
−6
−4
−2
0
Freq uency (k Hz)
Phase(radians)
Phase R esp onse
The following FIR taps in Q15 format were obtained:
-1585, 1685, 1709, 1418, 478, -841, -1787, -1516, 361,
3458, 6685, 8740, 8740, 6685, 3458, 361, -1516, -1787,
-841, 478, 1418, 1709, 1685, -1585
};
1.4.1 Result
Fig. 5 presents the spectrogram of the output signal. By observing the filter magnitude response in Fig. 3,
it is possible to conclude the filter implementation is yielding the expected result: tone at 800 Hz is not
attenuated, tone at 1800 Hz is attenuated by approximately 20 dB, while the tone at 3300 Hz is attenuated
significantly.
0.05 0.1 0.15 0.2 0.25 0.3 0.35
0
500
1000
1500
2000
2500
3000
3500
4000
Time
Frequency(Hz)
−140
−120
−100
−80
−60
−40
Figure 5: Lowpass Filter Output Spectrogram
4

1.5 Part 6 - Filter redesign
The filter from the previous section is redesigned. This time, the filter order is determined by the algorithm
run by FDAtool, which returns an order of 68.
The following FIR taps in Q15 format were obtained:
121, 142, 124, 12, -171, -345, -413, -320, -99,
130, 226, 120, -128, -353, -381, -155, 210, 473,
428, 52, -442, -712, -512, 114, 801, 1056, 567,
-520, -1587, -1809, -608, 1969, 5163, 7795, 8813, 7795,
5163, 1969, -608, -1809, -1587, -520, 567, 1056, 801,
114, -512, -712, -442, 52, 428, 473, 210, -155,
-381, -353, -128, 120, 226, 130, -99, -320, -413,
-345, -171, 12, 124, 142, 121
};
1.5.1 Result
Fig. 6 presents the spectrogram of the signal in the output of the lowpass filter with 69 taps. Note now,
using the order that is returned by FDAtools (and not forcing a given filter order) the specifications in
Table 2 are attended.
0.05 0.1 0.15 0.2 0.25 0.3 0.35
0
500
1000
1500
2000
2500
3000
3500
4000
Time
Frequency(Hz)
−130
−120
−110
−100
−90
−80
−70
−60
−50
−40
−30
Figure 6: Lowpass Filter Output Spectrogram
5

Experiment 4.1
1 Experiment 4.1
This experiment demonstrates the implementation of IIR filtering using a C function. The input signal is
a combination of three sinusoids with frequency 800 Hz, 1800 Hz and 3300 Hz.
1.1 Lowpass Elliptic IIR Filter
An Elliptic bandpass filter is designed with the goal of filtering the tones with frequency of 800 Hz and
3300 Hz. The filter magnitude and phase response are presented in Fig. 1 and Fig. 2, respectively. Note
the phase response is not linear in the passband.
0 0.5 1 1.5 2 2.5 3 3.5
−70
−60
−50
−40
−30
−20
−10
0
Frequency (kHz)
Magnitude(dB)
Figure 1: Elliptic Bandpass IIR Filter: Magnitude response
1

0 0.5 1 1.5 2 2.5 3 3.5
−8
−6
−4
−2
0
2
4
6
8
Frequency (kHz)
Phase(radians)
Phase Response
Figure 2: Elliptic Bandpass IIR Filter: Phase response
1.2 Output Validation
Fig. 3 and 4 present the spectrograms for the input and output signals. Note the tones at 800 Hz and
3300 Hz are eﬀectively ﬁltered.
0.05 0.1 0.15 0.2 0.25 0.3 0.35
0
500
1000
1500
2000
2500
3000
3500
4000
Time
Frequency(Hz)
−120
−110
−100
−90
−80
−70
−60
−50
−40
−30
2

0.05 0.1 0.15 0.2 0.25 0.3 0.35
0
500
1000
1500
2000
2500
3000
3500
4000
Time
Frequency(Hz)
−120
−110
−100
−90
−80
−70
−60
−50
−40
−30
1.3 Bandstop filter design
In this part of the experiment, an IIR filter is designed to have two passbands centered at 800 and 3300 Hz.
The filter is designed by the following Matlab script, which yields a filter with the magnitude response
presented in Fig. 5.
0 0.5 1 1.5 2 2.5 3 3.5
−90
−80
−70
−60
−50
−40
−30
−20
−10
0
Freq uency (k Hz)
Magnitude(dB)
Figure 5: Bandstop IIR Filter Magnitude response
fs = 8000;
Rp = 0.1;
Rs = 60;
[N,Wn] = ellipord(800/(fs/2), 1200/(fs/2), Rp, Rs);
[b, a] = ellip(N, Rp, Rs, Wn);
% Passband centered at 800 Hz
% NOTE: bandwidth should be 400 Hz
3

[num1, den1] = iirlp2bs(b, a, 0.25, [600/(fs/2) 1000/(fs/2)]);
% Passband centered at 3300 Hz
[num2, den2] = iirlp2bs(b, a, 0.25, [3100/(fs/2) 3500/(fs/2)]);
% Cascaded combination:
num = conv(num1, num2);
den = conv(den1, den2);
% Analyze filter:
fvtool(num, den)
The numerator and denominator of the ﬁlter are given by:
num =
Columns 1 through 9
0.0016 0.0007 -0.0037 -0.0011 0.0055 0.0012 -0.0049 -0.0004 0.0031
Columns 10 through 18
0.0003 0.0005 0.0001 -0.0020 0.0001 0.0005 0.0003 0.0031 -0.0004
-0.0049 0.0012 0.0055 -0.0011 -0.0037 0.0007 0.0016
den =
Columns 1 through 9
1.0000 0.5074 -4.3234 -1.4132 12.9245 3.1651 -26.0632 -3.8530 42.0055
3.6751 -52.7665 -1.1319 54.9906 -1.0758 -45.9154 2.9973 31.8037 -2.7216
-17.1539 1.9438 7.3969 -0.7509 -2.1463 0.2356 0.4332
1.3.1 Result
When adjusting the C function to use the designed bandpass ﬁlter, the spectrogram of the output signal
is the one presented in Fig. 6. Note, as expected, only the tones at 800 Hz and 3300 Hz are maintained.
4

0.05 0.1 0.15 0.2 0.25 0.3 0.35
0
500
1000
1500
2000
2500
3000
3500
4000
Time
Frequency(Hz)
−120
−110
−100
−90
−80
−70
−60
−50
−40
−30
Figure 6: Spectogram of the bandpass IIR ﬁlter output signal
5

Experiment 4.2
1 Experiment 4.2
This experiment demonstrates the implementation of IIR filtering using a fixed-point C function. The
input signal is assumed to be in Q15 and consist in a combination of three sinusoids with frequency
800 Hz, 1800 Hz and 3300 Hz. The filter coefficients from Experiment 4.1 are converted to Q11 format.
1.1 Lowpass Elliptic IIR Filter
Recall that the filter in Experiment 4.1 has the magnitude response presented in Fig. 1.
0 0.5 1 1.5 2 2.5 3 3.5
−70
−60
−50
−40
−30
−20
−10
0
Frequency (kHz)
Magnitude(dB)
Figure 1: Elliptic Bandpass IIR Filter: Magnitude response
Fig. 2 presents the spectrogram of the output signal. Note the tones at 800 Hz and 3300 Hz are effectively
filtered. For comparison, Fig. 3 presents the spectrogram obtained with the floating point filter implemen-
tation. Additionally, the time domain error between the fixed-point output and floating-point output is
presented in Fig. 4. Note the order of the error is significant (within 10 2), but the tones are still filtered
with the desired attenuation of nearly 60 dB.
1

0.05 0.1 0.15 0.2 0.25 0.3 0.35
0
500
1000
1500
2000
2500
3000
3500
4000
Time
Frequency(Hz)
−110
−100
−90
−80
−70
−60
−50
−40
−30
Figure 2: Fixed-point ﬁlter output spectogram
0.05 0.1 0.15 0.2 0.25 0.3 0.35
0
500
1000
1500
2000
2500
3000
3500
4000
Time
Frequency(Hz)
−120
−110
−100
−90
−80
−70
−60
−50
−40
−30
Figure 3: Floating-point ﬁlter output spectogram
2

0 50 100 150 200 250 300 350 400
0
0.002
0.004
0.006
0.008
0.01
0.012
0.014
Time (ms)
AbsoluteError
Figure 4: Absolute Error between the output of the floating-point IIR filter and fixed-point IIR filter
1.3 Noise signal
This part of the experiment evaluates the filtering of the white noise signal shown in Fig. 5, whose power
spectral density is around 50 db/hz, as shown in Fig. 6.
0 100 200 300 400 500 600 700 800 900 1000
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
T im e (m s)
Amplitude
Figure 5: Noise signal
The spectrogram of the noise signal is presented in Fig. 7. After passing it through the IIR filter, the
output signal has the spectrogram presented in Fig. 8, in which is possible to observe the only the noise
in the passband of the IIR filter is maintained, as expected.
3

0 1 2 3 4 5 6 7
−56
−54
−52
−50
−48
−46
−44
−42
Fre q uency (k Hz)
Power/frequency(dB/Hz)
Welch Power Sp ectral Density Estim ate
Figure 6: Noise signal PSD
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0
500
1000
1500
2000
2500
3000
3500
4000
Time
Frequency(Hz)
−85
−80
−75
−70
−65
−60
−55
−50
−45
−40
Figure 7: Noise signal spectrogram
4

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0
500
1000
1500
2000
2500
3000
3500
4000
Time
Frequency(Hz)
−130
−120
−110
−100
−90
−80
−70
−60
−50
−40
Figure 8: Spectrogram of the output signal for a white noise input
1.4 Lowpass filter design
In this part of the experiment, an Elliptic lowpass filter is designed to attenuate the components at 1800 Hz
and 3300 Hz by 60 dB and retain the tone at 800 Hz. The designed filter magnitude reponse is presented
in Fig. 9.
0 0.5 1 1.5 2 2.5 3 3.5
−80
−70
−60
−50
−40
−30
−20
−10
0
Frequency (kHz)
Magnitude(dB)
Figure 9: Elliptic Lowpass IIR Filter: Magnitude response
The numerator and denominator contain the following coefficients:
b =
0.0041 -0.0073 0.0140 -0.0126 0.0161 -0.0126 0.0140 -0.0073 0.0041
a =
1.0000 -5.4320 14.0418 -22.1450 23.1314 -16.3134 7.5707 -2.1126 0.2718
5

1.4.1 C Implementation
In order to adjust the C test program to use the coefficients of the designed lowpass IIR filter, a few
modifications must be made such that Q10 is used, instead of Q11. This is because for some coefficients in
the denominator, a overflow would occur when converting to Q11 format. The following coefficient arrays
were used in the C program:
Int16 num[NL] = {
(Int16)(0.0041*Q10+RND), (Int16)(-0.0073*Q10+RND), (Int16)(0.0140*Q10+RND),
(Int16)(-0.0126*Q10+RND), (Int16)(0.0161*Q10+RND), (Int16)(-0.0126*Q10+RND),
(Int16)(0.0140*Q10+RND), (Int16)(-0.0073*Q10+RND), (Int16)(0.0041*Q10+RND)
};
Int16 den[DL] = {
(Int16)(1.0000*Q10+RND), (Int16)(-5.4320*Q10+RND), (Int16)(14.0418*Q10+RND),
(Int16)(-22.1450*Q10+RND), (Int16)(23.1314*Q10+RND), (Int16)(-16.3134*Q10+RND),
(Int16)(7.5707*Q10+RND), (Int16)(-2.1126*Q10+RND), (Int16)(0.2718*Q10+RND)
};
1.4.2 Result
When adjusting the C function to use the given lowpass filter, the spectrogram of the output signal is the
one presented in Fig. 10. Note, as expected, only the tone at 800 Hz is maintained.
0.05 0.1 0.15 0.2 0.25 0.3 0.35
0
500
1000
1500
2000
2500
3000
3500
4000
Time
Frequency(Hz)
−120
−110
−100
−90
−80
−70
−60
−50
−40
−30
Figure 10: Spectogram of the lowpass IIR filter output signal
1.5 Highpass filter design
In this part of the experiment, an Elliptic highpass filter is designed to attenuate the components at 800 Hz
in Fig. 11.
6

0 0.5 1 1.5 2 2.5 3 3.5
−90
−80
−70
−60
−50
−40
−30
−20
−10
0
Frequency (kHz)
Magnitude(dB)
Figure 11: Elliptic Highpass IIR Filter: Magnitude response
b =
0.0466 -0.1027 0.2048 -0.2606 0.2606 -0.2048 0.1027 -0.0466
a =
1.0000 1.5281 2.9075 2.6302 2.2776 1.1759 0.4723 0.0939
1.5.1 Result
When adjusting the C function to use the designed highpass ﬁlter, the spectrogram of the output signal
is the one presented in Fig. 10. Note, as expected, only the tone at 3300 Hz is maintained.
0.05 0.1 0.15 0.2 0.25 0.3 0.35
0
500
1000
1500
2000
2500
3000
3500
4000
Time
Frequency(Hz)
−110
−100
−90
−80
−70
−60
−50
−40
−30
Figure 12: Spectogram of the highpass IIR ﬁlter output signal
7

Experiment 4.3
1 Experiment 4.3
This experiment demonstrates the implementation of IIR filtering using a cascaded combination of second
order IIR filters realized in the Direct-II form. The input signal is a combination of three sinusoids with
frequency 800 Hz, 1500 Hz and 3300 Hz.
1.1 Implementation
The fixed-point filtering implementation using the IIR sections requires, for each second-order section, 5
coefficients and a buffer of 2 signal samples. The coefficients are1 a1, a2, b0, b1 and b2. The two samples
are the memories for w[n 1] and w[n 2]. The algorithm is as follows:
1. For the i-th second order section, update the value for wi[n] using the following expression:
wi[n] = xi[n] a1,iwi[n 1] a2,iwi[n 2]
2. Update the second-order section output using the following expression:
yi[n] = b0,iwi[n] + b1,iwi[n 1] + b2,iwi[n 2]
3. Update the input sample for the succeding second-order section using the following expression:
xi+1[n] = yi[n]
The test C program obtains the coefficients from a header file, which, for both numerator and denom-
inator, declares an array with 2 rows for each second order section, the first with an overall gain and the
second with the actual coefficients.
Fig. 1 and Fig. 2 present the spectrogram of the input and output signals, respectively. Note the tones at
800 Hz and 3300 Hz are effectively filtered.
1
The coefficient a0 is assumed to be unitary, as usual. Thus, it is not required for computation
1

0.1 0.2 0.3 0.4 0.5 0.6 0.7
0
500
1000
1500
2000
2500
3000
3500
4000
Time
Frequency(Hz)
−140
−120
−100
−80
−60
−40
Figure 1: Spectogram of the input signal
0.1 0.2 0.3 0.4 0.5 0.6 0.7
0
500
1000
1500
2000
2500
3000
3500
4000
Time
Frequency(Hz)
−140
−120
−100
−80
−60
−40
Figure 2: Spectogram of the output signal
1.3 Highpass ﬁlter design
In this part of the experiment, an Elliptic highpass ﬁlter is designed to attenuate the components at 800 Hz
in Fig. 3.
b =
0.0232 -0.0720 0.1459 -0.1990 0.1990 -0.1459 0.0720 -0.0232
2

0 0.5 1 1.5 2 2.5 3 3.5
−90
−80
−70
−60
−50
−40
−30
−20
−10
0
Frequency (kHz)
Magnitude(dB)
Figure 3: Elliptic Highpass IIR Filter: Magnitude response
a =
1.0000 1.7121 3.0092 2.8976 2.3582 1.2601 0.4773 0.0948
The filter can be converted to a cascaded combination of four second order sections:
[sos,g] = tf2sos(b,a)
sos =
1.0000 -1.0000 0 1.0000 0.4189 0
1.0000 -1.3069 1.0000 1.0000 0.6627 0.3654
1.0000 -0.5494 1.0000 1.0000 0.3866 0.6817
1.0000 -0.2528 1.0000 1.0000 0.2439 0.9083
g =
0.0232
where the left three coefficients are the coefficients of the numerator and right three coefficients are the
coefficients of the denominator. The gain is such that H(z) = g ⇤ H1(z) ⇤ H2(z) ⇤ H3(z) ⇤ H4(z).
All the sections above have poles inside the unit circle, which imply the second-order sections are
stable.
1.3.1 Results
Using the designed highpass filter, it is possible to obtain the output signal whose spectrogram is presented
in Fig. 4. Note the tones at 800 and 1500 Hz are effectively attenuated by nearly 60 dB and only the tone
at 3300 Hz is preserved, as desired.
3

0.1 0.2 0.3 0.4 0.5 0.6 0.7
0
500
1000
1500
2000
2500
3000
3500
4000
Time
Frequency(Hz)
−140
−130
−120
−110
−100
−90
−80
−70
−60
−50
Figure 4: Spectogram of the highpass ﬁlter output signal
4

Experiment 4.4
1 Experiment 4.4
This experiment demonstrates the implementation with intrinsics of IIR filtering using a cascaded combi-
nation of second order IIR filters realized in the Direct-II form. The input signal is a combination of three
sinusoids with frequency 800 Hz, 1500 Hz and 3300 Hz.
1.1 Implementation
Similar to Experiment 4.3, the implementation in Experiment 4.4 requires, for each second-order section,
5 coefficients and a buffer of 2 signal samples. The coefficients are1 a1, a2, b0, b1 and b2. The two samples
are the memories for w[n 1] and w[n 2]. The algorithm is as follows:
1. For the i-th second order section, update the value for wi[n] using the following expression:
wi[n] = xi[n] a1,iwi[n 1] a2,iwi[n 2]
2. Update the second-order section output using the following expression:
yi[n] = b0,iwi[n] + b1,iwi[n 1] + b2,iwi[n 2]
3. Update the input sample for the succeding second-order section using the following expression:
xi+1[n] = yi[n]
With the use of intrinsics, this algorithm is implemented mainly by the following operations:
1. First operation, executed once in the beginning for each input sample: wi[n] = x[n]
w_0 = (Int32)x[n]<<12; // Scale input to prevent overflow
2. Multiply and subtract operation: wi[n] = wi[n] a1,iwi[n 1]
w_0 = _smas(w_0,*(w+l),*(coef+j)); j++; l=(l+Ns)&s;
3. Multiply and subtract operation: wi[n] = wi[n] a2,iwi[n 2]
w_0 = _smas(w_0,*(w+l),*(coef+j)); j++;
4. Update wi[n 2] or wi[n 1] (depending on iteration) with the value of w[n]:
1
The coefficient a0 is assumed to be unitary/
1

temp16 = *(w+l); // temporarily store the old value for w[n-2]
*(w+l) = (Int16)(w_0>>15); // Save in Q15
5. Multiply operation: yi[n] = b2,iwi[n 2]
w_0 = _lsmpy( temp16,*(coef+j)); j++;
6. Multiply and accumulate operation: yi[n] = yi[n] + b0,iwi[n]
w_0 = _smac(w_0,*(w+l),*(coef+j)); j++; l=(l+Ns)&s;
7. Multiply and accumulate operation: yi[n] = yi[n] + b1,iwi[n 1]
w_0 = _smac(w_0,*(w+l),*(coef+j)); j=(j+1)%m; l=(l+1)&s;
Note the same variable name is used while computing wi[n] and yi[n]. This is because, yi[n] is used
as the input sample to the succeeding second-order section.
800 Hz and 3300 Hz are eﬀectively ﬁltered.
0.1 0.2 0.3 0.4 0.5 0.6 0.7
0
500
1000
1500
2000
2500
3000
3500
4000
Time
Frequency(Hz)
−140
−120
−100
−80
−60
−40
2

0.1 0.2 0.3 0.4 0.5 0.6 0.7
0
500
1000
1500
2000
2500
3000
3500
4000
Time
Frequency(Hz)
−140
−120
−100
−80
−60
−40
1.3 Clock cycles
Using intrinsics, one call to intrinsics_IIR() (to process a block of samples) takes 100, 521 clock cycles.
This is a significant computational cost reduction with respect to the implementation in Experiment 4.3,
which requires 188, 840 clock cycles for a block processing call to cascadeIIR().
1.4 Notch filter design
In this part of the experiment, a second-order notch IIR filter is designed to attenuate the tone at 800 Hz
by 60 dB and retain the tones at 1500 and 3300 Hz. The designed filter magnitude response is presented
in Fig. 3, for both the floating-point and fixed-point (quantized) versions.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
−60
−50
−40
−30
−20
−10
0
Magnitude(dB)
Filter #1: Quantized
Filter #1: Reference
Figure 3: Second-order notch IIR Filter: Magnitude response
This filter was designed using the cascaded combination of two second-order notch IIR filters, whose
numerator and denominator contain the following coefficients:
3

b =
0.9941 -1.6086 0.9941
a =
1.0000 -1.6086 0.9883
When converting the filter to fixed-point, the filter coefficients are scaled considering the data that will
be filtered, in order to prevent overflow. The following scripts implements the filter design:
clear all
clc
[data, fs] = wavread('in.wav');
% Notch filter design
fs = 8000;
Wo = 800/(fs/2);
BW = 5/(fs/2);
[b, a] = iirnotch(Wo, BW, 10)
% Cascade two of those second order sections
Hd = dfilt.df2tsos(b, a, b, a);
set(Hd, 'arithmetic', 'fixed')
set(Hd, 'coeffAutoScale', true)
set(Hd, 'overflowMode', 'saturate')
Hd = autoscale(Hd, data);
info(Hd)
fvtool(Hd)
fipref('LoggingMode', 'on', 'DataTypeOverride', 'ForceOff');
y = filter(Hd, data);
R = qreport(Hd)
y_double = cast(y, 'double');
%sound(y_double, fs)
spectrogram(y_double,512,256,512,fs,'yaxis')
1.4.1 Results
Using the designed notch filter, it is possible to obtain the output signal whose spectrogram is presented
in Fig. 4. Note the tone at 800 Hz is effectively attenuated by nearly 60 dB and the other two tones (at
1500 Hz and 3300 Hz) are preserved, as desired.
4

0.1 0.2 0.3 0.4 0.5 0.6 0.7
0
500
1000
1500
2000
2500
3000
3500
4000
Time
Frequency(Hz)
Figure 4: Spectogram of the notch filter output signal
1.5 Use of Intrinsics in Experiment 4.2
If fixPoint_IIR() Experiment 4.2 is modified to use intrinsics, one call to it (to process a block of samples)
takes 1, 116 clock cycles, in contrast to 1,098 clock cycles for the original (not using intrinsics). In this
case, the modifications to intrinsics did not provide a significant improvement.
5

Experiment 4.5
1 Experiment 4.5
This experiment demonstrates the implementation in assembly of IIR filtering using a cascaded combina-
tion of second order IIR filters realized in the Direct-II form. The input signal is the same as the one in
Experiment 4.4, namely a combination of three sinusoids with frequency 800 Hz, 1500 Hz and 3300 Hz.
800 Hz and 3300 Hz are effectively attenuated by approximately 60 dB.
0.1 0.2 0.3 0.4 0.5 0.6 0.7
0
500
1000
1500
2000
2500
3000
3500
4000
Time
Frequency(Hz)
−140
−120
−100
−80
−60
−40
1

0.1 0.2 0.3 0.4 0.5 0.6 0.7
0
500
1000
1500
2000
2500
3000
3500
4000
Time
Frequency(Hz)
−140
−120
−100
−80
−60
−40
1.2 Clock cycles comparison
In the third part of the experiment, a comparison between the implementations from Experiment 4.1 to
Experiment 4.5 is presented in terms of computational cost. Since the implementation in Exp. 4.1 and 4.2
use the coeﬃcients in regular single-section form, the coeﬃcients from the cascaded second-order sections
in Exp 4.3 to Exp 4.5 must be converted. The following MATLAB script was used to accomplish this
conversion:
% It is necessary to convert the digital filter in SOS format to a normal
% format.
clear all
clc
% Numerator scales:
num_scales = [1491, 1491, 6214, 6214, 16384]; % in Q14
% Convert to double
num_scales = num_scales / 16384;
% Numerator coefficients:
num = [ 16384, 23636, 16384 ;
16384, 3306, 16384 ;
16384, 18247, 16384 ;
16384, 5864, 16384 ]; % in Q14
% Convert to double
num = num / 16384;
% Denominator scales:
den_scales = [16384 16384 16384 16384 16384]; % in Q14
% Convert to double
den_scales = den_scales / 16384;
den = [ 16384, 13231, 15528 ;
16384, 11232, 15504 ;
2

16384, 14706, 16069 ;
16384, 10054, 16049 ];
den = den / 16384;
%% Convert SOS matrix to a single section of higher order
sos_mtx = [num den];
gains = [num_scales den_scales]
[b, a] = sos2tf(sos_mtx, gains)
fvtool(b, a)
% exportCoefs('num.h', b, 11)
% exportCoefs('den.h', a, 11)
The magnitude response of the given filter is presented in Fig. 3. Clearly, the filter is designed to pass
solely the tone at 1500 Hz.
0 0.5 1 1.5 2 2.5 3 3.5
−70
−60
−50
−40
−30
−20
−10
0
Frequency (kHz)
Magnitude(dB)
Figure 3: Bandpass Filter Magnitude response
After the aforementioned conversion, it is possible to compare the computational cost of the five
different implementations using the same filter (same order). In order to make the comparison fair, given
that the implementations in Exp 4.1 and Exp 4.2 are on a per-sample basis while the others are on a per-
block basis, the computational cost for processing one sample in 4.1 and 4.2 is multiplied by the number
of samples processed at once by the functions used in 4.3, 4.4 and 4.5. Table 1 summarizes the results.
As expected, the assembly implementation is the most efficient. Additionally, it can be observed that
the intrinsics implementation does not necessarily outperform the fixed-point implementation, as in this
case.
Experiment Number of Clock Cycles
4.1 454, 080
4.2 122, 560
4.3 41, 873
4.4 100, 521
4,5 4, 862
Table 1: Computational cost comparison between implementations from Exp. 4.1 to Exp 4.5
3

Experiment 5
This experiment demonstrates the computation of a DFT using several different implementations:
floating-point, fixed-point, hardware-accelerator etc.
The input signal has the DFT whose magnitude spectrum is presented in Fig. 1. The purpose of the
experiment is to verify this magnitude spectrum generated in Matlab against the one obtained through
C5515.
−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5
0
2
4
6
8
10
12
14
16
18
20
Norm alized Freq uency
MagnitudeSpectrum|X(k)|
Figure 1: Magnitude Spectrum
1 Experiment 5.1
This experiment demonstrates the computation of a DFT using floating-point arithmetic.
Fig. 2 presents the magnitude spectrum obtained with the program provided in the experiment. Note,
however, these magnitudes are ultimately converted to fixed-point arithmetic in Q15 format. Hence, the
magnitudes are limited to a maximum of 32767, as observed in the figure. Although the scale is different,
both DFTs show impulses at the same normalized frequencies, as expected.
1

igorFreire_UCI_real-time-dsp_reports

igorFreire_UCI_real-time-dsp_reports

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (18)

Destaque

Destaque (9)

Semelhante a igorFreire_UCI_real-time-dsp_reports

Semelhante a igorFreire_UCI_real-time-dsp_reports (20)

igorFreire_UCI_real-time-dsp_reports