digital signal processing
Computer Architectures for signal processing
Harvard Architecture, Pipelining, Multiplier
Accumulator, Special Instructions for DSP, extended
Parallelism,General Purpose DSP Processors,
Implementation of DSP Algorithms for var
ious operations,Special purpose DSP
Hardware,Hardware Digital filters and FFT processors,
Case study and overview of TMS320
series processor, ADSP 21XX processor
2. 2
Module IV
Computer Architectures for signal processing
Harvard Architecture, Pipelining, Multiplier
Accumulator, Special Instructions for DSP, extended
Parallelism,General Purpose DSP Processors,
Implementation of DSP Algorithms for var
ious operations,Special purpose DSP
Hardware,Hardware Digital filters and FFT processors,
Case study and overview of TMS320
series processor, ADSP 21XX processor
3. 3
Outline/objectives
• Identify the most important DSP processor
architecture features and how they relate to
DSP applications
• Understand the types of code appropriate for
DSP implementation
4. 4
DSP ?
• A specialized microprocessor for real-
time DSP applications
– Digital filtering (FIR and IIR)
– FFT
– Convolution, Matrix Multiplication etc
5. 5
What is Digital Signal Processing?
Application of mathematical operations to
digitally represented signals
Signals represented digitally as sequences of
samples
1. Digital signals obtained from physical signals
via transducers (e.g.,microphones) and analog-
to- digital converters (ADC)
2. Digital signals converted back to physical
signals via digital-toanalog converters (DAC)
3. Digital Signal Processor (DSP): electronic
system that processes digital signals
7. 7
The measurand can be temperature, pressure or speech
signal which is picked up by a sensor (may be a thermocouple,
microphone, a load cell etc).
The conditioner is required to filter, demodulate and amplify
the Signal The analog processor is generally a low-pass filter
used for anti aliasing effect.
The ADC block converts the analog signals into digital form.
The DSP block represents the signal processor.
The DAC is for Digital to Analog Converter which converts the
digital signals into analog form. The analog low-pass filter
eliminates noise introduced by the interpolation in the DAC.
8. 8
What Makes a DSP a DSP?
• Single-Cycle MAC
• Multiple Execution Units
• High Bandwidth (Flat) Memory Sub-Systems
• Efficient Zero-Overhead Looping
• Short Pipeline
• High Bandwidth I/O
• Specialized Instruction Sets
• Sophisticated DMA
• Little to No Speculation
9. 9
Advantages of DSP systems
1. DSP systems are less sensitive to
environmental conditions than analog
systems.
2. Insensitivity to component
tolerances, thus provides
predictable,repeatable behaviour.
3. Reprogrammability.
4. Size (portability)
13. 13
Single Cycle MAC
• MAC’s Typically Determine DSP
Performance and Pipeline Length (EX)
• Most DSP’s Have 2-8 MAC Units
• MAC’s Typically Operate in Both a Scalar
and Vector Mode
15. 15
Hardware used in DSP
ASIC FPGA DSP
Performance Very High High Medium High
Flexibility Very low High High
Power
consumption
Very low low Low Medium
Development
Time
Long Medium Short
16. 16
Common DSP features
• Harvard architecture
• Dedicated single-cycle Multiply-Accumulate
(MAC) instruction (hardware MAC units)
• Single-Instruction Multiple Data (SIMD) Very
Large Instruction Word (VLIW) architecture
• Pipelining
• Saturation arithmetic
• Zero overhead looping
• Hardware circular addressing
• Cache
• DMA
54. 54
Single Instruction - Multiple Data
(SIMD)
• A technique for data-level parallelism by
employing a number of processing
elements working in parallel
55. 55
Very Long Instruction Word (VLIW)
• A technique for
instruction-level
parallelism by executing
instructions without
dependencies (known at
compile-time) in parallel
• Example of a single
VLIW instruction:
F=a+b; c=e/g; d=x&y; w=z*h;
57. 57
Pipelining
• DSPs commonly feature deep pipelines
• TMS320C6x processors have 3 pipeline stages
with a number of phases (cycles):
– Fetch
• Program Address Generate (PG)
• Program Address Send (PS)
• Program ready wait (PW)
• Program receive (PR)
– Decode
• Dispatch (DP)
• Decode (DC)
– Execute
• 6 to 10 phases
58. 58
Direct Memory Access (DMA)
• The feature that allows peripherals to access
main memory without the intervention of the
CPU
• Typically, the CPU initiates DMA transfer, does
other operations while the transfer is in progress,
and receives an interrupt from the DMA
controller once the operation is complete.
• Can create cache coherency problems (the data
in the cache may be different from the data in
the external memory after DMA)
• Requires a DMA controller
59. 59
DSP vs. Microcontroller
• DSP
– Harvard Architecture
– VLIW/SIMD (parallel
execution units)
– No bit level operations
– Hardware MACs
– DSP applications
• Microcontroller
– Mostly von Neumann
Architecture
– Single execution unit
– Flexible bit-level
operations
– No hardware MACs
– Control applications
61. 61
General Comparison
DSP µc
Raw DSP Bandwidth Excellent poor
Address space Small to medium Small to medium
Cost Medium to high Low to medium
MAC Yes No
Fast Shifter Yes No
Architecture Harvard/ modified
Harvard
Von Neumann
62. 62
General Comparison, cont.
DSP µc
Memory busses 2-3 1
Circular addressing Yes No
Saturation/
Overflow
Yes ?
Zero-over-head
looping
Yes No
Stack Hw Mem
FFT addressing Yes No
Digital I/O minimal Excellent
63. 63
TMS320C31 (C3x) Specs
• Introduced by TI in July of 1999
• Third-gen floating point processor
• 32-bit processor
• 40ns instruction cycle time
– 50 million fp ops/sec (MFLOPS)
– 25 million instructions/sec (MIPS)
• 2 1Kx32 words of internal mem (RAM)
• 24-bit address bus
– 2^24 or 16 million words (32-bit) of mem
• Only one serial port, but very fast execution speed
64. 64
Applications of TMS320C31
• Targeted at digital audio, data comm, and industrial
automation
• Consists of a multiplier,barrel shifter, ALU and a register
file containing eight 40-bit fp registers
• No support for rounding when converting fpinteger
– Lower 8 bits are chopped off
• Shifter can shift up to 32 bits left or right
• All operations performed in a single clock cycle; some in
parallel
65. 65
Why Floating Point?
• Only a little more expensive
• Much more “real estate”
• Easier to program
• FP support tools easier to use
• C compiler is more efficient
– Has a multiplier and accumulator
66. 66
Modified Harvard Arch
• Independent mem banks
• Separate busses for program,data, and direct mem
access (DMA)
– Performs concurrent program fetches,data read and
write,and DMA ops
• Allows for 4 levels of pipelining
– While 1 instruction is being executed, 3 instructions
are being read decoded and fetched
– Fewer gates per pipeline stage
– Increased clock rate and performance
67. 67
Direct Comparison
Processor MHz MIPS Latency Power Price
TMS320C62
TYPE -1
120 960 0.09 us 1.14 W $25
TMS320C62
TYPE -2
200 1600 0.09 us 1.9 W $96
68. 68
TMS320C25 DSP
History of the TMS320 family
This family currently includes five generations of DSPs. TMS320C1x,
TMS320C2x,
TMS320C3x,
TMS320C4x,
and
TMS320C5x
TMS320C1x,
TMS320C2x,
TMS320C3x,
TMS320C4x,
and
TMS320C5x
TMS320C25, a CMOS 40-MHz digital signal processor
capable of twice the performance of the TMS320C1x
devices
is capable of executing 10 million instructions per second.
24 additional instructions (133 total)
eight auxiliary registers
an eight-level hardware stack
4K words of on-chip program ROM
low power dissipation inherent to CMOS
71. 71
Memory Organization
TMS320C25 DSP
Total of 544 16-bit words of
on-chip data RAM,
288 words are always data
memory and the remaining
256 words may
be configured as either
program or data memory.
The TMS320C2x can
address a total of 64K
words of data memory.
Program and Data MemoryProgram and Data Memory
72. 72
TMS320C25 DSP
Memory Organization (Cntd.)
Three separate address spaces for program memory,
data memory, and I/O
spaces are
distinguished
externally by
means of the PS,
DS, and IS
spaces are
distinguished
externally by
means of the PS,
DS, and IS
The on-chip
program ROM
can be mapped
into the lower
4K words of
program
memory. This
ROM is
enabled when
MP/MC is set
to a logic low.
The on-chip
program ROM
can be mapped
into the lower
4K words of
program
memory. This
ROM is
enabled when
MP/MC is set
to a logic low.
73. 73
TMS320C25 DSP
Memory Organization (Auxiliary Registers)
register file containing eight auxiliary registers (AR0–AR7).
ARAU is useful
for address
manipulation
it may also serve
as an additional
general-purpose
arithmetic unit
74. 74
TMS320C25 DSP
Memory Organization (Memory Addressing Modes)
In the direct addressing
mode, the 9-bit data memory
page pointer (DP) points to
one of 512 pages, each page
consisting of 128 words.
In the direct addressing
mode, the 9-bit data memory
page pointer (DP) points to
one of 512 pages, each page
consisting of 128 words.
In the indirect
addressing mode,
the currently
selected 16-bit
auxiliary register
AR(ARP)
addresses the data
memory through
the auxiliary
register file
bus(AFB).
In the indirect
addressing mode,
the currently
selected 16-bit
auxiliary register
AR(ARP)
addresses the data
memory through
the auxiliary
register file
bus(AFB).
When an immediate operand is used, it is contained
either within the instruction word itself or in the word
following the instruction opcode .
When an immediate operand is used, it is contained
either within the instruction word itself or in the word
following the instruction opcode .
75. 75
TMS320C25 DSP
CALU
A typical ALU instruction:
1) Data is fetched from the RAM on the data bus,
2) Data is passed through the scaling shifter and the ALU
3) The result is moved into the accumulator.
Scaling Shifter
ALU and accumulator
Multiplier;T and P registers
77. 77
TMS320C25 DSP
System Control (pipeline operations)
the prefetch counter
(PFC)
the 16-bit microcall
stack (MCS) register,
the instruction
register (IR),
the queue instruction
register (QIR).
Two status registers, ST0
and ST1, contain the status
of various conditions
and modes.
78. 78
TMS320C25 DSP
System Control (Timer Operation+Repeat Counter)
The TMS320C2x
provides a memory-
mapped 16-bit timer
(TIM) register and
a 16-bit period (PRD)
register.
The on-chip timer is
a down counter that is
continuously clocked
by CLKOUT1.
The repeat counter (RPTC) is an 8-bit
counter.It can be loaded with a
number from 0 to 255 .
RPTC is cleared by reset.
79. 79
TMS320C25 DSP
External Memory and IO Interface
A 16-bit parallel data bus (D15–D0),
A 16-bit address bus (A15–A0),
Data, program, and I/O space select (DS, PS, and IS) signals, and
Various system control signals.
6) Program Internal ROM/Data External (PR/DE)
1) Program Internal RAM/Data Internal (PI/DI)
2) Program Internal RAM/Data External (PI/DE)
3) Program External/Data Internal (PE/DI)
4) Program External/Data External (PE/DE)
5) Program Internal ROM/Data Internal (PR/DI)
80. 80
TMS320C25 DSP
Interrupts
three external maskable user interrupts (INT2–INT0),
Internal interrupts are generated by the serial port (RINT and XINT), by
the timer (TINT), and by the software interrupt (TRAP) instruction.
The TMS320C2x has a built-in mechanism for
protecting multicycle instructions
from interrupts.
81. 81
TMS320C25 DSP
Serial Ports
A full-duplex on-chip serial port provides direct communication
with serial devices such as codecs, serial A/D converters, and
other serial systems.
If the serial port
is not being
used, the DXR
and DRR
registers can be
used
as general-
purpose
registers.
If the serial port
is not being
used, the DXR
and DRR
registers can be
used
as general-
purpose
registers.
82. 82
TMS320C25 DSP
Direct Memory Access
The flexibility of the TMS320C2x allows configurations to satisfy
a wide range of system requirements:
A standalone system (single processor),
A multiprocessor with devices in parallel,
A host/slave multiprocessor with shared global data memory space
A peripheral processor
In a multiprocessor environment, the SYNC input can be
used to greatly ease interface between processors.
For multiprocessing applications, the TMS320C2xs allocates
global data memory space and communicates with that space
via the BR (bus request) and READY control signals.
Notas do Editor
The first bullet, Single-cycle Multiply-accumulate capability is simply the action of having multiple, two to four, multiplers that allow servral multiply-accumulated operations per cycle.
The TMS320C25 is capable of executing 10 mil-lion
instructions per second. Enhanced features such as 24 additional instruc-tions
(133 total), eight auxiliary registers, an eight-level hardware stack, 4K
words of on-chip program ROM, a bit-reversed indexed addressing mode, and
the low power dissipation inherent to the CMOS process contribute to the high
performance.