Memory Based Hardware Efficient Implementation of FIR Filters

International Review on Computers and Software (I.RE.CO.S.), Vol. 8, N. 7
ISSN 1828-6003 July 2013
Manuscript received and revised June 2013, accepted July 2013 Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved
1718
Memory Based Hardware Efficient Implementation of FIR Filters
K. G. Shanthi, N. Nagarajan
Abstract – Finite impulse response (FIR) digital filters are key components used in many digital
signal processing (DSP) systems because of their linear phase, stability, fewer finite precision
errors and regular structure. The real time realization of FIR filter with less hardware
requirement and less latency has become very critical with increasing developments in very large
scale integration (VLSI) technology. The objective of this paper to explore the current trends in the
development of algorithms and architectures for memory based realization of FIR filters that are
mainly concerned with reducing the overall area-delay-power complexities. The purpose of this
study is to compare these architectures based on ROM size, delay and throughput. The results
presented here would assist the researchers in the field of Digital Signal processing to select best
architecture for an application based on requirements. New algorithms and architectures need to
be developed to design area-delay-power-efficient FIR filters for various demanding DSP
applications. Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved.
Keywords: Finite Impulse Response Filter, Field Programmable Gate Arrays (FPGA), Application
Specific Integrated Circuit (ASIC), Distributed Arithmetic (DA), Lookup Table (LUT)
Nomenclature
y[n] The FIR Filter Output
N Order of the Filter
Ci Constant coefficients
Xi Input data
B Input Word length
I. Introduction
Digital signal processing (DSP) is playing a vital role
in the significant advancements of digital technology
taking place currently around the world. Digital
communication, speech and image data compression,
speech recognition, spectral estimation and analysis,
adaptive filtering applications, wired and wireless
communication, multimedia systems, biomedical
instrumentation, satellite and aerospace control, remote
sensing are the major areas where DSP has created a
major impact [1].
The increased daily use of digital technology has led
to the development of improved algorithms and
architectures to design the DSP systems with less power
dissipation, higher speed performance and less area
complexity. Several architectural solutions have been
made to minimize the arithmetic complexities of the
algorithms in order to reduce the overall area-delay-
power complexities [2]. Finite impulse response (FIR)
filter is used as a basic tool in many DSP applications.
Digital filters are used to modify signal characteristics
in time or frequency domain and are used in many DSP
systems to perform signal preconditioning, anti-aliasing,
band selection, interpolation, low-pass filtering etc [1].
Traditionally, the design methods were mainly
focused on multiplier-based architectures to implement
the Multiply-and-Accumulate (MAC) blocks that
constitute the central piece in FIR filters and several DSP
functions. These multipliers consume most of the
resources of the system and also involve most of the
computation-time. The number of multiply and
accumulate operations required per filter output increases
with the filter order and thereby real time
implementations of these filters is a challenging task.
A discrete-time linear finite impulse response (FIR)
filter generates the output y[n] as a sum of delayed and
scaled input samples x[n].A N- tap FIR digital filter is
represented as:
     
1
0
N
i
y n c i x n i


  (1)
where y[n] is the FIR filter output, c[i] represents the
filter coefficients, x[n-i] is the input data and n is the time
index starting from 0. A direct implementation of Eq. (1)
requires N Multiply-and-Accumulate blocks, which is
expensive in terms of area and speed.
To resolve this problem many multiplier-less
architectures were proposed in the recent years which are
broadly classified in to two basic categories according to
how they manipulate the filter coefficients for the
multiply operation. The first type of multiplier-less
technique is the conversion-based approach and the
second type is memory based implementation approach.
For the past one decade, there has been a growing
trend to implement DSP functions in Field
Programmable Gate Arrays (FPGAs) rather than on

Copyright © 2013 Praise Worthy Prize S.r.l. - All rights reserved International Review on Computers and Software, Vol. 8, N. 7
1719
Application specific integrated circuits (ASIC) and DSP
chips.
The implementation on ASICs is not preferred due to
high development costs and time-to-market factors.
Sequential-execution architecture of programmable DSP
processors prevents them from achieving the desired
performance. In this context, FPGA platform provides a
very attractive solution that balance high flexibility with
the option to reconfigure, time-to-market, cost and
performance [3].
This paper is organized as follows: In Section 2, a
brief overview of the conversion-based multiplier-less
FIR filters is presented. Section 3 explores the
algorithmic aspects and architectural approach of
memory based FIR filters and an in-depth review of FIR
filters based on DA. Finally the Conclusion is presented
in Section 4.
II. Conversion-Based Multiplier-Less
Implementation of FIR Filters
In this approach the coefficients are transformed to
other numeric representations so that the multiplications
are implemented with adder/subtractors and shifters. A
coefficient in "n-bit" signed-digit representation can be
written as:
1
0
2
n-
i
i
i
C b

  (2)
where bi is taken from the set {-1 ,0 ,1 }.
The representation that has minimum non-zero digits
and no consecutive non-zero digits is known as the
canonic signed-digit (CSD) representation[2]. Since in
shift and add multiplication, non-zero digits represent
additions (or subtractions), CSD therefore is significantly
more efficient in adders than binary representations.
Multipliers [4] in the filter whose coefficients are
expressed as canonic signed digit code are realized with
wired-shifters, adders and subtractors.
Common subexpression elimination [CSE] is a
numerical transformation of the constant multiplications
that can lead to efficient hardware implementations in
terms of area, power and speed [5]-[8]. Subexpression
elimination can only be performed on constant
multiplications that operate on a common variable. It is
the process of examining the shift and add
implementations of constant multiplications and finding
the redundant operations.
Once the redundancies are found, these operations can
be performed once and can be shared among the constant
multiplications so that number of adders and shifters for
implementation are minimized. Common subexpression
(CSE) techniques attempt to minimize the number of
additions in the multiplier block by reusing terms. These
terms can be canonic signed digit (CSD) [5], minimal
signed digit (MSD), or all signed digit (ASD) [7].
Multiplierless FIR Filter Design Algorithms by
Malcolm D. Macleod, and Andrew G. Dempster
introduced a new CSE algorithm, which searches a
bounded number of Minimal Signed Digit (MSD)
representations [8]. Douglas L. Maskell, Jussipekka
Leiwo and Jagdish C. Patra [9] reduced both the
coefficient word length and the number of non-zero bits
in the filter coefficients so that the adder step can be
minimized that resulted in reducing the hardware
complexity of linear phase FIR digital filters.
III. Algorithms and Architectures
for Memory Based FIR Filters
The memory based approach involves the use of
memories (RAMs, ROMs) or Look-Up Tables (LUTs)
that store pre-computed values that can be readout for
multiplication operation. With the advancements in the
VLSI technology, the semiconductor memory has
become cheaper, faster and more efficient in terms power
dissipation.
Memory-based FIR filters consequently are gaining
substantial popularity in the DSP environment.
These filters result in high-throughput and reduced-
latency since the memory-access time is usually very
much shorter compared with multiplication time. They
have much less dynamic power consumption due to
minimal switching activities associated in obtaining the
output product/inner product values by memory read
operations. There are two types of memory based FIR
filters. One of the techniques is the direct memory-based
implementation of FIR filters [10], while the other is
based on distributed arithmetic (DA).
III.1. Direct-Memory-Based FIR Filters
In the direct-memory-based implementations [10], the
multiplications of input values with the fixed coefficients
can be replaced by a ROM or look-up-table (LUT) which
contains the pre-computed product values for all possible
values of input samples. Let X be an input word to be
multiplied with a W-bit fixed coefficient C. If X is
assumed to be an unsigned binary number of word-length
N, there are 2N
possible values of X, and hence there are
2N
possible values of product Y=C*X. Therefore direct
memory based implementation of multiplication would
require a memory unit of 2N
words to be used as LUT
consisting of pre-computed product values corresponding
to all possible values of X as shown in Fig. 1. The
product C* Xi is stored at the memory location whose
address is the same as the binary value of Xi for 0<2N
-1,
such that if N-bit binary value of Xi is used as address for
the memory-unit, then the corresponding product value is
read-out from the memory. However, the size of ROM
increases exponentially with the input length.
ROM with
2N
words
X
N
Y=C*X
N+W
Fig. 1. Structure of Direct-memory-based multiplier

1720
A direct implementation of equation (1) requires N
number of multiplications where N represents the tap
length. Each of the multipliers which involve the
multiplications of input values with the fixed coefficients
can be replaced by a ROM or LUT, where each of the
LUTs contains the pre-computed product values for all
possible values of input samples.
A systolic system consists of a set of interconnected
cells, each capable of performing some simple operation
[2], [11].
Systolic designs are very efficient for hardware
implementation of computation-intensive DSP
applications because of the features like simplicity,
regularity and modularity of structure.
They also produce high-throughput rate by using
pipelining or parallel processing or both. The systolic
array for FIR filter of order N is shown in Fig. 2.It
consists of N Processing elements (PEs), where each PE
during a cycle period performs one MAC operation.
Several algorithms and architectures have been suggested
for systolization of FIR filters [12], [13].
Fig. 2. Structure of a linear systolic array for an N-tap FIR filter
The average computation time and the latency of
direct-memory based implementation is high for large
transform-lengths and therefore several novel algorithms
have been proposed in the last few years to decompose
the sinusoidal transforms into multiple number of
circular convolution or convolution-like structures of
smaller convolution-lengths [14]–[18].
These decompositions have resulted in improvement
of throughput performance with substantial reduction of
hardware and computational latency. A concurrent
recursive algorithm is derived for the computation of FIR
filter, and is ported further to a two-dimensional systolic
structure for reduced-latency direct-ROM-based
realization of large order filters [19].
A new approach to LUT design referred to as the odd-
multiple-storage (OMS) scheme is presented, where only
the odd multiples of the fixed coefficient are required to
be stored thereby the memory-size is reduced to half at
the cost of some increase in combinational circuit
complexity[20]. By the antisymmetric product coding
(APC) approach, the LUT size can also be reduced to
half, where the product words are recoded as
antisymmetric pairs [21]. Two new approaches are
suggested for designing the LUT for LUT-multiplier-
based implementation, where the memory-size is reduced
to nearly half of the conventional approach [22].
III.2. FIR Filters Based on Distributed Arithmetic (DA)
The main operations required for DA-based
computation of inner product are a sequence of lookup
table accesses followed by shift-accumulation operations
of the LUT output to obtain the desired result. DA-based
computation is well suited for FPGA realization, because
the LUT as well as the shift-add operations, can be
efficiently mapped to the LUT-based FPGA logic
structures.
DA is a bit-serial operation that implements a series of
fixed-point MAC operations in a fixed number of steps,
regardless of the number of terms to be calculated. DA is
often preferred since it eliminates the need for hardware
multipliers and is capable of implementing large filters
with very high throughput. Croisier et al had proposed
the DA algorithm for digital filter implementations in
1973 [23]. The first detailed discussion of DA was given
by Abraham Peled and Bede Liu in 1974 at the Arden
House Workshop on Digital Signal Processing [24].
S.A.White [25] discussed an organization to form the
inner product of a pair of data vectors and gave a
criterion for minimizing the ROM size and made
modifications to increase the speed by employing
techniques such as bit pairing or partitioning the input
words into the most significant half and least significant
half, thereby introducing parallelism in the computation.
III.2.1. Conventional DA approach
Consider the inner product of two N point vectors C
and X given by:
 
1
0
N-
i i
i
y n c x

  (3)
where Ci represents the constant coefficients, Xi is the
input data which may change from time to time. Let the
input sample represent the data coded as B-bit 2’s
complement binary number such that |xi|<1. The input
sample is given by:
1
0
1
2
B
j
i i i j
j
x x x



    (4)
where xi,j ∊ {0, 1}, xi0 is the sign bit and xi, B-1 is the Least
significant bit (LSB).Then substituting (4) in (3), the
output can be expressed as:
 
1 1
0
0 1
2
N B
j
i i i j
i j
y n c x x
 

 
 
   
 
 
  (5)
 
1 1 1
0
0 1 0
2
N B N
j
i i i i j
i j i
y n c x c x
  

  
   
        
   
   (6)
For a given set of Ci (i = 0, 1, 2,…, N − 1), the terms in
the brackets may take one of 2N
possible values that can
be precomputed and stored in an LUT. All possible 2N
values of Ci can be read out from the ROM using the N
bit sequence {xi,j for 0≤i≤N} as address bits.
These intermediate results are accumulated in B clock
cycles to produce one filter output y[n].

1721
Fig. 3. LUT-based DA implementation of a 4-tap (N =4) FIR filter
Original LUT-based DA implementation of a 4-tap (N
=4) FIR filter consists of three units: the shift register
unit, the DA base unit, and the adder/shifter unit.
The LUT contains all 16 possible combination sums
of the filter weights C0, C1, C2, C3. The bank of shift
registers in Fig. 3 stores four consecutive input
samples(x[n-i], i=0, 1, 2, 3). The concatenation of
rightmost bits of the shift registers becomes the address
of the LUT. The shift register is shifted right at every
clock cycle. The corresponding LUT entries are also
shifted and accumulated in B consecutive times to
generate the output y[n]. The sign bits {xi0} are the last
bits to arrive. The clock period in which the sign bits all
simultaneously arrive is called the "sign-bit time”.
During the sign-bit time the control signal S = 1,
otherwise S = 0.
The time-complexity of FIR filters based on
Distributed Arithmetic is independent of the transform-
size or the number of filter-taps and depends only on the
word-length whereas time-complexity of Direct-memory-
based FIR filters is independent of word-length but
increases linearly with the transform size.
III.2.2. Distributed Arithmetic with Offset Binary Coding
The memory requirements (2N
) of DA-based
implementation for FIR filter increases exponentially
with the filter order N. With the use of offset binary
coding(OBC) the memory size can be reduced by half to
2N-1
words [2], [25]. The input data will be interpreted as
-1 for 0 and +1 for 1 in offset binary coding. Let the
input sample xi in offset binary coding be represented as:
 
1
2
i i ix x x     (7)
In 2's-complement notation the negative of Eq. (4) is
written as:
 
1
1
0
1
2 2
B
Nj
i i i j
j
x x x

 

     (8)
where the over score symbol indicates the complement of
a bit. From Eqs. (4) and (8), the Eq. (7) can be rewritten
as:
     
1
1
0 0
1
1
2 2
2
B-
Nj
i i i i j i j
j
x x x x x
 

 
      
  
 (9)
Define dij:
0 0
0
0
i j i j i j
i j i i
d x x j
d x x j
   
   
(10)
where dij ∊ {-1, 1}. Eq. (9) can be rewritten as:
 
1
1
0
1
2 2
2
B
Nj
i i j
j
x d

 

 
  
  
 (11)
Using Eq. (11) in Eq. (3):
   
1 1
1
0 0
1
2 2
2
N B
Nj
i i j
i j
y n c d
 
 
 
 
  
  
  (12)
   
1 1 1
1
0 0 0
1 1
2 2
2 2
B N N
Nj
i i j i
j i i
y n c d c
  
 
  
   
       
   
   (13)
   
1
1
0
2 2
B
Nj
j initial
j
y n D D

 

  (14)
where
1 1
0 0
1 1
2 2
N N
j i i j initial i
i i
D c d , D c
 
 
    .
The OBC scheme is characterized by Eq. (14).
Table I shows the content of the ROM for N=4. From
Table I, notice that the upper-half and the lower- half
ROM values are mirrored with sign reversed. Therefore
it is possible to reduce the ROM size by a factor of 2 as
shown in Table II. Fig. 4 shows a typical architecture for
DA-OBC based implementation of a 4-tap (N =4) FIR
filter. The XOR gates are used for address decoding; the
MUX with the constant Dinitial provides the initial value
to the shift accumulator. In Fig. 4, two control signals S1
and S2 are required, where S1 is 1 when j = 0 and 0
otherwise, and S2 is 1 when j = B-1 and 0 otherwise.
TABLE I
CONTENT OF THE ROM WITH DA-OBC
b3 b2 b1 b0 Contents of ROM
0 0 0 0
0 0 0 1
0 0 1 0
0 0 1 1
0 1 0 0
0 1 0 1
0 1 1 0
0 1 1 1
1 0 0 0
1 0 0 1
1 0 1 0
1 0 1 1
1 1 0 0
1 1 0 1
1 1 1 0
1 1 1 1
- (C3 +C2+ C1 +C0 )/2
- (C3 +C2+ C1 -C0 )/2
- (C3 +C2 - C1 +C0 )/2
- (C3 +C2 - C1 -C0 )/2
- (C3 - C2 + C1+C0 )/2
- (C3 -C2 + C1 - C0 )/2
- (C3 - C2- C1 + C0 )/2
- (C3 - C2 - C1 - C0 )/2
(C3 - C2 - C1 - C0 )/2
(C3 - C2 - C1 +C0 )/2
(C3 - C2 + C1- C0 )/2
(C3 -C2+ C1 + C0 )/2
(C3 +C2 - C1 - C0 )/2
(C3 +C2+ C1- C0 )/2
(C3 +C2+ C1 - C0 )/2
(C3 +C2+ C1+ C0 )/2

1722
TABLE II
REDUCED SIZE ROM (2N-1
) WITH DA-OBC CODING
FOR 4-TAP (N =4) FIR FILTER
b2 b1 b0 Contents of ROM
0 0 0
0 0 1
0 1 0
0 1 1
1 0 0
1 0 1
1 1 0
1 1 1
- (C3 +C2+ C1 +C0 )/2
- (C3 +C2+ C1 -C0 )/2
- (C3 +C2 - C1 +C0 )/2
- (C3 +C2 - C1 - C0 )/2
- (C3 - C2+ C1 +C0 )/2
- (C3 -C2+ C1 - C0 )/2
- (C3 - C2- C1 +C0 )/2
- (C3 - C2- C1 - C0 )/2
Fig. 4. DA-OBC based implementation of a 4-tap (N =4) FIR filter
III.2.3. Distributed Arithmetic with Modified Offset
Binary Coding (DA-MOBC)
The DA-MOBC can reduce the LUT size from 2N−2
to
as low as 2 by exploiting the observation that if the single
term inside the LUT can be relocated outside the LUT,
then the lower half of the LUT is mirrored version of the
upper half of the LUT with only the signs reversed [26].
From Table II, it can be observed that the ROM values
except C3 term are mirrored along the line between the 4-
th and the 5-th rows. Except C3 term, the LUT in Table II
have only 2N-2
possible values depending on the input
values. Table III illustrates the new ROM table.
LUT size reduction is achieved with the overhead of
control circuits such as XOR gates, MUX (multiplexers),
and full adders (FA). While the increase in the number of
XOR gates is proportional to the input vector length B,
the complexities of other control circuits (MUX, FA)
increase in proportion to the coefficient word-length as
shown in Fig. 5.
III.2.4. Distributed Arithmetic Based LUT-Less
Architecture Proposed by Yoo and Anderson
A recursive LUT reduction to the original DA
decreases the LUT size by half at every iteration and
eventually the LUT-less DA architecture can be achieved
[27]. From Fig. 3, it can be observed that the lower half
of LUT (locations whose addresses have a 1 in the MSB)
is the same with the sum of the upper half of LUT
(locations whose addresses have a 0 in the MSB) and C3
term.
Thus, LUT size can be reduced by a factor of 2 with
an additional 2x1 MUX and a full adder. After several
iterations of the LUT reduction, final LUT-less DA
architecture for a 4-tap FIR filter is achieved as shown in
Fig. 6.
Fig. 5. Block diagram of the LUT-less DA-OBC (DA-MOBC)
for a 4-tap FIR filter
TABLE III
REDUCED SIZE ROM (2N-2
) WITH DA-MOBC CODING
FOR 4-TAP (N =4) FIR FILTER
b2 b1 b0 Contents of ROM
0 0 0
0 0 1
0 1 0
0 1 1
- (C2+ C1 + C0 )/2
- (C2+ C1 - C0 )/2
- (C2 - C1 + C0 )/2
- (C2 - C1 - C0 )/2
Fig. 6. LUT-less Architecture for a 4-tap FIR filter proposed
by Yoo and Anderson
III.2.5. On-Line DA-LUT Architecture for FIR Filters
proposed by Eshtawie, Othman
The tri-state buffer and a carry look ahead adder
(CLA) are the basic digital logic units that are used to
construct the on-line LUT DA-LUT Architecture [28] as
shown in Fig. 7.
Filter coefficients will pass to the CLA only if their
buffer enable signal value is 1.
Only the needed location contents are calculated
whereas, in the DA technique the contents of locations
that may not be used when processing the input signal
are also computed.
Fig. 7. LUT-less Architecture for a 4-tap FIR filter
with tri-state buffers and CLA adders

1723
TABLE IV
COMPARISON OF VARIOUS ARCHITECTURES FOR A 4 TAP FILTER (N=4). THE SHIFT REGISTER AND THE ADDER/SHIFTER UNITS ARE NOT
CONSIDERED SINCE THEY ARE COMMON FOR ALL STRUCTURES. BC REPRESENTS THE COEFFICIENT WORD LENGTH.
Logic Functions
LUT-based DA
(conventional DA)
DA-OBC DA-MOBC
LUT-less Architecture
of Yoo & Anderson
On-Line DA-LUT
Architecture
ROM Size 2N
x BC 2N-1
x BC (2N-2
to 2) x BC 0 0
XOR gates 0 N N-1 0 0
2x1 MUX 0 BC BC N x BC 0
Adders 0 0 0 N-1 x BC N-1 CLA’s
Tristate Buffer 0 0 0 0 N
Adder/Sub 0 0 N x BC 0 0
In DA technique, even if the location content is zero it
will be fetched and added to the partial sum, whereas in
on-line LUT no addition operation occurs when
calculated contents is zero. Hence the execution time for
obtaining the filter output is very short.
III.2.6. Memory Partitioning and Multiple Memory
Bank Algorithms
The main drawback of DA based FIR filter is that as
the filter size increases, the memory size requirements of
the implementation grow exponentially. Memory access
time can be a bottleneck for speed of the entire system
when the ROM size is very large. A larger LUT can be
avoided by partitioning the circuit in to smaller LUTs
and to combine their outputs with adders.
Several Memory-partitioning and multiple memory
bank approaches along with flexible multi-bit data access
mechanisms are presented for FIR filtering and inner-
product computation in order to reduce the memory-size
of DA-based filters [10], [25], [29]-[32].
The N-tap filter is divided into m-smaller filters each
having k-input lines such that N= m × k and it is assumed
that N is not prime. The total number of clock cycles
required for this implementation will be B+log2(m); the
additional second term is the number of clock cycles
required to implement an adder tree to calculate the sum
of the outputs from m LUTS. The decrease in throughput
is very less with this implementation when compared
with a large LUT required for a high order filter.
Hence Eq. (6) is rewritten as:
 
 
 
1 11
0
0
1 11 1
1 0
2
z km-
i i
z i zk
z kB m
j
i i j
j z i zk
y n c x
c x
 
 
  

  
  
    
    
  
  
    
 
  
(15)
For example, a 32 tap DA FIR filter would require a
large LUT with 232
entries. This problem can be
overcome by breaking up the LUT into 8 smaller LUT
units with each having 4 input lines.
Hence a single large LUT with 232
memory elements
is replaced by 8 LUTS each having only 24
=16 memory
elements.
Fig. 8 shows the implementation of a 4-tap FIR filter
based on equation (15) for m=2 and k=2.
Fig. 8. Implementation of a 4-tap FIR filter
using memory partitioning with m=k=2
TABLE VI
COMPARISON OF VARIOUS REQUIREMENTS WITH AND WITHOUT
MEMORY-PARTITIONING
Memory Variants
No. of
Address
bits
Memory size
Clock cycles
required
Without memory
partitioning
(Full LUT
implementation)
N 2N
B
With Memory-
partitioning (ROM
decomposition)
N
k
m
  2 2
N / m k
m or m  2
B mlog
0
5
10
15
20
Full LUT Partitioned
LUT
LUTSize
ClockCycles
Fig. 9. Comparison of a 4-tap FIR filter (N=4) with and without
memory partitioning with m=k=2 with the input word length B=8
III.2.7. Systolic Architectures for DA-Based
Implementation of FIR Filters
Systolic architectures can result in cost effective, high
performance system by exploiting high-level of
concurrency using pipelining or parallel processing or
both [11]. Novel one- and two-dimensional systolic
structures were designed for computation of circular
convolution using distributed arithmetic (DA) that
resulted in less memory and less area-delay complexity
compared with the other DA-based structures for circular
convolution [33].
One- and two-dimensional fully pipelined computing
structures are presented for area-delay-power-efficient

1724
implementation of FIR filter by systolic decomposition
of distributed arithmetic based inner-product
computation [34].
A linear array consisting of number of Processing
elements (PEs) and an output cell is shown in Fig. 10.
Each PE consists of a ROM of 2M
words. Each PE
reads the content on its ROM at the location specified by
the input bit vector during a cycle period. The value read
from the ROM is then added to the input available to the
PE from its left. During every cycle period, the sum is
then transferred as output to its right as shown in Figs.
11. Each output cell contains a shift-register and an
adder. It shifts the content of its register left by one
position and then adds the available input to the recently
shifted content in its register during every cycle period.
For high-throughput implementation of FIR filters, a two
dimensional systolic array is used as shown in Figs. 12.
FPGA realization of FIR filters for high-speed and
medium-speed by using modified distributed arithmetic
architectures were suggested by Jiafeng Xie et al., which
made use of pipelined registers and pipelined shift adder
tree [35].
III.2.8. DA Based Architectures for Adaptive FIR
Filtering
Adaptive filtering DSP algorithms are employed in
several hand held mobile devices for applications such as
echo cancellation, signal de-noising, and channel
equalization. New hardware adaptive filter architecture
for very high throughput LMS adaptive filters using
distributed arithmetic (DA) has been suggested where
building adaptive DA filters requires recalculating the
contents of LUTs for each adaptation.
By using an auxiliary LUT with special addressing,
the efficiency and throughput of DA adaptive filters can
be of the same order as fixed DA filters [36], [37].
A new hardware architecture using conjugate
distributed arithmetic (CDA) for high throughput
hardware implementations of LMS adaptive filters is
presented where all possible combination sums of the
input signal samples are stored in the LUT and updated at
the arrival of every sample using an efficient update
procedure [36], [38].
Fig. 10. Linear 1-D systolic array for DA-based implementation
of FIR filter
Figs. 11. (a) Function of PE, (b) Function of output cell
of 1-D systolic array
Figs. 12. (a) 2-D systolic array for FIR filter; (b) function of PE; and (c) function of Shift Adder (SA) cell

1725
IV. Conclusion
The recent significant researches that are concerned
with reducing the overall area-delay-power complexities
of memory based realization of FIR filters are presented
in this paper. A detailed survey of memory-based
implementation of FIR filters using Distributed
Arithmetic is also presented stating its merits over direct
memory-based implementation of FIR filters.
The main goal behind this review is to assist the
researchers in the field of Digital signal processing to
understand the available methods and adopt the same in
various application environments.
Many algorithms and architectures have been
suggested in the literature to reduce the area and time-
complexities of memory-based implementation of FIR
filters but many more efficient algorithms and
architectures need to be developed to design flexible
area-delay-power efficient memory based FIR filters to
meet the growing requirements of DSP applications.
References
[1] J. G. Proakis and D. G. Manolakis, Digital Signal Processing:
Principles, Algorithms and Applications., NJ: Prentice-Hall, 1996.
[2] K. K. Parhi, VLSI Digital Signal Processing Systems: Design and
Implementation. New York: Wiley, 1999.
[3] G. R. Goslin, “A Guide to Using Field Programmable Gate Arrays
(FPGAs) for Application-Specific Digital Signal Processing
Performance”, XILINX, 1995.
[4] M. Yamada, and A. Nishihara, “High-Speed FIR Digital Filter
with CSD Coefficients Implemented on FPGA”, in Proc. IEEE
Design Automation Conference, 2001, pp. 7-8.
[5] R. I. Hartley, “Subexpression sharing in filters using canonic
signed-digit multipliers,” IEEE Trans. Circuits Syst. II, vol. 43,
no. 10, pp. 677–688, Oct. 1996.
[6] M. Potkonjak, M. B. Srivastava, and A. Chandrakasan, “Multiple
constant multiplications: Efficient and versatile framework and
algorithms for exploring common subexpression elimination,”
IEEE Trans. Computer-Aided Design Integr. Circuits Syst., vol.
15, no. 2, pp. 151–165, Feb. 1996.
[7] A. G. Dempster and M. D. Macleod, “Generation of signed-digit
representations for integer multiplication,” IEEE Signal Process.
Lett., vol.11, no. 8, pp. 663–665, Aug. 2004.
[8] M. D. Macleod and A. G. Dempster, “Multiplierless FIR filter
design algorithms,” IEEE Signal Processing Letters, vol. 12, no.
3, pp. 186–189,Mar. 2005.
[9] Douglas L. Maskell, Jussipekka Leiwo and Jagdish C. Patra,”The
Design of Multiplierless FIR Filters with a Minimum Adder Step
and Reduced Hardware complexity,” in Proc. 2006 IEEE
International Symposium on Circuits and Systems, , p. 4,May
2006.
[10] H.-R. Lee, C.-W. Jen, and C.-M. Liu, “On the design automation
of the memory-based VLSI architectures for FIR filters,” IEEE
Trans. Consumer. Electronics, vol. 39, no. 3, pp. 619–629, Aug.
1993.
[11] H. T. Kung, “Why systolic architectures?,” IEEE Computer, vol.
15,no. 1, pp. 37–45, Jan. 1982.
[12] R.Wyrzykowski and S. Ovramenko, “Flexible systolic
architecture for VLSI FIR filters,” Proc. Inst. Elect. Eng.—
Comput. Digit. Techniques,vol. 139, no. 2, pp. 170–172, Mar.
1992.
[13] B. K. Mohanty and P. K. Meher, “Cost-effective novel flexible
celllevel systolic architecture for high throughput implementation
of 2-D FIR filters,” Proc. Inst. Elect. Eng.—Comput. Digit.
Techniques, vol.143, no. 5, pp. 436–439, Nov. 1996.
[14] D. F. Chiper, “A new systolic array algorithm for memory-based
VLSI array implementation of DCT,” in Proc. Second IEEE
Symp. on Computers and Communications, pp. 297–301,July
1997.
[15] D. F. Chiper, M. N. S. Swamy, M. O. Ahmad, and T. Stouraitis,
“Systolic algorithms and a memory-based design approach for a
unified architecture for the computation of
DCT/DST/IDCT/IDST,”IEEE Trans. Circuits Syst-I: Regular
Papers, vol. 52, no. 6, pp. 1125–1137, June 2005.
[16] C. Cheng and K. K. Parhi, “A novel systolic array structure for
DCT,”IEEE Trans. Circuits Syst-II: Express Briefs, vol. 52, no. 7,
pp. 366–369,July 2005.
[17] P. K. Meher, J. C. Patra, and M. N. S. Swamy, “New systolic
algorithm and array architecture for prime-length discrete sine
transform,” IEEE Trans. Circuits Syst. II: Express Briefs, vol. 54,
no. 3, pp. 262–266,Mar. 2007.
[18] P. K. Meher and M. N. S. Swamy, “High-throughput memory-
based architecture for DHT using a new convolutional
formulation,” IEEETrans. Circuits Syst. II: Express Briefs, vol.
54, no. 7, pp. 606–610,July 2007.
[19] P. K. Meher, “Low-latency hardware-efficient memory-based
design for large-order FIR digital filters”, Sixth International
Conference on Information, Communications and Signal
Processing(ICICS 2007), Dec. 2007
[20] P. K. Meher, “New approach to LUT implementation and
accumulation for memory-based multiplication,” in Proc. 2009
IEEE Int. Symp.Circuits Syst., ISCAS’09, May 2009, pp. 453–
456.
[21] P. K. Meher, “New look-up-table optimizations for memory-
based multiplication,” in Proc. Int. Symp. Integr. Circuits
(ISIC’09), Dec.2009.
[22] P. K. Meher, “New approach to lookup table design and memory
based realization of FIR digital filter”, IEEE Transactions on
circuit and systems-I, Vol.57, NO.3, March 2010.
[23] A. Croisier, D. J. Esteban, M. E. Levilion, and V. Rizo, “Digital
filter for PCM encoded signals,” U.S. Patent 3 777 130, Dec. 4,
1973.
[24] A. Peled and B. Liu, “A new hardware realization of digital
filters,” IEEE Trans. Acoustic, Speech, Signal Process., vol. 22,
no. 6, pp.456–462, Dec. 1974.
[25] S. A. White, “Applications of the distributed arithmetic to digital
signal processing: A tutorial review,” IEEE ASSP Mag., vol. 6,
no. 3, pp. 5–19,Jul. 1989.
[26] P. Choi, S.-C. Shin, and J.-G. Chung, “Efficient ROM size
reduction for distributed arithmetic,” in Proc. IEEE Int. Symp.
Circuits System (ISCAS), May 2000, vol. 2, pp. 61–64.
[27] H. Yoo and D. V. Anderson, “Hardware-efficient distributed
arithmetic architecture for high-order digital filters,” in Proc.
IEEE Int. Conf. on Acoustics, Speech, Signal Processing
(ICASSP), Mar. 2005, vol. 5, pp. v/125–v/128.
[28] Mohamed A. Eshtawie and Masuri Othman," On-Line DA-LUT
Architecture for High-Speed High-Order Digital FIR Filters”, in
the tenth IEEE international conference on communication
systems, Nov. 2006, Singapore.
[29] C.-F. Chen, “Implementing FIR filters with distributed
arithmetic,” IEEE Trans. Acoustic., Speech, Signal Process., vol.
33, no. 5, pp.1318–1321, Oct. 1985.
[30] K. Nourji and N. Demassieux, “Optimal VLSI architecture for
distributed arithmetic-based algorithms,” in IEEE International
Conference on Acoustics, Speech, and Signal Processing, vol. 2,
Apr. 1994, pp. II/509–II/512.
[31] S.-S. Jeng, H.-C. Lin, and S.-M. Chang, “FPGA implementation
of FIR filter using M-bit parallel distributed arithmetic,” in
Proc.2006,IEEE Int. Symp. Circuits Systems (ISCAS), May 2006,
p. 4.
[32] M. Mehendale, S. D. Sherlekar, and G..Venkatesh “Area-delay
trade-off in distributed arithmetic based implementation of FIR
filters,” in Proc.10th Int. Conf. VLSI Design, Jan. 1997, pp. 124–
129.
[33] P. K. Meher, “Hardware-efficient systolization of DA-based
calculation of finite digital convolution,” IEEE Trans. Circuits
Syst. II, Exp. Briefs, vol. 53, no. 8, pp. 707–711, Aug. 2006.
[34] P. K. Meher, S. Chandrasekaran, and A. Amira, “FPGA
realization of FIR filters by efficient and flexible systolization
using distributed arithmetic,”IEEE Trans. Signal Process., vol. 56,
no. 7, pp. 3009–3017, July 2008.

1726
[35] Jiafeng Xie n, JianjunHe,GuanzhengTan,” FPGA realization of
FIR filters for high-speed and medium-speed by using modified
distributed arithmetic architectures”, Microelectronics Journal 41,
April 2010 pp. 365–370.
[36] S. Haykin, Adaptive Filter Theory, Prentice Hall, Upper Saddle
River, NJ, 2002.
[37] D. J. Allred, H. Yoo, V. Krishnan, W. Huang, and D. V.
Anderson, “LMS adaptive filters using distributed arithmetic for
high throughput,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol.
52, no. 7, pp. 1327–1337, July 2005.
[38] Walter Huang, Venkatesh Krishnan, and David V. Anderson,”
Conjugate Distributed Arithmetic Adaptive FIR Filters and their
Hardware Implementation”, MWSCAS '06,pp.295-299, Circuits
and Systems, Volume: 2, 2006.
Authors’ information
K. G. Shanthi (Corresponding author)
completed her B.E in 1996 from Madras
university, Chennai and obtained her ME in
2005 from the Government college of
technology, Coimbatore. Her major in PG course
is VLSI Design. Her field of interest includes
design of FPGA based VLSI architectures, VLSI
signal processing. She is currently working as
Associate professor at R.M.K Engineering College, Chennai. She is
currently pursuing her research in the field of VLSI Design.
Address: Associate Professor /Department of Electronics &
Communication Engg, R.M.K Engineering College, Chennai,
Tamilnadu, India .Pin code: 601 206.
E-mail: kgs.ece@rmkec.ac.in
Nagarajan N. received his B.Tech and M.E. degrees in Electronics
Engineering at M.I.T Chennai. He received his PhD in faculty of I.C.E.
from Anna University, Chennai. He is currently working as Principal
C.I.E.T, Coimbatore. His specialization includes optical, wireless
Adhoc and Sensor Networks.

Memory Based Hardware Efficient Implementation of FIR Filters

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Memory Based Hardware Efficient Implementation of FIR Filters

Similar to Memory Based Hardware Efficient Implementation of FIR Filters (20)

More from Dr.SHANTHI K.G

More from Dr.SHANTHI K.G (20)

Recently uploaded

Recently uploaded (20)

Memory Based Hardware Efficient Implementation of FIR Filters