An Efficient DSP Based Implementation of a Fast Convolution Approach with non Uniform Partitioning

Fast Convolution
Proposed Algorithm
Efficient DSP Implementation
Results
Conclusion
An Efficient DSP-Based Implementation of a Fast
Convolution Approach with non Uniform Partitioning
Andrea Primavera1
, Stefania Cecchi1
, Laura Romoli1
, Francesco Piazza1
and
Marco Moschetti2
1
A3lab - DII - Università Politecnica delle Marche -
Ancona - ITALY
2
Korg Italy - Osimo (AN) - ITALY
5th
European DSP in Education and Research Conference, 13th
and 14th
September, 2012, Amsterdam, Netherlands.
Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 1/28

Fast Convolution
Proposed Algorithm
Results
Conclusion
1 Fast Convolution
Introduction
State of the art
2 Proposed Algorithm
3 Eﬃcient DSP Implementation
Target
UPOLS implementation
Memory management
FFT/IFFT operations
Psychoacoustic expedients
Final remarks
4 Results
Case study: artiﬁcial reverberator
UPOLS performance
NUPOLS performance
5 Conclusion
Conclusion
Questions

Fast Convolution
Proposed Algorithm
Results
Conclusion
Introduction
State of the art
FIR ﬁltering is probably one of the most recurrent operations in DSP. It
is an expensive task especially for long impulse responses (IRs) and low
I/O latency.
LOW LATENCY
CONVOLUTION
COMPUTATIONAL
COST
MINIMIZATION
Problem
In the last 30 years, fast convolution algorithms have been deeply
investigated:
• OverLap and Save (OLS), OverLap and Add (OLA).
• Partitioned OverLap and Save (UPOLS).
• Non Uniform Partitioned OverLap and Save (NUPOLS).
State of the Art

Fast Convolution
Proposed Algorithm
Results
Conclusion
Introduction
State of the art
FIR filtering is probably one of the most recurrent operations in DSP. It
is an expensive task especially for long impulse responses (IRs) and low
I/O latency.
We propose an efficient DSP based real-time implementation of a
fast convolution approach with non uniform partitioning (NUPOLS)
taking into account:
• OMAP L137.
• Efficient partitioning.
• Usage of smart DSP expedients.
• Psychoacoustic improvement.
Proposed Solution

Fast Convolution
Proposed Algorithm
Results
Conclusion
Introduction
State of the art
Assuming a linear time-invariant system, the linear convolution between
the input signal x and the system impulse response h is deﬁned as follows:
y(t) = x(t) ∗ h(t) =
∞
−∞
x(t − τ)h(τ)dτ. (1)
For discrete-time signals and impulse response with a ﬁnite length N, it
results:
y[n] = x[n] ∗ h[n] =
N−1
m=0
x(n)h(m − n) (2)
The convolution is performed using equation (2).
LATENCY: Theoretically zero.
COMPUTATIONAL COST: N − 1 additions and N multiplications.
CONSIDERATIONS: It results too expensive for long IR.
Time Domain Convolution

Fast Convolution
Proposed Algorithm
Results
Conclusion
Introduction
State of the art
Considering the circular convolution and the DFT property:
y[n] = x[n] N h[n] =
N−1
m=0
x[(n − m)N ]h[m], (3)
x[n] N h[n] ↔ X[k]H[k], (4)
it results that the convolution can be computed in the frequency
domain.
Frequency Domain Convolution
Allowing to convert a circular convolution into a linear convolution.
LATENCY: Equal to K samples with K > N.
COMPUTATIONAL COST: 2LlogL
K + L
K complex multiplications (with
K power of 2 and L = 2K for 50% overlap).
CONSIDERATIONS: I/O latency is too high for long IR.
OverLap and Save (OLS)

Fast Convolution
Proposed Algorithm
Results
Conclusion
Introduction
State of the art
The IR is partitioned in sections of equal size, then, an OLS is applied
on each sub-filter.
LATENCY: Equal to K samples with K arbitrarily chosen.
COMPUTATIONAL COST: 2LlogL
K + LP
K complex multiplications and
L(P−1)
K additions (with K power of 2, P the number of partitions and
L = 2K for 50% overlap).
CONSIDERATIONS: Computational cost higher than OLS.
Uniform Partitioned OverLap and Save (UPOLS)
The IR is partitioned in sections of increasing size, reducing the com-
putational cost with respect to UPOLS algorithm.
LATENCY: Theoretically zero.
COMPUTATIONAL COST: It depends on the adopted partitioning.
CONSIDERATIONS: It is difficult to find the optimal partitioning.
Non Uniform Partitioned OverLap and Save (NUPOLS)

Fast Convolution
Proposed Algorithm
Results
Conclusion
An eﬃcient DSP based implementation of a low latency fast convolution
is proposed considering the NUPOLS algorithm.
Block diagram of the non uniform partitioned overlap and
save algorithm
g(t): impulse response
x(t): input signal
gi (t) : sub-ﬁlter i-th

Fast Convolution
Proposed Algorithm
Results
Conclusion
Block diagram of the proposed approach
x(t): input signal
• First UPOLS: characterized by a small block size (i.e., 64 samples)
for selecting the desired input/output latency.
• Second UPOLS: with a larger framesize allows one to minimize the
computational cost required to perform the convolution operation.

Fast Convolution
Proposed Algorithm
Results
Conclusion
x(t): input signal

Fast Convolution
Proposed Algorithm
Results
Conclusion
Target
Memory management
FFT/IFFT operations
Final remarks
The real time implementation of the proposed approach has been done
through the Texas Instruments Evaluation Board OMAPL137.
Hardware features
Dual-Core System-On-Chip
300MHz ARM926EJ-S RISC MPU
300MHz C674x VLIW Floating Point DSP
128KByte RAM Shared Memory
64MByte SDRAM
Enhanced Direct-Memory-Access Controller 3 (EDMA3)
2 I/O audio channel
32KByte L1P Program RAM/Cache (DSP side)
32KByte L1D Data RAM/Cache (DSP side)
256KByte L2 Uniﬁed Mapped RAM/Cache (DSP side)
• Design constraints: Sample frequency 48 kHz, latency 64 samples,
stereo implementation, ﬂoating point implementation.
• ARM: used to manage the control parameters.
• DSP: used to perform the DSP operations, exploiting its own
libraries (i.e., DSPLib) and DMA engine.

Fast Convolution
Proposed Algorithm
Results
Conclusion
Target
Memory management
FFT/IFFT operations
Final remarks
The UPOLS algorithm implementation can be summarized considering
three main phases:
• Impulse response partitioning
• Input signal partitioning
• Filtering
N
K K K K
h(t)
x(t) ..............x0 x1 x2 xn
L-points
FFT
H1 H2 H3× × ×
.....
+ +
+ +
L-points
IFFT
L-points
IFFT
L-points
IFFT
last
K points
last
K points
last
K points
K K K K
y(t) ..............y0 y1 y2 yn
.......

Fast Convolution
Proposed Algorithm
Results
Conclusion
Target
Memory management
FFT/IFFT operations
Final remarks
three main phases:
- The impulse response h is partitioned in P
blocks hn of length K.
- The filters set Hn is obtained by using a
L-points FFT of each block hn (with
L = 2K, overlap 50%).
- The set of P filters are then stored in a
delay line held in the external memory.
- The operation is performed offline using a
Matlab script.
• Filtering
N
K K K K
h(t)
x(t) ..............x0 x1 x2 xn
L-points
FFT
H1 H2 H3× × ×
.....
+ +
+ +
L-points
IFFT
L-points
IFFT
L-points
IFFT
last
K points
last
K points
last
K points
K K K K
y(t) ..............y0 y1 y2 yn
.......

Fast Convolution
Proposed Algorithm
Results
Conclusion
Target
Memory management
FFT/IFFT operations
Final remarks
three main phases:
- The input signal x is partitioned in blocks
of length K.
- The frequency domain block Xn is obtained
performing an L-points FFT to the input
vector composed of the new frame xn and
the previous frame xn−1 (overlap 50%).
- This vector Xn is stored in a delay line held
in the external memory together with the
P − 1 previous blocks.
• Filtering
N
K K K K
h(t)
x(t) ..............x0 x1 x2 xn
L-points
FFT
H1 H2 H3× × ×
.....
+ +
+ +
L-points
IFFT
L-points
IFFT
L-points
IFFT
last
K points
last
K points
last
K points
K K K K
y(t) ..............y0 y1 y2 yn
.......

Fast Convolution
Proposed Algorithm
Results
Conclusion
Target
Memory management
FFT/IFFT operations
Final remarks
three main phases:
• Filtering
- The output block Yn is obtained through
ﬁltering operations:
Yn =
P−1
i=0
Xn−P+1+i HP−1−i (5)
- The time-domain output signal yn is
composed of the last K samples of the
L-points IFFT of Yn.
N
K K K K
h(t)
x(t) ..............x0 x1 x2 xn
L-points
FFT
H1 H2 H3× × ×
.....
+ +
+ +
L-points
IFFT
L-points
IFFT
L-points
IFFT
last
K points
last
K points
last
K points
K K K K
y(t) ..............y0 y1 y2 yn
.......

Fast Convolution
Proposed Algorithm
Results
Conclusion
Target
Memory management
FFT/IFFT operations
Final remarks
Complex multiplications and accesses to external memory data are the
main bottlenecks in fast convolution implementation.
HOW TO SOLVE THESE PROBLEMS?
• NUPOLS algorithm allows one to minimize both the number of
complex multiplications and the memory accesses compared to
the UPOLS approach.
• The DMA engine allows one to parallelize transfers from/into
external memory and processing operations.
Adopted Solution

Fast Convolution
Proposed Algorithm
Results
Conclusion
Target
Memory management
FFT/IFFT operations
Final remarks
Parallelization of the transfers from/into external memory (executed by
DMA engine) and processing operations
Read Hn
(Blocking)
Read Xn
(Blocking)
Compute Yn
(i)
Read Hn
(Blocking)
Read Xn+1
(Non Blocking)
Compute Yn
Read Xn
(Blocking)
(ii)
Kernel used for UPOLS algorithms. (i) Basic approach. (ii) Improved approach.

Fast Convolution
Proposed Algorithm
Results
Conclusion
Target
Memory management
FFT/IFFT operations
Final remarks
The workload required for FFT/IFFT computation can be reduced taking
advantage of the stereo implementation and considering the real nature
of the audio signal.
• Two L-points FFTs/IFFTs of real sequences may be calculated
through one FFT/IFFT of a complex sequence.
• The symmetry property of the FFT has be exploited. This
decrease the number of access to the external memory and the
number of frequency multiplications from L to (K + 1) for each
of the P processed frequency block.
FFT Optimization

Fast Convolution
Proposed Algorithm
Results
Conclusion
Target
Memory management
FFT/IFFT operations
Final remarks
Psychoacoustic allows one to reduce the number of
complex multiplications and memory accesses.
All the components (frequency bins) overs a certain cut-oﬀ frequency
fc (e.g., 18 kHz) are leaved out.
Psychoacoustic Optimization

Fast Convolution
Proposed Algorithm
Results
Conclusion
Target
Memory management
FFT/IFFT operations
Final remarks
HOW TO PARALLELIZE THE 2 UPOLS?
In a low latency context multithreaded approach does not guarantee high
performance on the DSP board.
A manual partitioning of the code has been realized aiming to
uniformly distribute the FFT/IFFT operations and the complex
multiplications of both the UPOLS throughout the processing.
Adopted Solution

Fast Convolution
Proposed Algorithm
Results
Conclusion
Target
Memory management
FFT/IFFT operations
Final remarks
HOW TO PARALLELIZE THE 2 UPOLS?
The manual partitioning aims to uniformly distribute the FFT/IFFT
operations and the complex multiplications related to the larger POLS
during the K2
K1
iterations necessary to respect the processing constraint.
Iteration Operation Iteration Operation
1 Large FFT 3/3 17 MAC Left Channel
2 MUL Left Channel 18 MAC Left Channel
3 MUL Right Channel 19 MAC Right Channel
4 Large IFFT 1/3 20 MAC Right Channel
7 MAC Left Channel 23 MAC Right Channel
15 MAC Left Channel 31 Large FFT 1/3
16 MAC Left Channel 32 Large FFT 2/3
Distribution of the UPOLS operations in a NUPOLS implementation with K1 = 64
and K2 = 2048.

Fast Convolution
Proposed Algorithm
Results
Conclusion
UPOLS performance
NUPOLS performance
Fast convolution could be employed in many different real time audio
applications.
Digital artificial reverberation is the application that really points out
limits of real time FIR filtering.
• Convolutions with long IRs can be performed to simulate large
environments.
• Low input/output latencies are required in musical instruments.
Case Study: Artificial Reverberator
Several tests have been carried out to evaluate the effectiveness of
the proposed approach comparing the required workload of UPOLS
and NUPOLS implementation.
Tests

Fast Convolution
Proposed Algorithm
Results
Conclusion
UPOLS performance
NUPOLS performance
UPOLS PERFORMANCE
0.1 0.2 0.3 0.4 0.5
0
20
40
60
80
100
Impulse Response Length [s]
Workload
(a)
(b)
Workload of the Uniform Partitioned Overlap and Save algorithm (K = 64). (a)
Classic implementation. (b) Psychoacoustic approach
• The maximum impulse response length is about 0.55s
(guaranteeing real time performance).
• The approach is not suitable for the simulation of large
reverberating environments in musical instruments.
Considerations

Fast Convolution
Proposed Algorithm
Results
Conclusion
UPOLS performance
NUPOLS performance
NUPOLS PERFORMANCE
0 1 2 3 4 5
0
20
40
60
80
100
Workload
(a) (b) (c) (d)
(i)
0 1 2 3 4 5
0
20
40
60
80
100
Workload
(a) (b) (c) (d)
(ii)
0 1 2 3 4 5
0
20
40
60
80
100
Workload
(a) (b) (c) (d)
(iii)
0 1 2 3 4 5
0
20
40
60
80
100
Workload
(a)
K2
= 2048K2
= 512 K
2
= 1024
(iv)
Workload of NUPOLS algorithm with 4 diﬀerent partitionings ((i) K1 = 64
K2 = 2048, (ii) K1 = 64 K2 = 1024, (iii) K1 = 64 K2 = 512, and (iv) optimal
partitioning). Mean (a) and max (b) workload for classic implementation. Mean (c)
and max (d) workload using psychoacoustic approach.

Fast Convolution
Proposed Algorithm
Results
Conclusion
UPOLS performance
NUPOLS performance
NUPOLS PERFORMANCE
5 10 15 20 25 30
0
10
20
30
40
50
Processing iteration
Workload
(a)
(b)
(c)
NUPOLS workload as a function of the
processing cycle (IR Length=3.164 sec). (a)
Workload NUPOLS (b) Workload small
UPOLS (K1 = 64), (c) Workload large UPOLS
(K2 = 2048).
Partitioning Internal Memory
Usage
K1 = 64 K2 = 2048 100kB
K1 = 64 K2 = 1024 50kB
K1 = 64 K2 = 512 30kB
• Evident improvement in terms of performance with respect to
the uniform partitioning based approach.
• It is possible to perform a stereo convolution with an impulse
response of length 6s using about 50% of the DSP resources.
Considerations

Fast Convolution
Proposed Algorithm
Results
Conclusion
Conclusion
Questions
In conclusion:
• A novel approach for fast convolution computation has been
proposed based on non uniform partitioning of the impulse response.
• Two UPOLSs with uniform partitioning are introduced considering
two diﬀerent framesize: the desired input/output latency is obtained
through the UPOLS with lower framesize while the other UPOLS is
exploited for decreasing the number of memory accesses and
complex multiplications.
• A DSP-based real time implementation has been performed and
several experimental results have been carried out considering digital
reverberation as a particular case study.

Fast Convolution
Proposed Algorithm
Results
Conclusion
Conclusion
Questions
QUESTIONS?

An Efficient DSP Based Implementation of a Fast Convolution Approach with non Uniform Partitioning

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (8)

Semelhante a An Efficient DSP Based Implementation of a Fast Convolution Approach with non Uniform Partitioning

Semelhante a An Efficient DSP Based Implementation of a Fast Convolution Approach with non Uniform Partitioning (20)

Mais de a3labdsp

Mais de a3labdsp (14)

Último

Último (20)

An Efficient DSP Based Implementation of a Fast Convolution Approach with non Uniform Partitioning