"Finite impulse response convolution is one of the most widely used operation in digital signal processing field for filtering operations. In this context, low computationally demanding techniques become essential for calculating convolutions with low input/output latency in real scenarios, considering that the real time requirements are strictly related to the impulse response length. In this context, an efficient DSP implementation of a fast convolution approach is presented with the aim of lowering the workload required in applications like reverberation. It is based on a non uniform partitioning of the impulse response and a psychoacoustic technique derived from the human ear sensitivity. Several results are reported in order to prove the effectiveness of the proposed approach also introducing comparisons with the existing techniques of the state of the art."
An Efficient DSP Based Implementation of a Fast Convolution Approach with non Uniform Partitioning
1. Fast Convolution
Proposed Algorithm
Efficient DSP Implementation
Results
Conclusion
An Efficient DSP-Based Implementation of a Fast
Convolution Approach with non Uniform Partitioning
Andrea Primavera1
, Stefania Cecchi1
, Laura Romoli1
, Francesco Piazza1
and
Marco Moschetti2
1
A3lab - DII - Universit`a Politecnica delle Marche -
Ancona - ITALY
2
Korg Italy - Osimo (AN) - ITALY
5th
European DSP in Education and Research Conference, 13th
and 14th
September, 2012, Amsterdam, Netherlands.
Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 1/28
2. Fast Convolution
Proposed Algorithm
Efficient DSP Implementation
Results
Conclusion
1 Fast Convolution
Introduction
State of the art
2 Proposed Algorithm
3 Efficient DSP Implementation
Target
UPOLS implementation
Memory management
FFT/IFFT operations
Psychoacoustic expedients
Final remarks
4 Results
Case study: artificial reverberator
UPOLS performance
NUPOLS performance
5 Conclusion
Conclusion
Questions
Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 2/28
3. Fast Convolution
Proposed Algorithm
Efficient DSP Implementation
Results
Conclusion
Introduction
State of the art
FIR filtering is probably one of the most recurrent operations in DSP. It
is an expensive task especially for long impulse responses (IRs) and low
I/O latency.
LOW LATENCY
CONVOLUTION
COMPUTATIONAL
COST
MINIMIZATION
Problem
In the last 30 years, fast convolution algorithms have been deeply
investigated:
• OverLap and Save (OLS), OverLap and Add (OLA).
• Partitioned OverLap and Save (UPOLS).
• Non Uniform Partitioned OverLap and Save (NUPOLS).
State of the Art
Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 3/28
4. Fast Convolution
Proposed Algorithm
Efficient DSP Implementation
Results
Conclusion
Introduction
State of the art
FIR filtering is probably one of the most recurrent operations in DSP. It
is an expensive task especially for long impulse responses (IRs) and low
I/O latency.
LOW LATENCY
CONVOLUTION
COMPUTATIONAL
COST
MINIMIZATION
Problem
In the last 30 years, fast convolution algorithms have been deeply
investigated:
• OverLap and Save (OLS), OverLap and Add (OLA).
• Partitioned OverLap and Save (UPOLS).
• Non Uniform Partitioned OverLap and Save (NUPOLS).
State of the Art
Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 3/28
5. Fast Convolution
Proposed Algorithm
Efficient DSP Implementation
Results
Conclusion
Introduction
State of the art
FIR filtering is probably one of the most recurrent operations in DSP. It
is an expensive task especially for long impulse responses (IRs) and low
I/O latency.
We propose an efficient DSP based real-time implementation of a
fast convolution approach with non uniform partitioning (NUPOLS)
taking into account:
• OMAP L137.
• Efficient partitioning.
• Usage of smart DSP expedients.
• Psychoacoustic improvement.
Proposed Solution
Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 4/28
6. Fast Convolution
Proposed Algorithm
Efficient DSP Implementation
Results
Conclusion
Introduction
State of the art
Assuming a linear time-invariant system, the linear convolution between
the input signal x and the system impulse response h is defined as follows:
y(t) = x(t) ∗ h(t) =
∞
−∞
x(t − τ)h(τ)dτ. (1)
For discrete-time signals and impulse response with a finite length N, it
results:
y[n] = x[n] ∗ h[n] =
N−1
m=0
x(n)h(m − n) (2)
The convolution is performed using equation (2).
LATENCY: Theoretically zero.
COMPUTATIONAL COST: N − 1 additions and N multiplications.
CONSIDERATIONS: It results too expensive for long IR.
Time Domain Convolution
Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 5/28
7. Fast Convolution
Proposed Algorithm
Efficient DSP Implementation
Results
Conclusion
Introduction
State of the art
Assuming a linear time-invariant system, the linear convolution between
the input signal x and the system impulse response h is defined as follows:
y(t) = x(t) ∗ h(t) =
∞
−∞
x(t − τ)h(τ)dτ. (1)
For discrete-time signals and impulse response with a finite length N, it
results:
y[n] = x[n] ∗ h[n] =
N−1
m=0
x(n)h(m − n) (2)
The convolution is performed using equation (2).
LATENCY: Theoretically zero.
COMPUTATIONAL COST: N − 1 additions and N multiplications.
CONSIDERATIONS: It results too expensive for long IR.
Time Domain Convolution
Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 5/28
8. Fast Convolution
Proposed Algorithm
Efficient DSP Implementation
Results
Conclusion
Introduction
State of the art
Considering the circular convolution and the DFT property:
y[n] = x[n] N h[n] =
N−1
m=0
x[(n − m)N ]h[m], (3)
x[n] N h[n] ↔ X[k]H[k], (4)
it results that the convolution can be computed in the frequency
domain.
Frequency Domain Convolution
Allowing to convert a circular convolution into a linear convolution.
LATENCY: Equal to K samples with K > N.
COMPUTATIONAL COST: 2LlogL
K + L
K complex multiplications (with
K power of 2 and L = 2K for 50% overlap).
CONSIDERATIONS: I/O latency is too high for long IR.
OverLap and Save (OLS)
Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 6/28
9. Fast Convolution
Proposed Algorithm
Efficient DSP Implementation
Results
Conclusion
Introduction
State of the art
Considering the circular convolution and the DFT property:
y[n] = x[n] N h[n] =
N−1
m=0
x[(n − m)N ]h[m], (3)
x[n] N h[n] ↔ X[k]H[k], (4)
it results that the convolution can be computed in the frequency
domain.
Frequency Domain Convolution
Allowing to convert a circular convolution into a linear convolution.
LATENCY: Equal to K samples with K > N.
COMPUTATIONAL COST: 2LlogL
K + L
K complex multiplications (with
K power of 2 and L = 2K for 50% overlap).
CONSIDERATIONS: I/O latency is too high for long IR.
OverLap and Save (OLS)
Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 6/28
10. Fast Convolution
Proposed Algorithm
Efficient DSP Implementation
Results
Conclusion
Introduction
State of the art
The IR is partitioned in sections of equal size, then, an OLS is applied
on each sub-filter.
LATENCY: Equal to K samples with K arbitrarily chosen.
COMPUTATIONAL COST: 2LlogL
K + LP
K complex multiplications and
L(P−1)
K additions (with K power of 2, P the number of partitions and
L = 2K for 50% overlap).
CONSIDERATIONS: Computational cost higher than OLS.
Uniform Partitioned OverLap and Save (UPOLS)
The IR is partitioned in sections of increasing size, reducing the com-
putational cost with respect to UPOLS algorithm.
LATENCY: Theoretically zero.
COMPUTATIONAL COST: It depends on the adopted partitioning.
CONSIDERATIONS: It is difficult to find the optimal partitioning.
Non Uniform Partitioned OverLap and Save (NUPOLS)
Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 7/28
11. Fast Convolution
Proposed Algorithm
Efficient DSP Implementation
Results
Conclusion
Introduction
State of the art
The IR is partitioned in sections of equal size, then, an OLS is applied
on each sub-filter.
LATENCY: Equal to K samples with K arbitrarily chosen.
COMPUTATIONAL COST: 2LlogL
K + LP
K complex multiplications and
L(P−1)
K additions (with K power of 2, P the number of partitions and
L = 2K for 50% overlap).
CONSIDERATIONS: Computational cost higher than OLS.
Uniform Partitioned OverLap and Save (UPOLS)
The IR is partitioned in sections of increasing size, reducing the com-
putational cost with respect to UPOLS algorithm.
LATENCY: Theoretically zero.
COMPUTATIONAL COST: It depends on the adopted partitioning.
CONSIDERATIONS: It is difficult to find the optimal partitioning.
Non Uniform Partitioned OverLap and Save (NUPOLS)
Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 7/28
12. Fast Convolution
Proposed Algorithm
Efficient DSP Implementation
Results
Conclusion
An efficient DSP based implementation of a low latency fast convolution
is proposed considering the NUPOLS algorithm.
Block diagram of the non uniform partitioned overlap and
save algorithm
g(t): impulse response
x(t): input signal
gi (t) : sub-filter i-th
Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 8/28
13. Fast Convolution
Proposed Algorithm
Efficient DSP Implementation
Results
Conclusion
An efficient DSP based implementation of a low latency fast convolution
is proposed considering the NUPOLS algorithm.
Block diagram of the proposed approach
g(t): impulse response
x(t): input signal
gi (t) : sub-filter i-th
• First UPOLS: characterized by a small block size (i.e., 64 samples)
for selecting the desired input/output latency.
• Second UPOLS: with a larger framesize allows one to minimize the
computational cost required to perform the convolution operation.
Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 9/28
14. Fast Convolution
Proposed Algorithm
Efficient DSP Implementation
Results
Conclusion
An efficient DSP based implementation of a low latency fast convolution
is proposed considering the NUPOLS algorithm.
Block diagram of the proposed approach
g(t): impulse response
x(t): input signal
gi (t) : sub-filter i-th
• First UPOLS: characterized by a small block size (i.e., 64 samples)
for selecting the desired input/output latency.
• Second UPOLS: with a larger framesize allows one to minimize the
computational cost required to perform the convolution operation.
Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 10/28
15. Fast Convolution
Proposed Algorithm
Efficient DSP Implementation
Results
Conclusion
An efficient DSP based implementation of a low latency fast convolution
is proposed considering the NUPOLS algorithm.
Block diagram of the proposed approach
g(t): impulse response
x(t): input signal
gi (t) : sub-filter i-th
• First UPOLS: characterized by a small block size (i.e., 64 samples)
for selecting the desired input/output latency.
• Second UPOLS: with a larger framesize allows one to minimize the
computational cost required to perform the convolution operation.
Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 11/28
16. Fast Convolution
Proposed Algorithm
Efficient DSP Implementation
Results
Conclusion
Target
UPOLS implementation
Memory management
FFT/IFFT operations
Psychoacoustic expedients
Final remarks
The real time implementation of the proposed approach has been done
through the Texas Instruments Evaluation Board OMAPL137.
Hardware features
Dual-Core System-On-Chip
300MHz ARM926EJ-S RISC MPU
300MHz C674x VLIW Floating Point DSP
128KByte RAM Shared Memory
64MByte SDRAM
Enhanced Direct-Memory-Access Controller 3 (EDMA3)
2 I/O audio channel
32KByte L1P Program RAM/Cache (DSP side)
32KByte L1D Data RAM/Cache (DSP side)
256KByte L2 Unified Mapped RAM/Cache (DSP side)
• Design constraints: Sample frequency 48 kHz, latency 64 samples,
stereo implementation, floating point implementation.
• ARM: used to manage the control parameters.
• DSP: used to perform the DSP operations, exploiting its own
libraries (i.e., DSPLib) and DMA engine.
Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 12/28
17. Fast Convolution
Proposed Algorithm
Efficient DSP Implementation
Results
Conclusion
Target
UPOLS implementation
Memory management
FFT/IFFT operations
Psychoacoustic expedients
Final remarks
The real time implementation of the proposed approach has been done
through the Texas Instruments Evaluation Board OMAPL137.
Hardware features
Dual-Core System-On-Chip
300MHz ARM926EJ-S RISC MPU
300MHz C674x VLIW Floating Point DSP
128KByte RAM Shared Memory
64MByte SDRAM
Enhanced Direct-Memory-Access Controller 3 (EDMA3)
2 I/O audio channel
32KByte L1P Program RAM/Cache (DSP side)
32KByte L1D Data RAM/Cache (DSP side)
256KByte L2 Unified Mapped RAM/Cache (DSP side)
• Design constraints: Sample frequency 48 kHz, latency 64 samples,
stereo implementation, floating point implementation.
• ARM: used to manage the control parameters.
• DSP: used to perform the DSP operations, exploiting its own
libraries (i.e., DSPLib) and DMA engine.
Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 12/28
18. Fast Convolution
Proposed Algorithm
Efficient DSP Implementation
Results
Conclusion
Target
UPOLS implementation
Memory management
FFT/IFFT operations
Psychoacoustic expedients
Final remarks
The real time implementation of the proposed approach has been done
through the Texas Instruments Evaluation Board OMAPL137.
Hardware features
Dual-Core System-On-Chip
300MHz ARM926EJ-S RISC MPU
300MHz C674x VLIW Floating Point DSP
128KByte RAM Shared Memory
64MByte SDRAM
Enhanced Direct-Memory-Access Controller 3 (EDMA3)
2 I/O audio channel
32KByte L1P Program RAM/Cache (DSP side)
32KByte L1D Data RAM/Cache (DSP side)
256KByte L2 Unified Mapped RAM/Cache (DSP side)
• Design constraints: Sample frequency 48 kHz, latency 64 samples,
stereo implementation, floating point implementation.
• ARM: used to manage the control parameters.
• DSP: used to perform the DSP operations, exploiting its own
libraries (i.e., DSPLib) and DMA engine.
Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 12/28
19. Fast Convolution
Proposed Algorithm
Efficient DSP Implementation
Results
Conclusion
Target
UPOLS implementation
Memory management
FFT/IFFT operations
Psychoacoustic expedients
Final remarks
The real time implementation of the proposed approach has been done
through the Texas Instruments Evaluation Board OMAPL137.
Hardware features
Dual-Core System-On-Chip
300MHz ARM926EJ-S RISC MPU
300MHz C674x VLIW Floating Point DSP
128KByte RAM Shared Memory
64MByte SDRAM
Enhanced Direct-Memory-Access Controller 3 (EDMA3)
2 I/O audio channel
32KByte L1P Program RAM/Cache (DSP side)
32KByte L1D Data RAM/Cache (DSP side)
256KByte L2 Unified Mapped RAM/Cache (DSP side)
• Design constraints: Sample frequency 48 kHz, latency 64 samples,
stereo implementation, floating point implementation.
• ARM: used to manage the control parameters.
• DSP: used to perform the DSP operations, exploiting its own
libraries (i.e., DSPLib) and DMA engine.
Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 12/28
20. Fast Convolution
Proposed Algorithm
Efficient DSP Implementation
Results
Conclusion
Target
UPOLS implementation
Memory management
FFT/IFFT operations
Psychoacoustic expedients
Final remarks
The UPOLS algorithm implementation can be summarized considering
three main phases:
• Impulse response partitioning
• Input signal partitioning
• Filtering
N
K K K K
h(t)
x(t) ..............x0 x1 x2 xn
L-points
FFT
H1 H2 H3× × ×
.....
+ +
+ +
L-points
IFFT
L-points
IFFT
L-points
IFFT
last
K points
last
K points
last
K points
K K K K
y(t) ..............y0 y1 y2 yn
.......
Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 13/28
21. Fast Convolution
Proposed Algorithm
Efficient DSP Implementation
Results
Conclusion
Target
UPOLS implementation
Memory management
FFT/IFFT operations
Psychoacoustic expedients
Final remarks
The UPOLS algorithm implementation can be summarized considering
three main phases:
• Impulse response partitioning
- The impulse response h is partitioned in P
blocks hn of length K.
- The filters set Hn is obtained by using a
L-points FFT of each block hn (with
L = 2K, overlap 50%).
- The set of P filters are then stored in a
delay line held in the external memory.
- The operation is performed offline using a
Matlab script.
• Input signal partitioning
• Filtering
N
K K K K
h(t)
x(t) ..............x0 x1 x2 xn
L-points
FFT
H1 H2 H3× × ×
.....
+ +
+ +
L-points
IFFT
L-points
IFFT
L-points
IFFT
last
K points
last
K points
last
K points
K K K K
y(t) ..............y0 y1 y2 yn
.......
Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 14/28
22. Fast Convolution
Proposed Algorithm
Efficient DSP Implementation
Results
Conclusion
Target
UPOLS implementation
Memory management
FFT/IFFT operations
Psychoacoustic expedients
Final remarks
The UPOLS algorithm implementation can be summarized considering
three main phases:
• Impulse response partitioning
• Input signal partitioning
- The input signal x is partitioned in blocks
of length K.
- The frequency domain block Xn is obtained
performing an L-points FFT to the input
vector composed of the new frame xn and
the previous frame xn−1 (overlap 50%).
- This vector Xn is stored in a delay line held
in the external memory together with the
P − 1 previous blocks.
• Filtering
N
K K K K
h(t)
x(t) ..............x0 x1 x2 xn
L-points
FFT
H1 H2 H3× × ×
.....
+ +
+ +
L-points
IFFT
L-points
IFFT
L-points
IFFT
last
K points
last
K points
last
K points
K K K K
y(t) ..............y0 y1 y2 yn
.......
Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 15/28
23. Fast Convolution
Proposed Algorithm
Efficient DSP Implementation
Results
Conclusion
Target
UPOLS implementation
Memory management
FFT/IFFT operations
Psychoacoustic expedients
Final remarks
The UPOLS algorithm implementation can be summarized considering
three main phases:
• Impulse response partitioning
• Input signal partitioning
• Filtering
- The output block Yn is obtained through
filtering operations:
Yn =
P−1
i=0
Xn−P+1+i HP−1−i (5)
- The time-domain output signal yn is
composed of the last K samples of the
L-points IFFT of Yn.
N
K K K K
h(t)
x(t) ..............x0 x1 x2 xn
L-points
FFT
H1 H2 H3× × ×
.....
+ +
+ +
L-points
IFFT
L-points
IFFT
L-points
IFFT
last
K points
last
K points
last
K points
K K K K
y(t) ..............y0 y1 y2 yn
.......
Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 16/28
24. Fast Convolution
Proposed Algorithm
Efficient DSP Implementation
Results
Conclusion
Target
UPOLS implementation
Memory management
FFT/IFFT operations
Psychoacoustic expedients
Final remarks
Complex multiplications and accesses to external memory data are the
main bottlenecks in fast convolution implementation.
HOW TO SOLVE THESE PROBLEMS?
• NUPOLS algorithm allows one to minimize both the number of
complex multiplications and the memory accesses compared to
the UPOLS approach.
• The DMA engine allows one to parallelize transfers from/into
external memory and processing operations.
Adopted Solution
Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 17/28
25. Fast Convolution
Proposed Algorithm
Efficient DSP Implementation
Results
Conclusion
Target
UPOLS implementation
Memory management
FFT/IFFT operations
Psychoacoustic expedients
Final remarks
Complex multiplications and accesses to external memory data are the
main bottlenecks in fast convolution implementation.
HOW TO SOLVE THESE PROBLEMS?
• NUPOLS algorithm allows one to minimize both the number of
complex multiplications and the memory accesses compared to
the UPOLS approach.
• The DMA engine allows one to parallelize transfers from/into
external memory and processing operations.
Adopted Solution
Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 17/28
26. Fast Convolution
Proposed Algorithm
Efficient DSP Implementation
Results
Conclusion
Target
UPOLS implementation
Memory management
FFT/IFFT operations
Psychoacoustic expedients
Final remarks
Parallelization of the transfers from/into external memory (executed by
DMA engine) and processing operations
Read Hn
(Blocking)
Read Xn
(Blocking)
Compute Yn
(i)
Read Hn
(Blocking)
Read Xn+1
(Non Blocking)
Compute Yn
Read Xn
(Blocking)
(ii)
Kernel used for UPOLS algorithms. (i) Basic approach. (ii) Improved approach.
Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 18/28
27. Fast Convolution
Proposed Algorithm
Efficient DSP Implementation
Results
Conclusion
Target
UPOLS implementation
Memory management
FFT/IFFT operations
Psychoacoustic expedients
Final remarks
The workload required for FFT/IFFT computation can be reduced taking
advantage of the stereo implementation and considering the real nature
of the audio signal.
• Two L-points FFTs/IFFTs of real sequences may be calculated
through one FFT/IFFT of a complex sequence.
• The symmetry property of the FFT has be exploited. This
decrease the number of access to the external memory and the
number of frequency multiplications from L to (K + 1) for each
of the P processed frequency block.
FFT Optimization
Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 19/28
28. Fast Convolution
Proposed Algorithm
Efficient DSP Implementation
Results
Conclusion
Target
UPOLS implementation
Memory management
FFT/IFFT operations
Psychoacoustic expedients
Final remarks
The workload required for FFT/IFFT computation can be reduced taking
advantage of the stereo implementation and considering the real nature
of the audio signal.
• Two L-points FFTs/IFFTs of real sequences may be calculated
through one FFT/IFFT of a complex sequence.
• The symmetry property of the FFT has be exploited. This
decrease the number of access to the external memory and the
number of frequency multiplications from L to (K + 1) for each
of the P processed frequency block.
FFT Optimization
Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 19/28
29. Fast Convolution
Proposed Algorithm
Efficient DSP Implementation
Results
Conclusion
Target
UPOLS implementation
Memory management
FFT/IFFT operations
Psychoacoustic expedients
Final remarks
Psychoacoustic allows one to reduce the number of
complex multiplications and memory accesses.
All the components (frequency bins) overs a certain cut-off frequency
fc (e.g., 18 kHz) are leaved out.
Psychoacoustic Optimization
Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 20/28
30. Fast Convolution
Proposed Algorithm
Efficient DSP Implementation
Results
Conclusion
Target
UPOLS implementation
Memory management
FFT/IFFT operations
Psychoacoustic expedients
Final remarks
HOW TO PARALLELIZE THE 2 UPOLS?
In a low latency context multithreaded approach does not guarantee high
performance on the DSP board.
A manual partitioning of the code has been realized aiming to
uniformly distribute the FFT/IFFT operations and the complex
multiplications of both the UPOLS throughout the processing.
Adopted Solution
Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 21/28
31. Fast Convolution
Proposed Algorithm
Efficient DSP Implementation
Results
Conclusion
Target
UPOLS implementation
Memory management
FFT/IFFT operations
Psychoacoustic expedients
Final remarks
HOW TO PARALLELIZE THE 2 UPOLS?
The manual partitioning aims to uniformly distribute the FFT/IFFT
operations and the complex multiplications related to the larger POLS
during the K2
K1
iterations necessary to respect the processing constraint.
Iteration Operation Iteration Operation
1 Large FFT 3/3 17 MAC Left Channel
2 MUL Left Channel 18 MAC Left Channel
3 MUL Right Channel 19 MAC Right Channel
4 Large IFFT 1/3 20 MAC Right Channel
5 Large IFFT 2/3 21 MAC Right Channel
6 Large IFFT 3/3 22 MAC Right Channel
7 MAC Left Channel 23 MAC Right Channel
8 MAC Left Channel 24 MAC Right Channel
9 MAC Left Channel 25 MAC Right Channel
10 MAC Left Channel 26 MAC Right Channel
11 MAC Left Channel 27 MAC Right Channel
12 MAC Left Channel 28 MAC Right Channel
13 MAC Left Channel 29 MAC Right Channel
14 MAC Left Channel 30 MAC Right Channel
15 MAC Left Channel 31 Large FFT 1/3
16 MAC Left Channel 32 Large FFT 2/3
Distribution of the UPOLS operations in a NUPOLS implementation with K1 = 64
and K2 = 2048.
Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 22/28
32. Fast Convolution
Proposed Algorithm
Efficient DSP Implementation
Results
Conclusion
Case study: artificial reverberator
UPOLS performance
NUPOLS performance
Fast convolution could be employed in many different real time audio
applications.
Digital artificial reverberation is the application that really points out
limits of real time FIR filtering.
• Convolutions with long IRs can be performed to simulate large
environments.
• Low input/output latencies are required in musical instruments.
Case Study: Artificial Reverberator
Several tests have been carried out to evaluate the effectiveness of
the proposed approach comparing the required workload of UPOLS
and NUPOLS implementation.
Tests
Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 23/28
33. Fast Convolution
Proposed Algorithm
Efficient DSP Implementation
Results
Conclusion
Case study: artificial reverberator
UPOLS performance
NUPOLS performance
UPOLS PERFORMANCE
0.1 0.2 0.3 0.4 0.5
0
20
40
60
80
100
Impulse Response Length [s]
Workload
(a)
(b)
Workload of the Uniform Partitioned Overlap and Save algorithm (K = 64). (a)
Classic implementation. (b) Psychoacoustic approach
• The maximum impulse response length is about 0.55s
(guaranteeing real time performance).
• The approach is not suitable for the simulation of large
reverberating environments in musical instruments.
Considerations
Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 24/28
34. Fast Convolution
Proposed Algorithm
Efficient DSP Implementation
Results
Conclusion
Case study: artificial reverberator
UPOLS performance
NUPOLS performance
NUPOLS PERFORMANCE
0 1 2 3 4 5
0
20
40
60
80
100
Impulse Response Length [s]
Workload
(a) (b) (c) (d)
(i)
0 1 2 3 4 5
0
20
40
60
80
100
Impulse Response Length [s]
Workload
(a) (b) (c) (d)
(ii)
0 1 2 3 4 5
0
20
40
60
80
100
Impulse Response Length [s]
Workload
(a) (b) (c) (d)
(iii)
0 1 2 3 4 5
0
20
40
60
80
100
Impulse Response Length [s]
Workload
(a)
K2
= 2048K2
= 512 K
2
= 1024
(iv)
Workload of NUPOLS algorithm with 4 different partitionings ((i) K1 = 64
K2 = 2048, (ii) K1 = 64 K2 = 1024, (iii) K1 = 64 K2 = 512, and (iv) optimal
partitioning). Mean (a) and max (b) workload for classic implementation. Mean (c)
and max (d) workload using psychoacoustic approach.
Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 25/28
35. Fast Convolution
Proposed Algorithm
Efficient DSP Implementation
Results
Conclusion
Case study: artificial reverberator
UPOLS performance
NUPOLS performance
NUPOLS PERFORMANCE
5 10 15 20 25 30
0
10
20
30
40
50
Processing iteration
Workload
(a)
(b)
(c)
NUPOLS workload as a function of the
processing cycle (IR Length=3.164 sec). (a)
Workload NUPOLS (b) Workload small
UPOLS (K1 = 64), (c) Workload large UPOLS
(K2 = 2048).
Partitioning Internal Memory
Usage
K1 = 64 K2 = 2048 100kB
K1 = 64 K2 = 1024 50kB
K1 = 64 K2 = 512 30kB
• Evident improvement in terms of performance with respect to
the uniform partitioning based approach.
• It is possible to perform a stereo convolution with an impulse
response of length 6s using about 50% of the DSP resources.
Considerations
Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 26/28
36. Fast Convolution
Proposed Algorithm
Efficient DSP Implementation
Results
Conclusion
Conclusion
Questions
In conclusion:
• A novel approach for fast convolution computation has been
proposed based on non uniform partitioning of the impulse response.
• Two UPOLSs with uniform partitioning are introduced considering
two different framesize: the desired input/output latency is obtained
through the UPOLS with lower framesize while the other UPOLS is
exploited for decreasing the number of memory accesses and
complex multiplications.
• A DSP-based real time implementation has been performed and
several experimental results have been carried out considering digital
reverberation as a particular case study.
Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 27/28
37. Fast Convolution
Proposed Algorithm
Efficient DSP Implementation
Results
Conclusion
Conclusion
Questions
QUESTIONS?
Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 28/28