GSFAP adaptive filtering using log arithmetic for resource

Transkript

GSFAP adaptive filtering using log arithmetic for resource
GSFAP Adaptive Filtering Using
Log Arithmetic for Resource-Constrained
Embedded Systems
MILAN TICHY and JAN SCHIER
Academy of Sciences of the Czech Republic
and
DAVID GREGG
Trinity College Dublin
Adaptive filters are widely used in many applications of digital signal processing. Digital communications and digital video broadcasting are just two examples. Traditionally, small embedded systems have employed the least computationally intensive filter adaptive algorithms, such
as normalized least mean squares (NLMS). This article shows that FPGA devices are a highly
suitable platform for more computationally intensive adaptive algorithms. We present an optimized core which implements GSFAP. GSFAP is an algorithm with far superior adaptation
properties than NLMS, and with only slightly higher computational complexity. To further optimize resource requirements we use logarithmic arithmetic, rather than conventional floating
point, within the custom core. Our design makes effective use of the pipelined logarithmic addition units, and takes advantage of the very low cost of logarithmic multiplication and division. The resulting GSFAP core can be clocked at more than 80MHz on a one million-gate Xilinx XC2V1000-4 device. The core can be used to implement adaptive filters of orders 20 to 1000
performing echo cancellation on speech signals at a sampling rate exceeding 50kHz. For comparison, we implemented a similar NLMS core and found that although it is slightly smaller than
the GSFAP core and allows a higher signal sampling rate for the corresponding filter orders,
the GSFAP core has adaptation properties that are much superior to NLMS, and that our core
can provide very sophisticated adaptive filtering capabilities for resource-constrained embedded
systems.
This work was supported and funded by the Czech Ministry of Education, Project CAK No. IM0567,
and also by the European Commission, Project AETHER No. FP6-2004-IST-4-027611.
The paper reflects only the authors’ view and neither the Czech Ministry of Education nor the
European Commission is liable for any use that may be made of the information contained herein.
Authors’ addresses: M. Tichy (contact author), and J. Schier, Department of Signal Processing,
Institute of Information Theory and Automation, Academy of Sciences of the Czech Republic, Pod
Vodarenskou vezi 4, 182 08 Prague 8, Czech Republic; email: {tichy, schier}@utia.cas.cz; D. Gregg,
Department of Computer Science, O’Reilly Institute, Trinity College Dublin, Dublin 2, Ireland;
email: [email protected].
Permission to make digital or hard copies of part or all of this work for personal or classroom use
is granted without fee provided that copies are not made or distributed for profit or commercial
advantage and that copies show this notice on the first page or initial screen of a display along
with the full citation. Copyrights for components of this work owned by others than ACM must be
honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers,
to redistribute to lists, or to use any component of this work in other works requires prior specific
permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn
Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or [email protected].
C 2010 ACM 1539-9087/2010/02-ART29 $10.00
DOI 10.1145/1698772.1698787 http://doi.acm.org/10.1145/1698772.1698787
ACM Transactions on Embedded Computing Systems, Vol. 9, No. 3, Article 29, Publication date: February 2010.
29
29:2
•
M. Tichy et al.
Categories and Subject Descriptors: B.2.4 [Arithmetic and Logic Structures]: High-Speed Arithmetic—Cost/Performance; C.3 [Special-Purpose And Application-Based Systems]: Signal
Processing Systems
General Terms: Algorithms, Design, Performance, Experimentation
Additional Key Words and Phrases: FPGA, DSP, logarithmic arithmetic, affine projection
ACM Reference Format:
Tichy, M., Schier, J., and Gregg, D. 2010. GSFAP adaptive filtering using log arithmetic for resourceconstrained embedded systems. ACM Trans. Embedd. Comput. Syst. 9, 3, Article 29 (February
2010), 31 pages. DOI = 10.1145/1698772.1698787 http://doi.acm.org/10.1145/1698772.1698787
1. MOTIVATION
Adaptive filters are widely used in digital signal processing (DSP) for such
applications as noise and echo cancellation, and in areas such as data communication systems. A filter applies some function to its input signal to perform
the required transformation. An adaptive filter monitors its environment and
adapts this function accordingly.
A wide variety of adaptive filtering algorithms have been proposed in the literature. The most important categories of algorithms are perhaps those based
on least mean squares (LMS) [Widrow and Stearns 1985] and recursive least
squares (RLS) [Kalouptsidis and Theodoridis 1993; Haykin 2002]. Algorithms
based on LMS are fast and simple to implement, but suffer from slow convergence. The RLS-based algorithms converge much faster, but are usually considered too computationally expensive, particularly in applications like echo cancellation where filters with up to several hundred taps are required. Currently,
the most widely used adaptive filtering algorithm in resource-constrained embedded systems is a variant of LMS called normalized LMS (NLMS) [Haykin
2002]. Faster-converging algorithms are usually considered too expensive for
small embedded systems.
More recently, a category of algorithms based on affine projection (AP) [Ozeki
and Umeda 1984], sometimes referred to as the generalized NLMS, have been
developed, which provide some compromise between the slow convergence of
LMS and the computational complexity of RLS.
In this article we show that a highly optimized core for a faster-converging,
more computationally expensive algorithm can be comfortably implemented
on a small (one-million-gate) FPGA. The algorithm is the Gauss-Seidel fast
affine projection (GSFAP) [Albu et al. 2002a], which is perhaps one of the most
efficient of the fast AP algorithms. To reduce the resource requirements we
represent numbers using the logarithmic number system (LNS) [Swartzlander and Alexopoulos 1975; Yu and Lewis 1991; Coleman et al. 2000; Coleman
et al. 2008]. Logarithmic arithmetic allows resource-efficient, low-latency multiplication and division at the cost of slightly more expensive addition and
subtraction. The resulting GSFAP core can be clocked at over 80MHz on a
one-million-gate Xilinx XC2V1000-4 device. To demonstrate the effectiveness
of our design, we used the core to implement an adaptive filter of order 1000
performing echo cancellation on speech signals at a sampling rate of more than
50kHz.
ACM Transactions on Embedded Computing Systems, Vol. 9, No. 3, Article 29, Publication date: February 2010.
GSFAP Adaptive Filtering Using Log Arithmetic for Resource-Constrained
•
29:3
The rest of this article is organized as follows. Section 2 provides further
background on adaptive filtering. In Section 3 the GSFAP algorithm is described
in more detail. The log arithmic number system is introduced in Section 4.
Section 5 details the architecture of our GSFAP core. Finally, in Section 6 we
evaluate the performance of this architecture.
2. ADAPTIVE FILTERING
The classical approach of digital adaptive filter design is to use the least mean
square (LMS) [Widrow and Stearns 1985] algorithm or one of its modifications
because these algorithms are simple and extensively described in literature,
and thus their behavior is very well known. Such digital filters are relatively
easy to implement even in very small electronic devices, and consequently are
widely used in various applications. The disadvantage of the classical LMS
filter design is its slow convergence. In contrast, the recursive least square
(RLS) [Kalouptsidis and Theodoridis 1993; Haykin 2002] algorithm is known
as an algorithm with very good convergence properties, however, it has high
computational complexity and memory requirements.
In order to reduce the computational complexity of the affine projection algorithm (APA) [Ozeki and Umeda 1984], a fast version of the affine projection
algorithm (FAP) [Gay and Tavathia 1995] has been developed. Although the
FAP algorithm converges quickly and has low computational requirements, it
is actually a rather unpromising algorithm because it is numerically unstable,
especially for nonstationary signals. This is due to the use of fast RLS (FRLS)
[Cioffi and Kailath 1983; Slock and Kailath 1991] in the algorithm. However, a
number of variants have been proposed that solve the numerical stability problems, while maintaining the advantages of FAP. The “modified” FAP (MFAP)
[Liu et al. 1996; Kaneda et al. 1995] uses the matrix inversion lemma employed
in the classical RLS algorithm, thus avoiding the problems with fast RLS, but at
the cost of greater computational requirements. Conjugate gradient (CG) FAP
[Ding 2000] uses results of the modified FAP, and uses the conjugate gradient
method [Luenberger 1984] to deal with the matrix inversion. The FAP-based algorithm that we believe to be the most suitable for hardware implementation is
Gauss-Seidel (GS) FAP [Albu et al. 2002a], which replaces the CG method with
the Gauss-Seidel method [Hageman and Young 1981]. This algorithm has all
the advantages of modified FAP and CGFAP, but has lower computational complexity, allowing an efficient implementation with fewer hardware resources.
Let us now summarize the computational complexities of algorithms mentioned in previous paragraphs (see Table I). The complexity of the LMS-based
algorithms is O(L), typically 2L + 1 multiply-accumulate (MACC) operations
per iteration, where L is the filter order. The complexity of NLMS is similar,
2L + 3 MACC operations and 1 division per iteration. On the contrary, the complexity of the RLS-based algorithms is O(L2 ). The memory requirements of RLS
are also much higher than those of (N)LMS. Fast versions of the RLS algorithm
with complexity O(L) exist, which partly solve the complexity issues. The RLS
lattice algorithm requires 18L operations and the fast transversal filter (FTF)
requires 8L operations, or 9L in the stabilized form. This, however, comes at
ACM Transactions on Embedded Computing Systems, Vol. 9, No. 3, Article 29, Publication date: February 2010.
29:4
•
M. Tichy et al.
Table I. Summary of the Complexities
of Adaptive Filtering Algorithms
Algorithm
LMS
NLMS
RLS
RLS Lattice
FRLS
FRLS Stabilized
APA
FAP
FAP Stabilized
Modified FAP
CGFAP
GSFAP
Complexity [MACC ops]
2L + 1
2L + 3
O(L2 )
18L
8L
9L
2L + O(N 3 )
2L + 20N
2L + 24N
2L + 3N 2 + 12N
2L + 2N 2 + 9N + 1
2L + N 2 + 4N − 1
the expense of problems with numerical stability. For large filtering problems
(several hundred taps) and real-time implementations, the differences in complexity between the LMS (2L)- and the FRLS (9L or 18L)-based algorithms can
be significant. The complexity of FAP is 2L + O(N ), where N is the projection
order. All other FAP variants mentioned above have complexity 2L + O(N 2 ),
where GSFAP is the least complex, with its complexity being 2L + N 2 + 4N − 1
MACC operations and N divisions per sample period. In most applications,
especially those involving speech, the projection order is almost always very
much smaller than the filter order, that is, N L, so the time complexity is
usually dominated by L rather than N 2 . The echo or noise cancellation problems are good examples of such applications. In fact, the GSFAP algorithm has
complexity very similar to the original FAP, while being numerically stable.
Using the fast affine projection algorithm brings new qualitative properties
to the adaptive filter design. Its main attributes include RLS-like convergence
and tracking abilities with LMS-like computational complexity. In other words,
it is possible to use very high order digital adaptive filters with improved convergence properties, without a substantial increase in computational complexity.
3. THE GSFAP ALGORITHM
First, a brief review of the GSFAP algorithm is given. The GSFAP algorithm is
summarized in Figure 1.
The filter order, that is, the number of filter coefficients, is denoted as L.
Individual coefficients (weights) constitute the coefficient vector wk . All variants of FAP algorithms mentioned in this text do not use the actual (true) filter
coefficients wk in the adaptation process but use more efficient alternate coefficient vector update, introduced in Gay [1993]. This alternate coefficient vector
is denoted ŵk and has the same number of elements. The number N is referred
to as the projection order.
The parameters μ and δ are the relaxation factor and the regularization
parameter, respectively. The former, μ, represents the algorithm’s step-size
parameter, which tells us how quickly we move towards the optimal solution
(weights). The algorithm is stable for 0<μ<2. The latter, δ, is the regularization
ACM Transactions on Embedded Computing Systems, Vol. 9, No. 3, Article 29, Publication date: February 2010.
GSFAP Adaptive Filtering Using Log Arithmetic for Resource-Constrained
•
29:5
Fig. 1. The Gauss-Seidel Fast Affine Projection (GSFAP) algorithm.
parameter for the autocorrelation matrix inverse. It prevents the autocorrelation matrix R from becoming singular.
Step 0 initializes the key variables. Steps 1 through 4 represent one iteration
of the GSFAP algorithm. The index ·k is the “discrete time” (iteration) index;
if more indices are given, it is always the last one which denotes the iteration
index; for example, pi,k denotes i-th element of vector pk .
In Step 1, new data—input and desired signal samples uk and d k —are acquired; and the excitation signal vector uk , vector ξ k (Step 1a) and the correlation matrix Rk (Step 1b) are updated.
The excitation signal vector uk is a vector consisting of the delayed sequence
of L input signal samples, that is,
uk = [ uk uk−1 . . . uk−L+1 ]T .
(1)
The vector ξ k consists of the delayed sequence of N input signal samples, which
means that the vector ξ k overlaps the first N elements of vector uk . As we will
ACM Transactions on Embedded Computing Systems, Vol. 9, No. 3, Article 29, Publication date: February 2010.
29:6
•
M. Tichy et al.
see in Section 5, this property is important for saving hardware resources, onchip Block RAMs in particular. The vector ξ k−L , required for the update of Rk ,
has the same structure as ξ k but it represents the state at time k − L.
The matrix Rk is an N by N matrix, which is the auto-correlation matrix of
the excitation signal and it is symmetric and positive definite. Since the matrix
Rk is symmetric, its update involves N (N + 1) multiply-accumulate operations,
and N (N2 +5) read and N 2 write memory accesses (provided that the whole matrix
is held in memory—not only an upper or lower triangle).
In step 2, the vector pk is calculated. This vector is in fact the first column of the inverse of matrix Rk . This problem is equivalent to solving a set
of linear equations Rk pk = b, where b is a vector of length N , in which all
elements have the value zero, except for the first element which has the value
one. As shown in Albu et al. [2002a], one iteration of the Gauss-Seidel (GS)
method for solving a set of N linear equations provides a good estimate of the
actual vector pk using the vector pk−1 as initial value for each GS iteration.
The symbol Ri j,k denotes the i j -th element of the matrix Rk and the symbol
pi,k denotes the i-th element of the vector pk . One full GS iteration requires N 2
additions/subtractions, N (N − 1) multiplications, N divisions, and 2N (N − 1)
read and N write memory accesses.
Step 3 represents an efficient method for calculation of the filter output y k
(Step 3a) and of the estimation error ek (Step 3b) using alternate coefficient
vector ŵk rather than original weight vector wk . Vector is called the normalized estimation error vector and is of dimension N . The symbol ¯ k−1 denotes an
N − 1 vector consisting of the N − 1 uppermost elements of vector k−1 and the
symbol R̃0,k represents an N − 1 vector that consists of the N − 1 lowermost
elements of the first (left) column of the matrix Rk . The calculation of the filter
output y k consists of two dot-product operations, one of length L and the other
of length N − 1, and of one additional multiply-accumulate to multiply the dot
product ¯ Tk−1 R̃0,k by μ and to add the result to the term uTk ŵk−1 . It is evident
that the calculation of y k requires L + N multiply-accumulate operations
and 2(L + N − 1) read memory accesses. Then, for calculation of ek just one
additional subtraction is required.
The normalized estimation error vector k and consequently the alternate
coefficient vector ŵk are updated in Step 4. Both manipulations are based on a
simple multiply-accumulate operation. After the normalized estimation error
k (Step 4a) has been updated, the excitation signal vector u at time k − N + 1
and the lowermost element of the newly updated vector k denoted as N −1,k are
used to update the alternate coefficient vector ŵk (Step 4b). The first operation
requires N multiplications, N − 1 additions and 2N − 1 read and N write
memory accesses; the latter L + 1 multiplications, L additions and 2L read and
L write memory accesses. Finishing this step, one full iteration of the GSFAP
is completed.
4. LOGARITHMIC ARITHMETIC
In order to maintain accuracy of the algorithm in the FPGA implementation, we
decided to implement the computations using a floating-point-like arithmetic.
ACM Transactions on Embedded Computing Systems, Vol. 9, No. 3, Article 29, Publication date: February 2010.
GSFAP Adaptive Filtering Using Log Arithmetic for Resource-Constrained
•
29:7
Fig. 2. Format of the Logarithmic Number System representation.
Floating-point arithmetic units are often large and power-hungry, and usually have long pipeline latencies. Traditionally, multiplication, division, and
square root were particularly expensive to implement. Recent Xilinx Virtex-2
and Virtex-4 FPGAs have included hard-wired logic, which implements 18-bit
integer multiplication. These hardware multipliers can be used to implement
the mantissa multiplication step in floating-point multiplication, greatly reducing the cost of floating-point multiplication units. However, even with hardware multipliers, such units are still large, and often have long latencies. Furthermore, floating-point division and square root remain expensive, and even
floating-point addition units are far from cheap.
An alternative to floating-point is to represent real (non-integer) numbers in
log arithmic form. The log arithmic number system (LNS) has been proposed
many times, but has been little used since the advent of hardware floating-point
units in general purpose processors. However, on an FPGA, multiplication, division, and square root of log arithmic numbers require only very simple logic
to compute, making LNS ideal for algorithms that contain a large number of
such operations. Addition and subtraction are more complex in LNS, but recent
advances have made them practical even on small FPGAs.
4.1 LNS Data Representation
The log arithmic number system data representation is depicted in Figure 2.
A log arithmic number consists of two parts: a sign digit S and a logarithm E
that includes an integer part of length e and a fractional part of length f . The
sign bit S is set to 0 if the number is positive and to 1 if the number is negative;
E is the logarithm of the absolute value of a real number X to be represented.
For our representation in LNS, base-2 logarithm has been chosen. The logarithm E can thus be represented as a two’s complement fixed-point value equal
to log2 |X |, where X is the value to be represented and |·| is the absolute value
operator. Then, the value of X can be expressed as
X = (−1) S · 2 E .
(2)
It should be noted that the LNS representation can be considered as an extreme
case of the floating-point number system with the significand always equal to
1 and the exponent represented as a fixed-point number (with integer and
fractional parts) rather than an integer.
For the 32-bit LNS precision, the integer part is of length e = 8 and the
fraction part is of length f = 23. There are two special values—zero and NaN—
which have to be represented by a specific sequence of bits, where E has its
leftmost bit set to 1 and all remaining bits to 0; if S = 0, the LNS number
represents zero; if S = 1, the LNS number represents NaN.
ACM Transactions on Embedded Computing Systems, Vol. 9, No. 3, Article 29, Publication date: February 2010.
29:8
•
M. Tichy et al.
The largest and the smallest values of the LNS fixed-point parts, denoted
Emax and Emin , respectively, can be defined as
Emax = 2e−1 − 2− f
Emin = −2e−1 + 2− f .
(3)
(4)
Then, according to (2), the largest and the smallest positive real numbers rep+
+
resentable by the LNS are X max
= 2 Emax and X min
= 2 Emin , respectively.
The standard IEEE single-precision floating-point representation [Institute
of Electrical and Electronics Engineers, Inc. 1985] uses a sign bit, 8-bit biased
exponent, and (23 + 1)-bit significand. This format is able to represent signed
values within the range ≈ 1.17·10−38 to 3.4·1038 . In the equivalent 32-bit precision LNS representation, the integer and fractional parts are kept as coherent
two’s complement fixed-point value within the range ≈ − 128 to 128, according
to (4) and (3). Then, the real numbers representable by LNS are signed and
within the range ≈ 2.9 · 10−39 to 3.4 · 1038 .
4.2 LNS Operations
In the LNS, a value X is represented as the fixed-point quantity i = log|X |, with
an extra bit to indicate the sign of X and a special arrangement to accommodate
zero and NaN. Base-2 logarithms are used, though in principle any base could
be used.
Given two LNS values i = log2 |X | and j = log2 |Y |, the LNS addition, subtraction, multiplication, division, and square root can be defined by the following set of equations:
log2 (X + Y ) = i + log2 (1 + 2 j −i )
log2 (X − Y ) = i + log2 (1 − 2
log2 (X · Y ) = i + j
X
log2
=i− j
Y
√
i
log2 ( X ) = ,
2
j −i
)
(5)
(6)
(7)
(8)
(9)
where in (5) and (6), without loss of generality, we choose j ≤ i. In all these
cases the sign bits are handled separately, using the same rules as in the case
of floating-point operations.
It is evident that Equations (7), (8), and (9) can be implemented as simple
as fixed-point addition, subtraction and shift operations, respectively. Unfortunately, as mentioned above, Equations (5) and (6) require evaluation of a
nonlinear function
F(r = j − i) = log2 (1 ± 2r ),
(10)
which can be seen in Figure 3. To evaluate these functions look-up tables are
used containing values only at intervals through the functions. Intervening
values are obtained by interpolation using the first-order Taylor series approximation. The look-up tables are kept reasonably small using the error correction
ACM Transactions on Embedded Computing Systems, Vol. 9, No. 3, Article 29, Publication date: February 2010.
GSFAP Adaptive Filtering Using Log Arithmetic for Resource-Constrained
•
29:9
2
1
0
-8
-7
-6
-5
-4
-3
-2
-1
-1
0
-2
-3
-4
-5
j-i
log2(1+2 )
j-i
log2(1-2 )
-6
-7
-8
-9
-10
Fig. 3. Nonlinear functions that have to be evaluated using approximation, when performing the
LNS addition and subtraction.
mechanism and for the case of log arithmic subtraction using the range-shift
algorithm for values within the range −0.5 < r < 0. Both techniques as well
as the approximation tables structure are described in Coleman [1995] and
Coleman et al. [2000].
4.3 Hardware LNS Cores
The LNS arithmetic cores [Matousek et al. 2002] used for our implementation
were originally developed within the project High Speed Logarithmic Arithmetic (HSLA). Since the completion of this project we have created an updated
version with additional features, and higher clock speeds.
We use 32-bit and 19-bit LNS precisions, where the integer part is always of
length e = 8 and the fraction part is of length f = 23 and f = 10, respectively.
There are two special values—zero and NaN—that have to be represented by a
specific sequence of bits, where E has its leftmost bit set to 1 and all remaining
bits to 0; if S = 0, the LNS number represents zero; if S = 1, the LNS number
represents NaN.
As presented in Section 4.1, the 32-bit LNS is comparable to the IEEE single
precision floating point, in terms of the range and precision. The 19-bit precision
LNS format maintains the same range as 32-bit but has precision reduced to
10 fractional bits. It is comparable to the 16-bit floating-point formats used in
commercial DSP devices. All units are available both in 32-bit and in 19-bit
versions.
All hardware macros (cores) contain logic for handling exceptions, that is,
evaluating invalid operations, performing overflow or underflow checking. The
multiplication, division, and square root (MUL/DIV/SQRT) modules are implemented as macros with a latency of 1 cycle. The LNS addition/subtraction
(ADD/SUB) module is designed as 8-stage fully pipelined arithmetic unit with
two separate adder/subtracter pipelines. The look-up tables are stored in dual
ported on-chip Block RAMs. The main reason the adder/subtracter units have
two separate pipelines is the presence of two ports on the Block RAMs; a dual
pipeline adder/subtracter is not much more expensive than one with a single
pipeline. The architecture of individual LNS arithmetic units is described in
detail in Tichy [2006].
ACM Transactions on Embedded Computing Systems, Vol. 9, No. 3, Article 29, Publication date: February 2010.
29:10
•
M. Tichy et al.
Table II. Comparison of Underwood’s Highly Optimized IEEE Single Precision Floating-Point
Units with the 32-bit LNS Units in a Xilinx Virtex-II XC2V6000 (6-million gate) FPGA
ADD
FLP
LNS
Slice Flip Flops
4 input LUTs
Occupied Slices
Block RAMs
MULT18X18s
Latency [cc]
Clock rate∗ [MHz]
696
611
496
0
0
13
165
—
—
—
—
—
—
—
2-pipe ADD
FLP
LNS
1,392
1,222
992
0
0
13
165
1,702
2,135
1,648
28
8
8
80
MUL
FLP
LNS
821
722
598
0
4
16
124
35
139
83
0
0
1
200
DIV
FLP
LNS
2,476
2,220
1,929
0
0
37
100
35
145
82
0
0
1
200
∗
Figures for clock rate are informative only and can vary according to the context in which the units are used.
FLP data are related to XC2V6000-5 (speed grade 5); LNS data are related to XC2V6000-4 (speed grade 4).
4.4 Comparing LNS with Floating-Point
To demonstrate advantages and disadvantages of our LNS arithmetic units
more clearly, we compare these arithmetic units with Underwood’s IEEE
floating-point (FLP) units [Underwood 2004]. Underwood developed highly optimized IEEE single (32-bit) and double (64-bit) precision floating-point units.
Since our 32-bit LNS arithmetic corresponds to the IEEE single precision arithmetic [Institute of Electrical and Electronics Engineers, Inc. 1985], we exclude
double precision floating-point and 19-bit LNS arithmetics from our comparisons. Table II shows the resource requirements and parameters of our 32-bit
LNS units compared to Underwood’s IEEE single precision floating-point units.
4.4.1 Chip Area. The major disadvantage of the LNS arithmetic is the
number of on-chip Block RAMs occupied by the adder/subtracter (ADD/SUB)
unit for storing look-up tables. LNS ADD/SUB units are always instantiated in
pairs. The main reason the adder/subtracter unit has two separate pipes is the
presence of two ports on Block RAMs, which means a dual-pipe ADD/SUB unit
is not much more expensive than the one with a single pipe. It is evident that
resource requirements for a pair of LNS ADD/SUB pipes is significantly higher
than for a pair of standard 32-bit floating-point ADD/SUB units. With regard
to area (slices), a pair of FLP ADD/SUB units occupies around 60% of a pair
of LNS ADD/SUB units. This apparent drawback in size of units is, however,
compensated for by other parameters and the advantages of other LNS units.
The LNS multiplier (MUL) unit occupies only a small fraction (around 14%)
of the size of the floating-point multiplier unit—even when embedded (on-chip)
multipliers are used. Perhaps the most common operation in many DSP and
matrix algorithms is multiplication and addition. When we sum the resources
required by a single multiply-add pipe, we can decide that using floating-point
units is still preferable. The situation is different when using two multiply-add
pipes: we can see that the LNS units require fewer resources (except for Block
RAMs). Notwithstanding these facts, many DSP algorithms require division
and/or square root operations. While figures for a floating-point square root
unit are not available, data for a divider (DIV) unit (a LNS DIV unit occupies
less than 5% of an FLP DIV unit) show that using the LNS architecture can
result in a considerable advantage. One can argue that the number of Block
ACM Transactions on Embedded Computing Systems, Vol. 9, No. 3, Article 29, Publication date: February 2010.
GSFAP Adaptive Filtering Using Log Arithmetic for Resource-Constrained
•
29:11
RAMs occupied by the LNS ADD/SUB unit is insuperable obstacle for a practical
application of the LNS architecture. It is apparent that the decision about what
kind of architecture or arithmetic would be the most suitable platform for a
particular problem always depends on more than one factor. As we will see
later in this section, the number of used Block RAMs is not always a limiting
factor, even in considerations related to chip area.
4.4.2 Clock Speed and Latencies. Another important issue is the clock
speed and latencies of functional units. Underwood’s adder/subtracter unit can
be clocked at up to 165MHz on a Virtex-II XC2V6000-5 FPGA (six-million gate,
speed grade 5 device), with a latency of up to 13 clock cycles. The FLP multiplier unit can be clocked at about 125MHz on the same device with a latency
of 16 clock cycles, whereas the FLP divider unit can be clocked at 100MHz, but
with a latency of 37 clock cycles. Underwood’s units are highly configurable,
and the latency can be reduced at the cost of a corresponding reduction in clock
speed. In contrast, latencies of the LNS adder/subtracter, multiplier, and divider
units are 8, 1, and 1 clock cycles, respectively. The LNS adder/subtracter can be
clocked at up to 80MHz; the LNS multiplier and divider at 200MHz. These figures were obtained for a Virtex-II XC2V6000-4 FPGA (six-million gate, speed
grade 4 device) which is approximately 10%–15% slower than the Virtex-II
XC2V6000-5 FPGA [Xilinx, Inc. 2005], so we can reasonably expect the LNS
adder/subtracter unit to reach 90MHz on a speed grade 5 device. While the LNS
adder/subtracter unit is about 40% slower than a coresponding floating-point
unit, the multiplier and divider units are dramatically faster. If comparing the
clock speed of LNS and FLP units, we have to apply figures of the slowest units:
FLP DIV and LNS ADD/SUB.
In addition to the clock speed, the latencies of arithmetic units are also important when comparing LNS with floating point. This is particularly true if
an arithmetic module is on the critical path. As an example, we use implementation of our GSFAP core whose architecture is presented in Section 5. By
inspection of Step 2 of the GSFAP algorithm in Figure 1, we can see that the
divider is required in each iteration of the Gauss-Seidel procedure. The FLP
divider has a latency of 37 clock cycles whereas the latency of the LNS divider
is 1 clock cycle. Using the floating point would result in increasing of latency of
the GSFAP core by 36N clock cycles. This is a significant difference in performance. Thus, LNS arithmetic units can result in a substantially faster design
when compared with using floating point, depending on the mix of operations
in the algorithm. The difference in performance may be particularly large when
division or square root operations are on the critical path of the algorithm.
4.4.3 Precision and Accuracy. A final important issue is the precision and
accuracy of operations. All floating-point operations are liable to a maximum
half-bit rounding error [Koren 2002; Institute of Electrical and Electronics Engineers, Inc. 1985]. The LNS add and subtract operations show the same or
smaller rounding error [Coleman and Chester 1999; Coleman et al. 2000], except for subtraction where operands are closely matched. Given that LNS multiplication and division are implemented as fixed-point addition and subtraction,
ACM Transactions on Embedded Computing Systems, Vol. 9, No. 3, Article 29, Publication date: February 2010.
29:12
•
M. Tichy et al.
Resource utilization of FLP and LNS units in Xilinx Virtex XC2V1000
ADD(s)
MUL(s)
DIV
100
FLP
FLP
Occupied slices [%]
80
60
FLP
40
LNS
LNS
LNS
LNS
FLP
20
0
1 MACC
2 MACCs
[LMS]
2As, 2Ms, 1D
[NLMS]
2As, 4Ms, 1D
[GSFAP]
Fig. 4. Resource utilization of Underwood’s highly optimized IEEE single precision floating-point
units and of the 32-bit LNS units when used for implementation of various algorithms in a Xilinx
Virtex-II XC2V1000 (1-million gate) FPGA.
respectively, no rounding takes place. In order to confirm the precision of LNS,
we compared the signal-to-noise (SNR) ratios for the 32-bit floating-point and
the 32-bit LNS implementations of several algorithms (using various parameters). The double precision floating-point implementations were used as a baseline. We obtained comparable results for both FLP and LNS in most cases. More
details on these measurements can be found in Section 6.1. These results are
not unexpected for algorithms with a roughly equal balance of additions and
multiplications, because the additions still have a half-bit rounding error, although the multiplications introduce no rounding error. Clearly the benefits
to be gained by using LNS will vary depending on the ratio of add/subtract
to multiply/divide operations and the sign and range of the operands. A more
detailed discussion on the precision and accuracy of the LNS arithmetic used
in our implementation including the corresponding signal-to-noise ratios can
be found in Coleman et al. [2008].
4.4.4 Resource Utilization for Different Applications. To demonstrate and
compare FLP/LNS area requirements for various combinations of arithmetic
units, we use a smaller Xilinx Virtex-II XC2V1000 FPGA as a normalized area.
Figure 4 shows resource utilization of Underwood’s single precision floatingpoint units and the 32-bit LNS units when used for implementation of various
algorithms in a one-million gate Virtex-II XC2V1000 chip. For a single multiplyaccumulate (MACC) module, we can see that using floating-point units would
be more effective. In the case of two MACC modules, using an LNS unit requires
slightly less resources (excluding on-chip Block RAMs). A typical example where
ACM Transactions on Embedded Computing Systems, Vol. 9, No. 3, Article 29, Publication date: February 2010.
GSFAP Adaptive Filtering Using Log Arithmetic for Resource-Constrained
•
29:13
two MACCs can be effectively employed is the LMS algorithm. The most appropriate arithmetic for these kind of applications is not always clear. Fixed point
has some attractions because of the simplicity of such algorithms, and provided
the application can accept the precision and range of values offered by fixed
point. However, the situation changes dramatically when there is a need for
a divider. Typical examples of such applications are adaptive filters based on
NLMS or GSFAP algorithms. It can clearly be seen that the area required by
LNS units is less than 50% of the area needed by FLP units for the NLMS algorithm; using floating-point units for the implementation of GSFAP algorithm
would not be feasible, given that four multipliers are required for an efficient
implementation of the algorithm.
4.4.5 Conversions. In order to use the LNS arithmetic, input numbers may
have to be converted to the LNS format and the output may have to be converted
back to the required format. Conversions will be required if the whole system
does not use the LNS to represent real numbers, and instead fixed or floating
point numbers are used in other parts of the system.
On first sight, such conversions appear to be a major disadvantage of LNS
arithmetic. However, the kind of applications that we target usually take their
input from an analog to digital (A/D) converter and direct their output to a digital to analog (D/A) converter. Thus, conversion modules from fixed-point (FXP)
to LNS and vice versa are required. Such units supporting FXP ⇔ LNS conversions can be implemented in a relatively inexpensive way. Our conversion
unit requires that the “FXP” numbers are in the format of two’s complement
fraction part of the fixed point value within the range −1; 1
.1 The conversion
of a fixed point value to the LNS format is based on decomposing the FXP number, using the look-up tables and log arithmic addition/subtraction operations.
Since conversion modules employ an LNS ADD/SUB unit, they require only
one supplementary Block RAM and a little additional logic beyond what is already required for ADD/SUB. In particular, the 24-bit fixed-point ⇔ 32-bit LNS
conversion unit occupies 246 slices (296 Flip-Flops and 250 4-input LUTs) of a
Virtex-II device. The latency of the FXP ⇒ LNS is 19 and of the LNS ⇒ FXP
is 46 clock cycles. Both operations are partially pipelined, so FXP ⇒ LNS can
process new data every second clock cycle, while LNS ⇒ FXP every ninth clock
cycle. If necessary this unit can also be used for FLP ⇔ LNS conversions. The
principles of conversions are described in Pohl et al. [2003].
In most DSP applications hundreds or thousands of log operations must
typically be performed for every data sample that needs to be converted. The
only major exception to this that we are aware of is image processing, an area
that we do not focus on. In view of this fact and considering figures presented
in the previous paragraph, conversions involved in a system that uses the LNS
are not a limiting factor.
1 Note
that many A/D converters simply return an N-bit (often 8-, 16-, 20- or 24-bit) two’s complement number as their output, and it is up to the designer of the system to interpret this value
as a fixed point number and scale it if necessary. All our implementations have used a linear A/D
converter, and thus require an FXP ⇔ LNS converter. However, some other A/D converters return
a value that is non-linearly related to the phenomenon it is measuring.
ACM Transactions on Embedded Computing Systems, Vol. 9, No. 3, Article 29, Publication date: February 2010.
29:14
•
M. Tichy et al.
Also, it should be noted that in communications and audio-processing, socalled companding A/D and D/A converters [Kikkert 1974; Tsividis et al. 1990;
Whitehouse 2006] with log arithmic characteristics are used. These converters,
employing a-law or μ-law algorithm, are usually implemented either in hardware, or in a DSP directly at the point of data conversion. Using the output of
one of these algorithms would probably result in savings in the terms of FPGA
resource utilization, since no special LNS conversion would be needed. However, we have not studied this option yet in the terms of SNR and resulting
precision, and we feel that this topic goes somewhat beyond the scope of this
article.
4.4.6 Summary. Taking all the facts stated above into consideration, we
can conclude that a 32-bit LNS arithmetic unit can perform with substantially
better area usage, speed, and accuracy than a floating point unit, particularly
when more multipliers or dividers are required. A further advantage of using LNS arithmetic is that the designer must focus only on efficiently using
ADD/SUB units, because they are the only substantial hardware units. The
other units are so small and fast that they could be replicated to simplify the
design or reduce routing delays. In contrast, designing a system using FLP
arithmetic may be more complicated because the adders, multipliers, dividers,
and square root units are all substantial pieces of hardware that must be used
efficiently to get good overall performance.
Lastly, it should be noted that Haselman et al. [2005] also discuss the advantages and disadvantages of the LNS arithmetic. Although those authors come
to similar conclusions, their work completely omits information on the accuracy
of their implementation of LNS add/subtract operations in relation to the size
of the approximation tables.
5. GSFAP ARCHITECTURE
In this section we present the architecture of our GSFAP core and describe the
structure of its individual functional units. The algorithm employs one LNS
addition/subtraction dual-pipe unit (denoted ADD/SUB A and B—two separate,
parallel pipelines), four multiplication units (denoted MUL A, B, C, and D) and
one division (denoted DIV A). Nonscalar data structures, that is, vectors and a
matrix, are stored in on-chip dual-port Block RAMs.
The top-level architecture of the GSFAP unit is depicted in Figure 5. Blocks in
the diagram roughly correspond to individual steps of the algorithm presented
in Figure 1. The blocks and operations depicted on the left-hand side of the
diagram employ the first ADD/SUB pipeline (pipeline A) while the operations on
the right-hand side employ the second pipeline of the ADD/SUB unit (pipeline
B). The block denoted “UX update” does not use ADD/SUB unit at all and the
block denoted “WW[0,1] update” utilizes both pipelines. The time line on the
far right side of the diagram indicates the clock cycles during which different
parts of the design are active in the GSFAP based filter with the parameters
L = 1000 (filter order) and N = 9 (projection order). For example, the righthand dot-product unit (which computes uTk ŵk−1 ) is active from clock cycle 2 to
cycle 1043, and uses pipeline B of the ADD/SUB unit.
ACM Transactions on Embedded Computing Systems, Vol. 9, No. 3, Article 29, Publication date: February 2010.
GSFAP Adaptive Filtering Using Log Arithmetic for Resource-Constrained
•
29:15
Fig. 5. Block diagram and data dependencies of the top-level architecture of GSFAP unit.
5.1 Excitation Signal Vector (Block RAM UX) Update
In the following paragraphs, a description of individual components and data
structures is given. The first task is to acquire new input signal samples uk and
d k , and to update the excitation signal vector uk . It is also necessary to update
two vectors ξ k and ξ k−L which are needed for the update of correlation matrix
Rk . Another part of the algorithm where input signal samples are needed is the
update of the alternate coefficient vector ŵk : it is necessary to keep a delayed
excitation signal vector uk−N +1 (see Step 4b in Figure 1). As mentioned in
Section 3, the vector ξ k overlaps the first N elements of vector uk . With respect
to the structure of uk it is apparent that the first L − N + 1 elements of uk−N +1
overlap the last L − N + 1 elements of uk . Similarly, the last N − 1 elements
ACM Transactions on Embedded Computing Systems, Vol. 9, No. 3, Article 29, Publication date: February 2010.
29:16
•
M. Tichy et al.
Fig. 6. Contents of the UX Block RAM consisting of a delayed sequence of the most recent L + N
input signal samples.
of uk−N +1 are overlapped by the first N − 1 elements of ξ k−L . Thus all vectors
consisting of a sequence of input signal samples, uk , uk−N +1 , ξ k and ξ k−L , can be
stored in a single memory block of length L + N that we refer to as UX memory
(or vector). The contents of UX memory at time k is shown in Figure 6.
The module for updating this UX storage is denoted “UX update” in Figure 5.
The UX storage is implemented as a circular buffer, to avoid shifting each time
a new input signal sample is acquired. The state of the buffer is kept in the
register UX state that is of the same width as the number of UX address lines.
To update UX, we only need to take two steps: (1) decrement the register UX state
and (2) write a new data sample uk to UX to the position given by the value in that
register, that is, to UX[UX state]. Since this operation consists of storing one
value into a Block RAM and decrementing a single register, it takes, together
with acquiring new data, only 2 clock cycles regardless of L or N .
5.2 Update of the Correlation Matrix Rk
The next module of the GSFAP unit is denoted “R update”; its job is to update
the autocorrelation matrix of the excitation signal Rk . To update the correlation
matrix, we need to calculate
Rk = Rk−1 + ξ k ξ Tk − ξ k−L ξ Tk−L .
(11)
Since this matrix is symmetric, its update requires N (N + 1) multiplyaccumulate operations, and N (N2 +5) read and N 2 write memory accesses (provided that the whole matrix is held in memory—not just an upper or lower
triangle). However, it can be shown [Tichy 2006] that we can get the same
result at a much lower cost. Let us define the correlation vector
T
rk = r0,k r1,k . . . r N −1,k ,
(12)
which is actually the first column (or row) of the matrix Rk . This correlation
vector can be updated in each iteration as follows.
rk = rk−1 + ξ0,k ξ k − ξ0,k−L ξ k−L ,
(13)
where ξ0,k and ξ0,k−L are the first (uppermost) elements of the vectors ξ k and
ξ k−L , respectively. The matrix Rk can be constructed so that each element of
the original matrix Rk−1 is shifted diagonally (along the main diagonal) by 1
and the vector rk is used as the first column/row of the matrix Rk , where rk
has been updated according to (13). The operation is schematically depicted in
ACM Transactions on Embedded Computing Systems, Vol. 9, No. 3, Article 29, Publication date: February 2010.
GSFAP Adaptive Filtering Using Log Arithmetic for Resource-Constrained
•
29:17
Fig. 7. Schematic representation of the update of matrix Rk by shifting the original matrix Rk−1
diagonally and replacing its first column/row with the updated correlation vector rk .
Figure 7. On the assumption that we can implement shifting of the left-upper
submatrix diagonally without excessive memory access (see “reindexing” in the
following), this procedure requires only 2N multiply-accumulate operations,
and N read and 2N − 1 write memory accesses, which represents dramatic
savings of operations compared to the updating of Rk according to (11).
A fundamental question is how to store the matrix Rk in memory. This matrix
is symmetric, which suggests that resource usage could be minimized by storing
only its upper or lower triangle. However, we decided to store the whole matrix
in a Block RAM rather than its upper/lower triangle, even though it almost
doubles memory requirements. The reasons for such decision are as follows.
(1) The dimensions of R are the same as the projection order, which is reasonably small. A matrix R with dimensions of up to N = 22 will fit in a single
on-chip (Virtex-II) Block RAM, provided that a 32-bit arithmetic is used.
Experiments show that using a projection order higher than N = 20 does
not improve algorithm performance in practice, that is, its convergence and
tracking properties.
(2) The update of the matrix R can be implemented very efficiently using “reindexing” (see the following).
(3) Control logic necessary to calculate addresses of individual elements of the
matrix is simpler and faster.
The memory storage for the matrix R is referred to as R Block RAM. Individual
elements of R are stored in R in a column-wise manner.
Before we describe the actual hardware structure that can be used to update
R storage, it is essential to answer the following question: “How can we shift
the N − 1 by N − 1 left-upper submatrix of Rk−1 along the main diagonal to get
the N − 1 by N − 1 right-lower submatrix of Rk without superfluous accesses
to a memory?” The answer is obvious—a mechanism for rearranging indices
of appropriate elements of submatrix Rk−1 so that they become elements of
submatrix Rk . Our implementation of such procedure is based on a technique
similar to using a circular buffer, which we call “reindexing.” The principle of
updating the storage R is depicted in Figure 8. The actual state of the buffer is
kept in the register R state, which is decremented by the value N + 1; and the
ACM Transactions on Embedded Computing Systems, Vol. 9, No. 3, Article 29, Publication date: February 2010.
29:18
•
M. Tichy et al.
Fig. 8. The principle of “reindexing” using a circular buffer: updating of the matrix Rk by shifting
the original matrix Rk−1 diagonally and replacing its first row/column with the updated correlation
vector rk in the Block RAM R.
Fig. 9. (a) Architecture schematic of the update of correlation matrix Rk stored in the Block RAM
R; (b) Architecture schematic of the one step—calculation of the i-th element of vector pk —of the
Gauss-Seidel (GS) iteration procedure.
elements of updated correlation vector rk are stored into positions corresponding to elements of the first row/column of matrix Rk .
Then, the “R update” module can be implemented very effectively in hardware, the architecture of which is schematically depicted in Figure 9(a). The
values of vector ξ k are read from the Block RAM UX and the multiplication
ξ0,k ξ k using the MUL B unit is invoked. Concurrently, another “process” controls
reading values from the Block RAM R to get the values of vector rk−1 (the first
column/row of Rk−1 ). The results of multiplication are successively added to the
values read from R using the ADD/SUB A unit, which results in implementation
of rk−1 + ξ0,k ξ k . After all values of rk−1 have been read from the Block RAM R,
the value of the register R state is updated—decremented by N +1—to prepare
the buffer R for updating. When the pipeline MUL B → ADD/SUB A is freed, the
values of ξ k−L are read from the Block RAM UX and the multiplication ξ0,k−L ξ k−L
ACM Transactions on Embedded Computing Systems, Vol. 9, No. 3, Article 29, Publication date: February 2010.
GSFAP Adaptive Filtering Using Log Arithmetic for Resource-Constrained
•
29:19
is started. The results of this multiplication are successively subtracted from
values rk−1 + ξ0,k ξ k and propagated through the feedback path implemented
using the ADD/SUB pipe and the FIFO. The results of subtraction, forming the
vector rk , are written simultaneously to both Block RAM ports of R to positions
representing the first row/column of the matrix Rk .
The “R update” operation is fully pipelined without any interim pipeline stall,
with the result that the ADD/SUB A is fully utilized. Due to the use of ADD/SUB
pipe as an intermediate storage for partial results, no additional storage (except
the FIFO) is needed. These factors in particular make the “R update” module very
efficient. The length of FIFO is unambiguously determined by dimensions of the
matrix R, that is, by the projection order N . In order to use the feedback path
as a temporary storage for the full set of intermediate results rk−1 + ξ0,k ξ k , the
length of ADD/SUB pipe plus length of FIFO must correspond to N . Then, the
length of FIFO can be expressed as N f i f o = N − N A , where N A is the length of
the ADD/SUB pipeline.
Considering the latencies of the LNS adder and multiplier, this operation
takes only 30 clock cycles for a matrix of order N = 9 . It is apparent that
our solution is dramatically faster than the simple approach according to (11),
which involves N (N + 1) multiply-accumulate operations. In addition there are
also savings in the number of memory accesses.
5.3 GS Solver: Update of pk
One of the key modules of the algorithm is the Gauss-Seidel (GS) solver, which
is used to calculate the vector pk (Step 2 of the algorithm in Figure 1). Unfortunately, the GS procedure is actually a sequential algorithm. Each element
pi,k of the vector pk depends on previously computed elements, p0...i−1,k and
pi+1...N −1,k−1 . Hence, the individual elements of pk cannot be updated simultaneously, so parallelization is not feasible. Instead we use a pipelined architecture, in which we try to minimize the latency of one step of the algorithm, that
is, the latency of the pi,k calculation. The hardware architecture for the calculation of one step, that is, of one element of vector pk , is depicted in Figure 9(b).
This operation has to be performed N times to update the whole vector p. The
computation of the next element can start after the previous value has been
written to the Block RAM referred to as PEPS. We use this Block RAM to store
both vector p and .
In order to minimize the cycle count, we use both ports of Block RAMs R and
PEPS and two MUL (A and B) units in parallel. The multiplication results are
summed and the result of summation forms the dot product of two N − 1 length
vectors. Now, recall that the vector b has the value 1 in its first element, and
all other elements have the value 0. Because of this, we need to subtract the
value from 1 only in the first step, but in all other steps we are subtracting
from zero, which we can be implemented by simply negating the sign bit—
this manipulation is depicted in Figure 9(b) as the “branch path” consisting of
subtracting the SUM result from ’1’, the component NegSGN (sign negation) and
the multiplexer. Considering the ADD/SUB unit latency and that the negation
of the sign bit costs virtually nothing, we can save 9(N − 1) clock cycles for
ACM Transactions on Embedded Computing Systems, Vol. 9, No. 3, Article 29, Publication date: February 2010.
29:20
•
M. Tichy et al.
Fig. 10. (a) Organization of the Block RAMs UX and WW[0,1] connections to the dot-product unit,
which calculates uT
k ŵk−1 ; (b) Architecture schematic of the update of alternate coefficient vector
ŵk stored in Block RAMs WW[0,1].
a complete “P update.” The result of this operation (subtraction or negation)
is then divided by a corresponding diagonal element of correlation matrix Rii,k
and the result pi,k —the i-th element of vector pk —is finally written to the Block
RAM PEPS. It is also important to recall that although the GS procedure is a
naturally sequential algorithm, the LNS multiplication and division are fast
and cheap, so the resulting hardware is highly efficient.
5.4 Filter Output and Estimation Error
The next step of the algorithm is to compute the filter output y k and the estimation error ek as shown in Step 3 in Figure 1. The right-hand term of Step 3a—
dot product ¯ Tk−1 R̃0,k —is computed just after the vector pk has been updated. In
Section 5.3, we mentioned that vectors pk and k are both stored in the single
Block RAM denoted PEPS. We do this partly because these vectors are of length
N and they can easily fit into a single Block RAM.2 But this arrangement also
allows us to use the pipeline R, PEPS → MUL A, B → SUM (used in the update of
pk —see Figure 9(b)) for calculation of the dot product ¯ Tk−1 R̃0,k , without creating any extra hardware. Both dot-product operations are performed on vectors
of length N − 1, so the only necessary modification is to change the addressing
of Block RAMs R and PEPS. The resulting value is then multiplied, using the
MUL D unit, by a value of μ in order to get the result μ¯Tk−1 R̃0,k .
As depicted in Figure 5, the “long” dot product—of two vectors of length given
by the filter order L—uTk ŵk−1 is calculated in parallel with previously described
blocks: “R update,” “P update,” and the dot product μ¯Tk−1 R̃0,k . The organization
that N is usually less than 20, all three variables Rk , pk and k could easily fit
into a single Block RAM, but this arrangement would be inconvenient with respect to insufficient
number of memory access ports when updating pk and for calculation of ¯ T
k−1 R̃0,k . This is clearly
demonstrated in Figure 9(b).
2 Considering
ACM Transactions on Embedded Computing Systems, Vol. 9, No. 3, Article 29, Publication date: February 2010.
GSFAP Adaptive Filtering Using Log Arithmetic for Resource-Constrained
•
29:21
of Block RAMs connected to the dot-product unit, which employs the MUL C
unit and the ADD/SUB B unit, is depicted in Figure 10(a).
Results of two dot-product operations, μ¯Tk−1 R̃0,k and uTk ŵk−1 , discussed in
previous paragraphs are used to calculate the filter output y k and the estimation error ek , but to save a few more clock cycles they are processed in the
following way
ek = d k − μ¯Tk−1 R̃0,k − uTk ŵk−1
y k = uTk ŵk−1 + μ¯Tk−1 R̃0,k
Two subtractions, d k − μ¯Tk−1 R̃0,k and (·) − uTk ŵk−1 , are performed using the
ADD/SUB A unit, whereas the addition uTk ŵk−1 + μ¯Tk−1 R̃0,k is performed on
the ADD/SUB B unit. This order of operations causes that the filter output
y k and the estimation error ek are available at the same clock cycle, which is
schematically depicted in Figure 5.
5.5 Alternate Coefficient Vector Update
The last two modules of the GSFAP unit are the “EPS update,” which performs
the update of normalized estimation error vector k , and the “WW[0,1] update,”
which performs the update of alternate coefficient vector ŵk . The two modules
have a similar structure, with the former simpler than the latter (see Step 4 of
the algorithm in Figure 1).
The “EPS update” module performs a simple pipelined multiply-add operation. It reads operands from the Block RAM PEPS. Since both vector pk and k−1
are stored in this Block RAM, data are read from both ports. After all operands
have been retrieved from the Block RAM PEPS, the updated elements of vector
k are successively stored to the Block RAM using its port A. Before the port A
of PEPS is freed, the ADD/SUB pipeline and an auxiliary FIFO of length N − N A
are used as a buffer for results to be stored into the Block RAM.
After the vector k has been updated, its last element N −1,k is multiplied by
μ using the MUL D unit. The result of multiplication is stored in the register
μEPSreg, the contents of which is then used for the update of ŵk .
The last task to complete one iteration of GSFAP is to update the alternate
coefficient vector ŵk , which is performed by the module “WW[0,1] update.” Its
hardware architecture is schematically shown in Figure 10(b). We decided to
split the coefficient vector wk into two parts, which are stored in separate Block
RAMs denoted WW0 and WW1. The reason is that the data bandwidth to Block
RAMs is limited by the number of ports. Splitting the vector w allows us to use
two independent pipelines to update both halves of w in parallel as depicted
in the figure, in order to further reduce execution time, that is, the number of
clock cycles, of GSFAP iteration. This stage fully utilizes both pipelines of the
ADD/SUB unit and both MUL A and B units.
ACM Transactions on Embedded Computing Systems, Vol. 9, No. 3, Article 29, Publication date: February 2010.
29:22
•
M. Tichy et al.
6. EXPERIMENTAL RESULTS
6.1 Convergence Properties and Stability
In order to investigate the behavior and performance of adaptive filtering algorithms, we implemented several algorithms in a simulation environment. These
were LMS, NLMS, RLS, APA, MFAP, CGFAP, and GSFAP. The algorithms were
implemented using IEEE 64-bit (double) and 32-bit (single) and LNS 32-bit and
19-bit arithmetics.
In our simulations, the unknown system is modelled by a finite impulse
response (FIR) filter of order L and its transfer function is given by the filter
coefficients. We refer to these coefficients as the optimal weights or the optimal
weight vector, wopt . In most experiments, the individual weights were chosen as
random values. The weight vector was, however, normalized so that wopt = 1,
to avoid amplification or attenuation of the input signal. In the rest of the
experiments, an impulse response measured in an acoustic studio was used as
a model.
We decided to test the algorithms with three different filter lengths, with
50, 250, and 1000 filter coefficients. This provided a realistic simulation environment for applications where a very high order adaptive filter is required,
particularly for acoustic applications like adaptive noise cancellation.
Two different measures to compare the convergence of individual algorithms
were used, the system error norm and the mean squared error. The system error
norm (SEN) can be expressed as
SENk = 10 log10 (wk − wopt 2 )
[dB],
(14)
which is the squared norm of the difference between the adaptive filter weights,
wk , and the unknown system model coefficients represented by wopt . The other
measure is the mean squared error (MSE) and can be calculated as
MSEk =
−1
1 K
e2 ,
K i=0 i,k
(15)
which is the average of the square of estimation error over K independent runs
of an experiment. The indices i and k represent the index number of experiment
and the iterations, respectively. The error ei,k is then the estimation error in
the k-th iteration of the i-th experiment. It can also be expressed in decibels
(dB), that is, as 10 log10 (MSEk ).
To simulate a more realistic environment, additive white Gaussian noise was
added to the system output. The white noise, a zero-mean Gaussian process,
with the variance σn2 = 10−4 , was used. This noise then determines the theoretical value of the minimum MSE that can be reached if the coefficient vector of
the adaptive filter, wk , is identical to its optimal value, that is to the coefficient
vector of the unknown system, wopt .
For the filter input, we use two types of signal: white and colored noise.
The white noise used is zero-mean Gaussian process with the variance σu2 =
1. To generate the colored noise, we use first-order Markov process which is
2
generated by applying white Gaussian noise, with variance σinp
, to a first-order
ACM Transactions on Embedded Computing Systems, Vol. 9, No. 3, Article 29, Publication date: February 2010.
•
GSFAP Adaptive Filtering Using Log Arithmetic for Resource-Constrained
29:23
Adaptive filter: L=1000; NLMS and GSFAP: μ=0.7; RLS:λ =0.999; colored excitation
5
NLMS (lns32)
GSFAP10 (lns32)
RLS (lns32)
0
System Error Norm [dB]
−5
−10
−15
−20
−25
−30
−35
0
0.5
1
1.5
Samples
2
2.5
3
x 10
4
Fig. 11. Convergence rates of the NLMS, GSFAP, and RLS algorithms using the 32-bit LNS arithmetic.
auto-regressive filter with transfer function
H(z) =
1
,
1 − az −1
(16)
where a is a fixed parameter. The colored noise, that is, the adaptive filter
input, is chosen to have the variance σu2 = 1. In order to generate the output
of an auto-regressive filter with this variance, the variance of the input white
2
Gaussian noise is given as σinp
= 1 − a2 .
In Figure 11, we present the convergence rate of three algorithms: NLMS,
GSFAP with the projection order N = 10 denoted as GSFAP10, and RLS with
parameters depicted in the title of the figure. In this experiment we used highly
colored noise (with a = 0.9) as the excitation signal u. It can clearly be seen
that even if the GSFAP algorithm is not as good as the RLS algorithm, its
convergence rate is far superior to the convergence rate of NLMS with the
corresponding algorithm parameters. In addition the complexity of GSFAP is
significantly lower compared to RLS.
Besides the SEN and MSE, we also measured the signal-to-noise (SNR)
ratios for the 32-bit floating-point and the 32-bit LNS implementations of
these algorithms (using various parameters), in order to compare the numerical stability. As mentioned in Section 4.4.3, we obtained comparable results for both FLP and LNS implementations. We also found that all 32bit implementations were numerically stable for both highly colored and
nonstationary input signals. For example, the GSFAP implementation using FLP showed a SNR of 108.4 dB, whereas the corresponding LNS figure was 108.6 dB for the error signal. There were two deviations from this
rule.
ACM Transactions on Embedded Computing Systems, Vol. 9, No. 3, Article 29, Publication date: February 2010.
29:24
•
M. Tichy et al.
(1) The level of precision of GSFAP implemented using the 19-bit LNS arithmetic was not sufficient for some filtering applications. In particular, we
found that the algorithm became numerically unstable for some nonstationary excitation signals such as speech. However, the precision was found
to be sufficient for most other filtering applications, for example canceling
random noise in speech.
(2) The results of experiments with the RLS algorithm when excited with
both white and colored noise showed that for the 32-bit FLP implementation the RLS algorithm becomes unstable very quickly—after about 3500
and 3000 input samples for white and colored noise excitation, respectively. To the contrary, the RLS run under exactly the same conditions
but implemented using the 32-bit LNS with the same precision remains
stable.
To conclude this discussion, we can state that both 32-bit floating-point and 32bit LNS implementations of GSFAP algorithm are numerically stable and show
very similar results. Another conclusion is that implementation using the 32-bit
LNS compared to 32-bit FLP arithmetic can give considerable advantage under
some specific conditions, while further reducing of precision (saving hardware
resources) by using the 19-bit LNS arithmetic can lead in stability problems in
applications where non-stationary signals are involved.
6.2 Hardware Implementation
We developed separate GSFAP cores for the LNS 32-bit and 19-bit precisions.
The parameters of both cores are also fully configurable. It is possible to vary
filter order L for values 20 ≤ L ≤ 1000, projection order N for values 2 ≤ L ≤
20, step-size parameter μ for values 0 < μ < 2 and regularization parameter
δ for values 10−3 ≤ δ ≤ 1. However, modifying the value of N requires minor
architectural changes.
For presenting cores in this section, we fixed the length of the filter as L =
1000 and the projection order as N = 9. In this configuration, a full iteration
of the GSFAP algorithm takes 1597 cycles and performs 4227 log arithmic
operations, which represents 2.64 operations per cycle.
Table III shows area and performance parameters for the 32-bit and
19-bit LNS implementations of GSFAP algorithm on the two FPGA Xilinx
devices: one-million gate Virtex-II XC2V1000-4 and six-million gate Virtex-II
XC2V6000-4. On both devices, the designs can be clocked at a little over 80MHz.
This is very close to the maximum clock rate of LNS cores, indicating that our
architecture is not the limiting factor on clock speed. At this clock speed the
design is performing over 210 million log arithmic operations per second, which
is equivalent to 210MFLOPS. The cores can filter signals at a sampling rate of
more than 50kHz.
The 32-bit LNS version of GSFAP core occupies only a small fraction of the
six-million gate Xilinx XC2V6000-4. On the smaller XC2V1000-4 device, it uses
a very large percentage of available resources. In particular, 99% of slices on
the FPGA contain some logic. However, our design demonstrates that FPGAs
are very suitable for implementing complex adaptive filters. It is possible to
ACM Transactions on Embedded Computing Systems, Vol. 9, No. 3, Article 29, Publication date: February 2010.
GSFAP Adaptive Filtering Using Log Arithmetic for Resource-Constrained
•
29:25
Table III. Resource Utilization and Parameters of the 32-bit and 19-bit LNS GSFAP
Units When Used in Xilinx Virtex-II XC2V1000-4 and XC2V6000-4 FPGAs
Slice Flip Flops
4 input LUTs
Occupied Slices
Tbufs
Block RAMs
MULT18X18s
Clock rate
XPower estimate
Vccint Dynamic
Quiescent
Vccaux Dynamic
Quiescent
Vcco33 Dynamic
Quiescent
Total Power
LNS 32-bit
XC2V1000-4
XC2V6000-4
LNS 19-bit
XC2V1000-4
XC2V6000-4
4,835 47%
6,049 59%
5,118 99%
1,280 50%
34
85%
8
20%
80.006MHz
4,496
6%
6,058
8%
4,833 14%
1,280
7%
34
23%
8
5%
80.051MHz
3,414 33%
4,245 41%
3,538 69%
192
7%
12
30%
8
20%
80.502MHz
401 mW
18 mW
0 mW
330 mW
0 mW
3 mW
752 mW
455 mW
68 mW
0 mW
330 mW
0 mW
7 mW
860 mW
188 mW
18 mW
0 mW
330 mW
0 mW
3 mW
539 mW
3,234
4%
4,294
6%
3,353
9%
192
1%
12
8%
8
5%
80.160MHz
219 mW
68 mW
0 mW
330 mW
0 mW
7 mW
624 mW
implement such adaptive filters with 32-bit precision arithmetic on small FPGAs, suitable for resource-constrained embedded systems.
The 19-bit LNS version of the GSFAP core can operate at a clock speed
similar to the 32-bit version, but resource requirements are substantially lower.
In particular, on the XC2V1000-4 only 30% of Block RAMs and 69% of slices
are used, compared to 85% and 99% for the 32-bit version, respectively. There
is clearly potential for either implementing other logic on the same chip using
free resources, or placing the 19-bit GSFAP core on a smaller device.
For comparison, we created similar cores that implement the NLMS algorithm in the LNS 32-bit and 19-bit precisions. The parameters of all cores are
also fully configurable: 20 ≤ L ≤ 1022; 0 < μ < 2; and 0 ≤ δ ≤ 1. The corresponding NLMS filter of order L = 1000 is used to compare performance. In this
configuration, a full iteration of the NLMS algorithm takes 1088 clock cycles,
and performs 4008 log arithmic operations, which represents 3.68 operations
per cycle.
Table IV shows the corresponding figures for our NLMS cores. The clock
rates are again around 80MHz. The design performs 295 million log arithmic
operations per second (equivalent to MFLOPS), and can operate on signals at
a sampling rate of more than 73kHz.
The 32-bit LNS NLMS core occupies only a small fraction (12%) of the sixmillion gate XC2V6000-4 device. On the one-million gate XC2V1000-4 FPGA,
it uses quite a large percentage of available chip resources; in particular, 87% of
slices and 80% of Block RAMs. For the 19-bit LNS implementation, the figures
show that the core occupies a little over a half of the available slices and 25% of
Block RAMs on the one-million gate chip. Although the NLMS cores are smaller
than the GSFAP cores, the NLMS ones allow higher sampling rates.
The FPGAs we used to test the implementations are speed-grade four devices. More expensive speed-grade six devices would allow even faster clock
ACM Transactions on Embedded Computing Systems, Vol. 9, No. 3, Article 29, Publication date: February 2010.
29:26
•
M. Tichy et al.
Table IV. Resource Utilization and Parameters of the 32-bit and 19-bit LNS NLMS
Units When Used in Xilinx Virtex-II XC2V1000-4 and XC2V6000-4 FPGAs
Slice Flip Flops
4 input LUTs
Occupied Slices
Tbufs
Block RAMs
MULT18X18s
Clock rate
XPower estimate
Vccint Dynamic
Quiescent
Vccaux Dynamic
Quiescent
Vcco33 Dynamic
Quiescent
Total Power
LNS 32-bit
XC2V1000-4
XC2V6000-4
LNS 19-bit
XC2V1000-4
XC2V6000-4
4,408 43%
4,834 47%
4,473 87%
1,280 50%
32 80%
8 20%
80.051MHz
4,069
6%
4,831
7%
4,160 12%
1,280
7%
32
22%
8
5%
80.058MHz
3,026 29%
3,301 32%
2,973 58%
192
7%
10
25%
8
20%
80.652MHz
381 mW
18 mW
0 mW
330 mW
0 mW
3 mW
732 mW
434 mW
68 mW
0 mW
330 mW
0 mW
7 mW
839 mW
161 mW
18 mW
0 mW
330 mW
0 mW
3 mW
512 mW
2,846
4%
3,369
4%
2,820
8%
192
1%
10
6%
8
5%
80.483MHz
189 mW
68 mW
0 mW
330 mW
0 mW
7 mW
594 mW
speeds, in the region of 100MHz. Virtex-4 devices would be even faster. However, our cores are designed for resource-constrained embedded systems, where
cost is an important factor, and they achieve more than adequate results on
these slower devices. But even on our slower devices, we tested the cores at
100MHz and found them to be fully functional at room temperatures. For nonsafety-critical applications, it may be possible to run the cores above the clock
speeds reported by the design tools.
6.3 Practical Applications of GSFAP Core
To demonstrate performance of the GSFAP algorithm in practical application,
we developed the adaptive noise cancellation example. In this case, the adaptive
filter is used to cancel an unknown interference contained in a primary signal.
The primary signal serves as the desired response for the adaptive filter. A
reference signal, which is correlated with the interference, is used as the input
to the adaptive filter.
For our demonstration we used the LNS 32-bit implementation of a digital
adaptive filter of length L = 250 based on the GSFAP algorithm with the projection order N = 9 and the step size parameter μ = 0.7. The result of the
noise cancellation process is depicted in Figure 12. The figure shows the clear
(original) signal, the corrupted signal, and the filtered signal, which consists of
the original signal and an uncancelled (residual) noise. The impact of filtering
process can be easily recognized from the figure. The best demonstration of results is, however, subjective listening which unfortunately cannot be presented
in the text.
To demonstrate another practical application of adaptive filters, we developed an echo cancellation example. In this case the adaptive filter is used to suppress echo generated by an unknown system, typically a room or a car cabin. The
real-world room impulse responses and speech signals have been used in our
ACM Transactions on Embedded Computing Systems, Vol. 9, No. 3, Article 29, Publication date: February 2010.
GSFAP Adaptive Filtering Using Log Arithmetic for Resource-Constrained
•
29:27
Speech Test
Original
2
0
−2
0
0.5
1
1.5
2
2.5
3
0
0.5
1
1.5
2
2.5
3
0
0.5
1
2
2.5
3
Corrupted
5
0
Reconstructed
−5
2
0
−2
1.5
Time [s]
Fig. 12. Canceling random noise in a speech signal using the 32-bit LNS GSFAP core.
experiments. We used the input and echo signals sampled at the rate of 16kHz,
the adaptive filters of length L = 500 and the GSFAP projection order N = 9.
The step-size parameter was chosen as μ = 1 for both NLMS and GSFAP.
The results of using the LNS 32-bit implementation of NLMS and GSFAP
algorithms for this task are shown in Figure 13. The upper-left picture of the
figure shows the input (far-end) U and echo D signals. The upper-right picture
shows the echo signal to be suppressed and the convergence rates of NLMS and
GSFAP adaptive filter. It can clearly be seen that the GSFAP converges much
faster than the NLMS algorithm. The lower-left picture represents residual
echo for both algorithms. It should be noted that the variance (var) of residual echo E—which can also be used as a measure of quality of the adaptive
algorithm—“left” by the NLMS adaptive filter is 5.79 × 10−4 while for GSFAP
it is 6.32 × 10−5 . The lower-right picture shows the squared value of E.
7. RELATED WORK
Chew and Farhang-Boroujeny [1999] present the fixed-point implementation of
the adaptive filter using the LMS-Newton algorithm and use it for acoustic echo
cancellation. They report the sampling rate 29.4kHz for the 578-tap adaptive
filter on a Xilinx XC4042XL chip.
Jang et al. [2002] describe FPGA implementation of the NLMS using fixedpoint arithmetic within the context of acoustic echo cancellation. They use
FPGA only as a prototyping platform for implementation in ASIC and report a
sampling rate of 8kHz for the 256-tap filter on an Altera FLEX 10K50RC240
FPGA.
The GSFAP algorithm was developed by Albu et al. [2002a] to reduce the
time complexity of FAP based adaptive algorithms. The original description of
the algorithm contained a proposal that the algorithm be implemented on a
general purpose processor using LNS arithmetic. Simulations suggested that
a sampling rate of 16kHz could be achieved on a 200MHz processor.
ACM Transactions on Embedded Computing Systems, Vol. 9, No. 3, Article 29, Publication date: February 2010.
29:28
•
M. Tichy et al.
Input signals
2
Weight misadjustment [dB]
5
Input U, var: 0.038016
Echo
NLMS
GSFAP
0
1
0
−5
−1
−10
−2
−15
2
−20
Echo D, var: 0.14922
1
−25
0
−30
−1
−35
−2
−40
0
0.5
1
1.5
2
Samples
0
0.5
x 10 4
Filter error, E
0.2
1
1.5
2
Samples
x 10 4
2
Squared error,
E
NLMS, var: 0.00057892
NLMS
0.04
0.1
0.03
0
0.02
−0.1
0.01
−0.2
0
0.2
GSFAP, var: 6.3209e−005
GSFAP
0.04
0.1
0.03
0
0.02
−0.1
0.01
−0.2
0
0
0.5
1
Samples
1.5
2
4
x 10
0
0.5
1
Samples
1.5
2
4
x 10
Fig. 13. Comparison of the NLMS and GSFAP within the echo cancellation application using
speech signals and real impulse response.
An RLS lattice core based on the LNS architecture for embedded systems
is described by Kadlec et al. [2002]. Significant speed-ups using single-, dual-,
and quad-pipeline LNS architectures are reported. Lattice core designs achieve
clock speeds of 35 to 50MHz and the peak performance 168MFLOPS for the 20bit LNS implementation on a Virtex XCV2000E-6 FPGA device.
Similar work was published in Albu et al. [2001], where an LNS implementation of the normalized RLS lattice algorithm in a Virtex FPGA is presented.
They implemented the 8-th order filter capable of processing signals at a sampling rate of 47kHz for standard and 36.7kHz for a normalized version of RLS
lattice. Their designs can be clocked at 45MHz and the latency of one iteration
of the algorithm is 950 and 1224 clock cycles for standard and normalized RLS
lattice, respectively.
The same group continued their research on implementing RLS using FPGAs. They developed a core based on the modified a priori error-feedback least
square lattice (EF-LSL) algorithm, which was described in Albu et al. [2002b].
The authors report a maximum performance of 8kHz when using a 175-tap
filter implemented with 32-bit LNS on a Virtex device.
Pohl et al. [2003] implemented a similar RLS lattice core using LNS arithmetic on a Virtex-II XC2V6000-6 (six-million gate, speed grade 6) device. The
design achieves sampling rates of 31.6kHz, 15.8kHz, and 7.9kHz for filters of
order 84, 172, and 252, respectively. In contrast, our GSFAP cores achieve a
ACM Transactions on Embedded Computing Systems, Vol. 9, No. 3, Article 29, Publication date: February 2010.
GSFAP Adaptive Filtering Using Log Arithmetic for Resource-Constrained
•
29:29
sampling rate of more than 50 kHz for filters of order 1000. Although the RLS
algorithm may converge more quickly than GSFAP, it is at the cost of a greatly
reduced sampling rate.
Sucha et al. [2004] present an implementation of a 129-tap QR-RLS-based
adaptive filter capable of operating on signals sampled at rate of 44kHz. This is
a significant achievement on a resource-constrained Xilinx XC2V1000-4 device.
They use the 19-bit (rather than 32-bit) versions of the LNS arithmetic units
that we use in this article, in order to reduce resource usage.
Boppana et al. [2004] also implement a QR-RLS adaptive filter. They use
CORDIC operators and implement the algorithm using both Altera’s Nios and
custom logic. They present filter orders 4 to 32, where for a 32-tap filter they
need > 100000 (120μs) or > 1000 (3μs) clock cycles when Nios or custom logic
is used, respectively.
Most recently, the dichotomous coordinate descent (DCD) FAP [Zakharov and
Albu 2005] algorithm has been proposed. In this algorithm, the DCD method
is employed to solve the problem of matrix inversion. The DCD algorithm
[Zakharov et al. 2004; Liu et al. 2006] is a numerically stable algorithm for
solving systems of linear equations with relatively low computational complexity (lower than the Gauss-Seidel algorithm). Since the DCD algorithm is free of
multiplication and division, the DCD-FAP algorithm could also be a convenient
candidate for FPGA implementation, but we leave it for future work.
8. CONCLUSIONS
Adaptive filters are widely used in digital signal processing (DSP) for countless applications in telecommunications, digital broadcasting, etc. Traditionally, small resource-constrained embedded systems have used the least computationally intensive filter adaptive algorithms, such as NLMS. In this article
we have shown that FPGAs are a highly suitable platform for more complex
algorithms with better adaptation properties.
We have designed a core that implements the GSFAP adaptive filtering algorithm. GSFAP combines relatively low computational complexity with excellent
adaptation properties. In order to reduce resource requirements, we have used
log arithmic arithmetic, rather than traditional floating point.
Although it is efficient, GSFAP is a complicated algorithm which presented
several implementation challenges. Efficient design of the data structures using
our reindexing mechanism allowed matrices to be shifted diagonally without
copying. The GS stage is particularly difficult to parallelize efficiently because
of cyclic data dependencies between iterations of the inner loop. We have presented efficient solutions to these problems, making the best use of the pipelined
log arithmic addition units, and taking advantage of the very low cost of log
arithmic multiplication and division.
The resulting GSFAP core can be clocked at more than 80MHz on a one
million-gate Xilinx XC2V1000-4 device. At this clock speed the design performs
over 210 million log arithmic operations per second (equivalent to MFLOPS).
We used it to implement adaptive filters of orders 20 to 1000 performing echo
cancellation on speech signals at a sampling rate exceeding 50kHz. A similar
ACM Transactions on Embedded Computing Systems, Vol. 9, No. 3, Article 29, Publication date: February 2010.
29:30
•
M. Tichy et al.
NLMS core is around 15% smaller and around 50% faster, allowing it to filter signals at a sampling rate of around 73kHz. However, experiments show
that GSFAP has adaptation properties that are much superior to NLMS, and
that our core can provide very sophisticated adaptive filtering capabilities for
resource-constrained embedded systems.
REFERENCES
ALBU, F., FAGAN, A., KADLEC, J., AND COLEMAN, N. 2002a. The Gauss-Seidel fast affine projection
algorithm. In Proceedings of the Workshop on Signal Processing Systems (SIPS‘02). IEEE, 109–
114.
ALBU, F., KADLEC, J., COLEMAN, N., AND FAGAN, A. 2002b. Pipelined implementations of the a priori
error-feedback LSL algorithm using logarithmic arithmetic. In Proceedings of the International
Conference on Acoustics, Speech, and Signal Processing (ICASSP‘02). IEEE, III–2681–III–2684.
ALBU, F., KADLEC, J., SOFTLEY, C., MATOUSEK, R., HERMANEK, A., FAGAN, A., AND COLEMAN, N. 2001.
Implementation of (normalized) RLS lattice on Virtex. In Field-Programmable Logic and Applications. G. J. Brebner and R. Woods, Eds. Lecture Notes in Computer Science. vol. 2147,
Springer-Verlag, Berlin.
BOPPANA, D., DHANOA, K., AND KEMPA, J. 2004. FPGA based embedded processing architecture for
the QRD-RLS algorithm. In Proceedings of the 12th Annual Symposium on Field-Programmable
Custom Computing Machines (FCCM‘04). IEEE, 330–331.
CHEW, W. C. AND FARHANG-BOROUJENY, B. 1999. FPGA implementation of acoustic echo cancelling.
In Proceedings of the Region 10 Conference TENCON99. Vol. 1. IEEE, 263–266.
CIOFFI, J. AND KAILATH, T. 1983. Fast, fixed-order, least-squares algorithms for adaptive filtering. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing,
(ICASSP‘83). Vol. 8. IEEE, 679–682.
COLEMAN, J. N. 1995. Simplification of table structure in logarithmic arithmetic. Electro.
Lett. 31, 22, 1905–1906.
COLEMAN, J. N. AND CHESTER, E. I. 1999. A 32-bit logarithmic arithmetic unit and its performance
compared to floating-point. In Proceedings of the 14th IEEE Symposium on Computer Arithmetic.
IEEE, 142–151.
COLEMAN, J. N., CHESTER, E. I., SOFTLEY, C. I., AND KADLEC, J. 2000. Arithmetic on the European
Logarithmic Microprocessor. IEEE Trans. Comput. 49, 7, 702–715.
COLEMAN, J. N., SOFTLEY, C. I., KADLEC, J., MATOUSEK, R., TICHY, M., POHL, Z., HERMANEK, A., AND
BENSCHOP, N. F. 2008. The European logarithmic microprocesor. IEEE Trans. Comput. 57, 4,
532.
DING, H. 2000. A stable fast affine projection adaptation algorithm suitable for low-cost processors. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing
(ICASSP‘00). Vol. 1. IEEE, 360–363.
GAY, S. L. 1993. A fast converging, low complexity adaptive filtering algorithm. In Proceedings
of the Workshop on Applications of Signal Processing to Audio and Acoustics. IEEE, 4–7.
GAY, S. L. AND TAVATHIA, S. 1995. The fast affine projection algorithm. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP‘95). Vol. 5. IEEE,
3023–3026.
HAGEMAN, L. A. AND YOUNG, D. M. 1981. Applied Iterative Methods. Academic Press, New York.
HASELMAN, M., BEAUCHAMP, M., WOOD, A., HAUCK, S., UNDERWOOD, K. D., AND HEMMERT, K. S. 2005.
A comparison of floating point and logarithmic number systems for FPGAs. In Proceedings of
the 13th Annual Symposium on Field-Programmable Custom Computing Machines (FCCM‘05).
IEEE, 181–190.
HAYKIN, S. 2002. Adaptive Filter Theory 4th Ed. Prentice Hall, Upper Saddle River, NJ.
Institute of Electrical and Electronics Engineers, Inc. 1985. An American National Standard:
IEEE Standard for Binary Floating-Point Arithmetic. Institute of Electrical and Electronics Engineers, Inc., New York. ANSI/IEEE Std 754-1985.
JANG, S. A., LEE, Y. J., AND MOON, D. T. 2002. Design and implementation of an acoustic echo
canceller. In Proceedings of the Asia-Pacific Conference on ASIC. IEEE, 299–302.
ACM Transactions on Embedded Computing Systems, Vol. 9, No. 3, Article 29, Publication date: February 2010.
GSFAP Adaptive Filtering Using Log Arithmetic for Resource-Constrained
•
29:31
KADLEC, J., MATOUSEK, R., HERMANEK, A., LICKO, M., AND TICHY, M. 2002. Lattice for FPGAs using
logarithmic arithmetic. Electro. Engin. Des. 74, 906, 53–56.
KALOUPTSIDIS, N. AND THEODORIDIS, S. 1993. Adaptative System Identification and Signal Processing Algorithms. Prentice Hall, Englewood Cliffs, NJ.
KANEDA, Y., TANAKA, M., AND KOJIMA, J. 1995. An adaptive algorithm with fast convergence for
multi-point sound control. In Proceedings of the Active‘95. 993–1004.
KIKKERT, C. 1974. Digital companding techniques. IEEE Trans. Comm. 22, 1, 75–78.
KOREN, I. 2002. Computer Arithmetic Algorithms 2nd Ed. A. K. Peters, Ltd., Natick, MA.
LIU, J., WEAVER, B., AND WHITE, G. 2006. FPGA implementation of the DCD algorithm. In Proceedings of the London Communications Symposium. University College London, London, UK.
LIU, Q. G., CHAMPAGNE, B., AND HO, K. C. 1996. On the use of a modified fast affine projection
algorithm in subbands for acoustic echo cancelation. In Proceedings of the 7th Digital Signal
Processing Workshop. IEEE, 354–357.
LUENBERGER, D. G. 1984. Linear and Nonlinear Programming, 2nd ed. Addison-Wesley, Reading,.
MA.
MATOUSEK, R., TICHY, M., POHL, Z., KADLEC, J., SOFTLEY, C., AND COLEMAN, N. 2002. Logarithmic
number system and floating-point arithmetics on FPGA. In Field-Programmable Logic and Applications: Reconfigurable Computing Is Going Mainstream, M. Glesner, P. Zipf, and M. Renovell,
Ed. Lecture Notes in Computer Science. vol. 2438. Springer-Verlag, Berlin.
OZEKI, K. AND UMEDA, T. 1984. An adaptive filtering algorithm using an orthogonal projection to
an affine subspace and its properties. Electron. Comm. Japan 67-A, 5, 126–132.
POHL, Z., MATOUSEK, R., KADLEC, J., TICHY, M., AND LICKO, M. 2003. Lattice adaptive filter implementation for FPGA. In Proceedings of the 11th Iternational Symposium on Field Programming
Gate Arrays (FPGA‘03). ACM/SIGDA, 246. Abstract.
SLOCK, D. T. M. AND KAILATH, T. 1991. Numerically stable fast transversal filters for recursive
least squares adaptive filtering. IEEE Trans. Sign. Proces. 39, 1, 92–114.
SUCHA, P., POHL, Z., AND HANZALEK, Z. 2004. Scheduling of iterative algorithms on FPGA with
pipelined arithmetic unit. In Proceedings of the 10th Real-Time and Embedded Technology and
Applications Symposium (RTAS‘04). IEEE, 404–412.
SWARTZLANDER, E. E. AND ALEXOPOULOS, A. G. 1975. The sign/logarithm number system. IEEE
Trans. Comput. C-24, 12, 1238–1242.
TICHY, M. 2006. Fast adaptive filtering algorithms and their implementation using reconfigurable
hardware and log arithmetic. Ph.D. thesis, Faculty of Electrical Engineering, Czech Technical
University in Prague, Czech Republic.
TSIVIDIS, Y. P., GOPINATHAN, V., AND TOTH, L. 1990. Companding in signal processing. Electron.
Lett. 26, 1331–1332.
UNDERWOOD, K. D. 2004. FPGAs vs. CPUs: Trends in peak floating-point performance. In Proceedings of the 12th International Symposium on Field Programming Gate Arrays (FPGA‘04).
ACM/SIGDA, 171–179.
WHITEHOUSE, H. J. 2006. Implicit sampling analog-to-digital converter. In Proceedings of the
4th Digital Signal Processing Workshop. IEEE, 19–22.
WIDROW, B. AND STEARNS, S. D. 1985. Adaptive Signal Processing. Prentice Hall, Englewood Cliffs,
NJ.
Xilinx, Inc. 2005. Virtex-II Platform FPGAs: Complete Data Sheet, v3.4 Ed. Xilinx, Inc. Product
Specification.
YU, L. K. AND LEWIS, D. M. 1991. A 30-b integrated logarithmic number system processor. IEEE
J. Solid-State Circu. 26, 10, 1433–1440.
ZAKHAROV, Y. AND ALBU, F. 2005. Coordinate descent iterations in fast affine projection algorithm.
IEEE Sign. Process. Lett. 12, 5, 353–356.
ZAKHAROV, Y. V., WEAVER, B., AND TOZER, T. C. 2004. Novel signal processing technique for realtime solution of the least squares problem. In Proceedings of the 2nd International Workshop on
Signal Processing for Wireless Communications. 155–159.
Received January 2007; revised July 2007, August 2007; accepted December 2007
ACM Transactions on Embedded Computing Systems, Vol. 9, No. 3, Article 29, Publication date: February 2010.