Full text

Transkript

Full text

CONVERGENCE OF SPLIT-COMPLEX
BACKPROPAGATION ALGORITHM
WITH A MOMENTUM∗
Huisheng Zhang†, Wei Wu‡
Abstract: This paper investigates a split-complex backpropagation algorithm with
momentum (SCBPM) for complex-valued neural networks. Convergence results
for SCBPM are proved under relaxed conditions and compared with the existing
results. Monotonicity of the error function during the training iteration process
is also guaranteed. Two numerical examples are given to support the theoretical
findings.
Key words: Complex-valued neural networks, split-complex backpropagation
algorithm, convergence
Received: November 18, 2008
Revised and accepted: January 10, 2011
1.
Introduction
In recent years great interest has been aroused in the complex-valued neural networks (CVNNs) for their powerful capability in processing complex-valued signals
[1, 2, 3]. CVNNs are extensions of real-valued neural networks [4, 5]. A fully
complex backpropagation (BP) algorithm and a split-complex BP algorithm are
two types of complex backpropagation algorithms for training CVNNs. Unlike the
case of the fully complex BP algorithm [6, 7], the operation of activation function in the split-complex BP algorithm is split into a real part and an imaginary
part [2, 4, 5, 8]. This split-complex BP algorithm avoids occurrence of singular
points in the adaptive training process. This paper considers the split-complex BP
algorithms.
∗ This work is partially supported by the National Natural Science Foundation of China
(10871220,70971014), and by the Fundamental Research Funds for the Central Universities
(2009QN075).
† Huisheng Zhang
Department of Mathematics, Dalian Maritime University, Dalian, China, 116026
‡ Wei Wu – Corresponding author
School of Mathematical Sciences, Dalian University of Technology, Dalian, China, 116024, E-mail:
[email protected]
c
°ICS
AS CR 2011
75
Neural Network World 1/11, 75-90
Convergence properties of the algorithm are an important issue of its successful
training for neural networks. The convergence of BP algorithm for real-valued
neural networks has been analyzed by many authors from different aspects [9, 10,
11]. By using the contraction mapping theorem, the convergences in the mean
and in the mean square for recurrent neurons were obtained by Mandic and Goh
[12, 13]. For recent convergence analyses of complex-valued perceptrons and the BP
algorithm for complex-valued neural networks, we refer the readers to [1, 14, 15].
It is well known that the learning process of BP algorithm can be very slow.
To speed up and stabilize the learning procedure, a momentum is often added to
the BP algorithm. The BP algorithm with a momentum (BPM in short) can be
viewed as a memory gradient method in optimization theory [16]. Starting from
a point close enough to a minimum point of the objective function, the memory
gradient method converges under certain conditions [16, 17]. This local convergence
result is also obtained in [18] where the convergence of BPM for a neural network
is considered. However, these results cannot be applied to a more complex case
where the initial weights are chosen stochastically. Some convergence results for
the BPM are given in [19]; these results are of global nature since they are valid for
arbitrarily given initial values of the weights. Our contribution in this paper is to
present some convergence results of the split-complex backpropagation algorithm
with a momentum (SCBPM). We borrow some ideas from [19], but we employ
different proof techniques resulting in a new learning rate restriction which is much
relaxed and easier to check than the counterpart in [19]. Actually, our approach
can be applied to the BPM in [19] to relax the learning rate restriction there. Two
numerical examples are given to support our theoretical findings. We also mention
a recent paper on split quaternion nonlinear adaptive filtering [20], and we expect
to extend our results to quaternion networks in future.
The rest of this paper is organized as follows. A CVNN model with one hidden
layer and the SCBPM are described in the next section. Section 3 presents the
main convergence theorem. Section 4 gives two numerical examples to support our
theoretical findings. The proof of the theorem is presented in Section 5.
2.
The Structure of the Network and the Learning
Method
Fig. 1 shows the CVNN structure considered in this paper. It is a network with one
hidden layer and one output neuron. Let the numbers of input and hidden neurons
be L and M , respectively. We use “(·)T ” as the vector transpose operator, and write
R
I
wm = wm
+ iwm
= (wm1 , wm2 , · · · , wmL )T ∈ CL as the weight vector between the
I
R
I
R
∈ R,
and wml
, wml
+iwml
input√neurons and m-th hidden neuron, where wml = wml
R
i = −1, m = 1, · · · , M , and l = 1, · · · , L. Similarly, write wM +1 = wM
+1 +
I
T
M
iwM +1 = (wM +1,1 , wM +1,2 , · · · , wM +1,M ) ∈ C as the weight vector between
R
I
the hidden neurons and the output neuron, where wM +1,m = wM
+1,m + iwM +1,m ,
m = 1, · · · , M . For simplicity, all the weight vectors are incorporated into a total
weight vector
W = ((w1 )T , (w2 )T , · · · , (wM +1 )T )T ∈ CM L+M .
76
(1)
Zhang H., Wu W.: Convergence of split-complex backpropagation. . .
w11
z1
wM+ 1,1
z2
.
wM+ 1,2
.
. w
M + 1,M- 1
.
.
.
O
(OutputLayer)
zL
wM+ 1,M
wML
(InputNeurons)
(HideenLayer)
Fig. 1 Structure of a CVNN.
For an input signal z = (z1 , z2 , · · · , zL )T = x+iy ∈ CL , where x = (x1 , x2 , · · · , xL )T
⊂ RL , and y = (y1 , y2 , · · · , yL )T ⊂ RL , the input to the m-th hidden neuron is
R
I
Um = Um
+ iUm
=
L
L
X
X
R
I
I
R
(wml
xl − wml
yl ) + i
(wml
xl + wml
yl )
l=1
l=1
R
I
I
R
= xT wm
− y T wm
+ i(xT wm
+ yT wm
).
(2)
In order to apply the SCBPM to train a CVNN, we consider the following
popular real-imaginary-type activation function [22]
fC (U ) = f (U R ) + if (U I )
(3)
for any U = U R + iU I ∈ C, where f is a given real function (e.g., a sigmoid
function). The output Hm of the hidden neuron m is given by:
R
I
R
I
Hm = Hm
+ iHm
= f (Um
) + if (Um
).
(4)
Similarly, the input to the output neuron is
S = S R + iS I
=
M
X
R
R
I
I
(wM
+1,m Hm − wM +1,m Hm ) + i
m=1
M
X
I
R
R
I
(wM
+1,m Hm + wM +1,m Hm )
m=1
¡ R T I
¢
R
I T I
I T R
= (HR )T wM
+1 − (H ) wM +1 + i (H ) wM +1 + (H ) wM +1
(5)
and the output of the network is given by
O = OR + iOI = g(S R ) + ig(S I ),
(6)
77
R T
I T
where HR = (H1R , H2R , · · · , HM
) , HI = (H1I , H2I , · · · , HM
) , and g is a given
real function. A bias weight can be added to each neuron, but it is omitted for
simplicity of the presentation and deduction.
Let the network be supplied with a given set of training examples {zj , dj }Jj=1 ⊂
CL × C. For each input zj = xj + iyj (1 ≤ j ≤ J) from the training set, we
j
j,R
j,I
write Um
= Um
+ iUm
(1 ≤ m ≤ M ) as the input for the hidden neuron m,
j
j,R
j,I
Hm = Hm + iHm (1 ≤ m ≤ M ) as the output for the hidden neuron m, S j =
S j,R + iS j,I as the input to the output neuron, and Oj = Oj,R + iOj,I as the
final output. The square error function of CVNN trained by the SCBPM can be
represented as follows:
J
1X j
E(W) =
(O − dj )(Oj − dj )∗
2 j=1
J
=
=
1 X j,R
[(O − dj,R )2 + (Oj,I − dj,I )2 ]
2 j=1
J
X
£
¤
gjR (S j,R ) + gjI (S j,I ) ,
(7)
j=1
where “(·)∗ ” denotes the complex conjugate operator, dj,R and dj,I are the real
part and imaginary part of the desired output dj respectively, and
gjR (t) =
1
1
(g(t) − dj,R )2 , gjI (t) = (g(t) − dj,I )2 , t ∈ R, 1 ≤ j ≤ J.
2
2
(8)
The purpose of the network training is to find WF to minimize E(W). By writing
Hj = Hj,R + iHj,I
j,R T
j,I T
= (H1j,R , H2j,R , · · · , HM
) + i(H1j,I , H2j,I , · · · , HM
) ,
(9)
R
I
we get the gradients of E(W) with respect to wm
and wm
respectively as follows
J
¤
∂E(W) X £ 0
0
=
gjR (S j,R )Hj,R + gjI
(S j,I )Hj,I ,
R
∂wM +1
j=1
(10)
J
¤
∂E(W) X £ 0
0
=
−gjR (S j,R )Hj,I + gjI
(S j,I )Hj,R ,
I
∂wM +1
j=1
(11)
J ·
¡ R
¢
∂E(W) X 0
0
j,R j
I
0
j,I
j
=
gjR (S j,R ) wM
+1,m f (Um )x − wM +1,m f (Um )y
R
∂wm
j=1
¸
¡ I
¢
0
0
j,R j
R
0
j,I
j
, 1 ≤ m ≤ M, (12)
+ gjI
(S j,I ) wM
f
(U
)x
+
w
f
(U
)y
+1,m
m
M +1,m
m
J ·
¡
¢
∂E(W) X 0
R
0
j,R
j
I
0
j,I
j
=
gjR (S j,R ) − wM
+1,m f (Um )y − wM +1,m f (Um )x
I
∂wm
j=1
78
¸
¡
¢
0
I
0
j,R
j
R
0
j,I
j
+ gjI
(S j,I ) − wM
f
(U
)y
+
w
f
(U
)x
,
+1,m
m
M +1,m
m
1 ≤ m ≤ M.
(13)
n
T T
0
0,R
Write Wn = ((w1n )T , (w2n )T , · · · , (wM
(n = 0, 1, · · · ). Let wm
= wm
+
+1 ) )
0,I
0,R
0,I
iwm be arbitrarily chosen initial weights. Let ∆wm = ∆wm = 0. Then the
R
I
SCBPM algorithm updates the real part wm
and the imaginary part wm
of the
weights wm separately:
∂E(Wn )
n,R
n,R
+ τm
∆wm
,
R
∂wm
∂E(Wn )
n,I
n,I
∆wm
,
= −η
+ τm
I
∂wm
n+1,R
n+1,R
n,R
∆wm
= wm
− wm
= −η
n+1,I
n+1,I
n,I
∆wm
= wm
− wm
(14)
n,R
n,I
where η ∈ (0, 1) is the learning rate, τm
and τm
are the momentum factors,
m = 1, 2, · · · , M + 1, and n = 0, 1, · · · .
For simplicity, let us denote
∂E(Wn )
,
R
∂wm
∂E(Wn )
=
.
I
∂wm
pn,R
m =
pn,I
m
(15)
Then (14) can be rewritten as
n,R
n,R
n+1,R
− ηpn,R
∆wm
= τm
∆wm
m ,
n+1,I
n,I
n,I
∆wm
= τm
∆wm
− ηpn,I
m .
(16)
n,R
n,I
Similarly to the BPM [19], we choose the adaptive momentum factor τm
and τm
as follows
( τ kpn,R k
n,R
m
if k∆wm
k 6= 0,
n,R ,
n,R
τm = k∆wm k
(17)
0,
else,
( τ kpn,I k
n,I
m
k 6= 0,
if k∆wm
n,I ,
n,I
τm
= k∆wm k
(18)
0,
else,
where τ ∈ (0, 1) is a constant parameter and k · k is the usual Euclidian norm.
3.
Main Results
The following assumptions will be used in our discussion.
(A1): There exists a constant C1 > 0 such that
max{|f (t)|, |g(t)|, |f 0 (t)|, |g 0 (t)|, |f 00 (t)|, |g 00 (t)|} ≤ C1 .
t∈R
n,R
n,I
(A2): There exists a constant C2 > 0 such that kwM
+1 k ≤ C2 and kwM +1 k ≤ C2
for all n = 0, 1, 2, · · · .
= 0, ∂E(W)
= 0, m = 1, · · · , M + 1} contains only
(A3): The set Φ0 = {W| ∂E(W)
R
I
∂wm
∂wm
finite points.
79
n
Theorem 1 Suppose that Assumptions (A1) and (A2) are valid, and that {wm
}
are the weight vector sequences generated by (14). Then, there exists a constant
C F > 0 such that for 0 < s < 1, τ = sη and η ≤ C F1−s
, the following results
(1+s)2
hold:
(i) E(Wn+1 ) ≤ E(Wn ), n = 0, 1, 2, · · · ;
(ii) There is E F ≥ 0 such that lim E(Wn ) = E F ;
n→∞
°
°
n °
n °
)°
)°
= 0 and lim ° ∂E(W
= 0, 0 ≤ m ≤ M + 1.
(iii) lim ° ∂E(W
∂wR
∂wI
n→∞
n→∞
m
m
Furthermore, if Assumption (A3) also holds, then there exists a point W F ∈ Φ0
such that
(iv) lim Wn = WF .
n→∞
The monotonicity and convergence of the error function E(W) during the learning process are shown in Conclusions (i) and (ii), respectively. Conclusion (iii)
n
n
)
)
indicates the convergence of ∂E(W
and ∂E(W
, referred to as weak convergence.
R
I
∂wm
∂wm
The strong convergence of Wn is given in Conclusion (iv). We note that the restriction η ≤ C F1−s
in Theorem 1 is less restrictive and easier to check than the
(1+s)2
corresponding condition in [19]. We also mention that our results are of deterministic nature compared with a related work in [1], where the convergences in the
mean and in the mean square for complex-valued perceptrons are obtained.
4.
Numerical Example
In the following subsections we illustrate the convergence behavior of the SCBPM
by using two numerical examples. In the both examples, we set the transfer function
to be tansig(·) in MATLAB, which is a commonly used sigmoid function, and carry
out 10 independent tests with the initial components of the weights stochastically
chosen in [-0.5, 0.5]. The average of errors and the average of gradient norms for
all the tests in each example are plotted.
4.1
XOR problem
The well-known XOR problem is a benchmark used in literature on neural networks.
As in [22], the training samples of the encoded XOR problem for a CVNN is
presented as follows:
{z1 = −1 − i, d1 = 1}, {z2 = −1 + i, d2 = 0},
{z3 = 1 − i, d3 = 1 + i}, {z4 = 1 + i, d4 = i}.
This example uses a network with one input node, three hidden nodes, and one
output node. The learning rate η and the momentum parameter τ are set to be 0.1
and 0.01, respectively. The simulation results are presented in Fig. 2, which shows
that the gradient tends to zero and the square error decreases monotonically as the
number of iteration increases and at last it tends to a constant. This supports our
theoretical findings.
80
2
10
square error
norm of gradient
1
10
0
10
−1
10
−2
10
−3
10
−4
10
0
50
100
150
number of iteration
Fig. 2 Convergence behavior of SCBPM for solving the XOR problem (norm of
M
+1
P
2
n,I 2
gradient =
(kpn,R
m k + kpm k )).
m=1
4.2
Approximation problem
In this example, the synthetic complex-valued function [23] defined as
h(z) = (z1 )2 + (z2 )2
(19)
is approximated, where z is a two dimensional complex-valued vector comprised of
z1 and z2 . 10000 input points are selected from an evenly spaced 10 × 10 × 10 × 10
grid on −0.5 ≤ Re(z1 ), Im(z1 ), Re(z2 ), Im(z2 ) ≤ 0.5, where Re(·) and Im(·) represent the real part and imaginary part of a complex number, respectively. We
use a network with 2 input neurons, 25 hidden neurons and 1 output neuron. The
learning rate η and the momentum parameter τ are set to be 0.1 and 0.01, respectively. Fig. 3 illustrates the simulation results, which also support our convergence
theorem.
5.
Proofs
In this section, we first present two lemmas, then we use them to prove the main
theorem.
Lemma 1 Suppose that E : R2M L+2M −→ R is continuous and differentiable on
a compact set Φ ⊂ R2M L+2M , and that Φ1 = {v| ∂E(v)
= 0} contains only finite
∂v
81
1
10
square error
norm of gradient
0
10
−1
10
−2
10
−3
10
−4
10
−5
10
0
10
20
30
40
50
number of iteration
Fig. 3 Convergence behavior of SCBPM for solving the approximation problem
M
+1
P
2
n,I 2
(kpn,R
(norm of gradient =
m k + kpm k )).
m=1
points. If a sequence {vn }∞
n=1 ⊂ Φ satisfies
lim kvn+1 − vn k = 0, lim k∇E(vn )k = 0,
n→∞
n→∞
then there exists a point vF ∈ Φ1 such that lim vn = vF .
n→∞
Proof This result is almost the same as Theorem 14.1.5 in [21], and the detail of
the proof is omitted.
2
For any 1 ≤ j < J, 1 ≤ m ≤ M and n = 0, 1, 2, · · · , the following symbols will
be used in our proof later on:
¡
¢
n,j
n,j,R
n,j,I
n,R
n,I
n,I
n,R
Um
= Um
+ iUm
= (xj )T wm
− (yj )T wm
+ i (xj )T wm
+ (yj )T wm
,
n,j
n,j,R
n,j,I
n,j,R
n,j,I
Hm
= Hm
+ iHm
= f (Um
) + if (Um
),
n,j,R T
n,j,I T
Hn,j,R = (H1n,j,R , · · · , HM
) , Hn,j,I = (H1n,j,I , · · · , HM
) ,
³
n,I
n,j
n,j,R
n,j,I
n,j,R T n,R
n,j,I T n,I
S =S
+ iS
= (H
) wM +1 −(H
) wM +1+i (Hn,j,R )T wM
+1 +
´
n,R
+(Hn,j,I )T wM
+1 ,
ψ n,j,R = Hn+1,j,R − Hn,j,R , ψ n,j,I = Hn+1,j,I − Hn,j,I .
82
(20)
Lemma 2 Suppose Assumptions (A1) and (A2) hold, then for any 1 ≤ j ≤ J and
n = 0, 1, 2, · · · , we have
|Oj,R | ≤ C0 , |Oj,I | ≤ C0 , kHn,j,R k ≤ C0 , kHn,j,I k ≤ C0 ,
0
0
00
00
|gjR
(t)| ≤ C3 , |gjI
(t)| ≤ C3 , |gjR
(t)| ≤ C3 , |gjI
(t)| ≤ C3 , t ∈ R,
max{kψ n,j,R k2 , kψ n,j,I k2 } ≤ C4 (τ + η)2
M
X
2
n,I 2
(kpn,R
m k + kpm k ),
(21)
(22)
(23)
m=1
J ³
³
´
X
n+1,R
n+1,I
0
n,j,I T
gjR
(S n,j,R ) (Hn,j,R )T ∆wM
) ∆wM
+1 − (H
+1
j=1
³
´´
n+1,I
n+1,R
0
n,j,I T
+ gjI
(S n,j,I ) (Hn,j,R )T ∆wM
+
(H
)
∆w
+1
M +1
¡ n,R 2
¢
n,I
2
≤ (τ − η) kpM +1 k + kpM +1 k ,
(24)
J ³
³
´
X
n,R
n,j,I T n,I
0
gjR
(S n,j,R ) (ψ n,j,R )T wM
) wM +1
+1 − (ψ
j=1
³
´´
n,I
n,j,I T n,R
0
(S n,j,I ) (ψ n,j,R )T wM
+
(ψ
)
w
+gjI
+1
M +1
≤ (τ − η + C5 (τ + η)2 )
M
X
2
n,I 2
(kpn,R
m k + kpm k ),
(25)
m=1
J ³
³
´
X
n+1,R
n+1,I
n,j,I T
0
(S n,j,R ) (ψ n,j,R )T ∆wM
−
(ψ
)
∆w
gjR
+1
M +1
j=1
³
´´
n+1,I
n+1,R
n,j,I T
0
(S n,j,I ) (ψ n,j,R )T ∆wM
) ∆wM
+gjI
+1 + (ψ
+1
≤ C6 (τ + η)2
M
+1
X
2
n,I 2
(kpn,R
m k + kpm k ),
(26)
m=1
J
1 X 00 n,j
n+1,j,I
00
(tn,j
− S n,j,I )2 )
(g (t )(S n+1,j,R − S n,j,R )2 + gjI
2 )(S
2 j=1 jR 1
≤ C7 (τ + η)2
M
+1
X
2
n,I 2
(kpn,R
m k + kpm k )),
(27)
m=1
where Ci (i = 0, 3, · · · , 7) are constants independent of n and j, each tn,j
1 ∈ R lies
on the segment between S n+1,j,R and S n,j,R , and each tn,j
∈
R
lies
on
the
segment
2
between S n+1,j,I and S n,j,I .
Proof The validation of (21) results easily from (4)–(6) when the set of samples
is fixed and Assumptions (A1) and (A2) are satisfied. By (8), we have
0
gjR
(t) = g 0 (t)(g(t) − Oj,R ),
0
gjI
(t) = g 0 (t)(g(t) − Oj,I ),
00
gjR
(t) = g 00 (t)(g(t) − Oj,R ) + (g 0 (t))2 ,
83
00
gjI
(t) = g 00 (t)(g(t) − Oj,I ) + (g 0 (t))2 , 1 ≤ j ≤ J, t ∈ R.
Then (22) follows directly from Assumption (A1) by defining C3 = C1 (C1 + C0 ) +
(C1 )2 .
By (16), (17) and (18), for m = 1, · · · , M + 1,
n+1,R
n,R
n,R
n,R
n,R
n,R
n,R
k∆wm
k = kτm
∆wm
− ηpn,R
m k ≤ τm k∆wm k + ηkpm k ≤ (τ + η)kpm k.
(28)
Similarly, we have
n+1,I
k∆wm
k ≤ (τ + η)kpn,I
m k.
(29)
If Assumption (A1) is valid, it follows from (20), (28), (29), the Cauchy-Schwartz
Inequality and the Mean-Value Theorem for multivariate functions that for any
1 ≤ j ≤ J and n = 0, 1, 2, · · ·
°
°2
° f (U1n+1,j,R ) − f (U1n,j,R ) °
°
°
°
°
..
kψ n,j,R k2 = kHn+1,j,R − Hn,j,R k2 = °
°

.
°
°
° f (U n+1,j,R ) − f (U n,j,R ) °
M
M
° 0 n,j
°2
° f (s1 )((xj )T ∆w1n+1,R − (yj )T ∆w1n+1,I ) °
°
°
°
°
..
= °
°
.
°
°
° f 0 (sn,j )((xj )T ∆wn+1,R − (yj )T ∆wn+1,I ) °
M
=
M
X
M
M
j T
n+1,R
n+1,I 2
(f 0 (sn,j
− (yj )T ∆wm
))
m )((x ) ∆wm
m=1
M
X
≤ 2(C1 )2
n+1,R 2
n+1,I 2
((xj )T ∆wm
) + ((yj )T ∆wm
) )
m=1
M
X
≤ 2(C1 )2
n+1,R 2
n+1,I 2
(k∆wm
k kxj k2 + k∆wm
k kyj k2 )
m=1
≤ C4
M
X
n+1,R 2
n+1,I 2
(k∆wm
k + k∆wm
k )
m=1
≤ C4 (τ + η)2
M
X
2
n,I 2
(kpn,R
m k + kpm k ),
(30)
m=1
where C4 = 2(C1 )2 max {kxj k2 , kyj k2 } and sn,j
m is on the segment between
1≤j≤J
n+1,j,R
n,j,R
Um
and Um
for m = 1, · · · , M . Similarly we can get
kψ n,j,I k ≤ C4 (τ + η)2
M
X
2
n,I 2
(kpn,R
m k + kpm k ).
m=1
Thus, we have (23).
84
(31)
According to (16), (17) and (18), we have
T
n+1,R
T
n+1,I
(pn,R
+ (pn,I
m ) ∆wm
m ) ∆wm
¡ n,R 2
¢
2
n,R
T
n,R
n,I
n,I T
n,I
= −η kpm k + kpn,I
+ τm
(pn,R
m k
m ) ∆wm + τm (pm ) ∆wm
¡
¢
2
n,I 2
n,R
n,R
n,I
n,I
n,I
≤ −η kpn,R
+ τm
k∆wm
kkpn,R
m k + kpm k
m k + τm k∆wm kkpm k
¡ n,R 2
¢
2
≤ (τ − η) kpm k + kpn,I
(32)
m k ,
where m = 1, · · · , M, M + 1. This together with (10) and (11) validates (24):
J ³
X
³
´
n+1,R
n+1,I
0
n,j,I T
gjR
) ∆wM
+1 − (H
+1
j=1
=
³
´´
n+1,I
n+1,R
0
n,j,I T
+ gjI
+
(H
)
∆w
+1
M +1
J
X
¡ 0
n+1,R
n+1,R
0
n,j,I
gjR (S n,j,R )(Hn,j,R )T ∆wM
)(Hn,j,I )T ∆wM
+1 + gjI (S
+1
j=1
n+1,I
n+1,I
0
n,j,I
0
)(Hn,j,R )T ∆wM
− gjR
(S n,j,R )(Hn,j,I )T ∆wM
+1
+1 + gjI (S
¢
n+1,R
n,I
n+1,I
T
T
= (pn,R
M +1 ) ∆wM +1 + (pM +1 ) ∆wM +1
¡ n,R 2
¢
2
≤ (τ − η) kpM +1 k + kpn,I
M +1 k .
(33)
Next, we prove (25). By (2), (4), (20) and Taylor’s formula, for any 1 ≤ j ≤ J,
1 ≤ m ≤ M and n = 0, 1, 2, · · · , we have
n,j,R
n+1,j,R
n,j,R
n+1,j,R
)
) − f (Um
= f (Um
− Hm
Hm
1
n,j,R
n+1,j,R
n,j,R
n+1,j,R
n,j,R 2
= f 0 (Um
)(Um
− Um
) + f 00 (tn,j,R
)(Um
− Um
)
m
2
(34)
and
n,j,I
n+1,j,I
n,j,I
n+1,j,I
)
) − f (Um
= f (Um
− Hm
Hm
1
n,j,I
n+1,j,I
n,j,I
n+1,j,I
n,j,I 2
= f 0 (Um
)(Um
− Um
) + f 00 (tn,j,I
− Um
) , (35)
m )(Um
2
where tn,j,R
is an intermediate point on the line segment between the two points
m
n+1,j,R
n,j,R
n+1,j,I
n,j,I
Um
and Um
, and tn,j,I
between the two points Um
and Um
. Thus,
m
according to (2), (12), (13), (14), (20), (32) and (34)–(35) we have
J ³
³
´
X
n,R
n,j,I T n,I
0
gjR
−
(ψ
)
w
+1
M +1
j=1
=
³
´´
n,I
n,j,I T n,R
0
+gjI
(S n,j,I ) (ψ n,j,R )T wM
) wM +1
+1 + (ψ
J X
M ³
X
j=1 m=1
¡
¢
n,R
0
0
n,j,R
n+1,R
n+1,I
gjR
(S n,j,R )wM
) (xj )T ∆wm
− (yj )T ∆wm
+1,m f (Um
¡
¢
n,I
0
0
n,j,I
n+1,I
n+1,R
− gjR
(S n,j,R )wM
) (xj )T ∆wm
+ (yj )T ∆wm
+1,m f (Um
85
¡
¢
n,I
0
0
n,j,R
n+1,R
n+1,I
+ gjI
(S n,j,I )wM
) (xj )T ∆wm
− (yj )T ∆wm
+1,m f (Um
¡ j T
¢´
n,R
0
0
n,j,I
n+1,I
j T
n+1,R
+gjI
(S n,j,I )wM
f
(U
)
(x
)
∆w
+
(y
)
∆w
+ δ1
m
m
m
+1,m
=
M µµ X
J h
´
³
X
n,R
n,I
0
n,j,I
0
0
n,j,R
)yj
)xj − wM
gjR
(S n,j,R ) wM
+1,m f (Um
+1,m f (Um
m=1
j=1
³
´i ¶T
n,I
n,R
0
0
n,j,R j
n+1,R
0
n,j,I
j
+ gjI
(S n,j,I ) wM
f
(U
)x
+
w
∆wm
f
(U
)y
m
m
+1,m
M +1,m
µX
J h
³
´
n,R
n,I
0
0
n,j,R
j
0
n,j,I
j
+
gjR
(S n,j,R ) −wM
)y
)x
f
(U
−
w
f
(U
m
m
+1,m
M +1,m
j=1
+
=
¶
³
´i¶T
n,I
n,R
0
n,j,R
j
0
n,j,I
j
n+1,I
∆wm
+δ1
−wM +1,m f (Um )y +wM +1,m f (Um )x
0
gjI
(S n,j,I )
M
X
¡
¢
T
n+1,R
T
n+1,I
(pn,R
+ (pn,I
+ δ1
m ) ∆wm
m ) ∆wm
m=1
≤ (τ − η)
M
X
¡
¢
2
n,R 2
kpn,I
+ δ1 ,
m k + kpm k
(36)
m=1
where
J
M
¡
¢
1XX³ 0
n,R
00 n,j,R
n+1,R
n+1,I 2
) (xj )T ∆wm
+ (yj )T ∆wm
gjR (S n,j,R )wM
+1,m f (tm
2 j=1 m=1
¡ j T
¢
n,I
0
00 n,j,I
n+1,I
n+1,R 2
− gjR
(S n,j,R )wM
+ (yj )T ∆wm
+1,m f (tm ) (x ) ∆wm
¡
¢
n,I
0
00 n,j,R
n+1,R
n+1,I 2
+ gjI
(S n,j,I )wM
) (xj )T ∆wm
+ (yj )T ∆wm
+1,m f (tm
¢ ´
¡ j T
n,R
j T
n+1,R 2
n+1,I
00 n,j,I
0
+
(y
)
∆w
.
(37)
)
(x
)
∆w
(S n,j,I )wM
f
(t
+ gjI
m
m
m
+1,m
δ1 =
Using Assumptions (A1) and (A2), (17), (18), (22) and the triangular inequality
we immediately get
δ1 ≤ |δ1 | ≤C5
M
X
n+1,R 2
n+1,I 2
(k∆wm
k + k∆wm
k )
m=1
M
X
¡ n,R
¢
n,R
2
n,I
n,I
n,I 2
≤ C5
(τm k∆wm
k + ηkpn,R
m k) + (τm k∆wm k + ηkpm k)
m=1
≤ C5 (τ + η)2
M
X
2
n,I 2
(kpn,R
m k + kpm k ),
(38)
m=1
where C5 = 2JC1 C2 C3 max {kxj k2 + ky j k2 }. Now, (25) results from (36) and
1≤j≤J
(38).
According to (20), (22), (23) and (28) we have
J ³
X
j=1
86
³
´
n+1,R
n+1,I
n,j,I T
0
gjR
−
(ψ
)
∆w
+1
M +1
´´
³
n+1,I
n+1,R
n,j,I T
0
+gjI
) ∆wM
(S n,j,I ) (ψ n,j,R )T ∆wM
+1 + (ψ
+1
≤ C3
J ³
X
j=1
≤ C3
J
X
n+1,R 2
n+1,I 2
n,j,R 2
k∆wM
k + kψ n,j,I k2
+1 k + k∆wM +1 k + kψ
Ã
(τ + η)
2
2
(kpn,R
M +1 k
+
2
kpn,I
M +1 k )
+ 2C4 (τ + η)
j=1
≤ C6 (τ + η)2
2
´
M
X
2
(kpn,R
m k
m=1
!
+
M
+1
X
2
n,I 2
(kpn,R
m k + kpm k )
2
kpn,I
m k )
(39)
m=1
and
J
1 X 00 n,j
00
n+1,j,I
(g (t )(S n+1,j,R − S n,j,R )2 + gjI
(tn,j
− S n,j,I )2 )
2 )(S
2 j=1 jR 1
≤
≤
J
C3 X n+1,j,R
((S
− S n,j,R )2 + (S n+1,j,I − S n,j,I )2 )
2 j=1
J µ
C3 X ³ n+1,j,R T
n+1,I
n+1,R
n,j,R T n,R
n+1,j,I T
) wM +1 −
) ∆wM
(H
) ∆wM
+1 + (ψ
+1 − (H
2 j=1
´2
n,I
−(ψ n,j,I )T wM
+1
³
n+1,I
n+1,R
n,j,R T n,I
n+1,j,I T
+ (Hn+1,j,R )T ∆wM
) ∆wM
) wM +1 +
+1 + (H
+1 + (ψ
¶
´2
n,R
+(ψ n,j,I )T wM
+1
n+1,I 2
n+1,R 2
n,j,R 2
k +kψ n,j,I k2 )
≤ 2C3 J max {(C0 )2 + (C2 )2 }(k∆wM
+1 k + k∆wM +1 k +kψ
≤ C7 (τ + η)2
M
+1
X
2
n,I 2
(kpn,R
m k + kpm k ),
(40)
m=1
where C6 = JC3 max{1, 2C4 } and C7 = 2JC3 max {(C0 )2 + (C2 )2 } max {1, 2C4 }.
Finally we obtain (26) and (27).
2
Now, we are ready to prove Theorem 1 in terms of the two lemmas above.
Proof of Theorem 1. First we prove (i). By (24)–(27) and Taylor’s formula we
have
E(Wn+1 ) − E(Wn )
=
J
X
(gjR (S n+1,j,R ) − gjR (S n,j,R ) + gjI (S n+1,j,I ) − gjI (S n,j,I ))
j=1
=
J
X
¡ 0
0
gjR (S n,j,R )(S n+1,j,R − S n,j,R ) + gjI
(S n,j,I )(S n+1,j,I − S n,j,I )
j=1
¢
1 00 n,j
1 00 n,j
+ gjR
(t1 )(S n+1,j,R − S n,j,R )2 + gjI
(t2 )(S n+1,j,I − S n,j,I )2
2
2
87
=
J ³
X
j=1
´
³
n+1,I
n+1,R
0
n,j,I T
gjR
) ∆wM
+1
+1 − (H
³
´
n+1,I
n+1,R
0
n,j,I T
+ gjI
) ∆wM
+1 + (H
+1
´
³
n,R
n,j,I T n,I
0
+ gjR
−
(ψ
)
w
M +1
+1
³
´
n,I
n,R
0
+ gjI
(S n,j,I ) (ψ n,j,R )T wM +1 + (ψ n,j,I )T wM +1
³
´
n+1,R
n+1,I
n,j,I T
0
+ gjR
−
(ψ
)
∆w
+1
M +1
³
´
n+1,I
n+1,R
n,j,R
n,j,I
0
+ gjI
(S n,j,I ) (ψ
)T ∆wM +1 + (ψ
)T ∆wM +1
´
1 00 n,j
1 00 n,j
(t1 )(S n+1,j,R − S n,j,R )2 + gjI
(t2 )(S n+1,j,I − S n,j,I )2
+ gjR
2
2
¡ n,R 2
¢
2
2
≤ (τ − η + (C6 + C7 )(τ + η) ) kpM +1 k + kpn,I
M +1 k
+ (τ − η + (C5 + C6 + C7 )(τ + η)2 )
M
X
2
n,I 2
(kpn,R
m k + kpm k )
(41)
m=1
where tn,j
∈ R is on the segment between S n+1,j,R and S n,j,R , and tn,j
∈ R is on
1
2
the segment between S n+1,j,I and S n,j,I .
Obviously, by choosing C F = C5 + C6 + C7 and the learning rate η to satisfy
0<η<
1−s
,
C F (1 + s)2
(42)
we have
τ − η + (C6 + C7 )(τ + η)2 ≤ 0, τ − η + (C5 + C6 + C7 )(τ + η)2 ≤ 0.
Then there holds
E(Wn+1 ) ≤ E(Wn ), n = 0, 1, 2, · · ·
(43)
Conclusion (ii) immediately results from Conclusion (i) since E(Wn ) ≥ 0.
Next, we prove (iii). In the following we suppose (42) is valid. Let α = −(τ −
η + (C6 + C7 )(τ + η)2 ), β = −(τ − η + (C5 + C6 + C7 )(τ + η)2 ), then α ≥ 0, β ≥ 0.
According to (41), we have
n,I
2
2
E(Wn+1 ) ≤ E(Wn ) − α(kpn,R
M +1 k + kpM +1 k ) − β
M
X
2
n,I 2
(kpn,R
m k + kpm k )
m=1
n
M
X
X
¡
¢
k,I
2
2
2
k,I 2
≤ · · · ≤ E(W0 )−
α(kpk,R
k
+kp
k
)+β
(kpk,R
m k +kpm k ) .
M +1
M +1
m=1
k=0
(44)
Since E(Wn+1 ) ≥ 0, there holds
n
X
¡
k=0
88
k,I
2
2
α(kpk,R
M +1 k + kpM +1 k ) + β
M
X
¢
2
k,I 2
0
(kpk,R
m k + kpm k ) ≤ E(W ).
m=1
Let n → ∞, then
∞
X
¡
k,I
2
2
α(kpk,R
M +1 k + kpM +1 k ) + β
M
X
¢
2
k,I 2
0
(kpk,R
m k + kpm k ) ≤ E(W ) < ∞.
m=1
k=0
Hence, there holds
¡° ∂E(Wn ) °2 ° ∂E(Wn ) °2 ¢
2
n,I 2
° +°
° = lim (kpn,R
lim °
m k + kpm k ) = 0,
R
I
n→∞
n→∞
∂wm
∂wm
which implies
n °
° ∂E(Wn ) °
°
° = lim ° ∂E(W ) ° = 0, 1 ≤ m ≤ M + 1.
lim °
R
I
n→∞
n→∞
∂wm
∂wm
(45)
Finally, we prove (iv). We use (14), (15), (28), (29), and (45) to obtain
n+1,R
n,R
n+1,I
n,I
lim kwm
− wm
k = 0, lim kwm
− wm
k = 0, m = 1, · · · , M + 1. (46)
n→∞
n→∞
I
T T
R
T
I T
Write v = ((w1R )T , · · · , (wM
+1 ) , (w1 ) , · · · , (wM +1 ) ) , then E(W) can be
viewed as a function of v, and denoted as E(v):
E(W) ≡ E(v).
Obviously, E(v) is a continuously differentiable real-valued function and
∂E(W) ∂E(v)
∂E(W)
∂E(v)
≡
,
≡
, m = 1, · · · , M + 1.
R
R
I
I
∂wm
∂wm
∂wm
∂wm
n,R T
n,I T
n,I
T T
Let vn = ((w1n,R )T , · · · , (wM
+1 ) , (w1 ) , · · · , (wM +1 ) ) , then by (45) we have
n °
° ∂E(vn ) °
°
° = lim ° ∂E(v ) ° = 0, m = 1, · · · , M + 1.
lim °
R
I
n→∞
n→∞
∂wm
∂wm
(47)
From Assumption (A3), (46), (47) and Lemma 1 we know that there is a vF
satisfying lim vn = vF . By considering the relationship between vn and Wn , we
n→∞
immediately get the desired result. We thus complete the proof.
2
References
[1] Mandic D. P., Goh S. L.: Complex Valued Nonlinear Adaptive Filters: Noncircularity,
Widely Linear and Neural Models. NewYork: Wiley, 2009.
[2] Hirose A.: Complex-Valued Neural Networks. New York: Springer-Verlag New York Inc.,
2006.
[3] Sanjay S. P. R., William W. H.: Complex-valued neural networks for nonlinear complex
principal component analysis. Neural Networks, 18, 2005, pp. 61-69.
[4] George G. M., Koutsougeras C.: Complex Domain Backpropagation. IEEE Trans. Circuits
and Systems – II: Analog and Digital Signal Processing, 39, 5, 1992, pp. 330-334.
[5] Benvenuto T., Piazza F.: On the complex backpropagation algorithm. IEEE Trans. Signal
Processing, 40, 4, 1992, pp. 967-969.
89
[6] Kim T., Adali T.: Fully complex backpropagation for constant envelop signal processing. In:
Proc. IEEE Workshop Neural Network Signal Process., Sydney, Australia, 2000, pp. 231-240.
[7] Kim T., Adali T.: Nonlinear satellite channel equalization using fully complex feed-forward
neural networks. In: Proc. IEEE Workshop Nonlinear Signal Image Process., Baltimore, MD,
2001, pp. 141-150.
[8] Yang S. S., Siu S., Ho C. L.: Analysis of the initial values in split-complex backpropagation
algorithm. IEEE Trans. Neural Networks, 19, 9, 2008, pp. 1564-1573.
[9] Wu W., Feng G. R., Li Z. X., Xu, Y. S.: Deterministic convergence of an online gradient
method for BP neural networks. IEEE Trans. Neural Networks, 16, 3, 2005, pp. 533-540.
[10] White H.: Some asymptotic results for learning in single hidden-layer feedforward network
models. J. Amer. Statist. Assoc., 84, 1989, pp. 1003-1013.
[11] Zhang H., Wu W., Liu F., Yao M.: Boundedness and convergence of online gadient method
with penalty for feedforward neural networks. IEEE Trans. Neural Networks, 20, 6, 2009,
pp. 1050-1054.
[12] Mandic D. P., Chambers J. A.: Recurrent neural networks for prediction: learning algorithms, architectures and stability. NewYork: Wiley, 2001.
[13] Goh S. L., Mandic D. P.: A complex-valued RTRL algorithm for recurrent neural networks.
Neural Computation. 16, 12, 2004, pp. 2699-2713.
[14] Xu D. P., Zhang H. S., Liu L. J.: Convergence analysis of three classes of split-complex
gradient algorithms for complex-valued recurrent neural networks. Neural Computation, 22,
10, 2010, pp. 2655-2677.
[15] Zhang H. S., Xu D. P., Wang, Z. P.: Convergence of an online split-complex gradient algorithm for complex-valued neural networks. Discrete Dyn. Nat. Soc., 2010, pp. 1-27, DOI:
10.1155/2010/829692.
[16] Polyak B. T.: Introduction to Optimization, Optimization Software. New York: Inc. Publications Division, 1987.
[17] Shi Z. J., Guo J.: Convergence of memory gradient method. International Journal of Computer Mathematics, 85, 2008, pp. 1039-1053.
[18] Qian N.: On the momentum term in gradient descent learning algorithms. Neural Networks,
12, 1999, pp. 145-151.
[19] Wu W., Zhang N. M., Li Z. X., Li L., Liu Y.: Convergence of gradient method with momentum for back-propagation neural networks. Journal of Computational Mathematics, 26,
2008, pp. 613-623.
[20] Ujang B. C., Took C. C., Mandic D. P.: Split quaternion nonlinear adaptive filtering. Neural
Networks, 23, 2010, pp. 426-434.
[21] Ortega J. M., Rheinboldt W. C.: Iterative Solution of Nonlinear Equations in Several Variables. New York: Academic Press, 1970.
[22] Nitta T.: Orthogonality of decision boundaries in complex-valued neural networks. Neural
Computation, 16, 2003, pp. 73-97.
[23] Savitha R., Suresh S., Sundararajan N.: A fully complex-valued radial basis function, International Journal of Neural Systems, 19, 4, 2009, pp. 253-267.
90

Full text

Transkript

Podobné dokumenty

ČIGO - LIGO

net,Y

Hledání vlastní cesty

Přehled historie komplexních čísel

Percepce krajiny

Metody přímého nástřiku v kombinaci s hmotnostní

Mezní stavy a spolehlivost - Vysoké učení technické v Brně

Neuronové sítě

1. Úvod 2. Možnosti získání dat časové řady

Společenský a legislativní rámec neziskového sektoru

DC/DC Converters