This is the text version of the file http://www.cs.york.ac.uk/~sara/project/oldprojects/aston/thesis.ps.
G o o g l e automatically generates text versions of documents as we crawl the web.


Google is neither affiliated with the authors of this page nor responsible for its content.
These search terms have been highlighted: natural gradient bfgs 

Computationally efficient natural gradient descent
Page 1
Computationally efficient natural
gradient descent
Sara-Jayne Farmer
Master of Science by Research
in
Pattern Analysis and Neural Networks
Aston University
September 1998
This copy of the thesis has been supplied on condition that anyone who consults
it is understood to recognise that its copyright rests with its author and that no
quotation from the thesis and no information derived from it may be published
without proper acknowledgement.
1

Page 2
Aston University
Computationally ecient natural
gradient descent
Sara-Jayne Farmer
Master of Science by Research
in
Pattern Analysis and Neural Networks, 1998
This study examines the use of matrix momentum terms with the aim of creating a more
computationally ecient natural gradient descent algorithm for on-line learning. It uses
the statistical mechanics framework created by Saad and Solla to describe the evolution of
order parameters for this algorithm in a two-layer student-teacher scenario, and compares
this with results from matrix-momentum natural gradient learning of real datasets.
2

Page 3
Acknowledgements
Thanks go to Magnus Rattray and David Saad for supervision, guidance and several
patient explanations of strange matrix indices: any good stu in this thesis is inevitably
theirs. Thanks also go to Genny Orr for providing the phoneme classication dataset used
in chapter 4.
3

Page 4
Contents
1 Introduction
8
1.1 Neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
1.2 Multilayer perceptrons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.1 Backpropagation learning . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.2 Gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3 Online learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.1 Online learning and time-varying data . . . . . . . . . . . . . . . . . 12
1.4 Neural networks as statistical approximators . . . . . . . . . . . . . . . . . 13
1.5 Neural networks as points in parameter space . . . . . . . . . . . . . . . . . 13
1.5.1 Applying dierential geometry to neural networks . . . . . . . . . . 13
1.6 Second-order learning methods . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.6.1 Newton’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.6.2 Approximating the Hessian . . . . . . . . . . . . . . . . . . . . . . . 16
1.7 Matrix Momentum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.7.1 Balancing gradient descent and matrix momentum . . . . . . . . . . 18
1.8 Natural gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.8.1 The Fisher information matrix . . . . . . . . . . . . . . . . . . . . . 20
1.8.2 Matrix momentum for natural gradient descent (MM-NGD) . . . . . 20
2 Analysis of MM-NGD for online learning
22
2.1 Statistical mechanics analysis of online learning . . . . . . . . . . . . . . . . 22
2.1.1 Evolution of order parameters . . . . . . . . . . . . . . . . . . . . . . 23
2.1.2 Evolution of generalisation error . . . . . . . . . . . . . . . . . . . . 23
2.2 Example networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.1 Input data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.2 Activation functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.3 Generalisation error . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3 Analysis of gradient descent algorithms . . . . . . . . . . . . . . . . . . . . 25
2.3.1 Gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.2 Matrix momentum version of Newton’s method . . . . . . . . . . . . 26
2.3.3 MM-NGD with the full Fisher information matrix . . . . . . . . . . 27
2.3.4 MM-NGD with single-input Fisher information . . . . . . . . . . . . 28
3 Numerical results
30
3.1 Learning parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Initial conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3 Phases of learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4 Comparing standard, natural and MM-NGD learning algorithms . . . . . . 31
3.4.1 MM-NGD without teacher noise . . . . . . . . . . . . . . . . . . . . 32
3.4.2 MM-NGD with teacher noise . . . . . . . . . . . . . . . . . . . . . . 33
4

Page 5
3.5 Varying other parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4 Using MM-NGD on real data
36
4.1 MM-NGD for MLPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.1.1 Calculating the Fisher information matrix . . . . . . . . . . . . . . . 37
4.1.2 Approximating the Fisher information matrix . . . . . . . . . . . . . 37
4.2 Small datasets : the Iris dataset . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.3 Medium-sized datasets : wine classication . . . . . . . . . . . . . . . . . . 39
4.3.1 varying , denitions . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.4 Large datasets : speech (phoneme classication) data . . . . . . . . . . . . . 41
4.4.1 Large Fisher information matrices . . . . . . . . . . . . . . . . . . . 41
4.4.2 Slow learning times . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5 Conclusions
42
5.1 Possible extensions to this work . . . . . . . . . . . . . . . . . . . . . . . . . 43
A Variables and notation
46
A.1 Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
A.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
B Analysis of MM-NGD for soft committee machines
49
C The Fisher information matrix
52
C.1 Exact Fisher information matrix for a soft committee network . . . . . . . . 52
C.2 Exact Fisher information matrix for a multilayer perceptron . . . . . . . . . 54
C.2.1 Reducing the calculation time . . . . . . . . . . . . . . . . . . . . . . 55
5

Page 6
List of Figures
3.1
MM-NGD vs NGD and optimal gradient descent
. . . . . . . . . . . . . . . . . 32
3.2
Symmetric phase behaviour of MM-NGD (teacher noise=0)
. . . . . . . . . . . . 33
3.3
Asymptotic behaviour of MM-NGD for noisy teacher (
2
m
= 0
:
01)
. . . . . . . . . 34
3.4 Generalisation error evolution for various values . . . . . . . . . . . . . . 35
4.1
Iris data,
= 0
:
16, various
k . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2
Wine data, varying
k . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3
Comparing MM-NGD algorithms
. . . . . . . . . . . . . . . . . . . . . . . . . 40
6

Page 7
List of Tables
C.1 Speech network calculation times (in ops) and dominant calculation sizes
(%) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
7

Page 8
Chapter 1
Introduction
This study examines the use of matrix momentum terms and natural gradient descent in
the online learning of neural networks. Its aim is to create a more computationally ecient
natural gradient descent algorithm for on-line learning. It uses the statistical mechanics
framework created in [24] to describe the evolution of order parameters for this algorithm
in two-layer networks within a student-teacher scenario.
This chapter introduces the tools needed for this work and some of the issues in their
use.
1.1 Neural networks
A neural network is a exible probabilistic model of an unknown mapping instanced by a
set of examples. A network has nodes, parameters and activation functions. Nodes receive,
modify and pass on (propagate) information. Input nodes receive information in the form
of vectors or individual values from outside. Output nodes modify then pass information
to the user. Hidden nodes modify and pass information between the input and output
nodes, and play the main role in processing incoming data in a network. A network’s
structure (nodes and the layout of links between them) is usually specied by the person
building the network, but there are some algorithms that can adapt the network structure
to best represent the data being learnt.
Each node in a network may have several inputs, either from other nodes or from
outside the network. The values input to a node are termed its activation; the value output
from a node is a function of its activations and is termed its activation function. Activation
functions are specied by the network designer and mostly range from simple summation
8

Page 9
of all the inputs to a function of that sum. For nonlinear networks (networks that can
represent nonlinear functions) and derivative-based learning algorithms, the activation
function of the hidden nodes must be dierentiable, for example tanh or erf fuctions
(these also squash their outputs into a range between 0 and 1 or -1 and +1).
Networks also have numerical parameters which are used to control the behaviour of
a network and are usually the weights on the links between nodes. the weight on a link
is used to modify any output from one node before it is input to the other. Parameters
can be modied to improve the performance of a network: this modication is known as
learning or training. Learning algorithms adapt the parameters of a network so that the
outputs from it are as close as possible to the outputs given by the system that they are
modelling given random inputs from the same distribution as the examples.
Using a network consists of passing information to its input nodes and reading o the
activations of its output nodes. Although the ability to store examples may be useful, the
main aspect of using a neural network that we will focus on is generalisation: the ability to
summarise information and return consistent outputs for inputs that the network has not
been trained on. This is achieved with a learning algorithm. A learning algorithm adjusts
the
parameters
of a network to t a set of
examples
, where each example consists of
an input vector
and the output
that is expected from the network, given that input
and the system that the network is trying to model(here, is an index to the example, as
are most superscripts used in this document).
Because networks learn the underlying function that generated a set of examples rather
than the exact connection between each example’s input and output pair, they are robust:
able to learn from noisy or incomplete examples. How well a network is able to generalise
is determined by the user’s choice of network and activation function; tools like cross-
validation can be used to assess the expected performance. Neural networks are most
commonly used to model systems which can produce example outputs or input-output
pairs but whose underlying rules are either unknown or too expensive to model. There
are several varieties of neural network architectures, for instance multilayer perceptrons
and radial basis functions: in this document, only the networks known as multilayer
perceptrons are considered.
9

Page 10
1.2 Multilayer perceptrons
The most commonly-used neural networks are multilayer perceptrons, or MLPs. The
nodes in an MLP are arranged into layers : an input layer, then one or more hidden layers
and an output layer, where each layer is fully connected to the next layer in the network.
Connections between layers can only be directed forwards in the network: for instance,
connections can be made from a hidden layer to the output layer but connections from the
output layer back to a hidden layer are not allowed. Although not strictly necessary, the
same activation function is usually dened for all hidden and output nodes in an MLP.
Using an MLP consists of activating the input nodes, propagating that input between
the network’s layers and reading the output from its output nodes.
The most common MLP has one input layer, one hidden layer and one output layer
and is known as a two-layer network (the input layer is ignored when counting the number
of layers in a network). There are some variants of MLPs, for example a soft committee
network or soft committee machine [4] is a two-layer MLP whose hidden to output weights
are not allowed to vary, and are usually set to one, and output nodes whose activation
functions are or are proportional to a simple sum of their input activations. Soft committee
machines are used as example networks for much of this document and preserve most of
the characteristics of general two layer networks [20]. Reviews of other MLP variants and
their learning methods can be found in [5][9]).
1.2.1 Backpropagation learning
In the soft committee machines used in this document, the MLP parameters are re-
stricted to the input-to-hidden weights
J
. Before learning occurs, these are set to initial
values which are usually small and randomly distributed.
Learning in MLPs is usually done by backpropagation, which works in two stages.
First, an input
is propagated (modied and passed on) forward from the input layer to
the output layer through all the hidden layers of an MLP, to produce a network output
(
J;
). This is the forward pass of backpropagation. The network output (
J;
)
produced in the forward pass is then compared against the expected output
for the
corresponding input
. This comparison produces a network error
J
(
;
), which is
then propagated back from the output units to the input units, to give the local gradient
i
for each node
i
. This is the backward pass which gives the algorithm its name. When
the backward pass through the network is completed, the local gradients
i
(also known
10

Page 11
as backpropagation delta functions) are used to adjust the network parameters, and the
next forward pass can begin.
The network
error
J
(
;
) is used in the backward pass of the backpropagation
error.
J
(
;
) is a measure for the distance between the actual output (
J;
) and the
expected output
of a network with parameters
J
which has been given the input
.
In this document, this distance is taken to be quadratic:
J
(
;
) =
1
2
[ (
J;
)
)]
2
(1.1)
In
online learning
, the network error for just the latest example (
;
) is propagated back
through the network in the backward pass of the algorithm; in
batch learning
, an average
over all the available input patterns
D
=
f
(
1
;
1
)
:::
(
M
;
M
)
g
is used. This means that
all the examples must be available at every stage of batch learning, but online learning
can process one example at a time.
Although the network error is used in this learning algorithm, the
generalisation error
g
(
J
) is more useful in analysing the performance of a learning algorithm. The generali-
sation error gives the probability that the network will produce the wrong output for an
example that it has not been trained on but is drawn from the same distribution or model
as the training examples; it is dened as the network error averaged over the distribution
of network inputs:
g
(
J
) =
h
J
(
;
)
i
f g
(1.2)
1.2.2 Gradient descent
The errors for dierent values of the network parameters form an error landscape or
surface, where each point represents the error for a particular choice of parameters and
distances between points measure the dierences between the errors.
The error term passed back through the network is not the error at each node, but the
derivative of the network error with respect to the node’s parameters. This is proportional
to the derivative of the nodes activation function, which must therefore be dierentiable
for backpropagation to work.
Most MLP learning algorithms adjust the network parameters
J
by subtracting a
fraction of the error gradient
r
J J
(
;
) from them:
J
+1
=
J
r
J J
(
;
)
(1.3)
11

Page 12
where is known as the
learning rate
. This ensures that, for a general series of examples
(
;
) and an appropriate value of , the learning algorithm will move the network error,
on average, downhill in the error landscape.
The learning rate
controls the step-size on the error landscape. If it is constant
throughout learning a network’s parameters, then the learning algorithm will either be
forced to take small steps and a long time traversing the landscape towards an error
surface minimum, or larger steps which will overshoot the minimum in the nal stages of
learning. These problems can be avoided (or alleviated) by
annealing
the learning rate:
gradually reducing the value of over time.
1.3 Online learning
Algorithms where parameter adjustment is done as each new example is presented (rather
than for all examples at once) are known as
on-line learning
or
pattern learning
. Online
learning is useful when examples are only available serially, the task is nonstationary (the
function that is being learnt by the network is changing over time) or there is less data,
space or time available than is needed to process an entire dataset (group of examples)
simultaneously. This report deals exclusively with online learning.
1.3.1 Online learning and time-varying data
For many of the equations shown below, the input data is assumed to be iid: each input
is uncorrelated with the inputs that went before it. Data for online learning is produced
sequentially or sampled from datasets too big to be loaded completely into a system.
If data is sampled with replacement, then it will not be uncorrelated. Many examples
of real sequential data are from underlying systems whose parameters change over time
which often means that there are weak correlations between successive inputs, and may
imply that older data will be of less value to the learning algorithm and might need to be
forgotten. These factors have not been considered in this report, but may prove signicant
when using real data
1
.
1
For example, Heskes and Coolen[11] report on the eects of correlation between inputs to two-layer
networks, but this has not been considered here
12

Page 13
1.4 Neural networks as statistical approximators
A neural network models the expected value
h
P
(
j
;
J
)
i
f
;
g
(1.4)
of the conditional probability of its outputs
given its inputs , parameterised by its
weights
J
and sometimes also parameterised by its structure. Learning in a network is
the process of adapting the parameters
J
to get the mapping from this model as close
as possible to the underlying rule that generated a set of example input-output pairs
D
=
f
(
1
;
1
)
:::
(
M
;
M
)
g
. Given a prior distribution for the data
P
(
D
), this can also be
seen as maximising
P
(
J
j
D
) with respect to
J
.
1.5 Neural networks as points in parameter space
Information geometry is the application of techniques from dierential geometry to statis-
tical models [13] [1]. It represents families of neural networks as manifolds (k-dimensional
geometric objects which can be mapped onto Euclidean space) in a space (d-dimensional,
where k<d) whose coordinates are the parameters (weights and biases) of the networks, and
each individual network is represented by a point on the manifold which corresponds to its
particular parameters. As an example, a two-layer soft committee network with weights
J
1
:::J
N
can be represented as a point on a manifold of soft committee network mappings
in
N
-dimensional parameter space, and similar networks with dierent parameter values
would be represented by dierent points on the same manifold.
The curvatures of the manifolds are described using tensors. A Riemann tensor is
a matrix which describes the curvature of a surface at a specic point in space. In n-
dimensional space, the tensor is an n-by-n symmetric matrix: e.g. in four dimensions, this
is 4-by-4 with 16 elements but only 10 independent components because of its symmetry.
A complete surface is described by placing tensors at dierent points on the surface, or by
parametrising the tensor so that any point on the surface could be described: the resulting
collection of tensors is known as a vector eld, and dierential geometry is the study of
their properties.
1.5.1 Applying dierential geometry to neural networks
In this representation of a neural network, learning can be seen as moving the point that
represents a particular network closer to a (perhaps theoretical) set of parameters that
13

Page 14
generated a set of given input-output examples. This can be a powerful tool for the
analysis of learning, and it can also be used to create optimal or near-optimal learning
algorithms.
Having described learning as moving points closer together in parameter space, we need
to have a measure of how close two points in that space are in terms of the probability
distributions realised by them. There are several measures
2
(for example Kullback-Leibler,
Hellinger, squared and Euclidean) in statistics for the distance between two distributions
p
and
q
, but one of the most natural and useful is the Kullback-Leibler distance
KL
(
p
(
x
)
;q
(
x
)) =
Z
p
(
x
) log
p
(
x
)
q
(
x
)
dx
(1.5)
which for two distributions
p
(
;
;
J
) and
q
(
;
;
J
) of network inputs and outputs
parameterised by
J
is
KL
(
p
(
;
;
J
)
;q
(
;
;
J
)) =
Z
p
(
;
;
J
) log
p
(
;
;
J
)
q
(
;
;
J
)
d
(
;
)
(1.6)
We need to apply this to the parameter-space representation of neural networks.
In information geometry, the Kullback-Leibler distance between mappings realised by
two nearby points in a space is known as a metric : if we are measuring small distances in
parameter space (which is what we expect in online learning) and can ignore higher orders
of
dJ
, then equation (1.6) becomes
KL
(
p
(
;
;
J
)
;p
(
;
;
J
+
dJ
)) =
1
2
dJ
T
G
(
J
)
dJ
(1.7)
where
G
(
J
) =
h
(
r
J
log
P
J
(
;
;
J
)(
r
J
log
P
J
(
;
;
J
)
T
i
f
p
(
;
)
g
(1.8)
is the metric for this space, and is known as the Fisher information matrix. Gradient
descent in parameter spaces which use the Kullback-Leibler measure of distance is known
as natural gradient descent.
1.6 Second-order learning methods
Several second order methods have been devised to speed up learning. Examining the rst
few terms of a Taylor expansion of an error
J
(
;
) about a point
J
in weight-space
gives [5]:
^
J
(
;
) =
J
(
;
) + (
^
J J
)
r
J J
(
;
)
j
J
+
1
2
(
^
J J
)
T
H
(
^
J J
)
(1.9)
2
these are all dened through delta-divergences [29][30]
14

Page 15
where
H
=
@
J
(
;
)
@J
i
@J
k
j
J
(1.10)
is known as the Hessian matrix and
r
J J
(
;
)
j
J
=
@
J
(
;
)
@J
i
j
J
(1.11)
is the error surface gradient at
J
.
Many gradient descent algorithms use the error’s gradient
r
J J
(
;
) to calculate where
to move next in the energy landscape, and ignore any higher-order terms, but an improve-
ment in their learning speed can be had from using the second-order term
1
2
(
^
J J
)
T
H
(
^
J J
)
too. (Third and higher-order terms aren’t used because they are usually insignicant and
are more dicult to handle).
1.6.1 Newton’s method
The most common second-order network learning method is based on Newton’s method
for nding the minimum of a function[5].
If we assume that the local gradient obtained by dierentiating the second-order
J
(
;
)
estimate (equation 1.9) with respect to (
^
J J
) is zero, then we have
r
J J
(
;
)
j
J
+
H
(
^
J J
) = 0
(1.12)
i.e.
^
J
=
J H
1
r
J J
(
;
)
(1.13)
Since we have ignored higher-order terms in the Taylor expansion, this is not an exact
equation for
^
J
, and an iterative procedure must be used (we also focus on online learning
which requires iterative updates). This gives an update equation for the weights
J
of
J
+1
=
J
H
1
r
J J
(
;
)
(1.14)
is the learning rate as before.
Learning with Newton’s method converges faster than rst-order gradient descent but
it is very sensitive to rescaling (e.g. normalisation) of inputs, and it needs a lot of pro-
cessing and system space to create, store and invert the Hessian matrix. In addition,
employing Newton’s method makes the algorithm converge towards and stabilise on any
xed points in the error landscape regardless of whether they are minima or saddlepoints,
and this may result in suboptimal performance in complex learning dynamics.
15

Page 16
1.6.2 Approximating the Hessian
Accurately calculating the inverse Hessian
H
1 used in Newton’s method requires an
average over all the input data (this is, of course, only available in batch learning), followed
by a (large) matrix inversion, at every step of the gradient descent algorithm. This can
take a long time to calculate, and since the inverse Hessian is a very large matrix, it can
be impractical or impossible to store during calculations. It therefore makes sense to look
for more ecient Hessian algorithms and reasonable approximations to these matrices.
One approximation to the Hessian is to use only some of the input data to evaluate
the Hessian matrix. Where only the latest input is used, this is termed here a single-
step approximation. The calculations used to create the elements of the Hessian matrix
can also be approximated : variations of Newton’s method that use this are known as
pseudo-Newton or quasi-Newton algorithms. Simple approximations to the Hessian in-
clude diagonal approximations and the Levenberg-Marquardt approximation[5]. Diagonal
approximations only contain the elements on the diagonal of the Hessian. they are not
very precise: the o-diagonal elements of the Hessian are signicant in many learning
scenarios, and inverting a diagonal approximation of the Hessian diers signicantly from
the diagonal of its inverse [16]. the Levenberg-Marquardt approximation depends on the
expansion of the Hessian for a sum-of-squares error
J
(
;
) =
1
2
P
n
(
n
n
)
2
into its
components:
@
2
J
(
;
)
@J
i
@J
k
=
X
n
@
n
@J
i
@
n
@J
k
+
X
n
(
n
n
)
@
2
n
@J
i
@J
k
(1.15)
and ignoring the second term in that equation. This is a reasonable approximation if
(
n
n
) is small. Other algorithms build up the Hessian iteratively from rst-order
terms: the most common of these are the BFGS (Broyden-Fletcher-Goldfarb-Shanno) and
the Davidson-Fletcher-Powell procedures.
Other algorithms include approximations to the inverse Hessian [8], but since the
Hessian
H
is often used multiplied by a vector
v
, it can be more practical to create
the product
Hv
directly. Pearlmutter [16] gives an algorithm for doing this eciently
and exactly by setting
J
^
J
=
rv
in the Taylor expansion above then taking limits as
r
approaches zero to give a dierential operator which can be applied to a network to
generate
Hv
.
16

Page 17
1.7 Matrix Momentum
Inverting the Hessian in Newton’s method can be avoided by incorporating it into a matrix
momentum term[15].
Momentum is a method used mostly in batch learning to counter the eect of gradient
descent algorithms oscillating across large gradient changes in one dimension whilst giving
little eort in a gradually-changing but consistent downward slope in another: this is most
apparent when the error surface is a gently-descending valley with steep sides. One way
to ameliorate this eect is to add a
momentum
term [22] to the weight update equations,
which adds a fraction of the previous weight update into the current one. This reduces the
strength of the oscillation across the surface whilst emphasising the smaller (but consistent)
movement in the other dimension. Including a momentum term into the update equation
1.3 in an online learning scenario gives:
J
+1
=
J
r
J J
(
;
) + (
J
J
1
)
(1.16)
where
is the
momentum parameter
and is usually set between 0 and 1 (e.g. ref [10]
suggest = 0
:
9) but it may be adaptive in an attempt to speed up network convergence.
Suggestions for speeding up network convergence also include using a separate
;
for
each network weight, but using momentum already has this eect.
A simple momentum term (e.g. =
constant
) is not terribly useful in online learning
because its eect is merely to rescale the learning rate . This is because, for an error
surface with a gradient which is roughly constant near the current network parameters
J
(this is the situation that we expect in online learning), using the update equation 1.16 is
equivalent to using a learning rate of [5] [27]
e
=
(1
)
(1.17)
If, however, the momentum term in equation 1.16 is replaced with a matrix =
I kH
and the learning rate is replaced with =
k
where
is a scalar learning rate
which can be annealed (depend on the normalised example index =
=N
), the update
equation for the network becomes
J
+1
=
J
k
r
J J
(
;
) + (
I kH
)(
J
J
1
)
(1.18)
and the eective learning rate is
e
=
H
1
(1.19)
17

Page 18
This is known as
matrix momentum
: it is important because it gives a way of eectively
premultiplying the gradient by a matrix inverse
H
1
without having to calculate that
matrix inverse [15]. Matrix momentum has been used by Orr and Leen[15] with the
matrix
H
set to the Hessian to give Newton’s method without inverting the Hessian.
Note that the constant
k
is used to balance the contributions from the gradient and
momentum parts of the update equation, and that the matrix momentum equations above
are only valid for large values of
k
[18].
1.7.1 Balancing gradient descent and matrix momentum
If we scale both and by
N
and introduce a new variable where
i
=
N
(
J
i
J
1
i
)
(1.20)
then we can substitute these into equation 1.16 to get an update equation for
J
of
J
+1
i
=
J
i
N
r
J J
(
;
) +
N
i
(1.21)
We can then use equations 1.20 and 1.21 to also get an update equation for of
+1
i
=
i
r
J J
(
;
) =
i
r
J J
(
;
) + (
1)
i
(1.22)
where equation 1.21 and equation 1.22 form a coupled pair of equations.
Using this coupled pair of equations, reference [15], which is restricted to the asymp-
totic regime, suggests an optimal matrix momentum =
I
0
H
and annealed learning
rate =
0
=
where
0
is an initial learning rate. This gives an eective learning rate of
e
=
(1
)
=
H
1
(1.23)
which is similar to Newton’s method and does not require inverting the matrix
H
.
An alternative scaling of the , terms was suggested by [17], where and (1 ) are
both scaled by
N
. To do this, and are replaced by
= (1
N
)
=
k=N
If we now use this scaling and let
!1
,
k
!1
with
k=
constant, the eective learning
rate becomes [17]
e
=
k=
(1.24)
18

Page 19
This holds for =
I kH=N
and =
k
=N
with large values of
k
[18], again giving an
eective learning rate
e
of
e
=
H
1
(1.25)
where
k
balances the contributions of the gradient and momentum terms as before. Note
that, for eective training, the learning rate should start from a constant e.g.
0
, and
only asymptotically decay as
/
1
=
. These results are equally valid for any matrix,
including the Fisher information matrix
G
(
J
), and this scaling will be used throughout
the rest of this document.
1.8 Natural gradient descent
Natural gradient descent uses the Fisher information matrix as the natural metric in
parameter space. For gradient descent, the eect of using this metric is to premultiply the
gradient by the inverse of the Fisher information matrix, changing the update equation
for online standard gradient descent from equation 1.3 to
J
+1
=
J
N
G
(
J
)
1
(
J
)
r
J J
(
;
)
(1.26)
Useful features of natural gradient learning are that it is invariant to reparameterisation
of the model distribution (e.g. scaling the parameters does not change the eciency of
the learning algorithm), is asymptotically optimal [2], less prone than Newton’s method
to trapping in symmetric phases (trapping occurs when the algorithm becomes stable at
a saddlepoint or local minimum in the error landscape rather than its global minimum)
and the Fisher matrix
G
(
J
) is always positive denite.
The natural gradient descent algorithm requires the inverse
G
(
J
)
1
(
J
) of the Fisher
information matrix. Inverting the Fisher matrix can take a long time. Schemes for cal-
culating this inversion range from exact algorithms like block-wise partitioning and some
data preprocessing to facilitate the calculation [28] to diagonal approximations [2] [21].
In general, these algorithms can be computationally expensive to calculate and calcu-
lating the average
<>
over all inputs in the Fisher matrix also requires knowledge of all
input data: this may not be sensible or practical for many cases of on-line learning (which
is based on the premise that data is not all available at the same time). This document
concentrates on the rst problem: the second is addressed in [25].
19

Page 20
1.8.1 The Fisher information matrix
The loss function
L
of a network is dened [28] as the negative log-likelihood function
L
(
;
;
J
) = log
p
(
;
;
J
) = log
p
(
j
;
J
) + log
p
( )
(1.27)
where
p
(
;
;
J
) is the joint probability density function of the network inputs and outputs
parameterised by its weights
J
.
The Fisher information matrix
G
(
J
) of a network is dened from
L
as
G
(
J
) = [
G
i;k
]
(1.28)
where and are input node indices (1
i;k N
),
i
and
k
are hidden node indices
(1
i;k K
) and [
:
] is a block matrix whose elements are
G
i;k
=
@L
@J
i
@L
@J
k
f
;
g
=
@
log
p
(
j
;
J
)
@J
i
@
log
p
(
j
;
J
)
@J
k
f
;
g
(1.29)
where
<>
f
;
g
is an average over the output distribution , followed by an average over
the input distribution .
1.8.2 Matrix momentum for natural gradient descent (MM-NGD)
Matrix momentum is valid for any matrix. If we use the Fisher information matrix, we
have a learning algorithm that is equivalent to natural gradient descent without inverting
the Fisher information matrix.
The update equations for a matrix momentum form of natural gradient descent (from
now on, abbreviated to MM-NGD) with and scaled by the input size
N
are
J
+1
i
=
J
i
N
r
J J
(
;
) +
N
i
(1.30)
+1
i
=
i
r
J J
(
;
)
(1.31)
=
I
kG
(
J
)
N
(1.32)
=
k
N
(1.33)
where
i
=
N
(
J
i
J
1
i
),
is a scalar learning rate which can be annealed (depend on
the value of =
=N
, the normalised example index), and
G
(
J
) is the Fisher information
matrix.
When is signicantly larger than , the learning is dominated by the gradient and
is expected to be very similar to gradient descent. When is signicantly larger than
, then learning is dominated by the momentum terms and can be expected to depend
20

Page 21
largely on the value of the Fisher information matrix. The next chapter takes this as a
starting point, and analyses the eects of varying the parameters of MM-NGD learning
for soft committee networks.
21

Page 22
Chapter 2
Analysis of MM-NGD for online
learning
This chapter introduces the tools used to analyse the behaviour of a network’s parameters
and output errors during learning. It gives examples of this and equations for learning in
a type of MLP known as a soft committee machine.
2.1 Statistical mechanics analysis of online learning
Statistical mechanics seeks, in general, to describe a system of many interacting particles
(in this case, network parameters) in terms of a smaller number of
order parameters
.
These can be used to describe the system only if it is
self-averaging
: if the parameters
for each individual particle are identical (or very similar to) the average parameters for all
the particles. In the discussion that follows, all the order parameters have very sharply
peaked distributions with small variance, so only their mean values will be analysed.
It is dicult to reduce the number of parameters studied if we are looking at an
individual network, but if we assume that the outputs in the ’correct’ input-output ex-
amples given to a learning algorithm are generated by feeding the inputs into another
network, then we can describe the learning algorithm in terms of the dierence between
this ’teacher’ network and the ’student’ network which is learning from the examples. A
covariance matrix can be formed between the student and teacher activations. This con-
tains the overlaps
Q
ik
=
J
i
J
k
,
R
in
=
J
i
B
n
,
T
nm
=
B
n
B
m
between the two networks :
each of these overlaps can be used as an order parameter for online learning [24][4]. Note
that the teacher network does not have to produce perfect outputs: it may include some
22

Page 23
noise on its output, and that the teacher network does not have to exist: it is merely a
construction to give us a better insight into how a network is learning.
2.1.1 Evolution of order parameters
To derive dierential equations to describe the evolution of order parameters, we need to
use a continuous time variable . Since is discrete, is constructed by setting =
=N
,
which in the thermodynamic limit of
N
!1
becomes continuous.
Equations for the evolution of network order parameters, derived from the simple
weight update equation given above are provided in [24].
2.1.2 Evolution of generalisation error
The generalisation error
g
(
J
) =
<
J
(
;
)
>
f g
of the student network can be described
wholly in terms of
Q
,
R
and
T
. Unlike
J
, the evolution of
Q
,
R
and
T
over time (or
over presented input patterns) can be described deterministically for large values of
N
:
the derivatives of the overlaps form a coupled and closed dierential equation. In this
case, the reduction in number of variables is considerable: since much of the maths in,
for example, [24] assumes an innite input size
N
, the order parameters also have the
advantage of being nite.
Evolution of the generalisation error
g
(
J
) for a network has a characteristic shape:
the error will initially drop rapidly, then sit on a plateau until it continues to drop again.
The plateau is where the network is close to an unstable xed point of the error landscape
where the network weights have become symmetric : this xed point is unstable and
eventually the symmetry will be broken and the learning algorithm will continue towards
an asymptotic stable xed point. Both the length of the plateau and the rate at which the
error drops after it are determined by the system size and the learning rate : a larger
value of will reduce the plateau length but will take much longer to run if -annealing
is used.
2.2 Example networks
To illustrate the eciency and behaviour of dierent gradient descent learning algorithms,
the dynamics of their order parameters and generalisation error have been calculated for
a specic type of network. The example networks used in this section are soft committee
machines with erf hidden unit activations.
23

Page 24
Soft committee machines are used as examples here because they can model a wide
range of functions (in fact, they are universal approximators if node biases are included
[26]). A soft committee machine is a two-layer network with positive, unit-strength cou-
plings from its hidden units to a single output unit[4]. The activation function for hidden
units must be dierentiable: the erf (error) function is used here because its integral is
not too complicated.
Each network has a xed number
N
of input nodes, one output node and weights on
the connections between the input and hidden layer which are labelled
J
f
J
i
g
1
i K
in
the student network and
B
f
B
n
g
1
n M
in the teacher network, and
J
i
= (
J
i
1
,..,
J
iN
)
is the vector of input-to-hidden weights for the
i
-th hidden node in the student network.
Note that the number of hidden nodes in the student network
K
does not have to equal
the number of hidden nodes in the teacher network
M
.
An input pattern is
= (
1
,...,
N
) where is the current input. The pattern output
by the teacher network in response to
is
; training example is therefore the input-
output pair (
,
). The activation of the hidden units given an input pattern
is
x
i
=
J
i
. Similarly, the activation of hidden units in the teacher network given
is
y
n
=
B
n
. The output from a student network is
(
J;
) =
K
X
i
=1
g
(
x
i
)
(2.1)
where
g
(
x
i
) is the activation function of hidden unit
J
i
.
2.2.1 Input data
For this analysis, inputs must be i.i.d. samples from a Gaussian variable
N
(0
;
1). Input
units are also assumed to be uncorrelated (ie
i
is uncorrelated with
j
,
i
6
=
j
) with zero
mean and unit variance (although this is not true for many examples of real data), and
the maths assumes an innite input size
N
, although the analysis is still eective for nite
N
[3].
2.2.2 Activation functions
Although the input-to-hidden activation function
g
(
x
i
) is only restricted to being dieren-
tiable, the erf function, or more specically
g
(
x
) = erf(
x=
p
2) =
2
p
2
R
x
0
e
t
2
=
2
dt
has been
used in this section.
24

Page 25
2.2.3 Generalisation error
The generalisation error of a soft committee network with erf hidden unit activations and
normalised inputs has been calculated in [24] in terms of the order parameters:
g
(
J
) =
1
X
ik
arcsin
Q
ik
p
1+
Q
ii
p
1+
Q
kk
+
X
nm
arcsin
T
nm
p
1+
T
nn
p
1+
T
mm
+
X
in
arcsin
R
in
p
1+
Q
ii
p
1+
T
nn
(2.2)
2.3 Analysis of gradient descent algorithms
It is instructive when looking at the evolution of order parameters for natural gradient
descent to have something to compare its speed and complexity against. The obvious
candidates are gradient descent (already analysed in [24]) and the matrix momentum
form of Newton’s method (second-order, fast and analysed in [19][17]). The calculations
used are similar for the various learning algorithms, and are shown for natural gradient
descent in appendix (B).
For the purposes of analysis, it is assumed that all data is generated by another soft
committee machine: to avoid confusion, this is called the ’teacher’ network and the network
that is being adapted is its ’student’. Although the student and teacher networks have
the same structure and inputs, they do not necessarily have the same number of hidden
nodes. For all forms of learning algorithm shown here, the network error is assumed to be
quadratic and the error gradient
r
J J
(
;
) is therefore
r
J J
(
;
) =
@
@J
i
0
B
@
1
2
0
@
M
X
n
=1
g
(
B
n
)
K
X
j
=1
g
(
J
j
)
1
A
2
1
C
A
=
i
(2.3)
where
i
=
g
0
(
J
i
)
0
@
M
X
n
=1
g
(
B
n
)
K
X
j
=1
g
(
J
j
)
1
A
(2.4)
is the backpropagation delta function.
2.3.1 Gradient descent
Saad and Solla [24] develop update equations for a soft committee network using gradient
descent. The update for a student hidden unit
J
i
in a soft committee network using this
learning algorithm, the learning rate scaled by the input size
N
and a quadratic error
function
J
(
;
) as given in equation 1.1 is [24]:
J
+1
i
=
J
i
+
N
i
(2.5)
25

Page 26
The order parameters for gradient descent are the overlaps between the student and
teacher weight vectors
J
and
B
: these are
Q
ik
=
J
i
J
k
,
R
in
=
J
i
B
n
and
T
nm
=
B
n
B
m
.
Since the teacher network is constant,
T
does not evolve and its derivative
d
Tnm
d
is zero.
The update equations for the other order parameters
Q
and
R
are:
dQ
ik
d
=
h
i
x
k
i
+
h
k
x
i
i
+
2
h
i k
i
dR
in
d
=
h
i
y
n
i
2.3.2 Matrix momentum version of Newton’s method
For matrix momentum, it is convenient to dene a new variable
i
=
N
(
J
i
J
1
i
).
The order parameters are now the overlaps between
J
,
B
and , i.e.
Q
,
R
,
T
and
C
ik
=
i
k
,
D
in
=
i
B
n
and
E
ik
=
J
i
k
. Unlike the overlaps between
J
and
B
,
the overlaps with do not appear to have a direct physical meaning.
With
J
and both being updated simultaneously, it is important that contributions
towards updating the same quantity occur on the same time scale. This is achieved by
scaling both and . For the Newton’s method calculations, it is sucient to scale both
and 1
by
N
(i.e.
= 1
=N
) as described in [14]. For the
J
, updates given
in equations 1.21 and 1.22 with =
I
k
H
N
, =
k
N
and a soft committee machine (i.e.
r
J J
(
xi ;
) =
i
), the order parameter evolutions are [18]:
dQ
ik
d
=
E
ik
+
E
ki
dR
in
d
=
D
in
dC
ik
d
=
k
h
i
z
k
+
k
z
i
i
+
k
2 2
h
i k
i
k
X
m
(
c
im
D
km
+
c
km
D
im
)
k
X
j
(
a
ij
C
kj
+
a
kj
C
ij
+
b
ij
E
jk
+
b
kj
E
ji
)
dD
ik
d
=
k
h
i
y
n
i
k
X
j
(
a
ij
D
jn
+
b
ij
R
jn
)
k
X
m
c
im
T
nm
dE
ik
d
=
C
ik
+
k
h
k
x
i
i
k
X
j
(
a
kj
E
ij
+
b
kj
Q
ij
)
k
X
m
c
km
R
im
where
a
ij
= (1 +
ij
)
@
g
(
J
)
@Q
ij
b
ij
= (1 +
ij
)
"
X
lk
(1 +
lk
)
E
lk
@
2
g
(
J
)
@Q
ij
@Q
kl
+
X
kn
D
kn
@
2
g
(
J
)
@Q
ij
@R
kn
#
26

Page 27
c
in
=
X
lk
(1 +
lk
)
E
lk
@
2
g
(
J
)
@R
in
@Q
kl
+
X
km
D
km
@
2
g
(
J
)
@R
in
@R
km
2.3.3 MM-NGD with the full Fisher information matrix
As with the matrix momentum version of Newton’s method, the order parameters for
MM-NGD are the overlaps between the variables
J
,
B
and :
Q
ik
=
J
i
J
k
,
R
in
=
J
i
B
n
T
nm
=
B
n
B
m
,
C
ik
=
i
k
,
D
in
=
i
B
n
and
E
ik
=
J
i
k
.
Since this is a toy problem, with the input distribution and the model both known, we
can actually calculate the Fisher information matrix directly. For a soft committee network
and a quadratic error function
J
(
;
J
), the exact calculation for the Fisher information
matrix given in equation (1.29) reduces to
G
(
J
) = [
G
i;k
] =
2
4
1
4
m
*
@
J
(
;
J
)
@J
i
@
J
(
;
J
)
@J
k
+
f g
3
5
(2.6)
where
i;k
are indices over the hidden units,
;
are indices over the input elements and
2
m
is the variance of the teacher output noise . This is a matrix consisting of
K K
blocks which each contain
N N
values. Further manipulation of this equation (this is
given in appendix C) gives
G
ik
=
1
2
m
< A
ik
( )
>
f g
(2.7)
where
A
ik
is
A
ik
=
2
p
ik
h
I
1
ik
((1 +
Q
kk
)
J
i
J
T
i
+ (1 +
Q
ii
)
J
k
J
T
k
Q
ik
(
J
i
J
T
k
+
J
k
J
T
i
))
i
(2.8)
I
is the identity matrix and
ik
= (1 +
Q
ii
)(1 +
Q
kk
)
Q
2
ik
(2.9)
For
J
, update equations 1.30 and 1.31, with =
I
k
G
(
J
)
N
and =
k
N
, the equations
of motion for the order parameters are:
dQ
ik
d
=
E
ik
+
E
ki
(2.10)
dR
in
d
=
D
in
(2.11)
dC
ik
d
=
X
j
2
k
p
kj
h
C
ij
1
kj
((1+
Q
jj
)
E
kj
E
ki
+ (1+
Q
kk
)
E
jj
E
ji
Q
kj
(
E
jj
E
ki
+
E
kj
E
ji
))
i
X
j
2
k
p
ij
h
C
jk
1
ij
((1+
Q
jj
)
E
ij
E
ik
+ (1+
Q
ii
)
E
jj
E
jk
27

Page 28
Q
ij
(
E
jj
E
ik
+
E
ij
E
jk
))
i
+
k
h
k
z
i
+
i
z
k
i
+
k
2 2
h
i k
i
(2.12)
dD
in
d
=
X
j
2
k
p
ij
h
D
jn
1
ij
((1+
Q
jj
)
E
ij
R
in
+ (1+
Q
ii
)
E
jj
R
jn
Q
ij
(
E
jj
R
in
+
E
ij
R
jn
))
i
+
k
h
i
y
n
i
(2.13)
dE
ik
d
=
C
ik
X
j
2
k
p
kj
h
E
ij
1
kj
((1+
Q
jj
)
E
kj
Q
ik
+ (1+
Q
kk
)
E
jj
Q
ij
Q
kj
(
E
jj
Q
ik
+
E
kj
Q
ij
))
i
+
k
h
k
x
i
i
(2.14)
The calculations for these are given in appendix B.
There is a more compact form of these equations. If the update equations for
C
,
D
and
E
are written as
dC
ik
d
=
k
T
i
X
j
A
kj
j
k
X
j
(
A
ij
j
)
T
k
+
k
h
k
z
i
+
i
z
k
i
+
k
2 2
h
i k
i
(2.15)
dD
in
d
=
k
X
j
(
A
ij
j
)
T
B
n
+
k
h
i
y
n
i
(2.16)
dE
ik
d
=
C
ik
kJ
T
i
X
j
A
kj
j
+
k
h
k
x
i
i
(2.17)
then the sum
P
ij
A
ij
j
can be calculated from
X
j
A
ij
j
=
2
X
j
1
p
ij
h
j
1
ij
((1+
Q
jj
)
E
ij
J
i
+(1+
Q
ii
)
E
jj
J
j
Q
ij
(
E
jj
J
i
+
E
ij
J
j
))
i
(2.18)
and substituted in.
2.3.4 MM-NGD with single-input Fisher information
When the network being analysed is very large, it may not be possible or practical to calcu-
late the full Fisher information matrix for each input to it. A single-input approximation
to the Fisher information matrix which only uses one example is
G
ik
=
1
2
m
A
ik
( )
(2.19)
where
A
ik
( ) is dened as before. The update equations for this have been calculated by
Scarpetta [25] and are
dQ
ik
d
=
E
ik
+
E
ki
(2.20)
dR
in
d
=
D
in
(2.21)
28

Page 29
dC
ik
d
=
k
h
(
i
i
)
z
k
+ (
k
k
)
z
i
i
+
k
2
h
(
i
i
)(
k
k
)
i
(2.22)
dD
in
d
=
k
h
(
i
i
)
y
n
i
(2.23)
dE
ik
d
=
C
ik
+
k
h
(
k
k
)
x
i
i
(2.24)
where
i
=
g
0
(
x
i
)
X
j
z
j
g
0
(
x
j
)
Although the behaviour of MM-NGD with single-input Fisher information matrix ap-
proximations is important for predicting the behaviour of MM-NGD learning when it is
dicult to evaluate the Fisher information matrix, its analysis does not form part of this
report: preliminary results and discussion of this topic can be found in [25].
29

Page 30
Chapter 3
Numerical results
This chapter shows some of the characteristics of the order parameter and generalisation
error evolutions calculated in the last chapter, calculated for soft committee networks with
two-hidden-unit student networks, two-hidden-unit teacher networks and erf activation
functions. Emphasis has been placed on the behaviour of generalisation error over time
for several values of the learning parameters and
k
and teacher network noise
2
m
.
3.1 Learning parameters
The training parameters of the MM-NGD learning algorithm are the learning rate and
the parameter
k
, representing the balance between gradient descent and momentum. The
noise variance of the teacher network outputs is termed
2
m
. Other variables that can be
adjusted are the numbers of teacher and student hidden nodes, and whether the learning
rate is annealed (gradually reduced over time).
Another factor that aects the learning algorithm outputs, if we have noise on our
inputs, is the point at which learning rate annealing is started. It should be noted that
all times are described in terms of , and that the interplay between the values of and
the number of datasteps used for each stage of learning is also signicant to the eciency
of the learning algorithm.
3.2 Initial conditions
All order parameters are initially zero or sampled from Gaussian distributions:
Q
ik
N
(0
;
0
:
5)
;i
=
k
;
Q
ik
N
(0
;
0
:
001)
;i
6
=
k
30

Page 31
R
in
N
(0
;
0
:
001)
C
ik
N
(0
;
0
:
001)
;i
=
k
;
C
ik
= 0
;i
6
=
k
D
in
= 0
E
ik
= 0
T
nm
= t
init
;n
=
m
;
T
nm
= 0
;n
6
=
m
where the teacher covariance matrix is set to the same values as
T
nm
and t
init
is set
to 1 if there is no teacher noise
2
m
, and 0
:
5 for a teacher noise of 0
:
01. All the order
parameter derivatives were initially set to zero. With these initial values, the networks
were initially symmetric. The same random seed was used for each algorithm so that
when their generalisation errors are compared in this chapter, their initial conditions are
identical.
3.3 Phases of learning
In these output plots, the generalisation errors follow a characteristic curve in time. First,
an initial phase where the generalisation error drops rapidly to a value. Then, a symmetric
phase where the generalisation error stays at that value for some time (this is known as the
symmetric plateau), then a drop to an asymptotic phase during which the error reduces
gradually. In the plots, the initial and symmetric phases are shown for
g
against ; the
asymptotic phases are shown on a log-log scale.
3.4 Comparing standard, natural and MM-NGD learning
algorithms
Figure 3.1 shows the expected generalisation errors of optimal (with respect to the learning
rate) standard gradient descent [23], optimal natural gradient descent and MM-NGD
gradient descent algorithms, with the optimal learning rate ( = 0
:
144855) for the example
network and
k
= 10 in the MM-NGD algorithm (note that is written as alpha in these
graphs).
The natural gradient descent algorithm reaches a plateau less quickly that the standard
gradient descent algorithm as it reweighs the gradient in all directions reducing the strong
dierence between gradients in standard gradient descent which drives the system very
quickly towards the symmetric xed point. The plateau height for the standard gradient
31

Page 32
0
50
100
150
0.1
0.2
alpha
generalisation error
grad desc
ngd
ngd−mm (k = 10)
Figure 3.1:
MM-NGD vs NGD and optimal gradient descent
descent is much higher than the natural gradient plateau: this occurs because although
gradient descent moves quickly towards a minimum in the most signicant gradient di-
rection, it does not move much in the other gradient directions and will therefore be at a
higher level in those smaller directions when it reaches the symmetric phase.
The dierence between the MM-NGD and natural gradient plateaux is discussed in
the next section.
3.4.1 MM-NGD without teacher noise
Figure 3.2 shows the symmetric-phase performance of natural gradient descent algorithms
on a two-hidden-unit network (
K
=
M
= 2) when there is no noise on the teacher network
outputs (
2
m
= 0). In this plot, all algorithms have a learning rate of = 0
:
15. The solid
line is the generalisation error of the natural gradient descent (NGD) algorithm, and the
dotted lines (from right to left) are MM-NGD with
k
set to 0.5, 1.4, 2.1 and 10.
As the value of
k
is reduced, the length of the plateau in the MM-NGD error curve
approaches the plateau length of the NGD curve. Experiments with larger values of
k
(up
to K=100) show that the MM-NGD curve approaches the NGD curve without reaching
it, and with diminishing returns for larger values. Since the number of datapoints needed
32

Page 33
0
50
100
150
200
0.01
0.02
0.03
0.04
0.05
alpha
generalisation error
ngd
k = 0.5
k = 1.4
k = 2.1
k = 10
Figure 3.2:
Symmetric phase behaviour of MM-NGD (teacher noise=0)
to calculate these error curves increases with
k
and
k
= 20 is not visibly very dierent
from
k
= 50 or
k
= 100, none of the plots shown here use a larger value than 20 for
k
.
If MM-NGD and NGD are equivalent for large values of
k
, then the curve for large
k
should be very similar to that for NGD. This is not the case: although the plateaux
lengths are similar, the MM-NGD and NGD algorithms have dierent plateau heights. It
is not certain why this should be the case: the plateau height should be determined by
the
2
terms in the order parameter update equations but experiments with these terms
have proved inconclusive.
3.4.2 MM-NGD with teacher noise
Figure 3.3 shows the asymptotic behaviour of NGD and MM-NGD with teacher noise
variance set to
2
m
= 0
:
01, the teacher covariance matrix
T
set to 0.5 on their diagonals
(
T
mn
=
mn
0
:
5 where
mn
is the Kronecker delta). The upper dotted line is the (theoret-
ical) optimal gradient descent bound: gradient descent learning is much worse than this,
and outputs from this algorithm have not been shown.
The lower dotted line is the Cramer-Rao lower bound: this shows a theoretical limit
33

Page 34
10
3
10
4
10
−7
10
−6
10
−5
10
−4
10
−3
log10(alpha)
log10(Eg)
ngd
k = 0.3
k = 0.7
k = 1.4
k = 10
Figure 3.3:
Asymptotic behaviour of MM-NGD for noisy teacher (
2
m
= 0
:
01)
on the speed at which all algorithms can learn, and is dened as
V
1
M
G
(
J
)
1
(3.1)
where
G
(
J
) is the Fisher information matrix and
M
is the number of examples available.
Since natural gradient descent algorithms are asymptotically optimal, they should be
expected to approach the Cramer-Rao bound: this is seen in the plots, where the lower
solid line is the generalisation error for NGD, and the other learning curves are for MM-
NGD and (from top to bottom)
k
= 0
:
3, 0
:
7, 1
:
4 and 10.
3.5 Varying other parameters
The results shown in this chapter have all been for optimal or near-optimal values of
the learning rate
and dierent values of the momentum balance parameter
k
. This
reects our interest in the momentum parameters and the fact that the behaviour of
generalisation error with dierent values of has been well studied elsewhere. However,
before this algorithm is applied to real data, it is interesting and perhaps prudent to note
what happens to the generalisation error evolution when we vary : since online learning
is sensitive to parameter settings, this might provide some clues to what might be amiss
34

Page 35
0
50
100
150
0.05
0.1
0.15
eta=0.01
eta=0.02
eta=0.04
eta=0.07
eta=0.15
(a) Smaller than optimal values
0
50
100
150
0.1
0.2
0.3
0.4
eta=0.73
eta=0.7
eta=0.6
eta=0.5
eta=0.3
eta=0.15
(b) Larger than optimal values
Figure 3.4: Generalisation error evolution for various values
with them.
Figure 3.4 shows the behaviour of mm-ngd generalisation error for several values of
( = 0
:
01, 0
:
02, 0
:
04, 0
:
07, 0
:
15 and = 0
:
73, 0
:
7, 0
:
6, 0
:
5, 0
:
3: note that is written as
eta in the graphs) with
k
= 8,
2
m
=0.01, and t
init
= 1. For decreasing values of below the
optimum value = 0
:
144885 ( = 0
:
15 is used as an approximation to this in gure 3.4),
the initial phase before a plateau is reached is longer, and the symmetric plateau height
increases until the initial phase and symmetric plateau appear to be part of one smooth
curve. For increasing values of above the optimum value, the initial phase decreases in
time, and the symmetric plateau height increases until at about = 0
:
5, learning is not
possible.
Also of interest is the behaviour of generalisation error with dierent combinations of
values of and
k
, but this has not been studied here.
35

Page 36
Chapter 4
Using MM-NGD on real data
The matrix momentum form of natural gradient descent works well with toy examples,
but its potential value is in its application to real data. To this end, the datasets used for
matrix momentum with standard gradient descent have also been tried with MM-NGD.
The networks used here are multilayer perceptrons with linear hidden to output con-
nections and tanh hidden-layer activation functions. The learning algorithm used was
online learning, using sampling of the input data with replacement. Code was written in
both Matlab (to t into the Netlab framework) and C++.
4.1 MM-NGD for MLPs
The parameter update equation 1.30 with =
I
k
G
( )
N
and =
k
N
was used here. This
gives a parameter update equation of
+1
i
=
i
k
N
2
r
J
(
;
) + (
I
kG
( )
N
)(
i
1
i
)
(4.1)
where
is the set of network parameters
f
J
ij
;b
j
;a
jo
;c
o
g
for a multilayer perceptron
with input-to-hidden weights