Computationally efficient natural gradient descent

This is the text version of the file http://www.cs.york.ac.uk/~sara/project/oldprojects/aston/thesis.ps.
G o o g l e automatically generates text versions of documents as we crawl the web.

Google is neither affiliated with the authors of this page nor responsible for its content.

These search terms have been highlighted:

natural

gradient

bfgs

Computationally efficient natural gradient descent

Page 1

Computationally efficient natural

gradient descent

Sara-Jayne Farmer

Master of Science by Research

Pattern Analysis and Neural Networks

Aston University

September 1998

This copy of the thesis has been supplied on condition that anyone who consults

it is understood to recognise that its copyright rests with its author and that no

quotation from the thesis and no information derived from it may be published

without proper acknowledgement.

Page 2

Aston University

Computationally ecient natural

gradient descent

Sara-Jayne Farmer

Master of Science by Research

Pattern Analysis and Neural Networks, 1998

This study examines the use of matrix momentum terms with the aim of creating a more

computationally ecient natural gradient descent algorithm for on-line learning. It uses

the statistical mechanics framework created by Saad and Solla to describe the evolution of

order parameters for this algorithm in a two-layer student-teacher scenario, and compares

this with results from matrix-momentum natural gradient learning of real datasets.

Page 3

Acknowledgements

Thanks go to Magnus Rattray and David Saad for supervision, guidance and several

patient explanations of strange matrix indices: any good stu in this thesis is inevitably

theirs. Thanks also go to Genny Orr for providing the phoneme classication dataset used

in chapter 4.

Page 4

Contents

1 Introduction

1.1 Neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.2 Multilayer perceptrons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.2.1 Backpropagation learning . . . . . . . . . . . . . . . . . . . . . . . . 10

1.2.2 Gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.3 Online learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.3.1 Online learning and time-varying data . . . . . . . . . . . . . . . . . 12

1.4 Neural networks as statistical approximators . . . . . . . . . . . . . . . . . 13

1.5 Neural networks as points in parameter space . . . . . . . . . . . . . . . . . 13

1.5.1 Applying dierential geometry to neural networks . . . . . . . . . . 13

1.6 Second-order learning methods . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.6.1 Newton’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.6.2 Approximating the Hessian . . . . . . . . . . . . . . . . . . . . . . . 16

1.7 Matrix Momentum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

1.7.1 Balancing gradient descent and matrix momentum . . . . . . . . . . 18

1.8 Natural gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

1.8.1 The Fisher information matrix . . . . . . . . . . . . . . . . . . . . . 20

1.8.2 Matrix momentum for natural gradient descent (MM-NGD) . . . . . 20

2 Analysis of MM-NGD for online learning

2.1 Statistical mechanics analysis of online learning . . . . . . . . . . . . . . . . 22

2.1.1 Evolution of order parameters . . . . . . . . . . . . . . . . . . . . . . 23

2.1.2 Evolution of generalisation error . . . . . . . . . . . . . . . . . . . . 23

2.2 Example networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.2.1 Input data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.2.2 Activation functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.2.3 Generalisation error . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.3 Analysis of gradient descent algorithms . . . . . . . . . . . . . . . . . . . . 25

2.3.1 Gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.3.2 Matrix momentum version of Newton’s method . . . . . . . . . . . . 26

2.3.3 MM-NGD with the full Fisher information matrix . . . . . . . . . . 27

2.3.4 MM-NGD with single-input Fisher information . . . . . . . . . . . . 28

3 Numerical results

3.1 Learning parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2 Initial conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.3 Phases of learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.4 Comparing standard, natural and MM-NGD learning algorithms . . . . . . 31

3.4.1 MM-NGD without teacher noise . . . . . . . . . . . . . . . . . . . . 32

3.4.2 MM-NGD with teacher noise . . . . . . . . . . . . . . . . . . . . . . 33

Page 5

3.5 Varying other parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4 Using MM-NGD on real data

4.1 MM-NGD for MLPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.1.1 Calculating the Fisher information matrix . . . . . . . . . . . . . . . 37

4.1.2 Approximating the Fisher information matrix . . . . . . . . . . . . . 37

4.2 Small datasets : the Iris dataset . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.3 Medium-sized datasets : wine classication . . . . . . . . . . . . . . . . . . 39

4.3.1 varying , denitions . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.4 Large datasets : speech (phoneme classication) data . . . . . . . . . . . . . 41

4.4.1 Large Fisher information matrices . . . . . . . . . . . . . . . . . . . 41

4.4.2 Slow learning times . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5 Conclusions

5.1 Possible extensions to this work . . . . . . . . . . . . . . . . . . . . . . . . . 43

A Variables and notation

A.1 Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

A.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

B Analysis of MM-NGD for soft committee machines

C The Fisher information matrix

C.1 Exact Fisher information matrix for a soft committee network . . . . . . . . 52

C.2 Exact Fisher information matrix for a multilayer perceptron . . . . . . . . . 54

C.2.1 Reducing the calculation time . . . . . . . . . . . . . . . . . . . . . . 55

Page 6

List of Figures

3.1

MM-NGD vs NGD and optimal gradient descent

. . . . . . . . . . . . . . . . . 32

3.2

Symmetric phase behaviour of MM-NGD (teacher noise=0)

. . . . . . . . . . . . 33

3.3

Asymptotic behaviour of MM-NGD for noisy teacher (

= 0

01)

. . . . . . . . . 34

3.4 Generalisation error evolution for various values . . . . . . . . . . . . . . 35

4.1

Iris data,

= 0

16, various

k . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.2

Wine data, varying

k . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.3

Comparing MM-NGD algorithms

. . . . . . . . . . . . . . . . . . . . . . . . . 40

Page 7

List of Tables

C.1 Speech network calculation times (in ops) and dominant calculation sizes

(%) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

Page 8

Chapter 1

Introduction

This study examines the use of matrix momentum terms and natural gradient descent in

the online learning of neural networks. Its aim is to create a more computationally ecient

natural gradient descent algorithm for on-line learning. It uses the statistical mechanics

framework created in [24] to describe the evolution of order parameters for this algorithm

in two-layer networks within a student-teacher scenario.

This chapter introduces the tools needed for this work and some of the issues in their

use.

1.1 Neural networks

A neural network is a exible probabilistic model of an unknown mapping instanced by a

set of examples. A network has nodes, parameters and activation functions. Nodes receive,

modify and pass on (propagate) information. Input nodes receive information in the form

of vectors or individual values from outside. Output nodes modify then pass information

to the user. Hidden nodes modify and pass information between the input and output

nodes, and play the main role in processing incoming data in a network. A network’s

structure (nodes and the layout of links between them) is usually specied by the person

building the network, but there are some algorithms that can adapt the network structure

to best represent the data being learnt.

Each node in a network may have several inputs, either from other nodes or from

outside the network. The values input to a node are termed its activation; the value output

from a node is a function of its activations and is termed its activation function. Activation

functions are specied by the network designer and mostly range from simple summation

Page 9

of all the inputs to a function of that sum. For nonlinear networks (networks that can

represent nonlinear functions) and derivative-based learning algorithms, the activation

function of the hidden nodes must be dierentiable, for example tanh or erf fuctions

(these also squash their outputs into a range between 0 and 1 or -1 and +1).

Networks also have numerical parameters which are used to control the behaviour of

a network and are usually the weights on the links between nodes. the weight on a link

is used to modify any output from one node before it is input to the other. Parameters

can be modied to improve the performance of a network: this modication is known as

learning or training. Learning algorithms adapt the parameters of a network so that the

outputs from it are as close as possible to the outputs given by the system that they are

modelling given random inputs from the same distribution as the examples.

Using a network consists of passing information to its input nodes and reading o the

activations of its output nodes. Although the ability to store examples may be useful, the

main aspect of using a neural network that we will focus on is generalisation: the ability to

summarise information and return consistent outputs for inputs that the network has not

been trained on. This is achieved with a learning algorithm. A learning algorithm adjusts

the

parameters

of a network to t a set of

examples

, where each example consists of

an input vector

and the output

that is expected from the network, given that input

and the system that the network is trying to model(here, is an index to the example, as

are most superscripts used in this document).

Because networks learn the underlying function that generated a set of examples rather

than the exact connection between each example’s input and output pair, they are robust:

able to learn from noisy or incomplete examples. How well a network is able to generalise

is determined by the user’s choice of network and activation function; tools like cross-

validation can be used to assess the expected performance. Neural networks are most

commonly used to model systems which can produce example outputs or input-output

pairs but whose underlying rules are either unknown or too expensive to model. There

are several varieties of neural network architectures, for instance multilayer perceptrons

and radial basis functions: in this document, only the networks known as multilayer

perceptrons are considered.

Page 10

1.2 Multilayer perceptrons

The most commonly-used neural networks are multilayer perceptrons, or MLPs. The

nodes in an MLP are arranged into layers : an input layer, then one or more hidden layers

and an output layer, where each layer is fully connected to the next layer in the network.

Connections between layers can only be directed forwards in the network: for instance,

connections can be made from a hidden layer to the output layer but connections from the

output layer back to a hidden layer are not allowed. Although not strictly necessary, the

same activation function is usually dened for all hidden and output nodes in an MLP.

Using an MLP consists of activating the input nodes, propagating that input between

the network’s layers and reading the output from its output nodes.

The most common MLP has one input layer, one hidden layer and one output layer

and is known as a two-layer network (the input layer is ignored when counting the number

of layers in a network). There are some variants of MLPs, for example a soft committee

network or soft committee machine [4] is a two-layer MLP whose hidden to output weights

are not allowed to vary, and are usually set to one, and output nodes whose activation

functions are or are proportional to a simple sum of their input activations. Soft committee

machines are used as example networks for much of this document and preserve most of

the characteristics of general two layer networks [20]. Reviews of other MLP variants and

their learning methods can be found in [5][9]).

1.2.1 Backpropagation learning

In the soft committee machines used in this document, the MLP parameters are re-

stricted to the input-to-hidden weights

. Before learning occurs, these are set to initial

values which are usually small and randomly distributed.

Learning in MLPs is usually done by backpropagation, which works in two stages.

First, an input

is propagated (modied and passed on) forward from the input layer to

the output layer through all the hidden layers of an MLP, to produce a network output

(

). This is the forward pass of backpropagation. The network output (

)

produced in the forward pass is then compared against the expected output

for the

corresponding input

. This comparison produces a network error

(

;

), which is

then propagated back from the output units to the input units, to give the local gradient

for each node

. This is the backward pass which gives the algorithm its name. When

the backward pass through the network is completed, the local gradients

(also known

Page 11

as backpropagation delta functions) are used to adjust the network parameters, and the

next forward pass can begin.

The network

error

(

;

) is used in the backward pass of the backpropagation

error.

(

;

) is a measure for the distance between the actual output (

) and the

expected output

of a network with parameters

which has been given the input

In this document, this distance is taken to be quadratic:

(

;

) =

[ (

)

)]

(1.1)

online learning

, the network error for just the latest example (

;

) is propagated back

through the network in the backward pass of the algorithm; in

batch learning

, an average

over all the available input patterns

(

;

)

:::

(

;

)

is used. This means that

all the examples must be available at every stage of batch learning, but online learning

can process one example at a time.

Although the network error is used in this learning algorithm, the

generalisation error

(

) is more useful in analysing the performance of a learning algorithm. The generali-

sation error gives the probability that the network will produce the wrong output for an

example that it has not been trained on but is drawn from the same distribution or model

as the training examples; it is dened as the network error averaged over the distribution

of network inputs:

(

) =

(

;

)

f g

(1.2)

1.2.2 Gradient descent

The errors for dierent values of the network parameters form an error landscape or

surface, where each point represents the error for a particular choice of parameters and

distances between points measure the dierences between the errors.

The error term passed back through the network is not the error at each node, but the

derivative of the network error with respect to the node’s parameters. This is proportional

to the derivative of the nodes activation function, which must therefore be dierentiable

for backpropagation to work.

Most MLP learning algorithms adjust the network parameters

by subtracting a

fraction of the error gradient

J J

(

;

) from them:

J J

(

;

)

(1.3)

Page 12

where is known as the

learning rate

. This ensures that, for a general series of examples

(

;

) and an appropriate value of , the learning algorithm will move the network error,

on average, downhill in the error landscape.

The learning rate

controls the step-size on the error landscape. If it is constant

throughout learning a network’s parameters, then the learning algorithm will either be

forced to take small steps and a long time traversing the landscape towards an error

surface minimum, or larger steps which will overshoot the minimum in the nal stages of

learning. These problems can be avoided (or alleviated) by

annealing

the learning rate:

gradually reducing the value of over time.

1.3 Online learning

Algorithms where parameter adjustment is done as each new example is presented (rather

than for all examples at once) are known as

on-line learning

pattern learning

. Online

learning is useful when examples are only available serially, the task is nonstationary (the

function that is being learnt by the network is changing over time) or there is less data,

space or time available than is needed to process an entire dataset (group of examples)

simultaneously. This report deals exclusively with online learning.

1.3.1 Online learning and time-varying data

For many of the equations shown below, the input data is assumed to be iid: each input

is uncorrelated with the inputs that went before it. Data for online learning is produced

sequentially or sampled from datasets too big to be loaded completely into a system.

If data is sampled with replacement, then it will not be uncorrelated. Many examples

of real sequential data are from underlying systems whose parameters change over time

which often means that there are weak correlations between successive inputs, and may

imply that older data will be of less value to the learning algorithm and might need to be

forgotten. These factors have not been considered in this report, but may prove signicant

when using real data

For example, Heskes and Coolen[11] report on the eects of correlation between inputs to two-layer

networks, but this has not been considered here

Page 13

1.4 Neural networks as statistical approximators

A neural network models the expected value

(

;

)

;

(1.4)

of the conditional probability of its outputs

given its inputs , parameterised by its

weights

and sometimes also parameterised by its structure. Learning in a network is

the process of adapting the parameters

to get the mapping from this model as close

as possible to the underlying rule that generated a set of example input-output pairs

(

;

)

:::

(

;

)

. Given a prior distribution for the data

(

), this can also be

seen as maximising

(

) with respect to

1.5 Neural networks as points in parameter space

Information geometry is the application of techniques from dierential geometry to statis-

tical models [13] [1]. It represents families of neural networks as manifolds (k-dimensional

geometric objects which can be mapped onto Euclidean space) in a space (d-dimensional,

where k<d) whose coordinates are the parameters (weights and biases) of the networks, and

each individual network is represented by a point on the manifold which corresponds to its

particular parameters. As an example, a two-layer soft committee network with weights

:::J

can be represented as a point on a manifold of soft committee network mappings

-dimensional parameter space, and similar networks with dierent parameter values

would be represented by dierent points on the same manifold.

The curvatures of the manifolds are described using tensors. A Riemann tensor is

a matrix which describes the curvature of a surface at a specic point in space. In n-

dimensional space, the tensor is an n-by-n symmetric matrix: e.g. in four dimensions, this

is 4-by-4 with 16 elements but only 10 independent components because of its symmetry.

A complete surface is described by placing tensors at dierent points on the surface, or by

parametrising the tensor so that any point on the surface could be described: the resulting

collection of tensors is known as a vector eld, and dierential geometry is the study of

their properties.

1.5.1 Applying dierential geometry to neural networks

In this representation of a neural network, learning can be seen as moving the point that

represents a particular network closer to a (perhaps theoretical) set of parameters that

Page 14

generated a set of given input-output examples. This can be a powerful tool for the

analysis of learning, and it can also be used to create optimal or near-optimal learning

algorithms.

Having described learning as moving points closer together in parameter space, we need

to have a measure of how close two points in that space are in terms of the probability

distributions realised by them. There are several measures

(for example Kullback-Leibler,

Hellinger, squared and Euclidean) in statistics for the distance between two distributions

and

, but one of the most natural and useful is the Kullback-Leibler distance

(

)

(

)) =

(

) log

(

)

(

)

(1.5)

which for two distributions

(

;

) and

(

;

) of network inputs and outputs

parameterised by

(

;

)

(

;

)) =

(

;

) log

(

;

)

(

;

)

(

;

)

(1.6)

We need to apply this to the parameter-space representation of neural networks.

In information geometry, the Kullback-Leibler distance between mappings realised by

two nearby points in a space is known as a metric : if we are measuring small distances in

parameter space (which is what we expect in online learning) and can ignore higher orders

, then equation (1.6) becomes

(

;

)

(

;

)) =

(

)

(1.7)

where

(

) =

(

log

(

;

)(

log

(

;

)

(

;

)

(1.8)

is the metric for this space, and is known as the Fisher information matrix. Gradient

descent in parameter spaces which use the Kullback-Leibler measure of distance is known

as natural gradient descent.

1.6 Second-order learning methods

Several second order methods have been devised to speed up learning. Examining the rst

few terms of a Taylor expansion of an error

(

;

) about a point

in weight-space

gives [5]:

(

;

) =

(

;

) + (

J J

)

J J

(

;

)

(

J J

)

(

J J

)

(1.9)

these are all dened through delta-divergences [29][30]

Page 15

where

(

;

)

(1.10)

is known as the Hessian matrix and

J J

(

;

)

(

;

)

(1.11)

is the error surface gradient at

Many gradient descent algorithms use the error’s gradient

J J

(

;

) to calculate where

to move next in the energy landscape, and ignore any higher-order terms, but an improve-

ment in their learning speed can be had from using the second-order term

(

J J

)

(

J J

)

too. (Third and higher-order terms aren’t used because they are usually insignicant and

are more dicult to handle).

1.6.1 Newton’s method

The most common second-order network learning method is based on Newton’s method

for nding the minimum of a function[5].

If we assume that the local gradient obtained by dierentiating the second-order

(

;

)

estimate (equation 1.9) with respect to (

J J

) is zero, then we have

J J

(

;

)

(

J J

) = 0

(1.12)

i.e.

J H

J J

(

;

)

(1.13)

Since we have ignored higher-order terms in the Taylor expansion, this is not an exact

equation for

, and an iterative procedure must be used (we also focus on online learning

which requires iterative updates). This gives an update equation for the weights

J J

(

;

)

(1.14)

is the learning rate as before.

Learning with Newton’s method converges faster than rst-order gradient descent but

it is very sensitive to rescaling (e.g. normalisation) of inputs, and it needs a lot of pro-

cessing and system space to create, store and invert the Hessian matrix. In addition,

employing Newton’s method makes the algorithm converge towards and stabilise on any

xed points in the error landscape regardless of whether they are minima or saddlepoints,

and this may result in suboptimal performance in complex learning dynamics.

Page 16

1.6.2 Approximating the Hessian

Accurately calculating the inverse Hessian

1 used in Newton’s method requires an

average over all the input data (this is, of course, only available in batch learning), followed

by a (large) matrix inversion, at every step of the gradient descent algorithm. This can

take a long time to calculate, and since the inverse Hessian is a very large matrix, it can

be impractical or impossible to store during calculations. It therefore makes sense to look

for more ecient Hessian algorithms and reasonable approximations to these matrices.

One approximation to the Hessian is to use only some of the input data to evaluate

the Hessian matrix. Where only the latest input is used, this is termed here a single-

step approximation. The calculations used to create the elements of the Hessian matrix

can also be approximated : variations of Newton’s method that use this are known as

pseudo-Newton or quasi-Newton algorithms. Simple approximations to the Hessian in-

clude diagonal approximations and the Levenberg-Marquardt approximation[5]. Diagonal

approximations only contain the elements on the diagonal of the Hessian. they are not

very precise: the o-diagonal elements of the Hessian are signicant in many learning

scenarios, and inverting a diagonal approximation of the Hessian diers signicantly from

the diagonal of its inverse [16]. the Levenberg-Marquardt approximation depends on the

expansion of the Hessian for a sum-of-squares error

(

;

) =

(

)

into its

components:

(

;

)

(

)

(1.15)

and ignoring the second term in that equation. This is a reasonable approximation if

(

) is small. Other algorithms build up the Hessian iteratively from rst-order

terms: the most common of these are the BFGS (Broyden-Fletcher-Goldfarb-Shanno) and

the Davidson-Fletcher-Powell procedures.

Other algorithms include approximations to the inverse Hessian [8], but since the

Hessian

is often used multiplied by a vector

, it can be more practical to create

the product

directly. Pearlmutter [16] gives an algorithm for doing this eciently

and exactly by setting

in the Taylor expansion above then taking limits as

approaches zero to give a dierential operator which can be applied to a network to

generate

Page 17

1.7 Matrix Momentum

Inverting the Hessian in Newton’s method can be avoided by incorporating it into a matrix

momentum term[15].

Momentum is a method used mostly in batch learning to counter the eect of gradient

descent algorithms oscillating across large gradient changes in one dimension whilst giving

little eort in a gradually-changing but consistent downward slope in another: this is most

apparent when the error surface is a gently-descending valley with steep sides. One way

to ameliorate this eect is to add a

momentum

term [22] to the weight update equations,

which adds a fraction of the previous weight update into the current one. This reduces the

strength of the oscillation across the surface whilst emphasising the smaller (but consistent)

movement in the other dimension. Including a momentum term into the update equation

1.3 in an online learning scenario gives:

J J

(

;

) + (

)

(1.16)

where

is the

momentum parameter

and is usually set between 0 and 1 (e.g. ref [10]

suggest = 0

9) but it may be adaptive in an attempt to speed up network convergence.

Suggestions for speeding up network convergence also include using a separate

;

for

each network weight, but using momentum already has this eect.

A simple momentum term (e.g. =

constant

) is not terribly useful in online learning

because its eect is merely to rescale the learning rate . This is because, for an error

surface with a gradient which is roughly constant near the current network parameters

(this is the situation that we expect in online learning), using the update equation 1.16 is

equivalent to using a learning rate of [5] [27]

)

(1.17)

If, however, the momentum term in equation 1.16 is replaced with a matrix =

I kH

and the learning rate is replaced with =

where

is a scalar learning rate

which can be annealed (depend on the normalised example index =

), the update

equation for the network becomes

J J

(

;

) + (

I kH

)(

)

(1.18)

and the eective learning rate is

(1.19)

Page 18

This is known as

matrix momentum

: it is important because it gives a way of eectively

premultiplying the gradient by a matrix inverse

without having to calculate that

matrix inverse [15]. Matrix momentum has been used by Orr and Leen[15] with the

matrix

set to the Hessian to give Newton’s method without inverting the Hessian.

Note that the constant

is used to balance the contributions from the gradient and

momentum parts of the update equation, and that the matrix momentum equations above

are only valid for large values of

[18].

1.7.1 Balancing gradient descent and matrix momentum

If we scale both and by

and introduce a new variable where

(

)

(1.20)

then we can substitute these into equation 1.16 to get an update equation for

J J

(

;

) +

(1.21)

We can then use equations 1.20 and 1.21 to also get an update equation for of

J J

(

;

) =

J J

(

;

) + (

(1.22)

where equation 1.21 and equation 1.22 form a coupled pair of equations.

Using this coupled pair of equations, reference [15], which is restricted to the asymp-

totic regime, suggests an optimal matrix momentum =

and annealed learning

rate =

where

is an initial learning rate. This gives an eective learning rate of

)

(1.23)

which is similar to Newton’s method and does not require inverting the matrix

An alternative scaling of the , terms was suggested by [17], where and (1 ) are

both scaled by

. To do this, and are replaced by

= (1

)

k=N

If we now use this scaling and let

with

constant, the eective learning

rate becomes [17]

(1.24)

Page 19

This holds for =

I kH=N

and =

with large values of

[18], again giving an

eective learning rate

(1.25)

where

balances the contributions of the gradient and momentum terms as before. Note

that, for eective training, the learning rate should start from a constant e.g.

, and

only asymptotically decay as

. These results are equally valid for any matrix,

including the Fisher information matrix

(

), and this scaling will be used throughout

the rest of this document.

1.8 Natural gradient descent

Natural gradient descent uses the Fisher information matrix as the natural metric in

parameter space. For gradient descent, the eect of using this metric is to premultiply the

gradient by the inverse of the Fisher information matrix, changing the update equation

for online standard gradient descent from equation 1.3 to

(

)

(

)

J J

(

;

)

(1.26)

Useful features of natural gradient learning are that it is invariant to reparameterisation

of the model distribution (e.g. scaling the parameters does not change the eciency of

the learning algorithm), is asymptotically optimal [2], less prone than Newton’s method

to trapping in symmetric phases (trapping occurs when the algorithm becomes stable at

a saddlepoint or local minimum in the error landscape rather than its global minimum)

and the Fisher matrix

(

) is always positive denite.

The natural gradient descent algorithm requires the inverse

(

)

(

) of the Fisher

information matrix. Inverting the Fisher matrix can take a long time. Schemes for cal-

culating this inversion range from exact algorithms like block-wise partitioning and some

data preprocessing to facilitate the calculation [28] to diagonal approximations [2] [21].

In general, these algorithms can be computationally expensive to calculate and calcu-

lating the average

over all inputs in the Fisher matrix also requires knowledge of all

input data: this may not be sensible or practical for many cases of on-line learning (which

is based on the premise that data is not all available at the same time). This document

concentrates on the rst problem: the second is addressed in [25].

Page 20

1.8.1 The Fisher information matrix

The loss function

of a network is dened [28] as the negative log-likelihood function

(

;

) = log

(

;

) = log

(

;

) + log

( )

(1.27)

where

(

;

) is the joint probability density function of the network inputs and outputs

parameterised by its weights

The Fisher information matrix

(

) of a network is dened from

(

) = [

i;k

]

(1.28)

where and are input node indices (1

i;k N

and

are hidden node indices

i;k K

) and [

] is a block matrix whose elements are

i;k

;

log

(

;

)

log

(

;

)

;

(1.29)

where

;

is an average over the output distribution , followed by an average over

the input distribution .

1.8.2 Matrix momentum for natural gradient descent (MM-NGD)

Matrix momentum is valid for any matrix. If we use the Fisher information matrix, we

have a learning algorithm that is equivalent to natural gradient descent without inverting

the Fisher information matrix.

The update equations for a matrix momentum form of natural gradient descent (from

now on, abbreviated to MM-NGD) with and scaled by the input size

are

J J

(

;

) +

(1.30)

J J

(

;

)

(1.31)

(

)

(1.32)

(1.33)

where

(

is a scalar learning rate which can be annealed (depend on

the value of =

, the normalised example index), and

(

) is the Fisher information

matrix.

When is signicantly larger than , the learning is dominated by the gradient and

is expected to be very similar to gradient descent. When is signicantly larger than

, then learning is dominated by the momentum terms and can be expected to depend

Page 21

largely on the value of the Fisher information matrix. The next chapter takes this as a

starting point, and analyses the eects of varying the parameters of MM-NGD learning

for soft committee networks.

Page 22

Chapter 2

Analysis of MM-NGD for online

learning

This chapter introduces the tools used to analyse the behaviour of a network’s parameters

and output errors during learning. It gives examples of this and equations for learning in

a type of MLP known as a soft committee machine.

2.1 Statistical mechanics analysis of online learning

Statistical mechanics seeks, in general, to describe a system of many interacting particles

(in this case, network parameters) in terms of a smaller number of

order parameters

These can be used to describe the system only if it is

self-averaging

: if the parameters

for each individual particle are identical (or very similar to) the average parameters for all

the particles. In the discussion that follows, all the order parameters have very sharply

peaked distributions with small variance, so only their mean values will be analysed.

It is dicult to reduce the number of parameters studied if we are looking at an

individual network, but if we assume that the outputs in the ’correct’ input-output ex-

amples given to a learning algorithm are generated by feeding the inputs into another

network, then we can describe the learning algorithm in terms of the dierence between

this ’teacher’ network and the ’student’ network which is learning from the examples. A

covariance matrix can be formed between the student and teacher activations. This con-

tains the overlaps

between the two networks :

each of these overlaps can be used as an order parameter for online learning [24][4]. Note

that the teacher network does not have to produce perfect outputs: it may include some

Page 23

noise on its output, and that the teacher network does not have to exist: it is merely a

construction to give us a better insight into how a network is learning.

2.1.1 Evolution of order parameters

To derive dierential equations to describe the evolution of order parameters, we need to

use a continuous time variable . Since is discrete, is constructed by setting =

which in the thermodynamic limit of

becomes continuous.

Equations for the evolution of network order parameters, derived from the simple

weight update equation given above are provided in [24].

2.1.2 Evolution of generalisation error

The generalisation error

(

) =

(

;

)

f g

of the student network can be described

wholly in terms of

and

. Unlike

, the evolution of

and

over time (or

over presented input patterns) can be described deterministically for large values of

the derivatives of the overlaps form a coupled and closed dierential equation. In this

case, the reduction in number of variables is considerable: since much of the maths in,

for example, [24] assumes an innite input size

, the order parameters also have the

advantage of being nite.

Evolution of the generalisation error

(

) for a network has a characteristic shape:

the error will initially drop rapidly, then sit on a plateau until it continues to drop again.

The plateau is where the network is close to an unstable xed point of the error landscape

where the network weights have become symmetric : this xed point is unstable and

eventually the symmetry will be broken and the learning algorithm will continue towards

an asymptotic stable xed point. Both the length of the plateau and the rate at which the

error drops after it are determined by the system size and the learning rate : a larger

value of will reduce the plateau length but will take much longer to run if -annealing

is used.

2.2 Example networks

To illustrate the eciency and behaviour of dierent gradient descent learning algorithms,

the dynamics of their order parameters and generalisation error have been calculated for

a specic type of network. The example networks used in this section are soft committee

machines with erf hidden unit activations.

Page 24

Soft committee machines are used as examples here because they can model a wide

range of functions (in fact, they are universal approximators if node biases are included

[26]). A soft committee machine is a two-layer network with positive, unit-strength cou-

plings from its hidden units to a single output unit[4]. The activation function for hidden

units must be dierentiable: the erf (error) function is used here because its integral is

not too complicated.

Each network has a xed number

of input nodes, one output node and weights on

the connections between the input and hidden layer which are labelled

i K

the student network and

n M

in the teacher network, and

= (

,..,

)

is the vector of input-to-hidden weights for the

-th hidden node in the student network.

Note that the number of hidden nodes in the student network

does not have to equal

the number of hidden nodes in the teacher network

An input pattern is

= (

,...,

) where is the current input. The pattern output

by the teacher network in response to

; training example is therefore the input-

output pair (

). The activation of the hidden units given an input pattern

. Similarly, the activation of hidden units in the teacher network given

. The output from a student network is

(

) =

(

)

(2.1)

where

(

) is the activation function of hidden unit

2.2.1 Input data

For this analysis, inputs must be i.i.d. samples from a Gaussian variable

;

1). Input

units are also assumed to be uncorrelated (ie

is uncorrelated with

) with zero

mean and unit variance (although this is not true for many examples of real data), and

the maths assumes an innite input size

, although the analysis is still eective for nite

[3].

2.2.2 Activation functions

Although the input-to-hidden activation function

(

) is only restricted to being dieren-

tiable, the erf function, or more specically

(

) = erf(

2) =

has been

used in this section.

Page 25

2.2.3 Generalisation error

The generalisation error of a soft committee network with erf hidden unit activations and

normalised inputs has been calculated in [24] in terms of the order parameters:

(

) =

arcsin

(2.2)

2.3 Analysis of gradient descent algorithms

It is instructive when looking at the evolution of order parameters for natural gradient

descent to have something to compare its speed and complexity against. The obvious

candidates are gradient descent (already analysed in [24]) and the matrix momentum

form of Newton’s method (second-order, fast and analysed in [19][17]). The calculations

used are similar for the various learning algorithms, and are shown for natural gradient

descent in appendix (B).

For the purposes of analysis, it is assumed that all data is generated by another soft

committee machine: to avoid confusion, this is called the ’teacher’ network and the network

that is being adapted is its ’student’. Although the student and teacher networks have

the same structure and inputs, they do not necessarily have the same number of hidden

nodes. For all forms of learning algorithm shown here, the network error is assumed to be

quadratic and the error gradient

J J

(

;

) is therefore

J J

(

;

) =

(

)

(

)

(2.3)

where

(

)

(

)

(

)

(2.4)

is the backpropagation delta function.

2.3.1 Gradient descent

Saad and Solla [24] develop update equations for a soft committee network using gradient

descent. The update for a student hidden unit

in a soft committee network using this

learning algorithm, the learning rate scaled by the input size

and a quadratic error

function

(

;

) as given in equation 1.1 is [24]:

(2.5)

Page 26

The order parameters for gradient descent are the overlaps between the student and

teacher weight vectors

and

: these are

and

Since the teacher network is constant,

does not evolve and its derivative

Tnm

is zero.

The update equations for the other order parameters

and

are:

i k

2.3.2 Matrix momentum version of Newton’s method

For matrix momentum, it is convenient to dene a new variable

(

The order parameters are now the overlaps between

and , i.e.

and

. Unlike the overlaps between

and

the overlaps with do not appear to have a direct physical meaning.

With

and both being updated simultaneously, it is important that contributions

towards updating the same quantity occur on the same time scale. This is achieved by

scaling both and . For the Newton’s method calculations, it is sucient to scale both

and 1

(i.e.

= 1

) as described in [14]. For the

, updates given

in equations 1.21 and 1.22 with =

, =

and a soft committee machine (i.e.

J J

(

xi ;

) =

), the order parameter evolutions are [18]:

2 2

i k

(

)

(

)

(

)

(

)

where

= (1 +

)

(

)

= (1 +

)

(1 +

)

(

)

(

)

Page 27

(1 +

)

(

)

(

)

2.3.3 MM-NGD with the full Fisher information matrix

As with the matrix momentum version of Newton’s method, the order parameters for

MM-NGD are the overlaps between the variables

and :

and

Since this is a toy problem, with the input distribution and the model both known, we

can actually calculate the Fisher information matrix directly. For a soft committee network

and a quadratic error function

(

;

), the exact calculation for the Fisher information

matrix given in equation (1.29) reduces to

(

) = [

i;k

] =

(

;

)

(

;

)

f g

(2.6)

where

i;k

are indices over the hidden units,

;

are indices over the input elements and

is the variance of the teacher output noise . This is a matrix consisting of

K K

blocks which each contain

N N

values. Further manipulation of this equation (this is

given in appendix C) gives

< A

( )

f g

(2.7)

where

((1 +

)

+ (1 +

)

(

))

(2.8)

is the identity matrix and

= (1 +

)(1 +

)

(2.9)

For

, update equations 1.30 and 1.31, with =

(

)

and =

, the equations

of motion for the order parameters are:

(2.10)

(2.11)

((1+

)

+ (1+

)

(

))

((1+

)

+ (1+

)

Page 28

(

))

2 2

i k

(2.12)

((1+

)

+ (1+

)

(

))

(2.13)

((1+

)

+ (1+

)

(

))

(2.14)

The calculations for these are given in appendix B.

There is a more compact form of these equations. If the update equations for

and

are written as

(

)

2 2

i k

(2.15)

(

)

(2.16)

(2.17)

then the sum

can be calculated from

((1+

)

+(1+

)

(

))

(2.18)

and substituted in.

2.3.4 MM-NGD with single-input Fisher information

When the network being analysed is very large, it may not be possible or practical to calcu-

late the full Fisher information matrix for each input to it. A single-input approximation

to the Fisher information matrix which only uses one example is

( )

(2.19)

where

( ) is dened as before. The update equations for this have been calculated by

Scarpetta [25] and are

(2.20)

(2.21)

Page 29

(

)

+ (

)

(

)(

)

(2.22)

(

)

(2.23)

(

)

(2.24)

where

(

)

(

)

Although the behaviour of MM-NGD with single-input Fisher information matrix ap-

proximations is important for predicting the behaviour of MM-NGD learning when it is

dicult to evaluate the Fisher information matrix, its analysis does not form part of this

report: preliminary results and discussion of this topic can be found in [25].

Page 30

Chapter 3

Numerical results

This chapter shows some of the characteristics of the order parameter and generalisation

error evolutions calculated in the last chapter, calculated for soft committee networks with

two-hidden-unit student networks, two-hidden-unit teacher networks and erf activation

functions. Emphasis has been placed on the behaviour of generalisation error over time

for several values of the learning parameters and

and teacher network noise

3.1 Learning parameters

The training parameters of the MM-NGD learning algorithm are the learning rate and

the parameter

, representing the balance between gradient descent and momentum. The

noise variance of the teacher network outputs is termed

. Other variables that can be

adjusted are the numbers of teacher and student hidden nodes, and whether the learning

rate is annealed (gradually reduced over time).

Another factor that aects the learning algorithm outputs, if we have noise on our

inputs, is the point at which learning rate annealing is started. It should be noted that

all times are described in terms of , and that the interplay between the values of and

the number of datasteps used for each stage of learning is also signicant to the eciency

of the learning algorithm.

3.2 Initial conditions

All order parameters are initially zero or sampled from Gaussian distributions:

;

001)

Page 31

;

001)

;

001)

;

= 0

= t

init

;

= 0

where the teacher covariance matrix is set to the same values as

and t

init

is set

to 1 if there is no teacher noise

, and 0

5 for a teacher noise of 0

01. All the order

parameter derivatives were initially set to zero. With these initial values, the networks

were initially symmetric. The same random seed was used for each algorithm so that

when their generalisation errors are compared in this chapter, their initial conditions are

identical.

3.3 Phases of learning

In these output plots, the generalisation errors follow a characteristic curve in time. First,

an initial phase where the generalisation error drops rapidly to a value. Then, a symmetric

phase where the generalisation error stays at that value for some time (this is known as the

symmetric plateau), then a drop to an asymptotic phase during which the error reduces

gradually. In the plots, the initial and symmetric phases are shown for

against ; the

asymptotic phases are shown on a log-log scale.

3.4 Comparing standard, natural and MM-NGD learning

algorithms

Figure 3.1 shows the expected generalisation errors of optimal (with respect to the learning

rate) standard gradient descent [23], optimal natural gradient descent and MM-NGD

gradient descent algorithms, with the optimal learning rate ( = 0

144855) for the example

network and

= 10 in the MM-NGD algorithm (note that is written as alpha in these

graphs).

The natural gradient descent algorithm reaches a plateau less quickly that the standard

gradient descent algorithm as it reweighs the gradient in all directions reducing the strong

dierence between gradients in standard gradient descent which drives the system very

quickly towards the symmetric xed point. The plateau height for the standard gradient

Page 32

100

150

0.1

0.2

alpha

generalisation error

grad desc

ngd

ngd−mm (k = 10)

Figure 3.1:

MM-NGD vs NGD and optimal gradient descent

descent is much higher than the natural gradient plateau: this occurs because although

gradient descent moves quickly towards a minimum in the most signicant gradient di-

rection, it does not move much in the other gradient directions and will therefore be at a

higher level in those smaller directions when it reaches the symmetric phase.

The dierence between the MM-NGD and natural gradient plateaux is discussed in

the next section.

3.4.1 MM-NGD without teacher noise

Figure 3.2 shows the symmetric-phase performance of natural gradient descent algorithms

on a two-hidden-unit network (

= 2) when there is no noise on the teacher network

outputs (

= 0). In this plot, all algorithms have a learning rate of = 0

15. The solid

line is the generalisation error of the natural gradient descent (NGD) algorithm, and the

dotted lines (from right to left) are MM-NGD with

set to 0.5, 1.4, 2.1 and 10.

As the value of

is reduced, the length of the plateau in the MM-NGD error curve

approaches the plateau length of the NGD curve. Experiments with larger values of

(up

to K=100) show that the MM-NGD curve approaches the NGD curve without reaching

it, and with diminishing returns for larger values. Since the number of datapoints needed

Page 33

100

150

200

0.01

0.02

0.03

0.04

0.05

alpha

generalisation error

ngd

k = 0.5

k = 1.4

k = 2.1

k = 10

Figure 3.2:

Symmetric phase behaviour of MM-NGD (teacher noise=0)

to calculate these error curves increases with

and

= 20 is not visibly very dierent

from

= 50 or

= 100, none of the plots shown here use a larger value than 20 for

If MM-NGD and NGD are equivalent for large values of

, then the curve for large

should be very similar to that for NGD. This is not the case: although the plateaux

lengths are similar, the MM-NGD and NGD algorithms have dierent plateau heights. It

is not certain why this should be the case: the plateau height should be determined by

the

terms in the order parameter update equations but experiments with these terms

have proved inconclusive.

3.4.2 MM-NGD with teacher noise

Figure 3.3 shows the asymptotic behaviour of NGD and MM-NGD with teacher noise

variance set to

= 0

01, the teacher covariance matrix

set to 0.5 on their diagonals

(

5 where

is the Kronecker delta). The upper dotted line is the (theoret-

ical) optimal gradient descent bound: gradient descent learning is much worse than this,

and outputs from this algorithm have not been shown.

The lower dotted line is the Cramer-Rao lower bound: this shows a theoretical limit

Page 34

−7

−6

−5

−4

−3

log10(alpha)

log10(Eg)

ngd

k = 0.3

k = 0.7

k = 1.4

k = 10

Figure 3.3:

Asymptotic behaviour of MM-NGD for noisy teacher (

= 0

01)

on the speed at which all algorithms can learn, and is dened as

(

)

(3.1)

where

(

) is the Fisher information matrix and

is the number of examples available.

Since natural gradient descent algorithms are asymptotically optimal, they should be

expected to approach the Cramer-Rao bound: this is seen in the plots, where the lower

solid line is the generalisation error for NGD, and the other learning curves are for MM-

NGD and (from top to bottom)

= 0

3, 0

7, 1

4 and 10.

3.5 Varying other parameters

The results shown in this chapter have all been for optimal or near-optimal values of

the learning rate

and dierent values of the momentum balance parameter

. This

reects our interest in the momentum parameters and the fact that the behaviour of

generalisation error with dierent values of has been well studied elsewhere. However,

before this algorithm is applied to real data, it is interesting and perhaps prudent to note

what happens to the generalisation error evolution when we vary : since online learning

is sensitive to parameter settings, this might provide some clues to what might be amiss

Page 35

100

150

0.05

0.1

0.15

eta=0.01

eta=0.02

eta=0.04

eta=0.07

eta=0.15

(a) Smaller than optimal values

100

150

0.1

0.2

0.3

0.4

eta=0.73

eta=0.7

eta=0.6

eta=0.5

eta=0.3

eta=0.15

(b) Larger than optimal values

Figure 3.4: Generalisation error evolution for various values

with them.

Figure 3.4 shows the behaviour of mm-ngd generalisation error for several values of

( = 0

01, 0

02, 0

04, 0

07, 0

15 and = 0

73, 0

7, 0

6, 0

5, 0

3: note that is written as

eta in the graphs) with

= 8,

=0.01, and t

init

= 1. For decreasing values of below the

optimum value = 0

144885 ( = 0

15 is used as an approximation to this in gure 3.4),

the initial phase before a plateau is reached is longer, and the symmetric plateau height

increases until the initial phase and symmetric plateau appear to be part of one smooth

curve. For increasing values of above the optimum value, the initial phase decreases in

time, and the symmetric plateau height increases until at about = 0

5, learning is not

possible.

Also of interest is the behaviour of generalisation error with dierent combinations of

values of and

, but this has not been studied here.

Page 36

Chapter 4

Using MM-NGD on real data

The matrix momentum form of natural gradient descent works well with toy examples,

but its potential value is in its application to real data. To this end, the datasets used for

matrix momentum with standard gradient descent have also been tried with MM-NGD.

The networks used here are multilayer perceptrons with linear hidden to output con-

nections and tanh hidden-layer activation functions. The learning algorithm used was

online learning, using sampling of the input data with replacement. Code was written in

both Matlab (to t into the Netlab framework) and C++.

4.1 MM-NGD for MLPs

The parameter update equation 1.30 with =

( )

and =

was used here. This

gives a parameter update equation of

(

;

) + (

( )

)(

)

(4.1)

where

is the set of network parameters

for a multilayer perceptron

with input-to-hidden weights