New tools for neurohydrologists: using 'network pruning' and 'model breeding' algorithms to discover optimum inputs and architectures

Robert J. Abrahart
Department of Geography, University College Cork, Ireland.
Email: bob@ashville.demon.co.uk

Linda See and Pauline E. Kneale
School of Geography, University of Leeds, Leeds, LS2 9JT.
Email: {l.see}{pauline}@geog.leeds.ac.uk



Abstract

River flow prediction and forecasting are important environmental functions. The successful application of detailed physical-mathematical models offers one possible source for the provision of these estimates. But such models are often too complex, or too demanding in terms of data and computer requirements, for practical implementation purposes. Simpler approaches offered through 'conceptual' and 'black-box' modelling are thus attractive alternatives. Foremost in this re-emergent field is the use of computational intelligence tools such as neural networks and genetic algorithms - which are being investigated as potential mechanisms for the provision of detailed hydrological estimates.

However, irrespective of recent computational and methodological advances, several fundamental problems still need to be addressed - such as the selection of an optimal neural network architecture for each given task. A number of simple and novel solutions to this problem have been put forward in the guise of built-in functions and add-on software tools. These computational resources can be used to diminish the amount of subjective guesswork that is needed to resolve difficult network design issues. It is therefore important that scientists begin to examine the various options that are now available and in particular the extent to which the application of such devices can be used to assist the hydrological modelling effort.

This paper provides some numerical results from an initial investigation into the use of automated neural network design tools for the creation of improved network architectures based on a 'one-step-ahead prediction' of continuous flow records for the Upper River Wye catchment 1984-6. Four alternative neural network modelling strategies were implemented; the first investigation involved using standard procedures to create a set of standard networks; in the next two investigations two simple pruning algorithms were used to create a set of more efficient architectures, and in the last investigation a genetic algorithm package was used to breed a set of optimised neural network modelling solutions based on random mutation and survival of the fittest.

1. Introduction

Neural networks have been applied to various hydrological modelling tasks and the application of these multifarious technologies will form an expanding area of scientific investigation throughout the next decade. Basic feedforward backpropagation networks and simplistic 'one-step-ahead river flow predictions' form the bulk of this work. However, in application based research, hydrological science has now begun to witness the adoption of alternative neural network strategies. For example, Self Organizing Map (SOM) data classification techniques (Kohonen, 1995) have been used to split the data into different subsets, which will in turn facilitate more accurate simulation through integrated multi-network modelling (Abrahart & See, 1998). More recent work has also involved the adoption of neural network solutions as embedded functions contained within Third Generation Language (3GL) programs that operate input-output feedback loops (Abrahart, 1998). However, irrespective of these recent computational and methodological advances, fundamental problems still need to be addressed - such as the selection of an optimal neural network architecture. Recent software advances in the guise of built-in functions and add-on tools can now be used to diminish the amount of subjective guesswork that is needed to resolve difficult network design issues. It is therefore important that scientists begin to examine these tools and the extent to which the application of such devices can be used to assist the hydrological modelling effort.

This paper provides some numerical results from an initial investigation into the use of various automated neural network design tools for the creation of improved network architectures based on a 'one-step-ahead prediction' of continuous flow records for the Upper River Wye catchment 1984-6. Four alternative neural network modelling strategies were implemented; the first investigation involved using standard procedures to create a set of standard networks; in the next two investigations two simple pruning algorithms were used to create a set of more efficient architectures, and in the last investigation a genetic algorithm package was used to breed a set of optimised neural network modelling solutions based on random mutation and survival of the fittest.

In general terms these collective experiments were designed to investigate the raw power, modelling possibilities and application potential that is associated with the use of these computer-based algorithms to:

2. Neural networks

Neural networks offer an important alternative to the traditional methods of analysis and modelling. For example, in conventional computing, a model is expressed as a series of equations which are then translated into 3GL code and run on a computer. But a neural network is much more flexible. Instead of being told the precise nature of a relationship or model - the neural network is trained to best represent the relationships and processes that are implicit, albeit invisible, within the data. Neural networks could thus be used to provide a robust error-tolerant multi-dimensional non-linear solution in certain situations that would otherwise present the hydrological modeller with a difficult modelling task. However, in common with regression analysis, the fact that a relationship between input (independent variables) and output (dependent variables) can be modelled with a neural network provides no direct proof that a connection or causal relationship exists. There could indeed be no sensible or logical link between them. So, in all cases, the final model and its internal relationships will demand a theoretical justification that is made on logical grounds; a point that is of particular importance with regard to the use of automated neural network design tools of the kind that are being investigated and reported on here. For readers who require additional information on this subject a more detailed introduction to artificial neural networks can be found in Openshaw & Openshaw (1997).

Neural networks are seen to offer a plethora of good hydrological modelling opportunities and various successes have been reported in the literature e.g. rainfall forecasting in space and time (French et al.,1992); predicting river flow levels at ungauged sites (Karunanithi et al.,1994); spatial interpolation of aquifer properties (Rizzo & Dougherty,1994); optimisation of a groundwater model (Rogers & Dowla, 1994); modelling the rainfall-runoff transformation using a combination of areal and point based measurements (Lorrai & Sechi,1995); synthesizing reservoir inflow records (Raman & Sunilkumar, 1995); modelling synthetic sequences of rainfall-runoff data (Minns & Hall, 1996); and modelling soil water retention curves (Schaap & Bouten, 1996). These powerful CI (Computational Intelligence; see Fischer & Abrahart, forthcoming) tools can be used to model raw data, trained to clone existing models, or implemented in various modes of computational association with equation-based tools. Most research has to date focused on rainfall-runoff applications that range from modelling a 5 x 5 cell synthetic watershed using inputs derived from a stochastic rainfall generator (Hsu et al., 1995), to predicting runoff for the Leaf River Basin (1,949 km2) using five years of daily data (Smith & Eli, 1995), and to constructing robust models of fifteen-minute flows with six hour lead times for the Rivers Amber and Mole (Dawson & Wilby, 1998).

There are no hard and fast rules governing the correct design of a neural network. It is axiomatic that more complex problems will require more complex solutions. However, when there are a large number of free parameters, the network will be (a) slower to train and (b) more susceptible to overfitting. Important factors such as the number of inputs, the number of hidden units, and the arrangement of these units into layers are often determined using 'trial and error' experimental design procedures (e.g. Fischer & Gopal, 1994) or fixed in advance according to the subjective opinion of each individual designer (e.g. Abrahart & Kneale, 1997). The laborious task of testing for optimum inputs and architectures can be a time consuming process and the end result will often be neither that informative nor altogether conclusive or convincing. The main aim of this investigation - from a computational perspective - was thus to investigate the use of modern technologies to build better and more efficient neural network hydrological models. In the initial stages of this emergent paradigm it is also important to examine and report on the science involved. Little hard knowledge is known about the temporal dimension of neural network rainfall-runoff modelling. So another major goal of this research was to help generate a better understanding of relevant inputs and operational considerations related to neural network rainfall-runoff modelling. These experiments were also used to provide some additional insights into the modelling process, via an explanation of the significance and power associated with the use of different inputs, in particular past river flow inputs - since this data is peculiar to neural network hydrological modelling operations.

3. Study area and database

The area chosen for this study was the Upper River Wye in Central Wales (Figure 1). This is an upland research catchment that has been used on several previous occasions for various hydrological modelling purposes e.g. Beven et al. (1984); Bathurst (1986); Quinn & Beven (1993). The basin covers an area of some 10.55 km2, elevations range from 350-700 m above sea level, and average annual rainfall is in the order of 2500 mm. Ground cover comprises grass or moorland. Soil profiles are thin, most of the area being peat, overlying a podzol or similar type of soil. Runoff response is dominated by saturated sub-surface flow, especially at the interface between the two soil layers, and by overland flow following saturation of the peat layer (Knapp, 1970; Newson, 1976).

Figure 1: Upper Wye Catchment (Beven et al., 1984)

The data that were available for this area were for Cefn Brwyn (Gauging Station Number 55008), comprising rainfall (RAIN), potential evapotranspiration (PET), and river flow ordinates (FLOW) on a one hour timestep for the period 1984-6. The data were first pre-processed into what has now become the standard format for temporal neural network modelling. The resultant multi-column file had separate columns for: annual hour-count (CLOCK), RAIN t, RAIN t-1 to t-6, PET t, PET t-1 to t-6, FLOW t-1 to t-6, and FLOW t. The six-hour historical record was considered sufficient for predictive modelling purposes based on previous reported experiments (Abrahart & Kneale, 1997). It also tallies with the empirical rule that at least five or six points should be used to define the rising limb of a finite-period unit hydrograph, which dates back at least to F.F. Snyder in the late-1930s (Johnstone & Cross, 1949), and is promulgated in the UK Flood Studies Report (NERC, 1975). Given the circular nature of CLOCK these particular values were transformed into their sine and cosine equivalents, making a total of twenty-three variables, as shown in Figure 2. All variables were next subjected to linear normalisation between zero (lowest possible value for that variable in the database) and one (highest possible value for that variable in the database). The normalised file was then split into three individual data sets: 1984, 1985 and 1986. To help keep matters simple - all river flow values are henceforth reported in terms of these 'normalised flow units' (nfu).

Figure 2: Column headings for multi-column data file

4. Method

Four sets of neural network models were created and tested. An initial set of models was developed using standard training procedures. These models were intended to act as an 'experimental control' or 'benchmark' against which the other three sets of models could then be compared. All models had an identical starting point and, where practical considerations allowed, the same training methods and parameters were used.

4.1 Construction of the initial network architecture

The Stuttgart Neural Network Simulator (SNNS) was used to construct a two-hidden-layer feedforward network. This network comprised a 22:16:14:1 architecture, with all standard connections enforced, and with no cross-layer connections permitted (Figure 3). The input nodes were: sin[CLOCK], cos[CLOCK], current RAIN [t], last six RAIN recordings [t-1 to t-6], current PET [t], last six PET recordings [t-1 to t-6], and the last six FLOW ordinates [t-1 to t-6]. The output node corresponded to current FLOW [t]. All connection weights and unit biases were initialised with random numbers - set between plus and minus one. The design of this network was based on earlier work where an identical architecture, that was trained on the same database, had been observed to perform in an acceptable manner (Abrahart & Kneale, 1997). This particular network formed the initial starting point for each subsequent individual investigation.

Figure 3: Initial network architecture from which all other models were created
[layers are displayed in two column format]

4.2 Types of algorithm to be tested

In addition to the 'standard procedure' for training a backpropagation network, two different types of traning algorithm were used to build alternative models, resulting in numerous additional architectures being created from the initial network for each individual set of training data - as shown in Figure 4:

Figure 4: Extent of alternative model building operation

4.3 Training and testing: standard procedures and pruning functions

The Stuttgart Neural Network Simulator (SNNS) was run in batch file mode and used to perform all standard neural network modelling operations and training procedures. It was also used to implement the two automated network pruning algorithms which are both available for use as internal functions.

In the standard procedure and network pruning experiments the initial 22:16:14:1 network was trained on one annual data set and tested with the other two. This operation was in turn repeated for each of the three individual data sets and an optimal solution for each model building scenario was selected. Statistical and graphical comparisons between the various preferred neural network solutions then followed. This multiple training and testing, using data from three different hydrological periods, facilitated a number of informative comparisons - more so since 1984 was a drought year; 1985 contained a limited number of intermediate events; and 1986 had a far higher proportion of 'information rich' event-related data.

In the standard procedure and network pruning experiments all network training was undertaken using the SNNS 'enhanced backpropagation' algorithm (E-BPROP). All training patterns were presented in random order with weight updates being implemented after the presentation of each individual pattern. All batch files were set to run for 6500 epochs (training cycles). The 'learning rate' parameter was set at 0.2; the 'momentum' parameter was held constant at 0.1; the 'flat spot elimination' and 'maximum tolerated difference' parameters were kept at zero. This decision to use low levels of learning and momentum was based on earlier experiments. Too much rapid forcing is known to produce wild fluctuations that are difficult to control and is a problem that can be attributed to the poor spread of training data i.e. much of the solution surface is dedicated to modelling a flat response with intermittent storm events.

4.3.1 Model creation using standard procedures

Three standard training runs were undertaken to provide an experimental control or benchmark against which some sort of relative comparison could be made. Each annual data set was in turn used to train the initial 22:16:14:1 network. Sum squared error statistics were computed for all three data sets at 100 epoch intervals and these results were then translated into a combined graph from which the best overall modelling solution for each annual data set could be selected using visual inspection (Figures 5 to 7).

4.3.2 Model creation using magnitude based pruning

Three weight-pruning training runs were undertaken. Training was in all but one respect identical to that used for the standard model, the difference being that after each period of 100 training epochs the five weighted connections that had the lowest weights were deleted, which created a network that became less and less complicated over time. The idea behind this technique is that the lowest weights will be associated with the weakest connections that transmit the least significant throughputs. This connection elimination procedure was allowed to run until the network was no longer able to function. The sum squared error statistics at 100 epoch intervals are plotted in Figures 8 to 10.

4.3.3 Model creation using skeletonization

Three node-pruning training runs were undertaken. However, in this instance, after each period of 100 training epochs the node that produced the least amount of overall change in the global sum squared error statistic when omitted was deleted - which again produced a network that became less and less complicated over time. The idea behind this technique is that the lowest change in error would be associated with the least significant node. This node elimination procedure was allowed to run until the network was no longer able to function. The sum squared error statistics are plotted in Figures 11 to 13.

4.3.4. Selection of an optimal solution

One critical issue for the successful application of a neural network concerns the complex relationship that exists between learning and generalisation. It is important to stress that the ultimate goal of network training is not to learn or reproduce an exact representation of the training data, but rather to build a model of the underlying process(es) which generated that data, in order to achieve a good generalisation or out-of-sample performance. It is therefore important to validate the final product not in terms of its training data but in terms of its application to the other two data sets. Network training error also fluctuates quite a bit at various points during the training process, often to a marked degree, which thus renders a quantitative assessment difficult. The decision was therefore made to investigate extended runs and to undertake a visual assessment of the performance of the two validation data sets, in each model building scenario, to determine in each case a 'preferred' neural network solution. In most cases the optimal model was selected at a point where the error associated with one or other of the two validation data sets began to increase in a continuous manner and with no subsequent fallback. Vertical dashed lines denote the chosen network solutions on each of the nine training graphs depicted in Figures 5 to 13. Attention is drawn to the use of a log scale for plotting the sum squared error statistic.

Figure 5: Selection of 'preferred network' created using standard procedures - trained on 1984 data
[dashed vertical line indicates position of chosen model]

Figure 6: Selection of 'preferred network' solution created using standard procedures - trained on 1985 data

Figure 7: Selection of 'preferred network' solution created using standard procedures - trained on 1986 data

Figure 8: Selection of 'preferred network' solution created using magnitude based pruning - trained on 1984 data

Figure 9: Selection of 'preferred network' solution created using magnitude based pruning - trained on 1985 data

Figure 10: Selection of 'preferred network' solution created using magnitude based pruning - trained on 1986 data

Figure 11: Selection of 'preferred network' solution created using skeletonization - trained on 1984 data

Figure 12: Selection of 'preferred network' solution created using skeletonization - trained on 1985 data

Figure 13: Selection of 'preferred network' solution created using skeletonization - trained on 1986 data

4.4 Training and testing: random mutation with resilient learning and hard pruning

ENZO is a dedicated software package, that can be used to operate the Stuttgart Neural Network Simulator, and comprises a genetic algorithm tool which has been adapted for the task of neural network optimisation. Constructing a neural network solution to a given problem can be a difficult task since it involves choosing a particular network architecture (i.e. number of layers, number of units per layer, and patterns of connection), together with a set of network coefficients (i.e. weights, thresholds, etc.), that will, in combination, produce an optimal performance for a given modelling situation. All global optimisation heuristics, when faced with complex optimisation problems, must adopt a balance between the level of exploration and the level of exploitation since:

Evolution-based algorithms avoid the problem of becoming trapped in a local minimum through the use of a parallel search process, comprising a population of search points (individuals), and stochastic search steps i.e. stochastic selection of the parents and stochastic generation of their offspring (mutation and crossover). But this search procedure is nevertheless in a broad sense still biased towards exploitation since it is the fittest parents that are selected for the creation of future generations. Moreover, genetic algorithms are problem independent, and will therefore neglect vital problem dependent knowledge such as gradient information relating to the solution surface. So the use of a pure evolution-based genetic algorithm will at best produce modest results in comparison to other heuristics that can exploit this additional information. However, in the case of neural networks, each individual model is capable of moving down the solution surface gradient on its own - using standard gradient descent procedures such as backpropagation. The application of a hybrid evolution-based method can therefore enable us to restrict the search space to a set of local optima using a two phased operation:

Level-1 heuristic:periods that contain coarse steps based on evolution, which are intertwined with ...
Level-2 heuristic:periods that contain fine steps for local optimisation

4.4.1 Level-1 heuristic

ENZO has a lot of parameters that can be adjusted. Although this in itself might appear to pose an additional optimisation problem, implementation of sensible default values within the program makes the engineering of a poor result quite difficult, and the code is in fact quite robust. Nevertheless, given that a certain degree of modification could lead to superior model building, and that the required modifications could well be problem dependent - it might at some later date be useful to tailor various specific aspects of the algorithm in one form or another. Such items are the subject of alternative explorations and further research.

In these initial experiments a straightforward modelling operation was undertaken and various important details relating to the chosen method of application are as follows:

  1. Maximum population size to be held in the system was set at 100. This means that at each point in time there were up to100 different neural networks architectures from which alternative models could be created using evolution-based procedures.
  2. 100 new networks were produced from the initial 22:16:14:1 neural network. These networks were not direct copies of the original. The networks were instead created using random selection of the original hidden units; the number of selected units ranged between the minimum number required to learn a given problem (which was determined using an automated binary search procedure) and the number of hidden units contained in the original network. Most parent networks therefore had a different configuration and a reduced size which would in turn lead to more rapid evolution and training. In this initial exercise there was no pre-evolution training, no random selection of inputs, and no random deletion of the original weights. Such measures can sometimes also be used to reduce the amount of training that is required for the children.
  3. The 100 networks created in Step 2 were subjected to random initialisation with all weights and unit biases being set between plus and minus one. Steps 2 and 3 in combination therefore created 100 different networks that were all based in some manner or other on random adjustments to the standard model.
  4. The initialised networks were trained using the batch learning algorithm RPROP (Resilient Propagation). Full details on this learning algorithm and its implementation are provided in the next section. Training was stopped when either mean error for the training data reached 0.0005 nfu or when the training programme had reached 100 epochs.
  5. All insignificant connections in each network were deleted; in this instance comprising those connections which had a connection strength or weighting that was below a fixed threshold of ±0.25. This threshold could have been set to change with time, but a marked reduction in the number of connections was in this case thought important, and thus applied in order to focus maximum effort into producing a minimum number of free parameters. It was assumed that evolution would counteract the worst effects of overpruning since new connections would be added and, when new nodes were created, each new node would be connected to all of the existing nodes in the adjacent layers.
  6. All pruned networks were retrained using the batch learning tool RPROP. The implemented training programme and algorithm parameters were identical to that specified in Step 4. In all but a few difficult cases it took just one further training epoch to reach the desired mean error training goal of 0.0005 nfu.
  7. A fitness value for each network was determined. In this instance the measure of fitness was restricted to an error value associated with a test data set (i.e. an alternative annual data set). No other items were included in these fitness considerations e.g. penalties could be associated with the number of nodes or the number of weighted connections; or even prescribed according to the number of training epochs needed to obtain a desired goal.
  8. All networks were ranked according to their prescribed fitness and ten new networks were created from the population of 100 trained parents. The picking of these parent networks was based on a random selection procedure; but with preference being given to the fitter networks i.e. the fittest 12.5% were given the same chance of being selected for parenthood as the rest of the population.
  9. The ten children were subjected to random architectural mutation. All probabilities associated with weight and link mutation were set at 0.2; all probabilities associated with weight and link soft pruning were set at 0.2; and the probabilities associated with input and hidden unit mutation were set at 0.2 (with probabilities for the insertion and deletion split being set at 0.5). The insertion of bypass connections following the removal of hidden units was permitted. Other points to note are that the range of possibilities for node insertion and deletion was restricted to those that existed within the initial network architecture, that potential mutation was restricted to one unit and one connection per generation, and that all inserted nodes would possess a full set of weighted connections.
  10. Steps 4 - 7 were applied to the ten children. These networks were first trained, using a fixed stopping condition, and their individual fitness was then assessed using an error value associated with a test data set (i.e an alternative annual data set).
  11. Each of the ten children was then inserted at an appropriate point in the ranked population and the ten least fit networks were deleted (thus preserving a population total of 100).
  12. Steps 8-11 were repeated. The entire random mutation and network evaluation operation was allowed to run for a full 30 generations at which point the complete population of 100 networks was saved to file and their individual performance evaluated in more detail.

4.4.2 Level-2 heuristic

Resilient propagation (RPROP) was used to train each of the numerous mutated networks. RPROP is a fast, local adaptive learning scheme, and performs supervised batch learning in multi-layer perceptrons (Riedmiller & Braun, 1993). It is therefore one of the best algorithms for handling a large number of networks that need rapid training. The basic principle behind RPROP is to eliminate the harmful influence of the partial derivative on the weight step. Thus, only the sign of the derivative is used to indicate the direction of the weight update, with the size of the weight change being determined from a weight-specific update value that is also based on a sign-dependent process. Each time the partial derivative of a weight changes its sign, this indicates that the last update value was too big, and that the algorithm has therefore jumped over a local minimum. The update value is therefore decreased. If the derivative maintains its sign the update value is given a small increase in order to accelerate convergence in shallow regions of the solution surface. Since RPROP is attempting to adapt its learning process to the error function, weight-update and adaptation are performed after the gradient information of the whole pattern set is computed, which means that a batch or epoch learning process must be used. Default parameters were set as follows: initial update-value [0.1], limit for maximum step [50.0], and weight-decay exponent [4.0].

4.4.3 Model creation using random mutation with resilient learning and hard pruning

Six random mutation runs were undertaken based on the initial 22:16:14:1 network. Each annual data set was used in a paired formation. Within each pair, one data set was used to perform the Level-1 heuristic, with the other set being used to perform the Level-2 heuristic. Network results were also calculated for the third data set on each occasion and used for comparative purposes. Architectural changes were restricted to (a) network pruning and (b) single parent random mutation operations. No fittest-parent feature implantation or direct transfer of short-cut crossover connections were permitted. Low mutation probabilities, with an equal split between insertion and deletion, were selected in each case which enabled us to experience some degree of alteration whilst at the same time maintaining a stable network configuration. More radical mutation will be done in later experiments. The program was in each case run for 30 generations and the fittest 100 networks were saved to file. This made a total of 400 networks that were tested and evaluated on each annual dataset, comprising 100 original models and 300 (10x30) offspring, making a grand total of 2,400 networks that were examined. In the previous experiments there were two independent data sets that could be used to evaluate the best overall network solution. However, this operation could not be repeated on the final population, or at least could not be done in the same manner because two full data sets had been involved in the model construction process. So at best there was now just one independent data set that could be used for model evaluation purposes - although it was apparent in the previous work that the difference in performance between one river flow data set and another was such that one set on its own could not give an adequate representation. Given this situation, together with the experimental nature of this project, it seemed sensible to retain the use of the internal fitness measure for determining which was the best overall network model. The fittest member was therefore selected from each final population, for further investigation of its architecture, statistics and hydrographs, and for comparison with the optimal solutions derived from the other training exercises. To provide a program validation check on all members of each final population, the six hundred saved networks were all tested in terms of sum squared error related to the three annual data sets, and no major discrepancies were identified.

5. Results

5.1 Number of units and connections

Simple summaries for the purpose of assessment can often be useful. So a straightforward and unpretentious counting exercise was therefore performed on the different network architectures that comprised the chosen pruning algorithm and genetic algorithm solutions. The collated information is provided in counts relating to the number of nodes per layer, the total number of connections, and the percentage of original items remaining. These statistics are reproduced below in Tables 1 and 2.

5.1.1 Network pruning experiments

Extensive reduction in the original network architecture was produced from the automated application of both network pruning algorithms (Table 1). Marked differences were also observed to have arisen in the number and distribution of units, between the two different methods of pruning, and between the three different sets of training data. The input layer saw the greatest variation with the number of input units in the final solutions ranging from 3 to15 units. The hidden units maintained a more balanced profile. Both hidden layers suffered losses of a similar number, with final numbers ranging from 9 to14 units in the first hidden layer, and from 6 to11 units in the second hidden layer. The total number of connections in each solution also exhibited considerable variation ranging from 64 to 258, and there appeared to be no explicit relationship between (a) the number of units in each layer, and (b) the total number of connections.

Training data
Number of nodes in input layer
Number of nodes in 1st hidden layer
Number of nodes in 2nd hidden layer
Total number of connections
Initial network
---
22
16
14
590
Magnitude based pruning
1984
15 [68.2]
14 [87.5]
8 [57.1]
104 [17.6]
Magnitude based pruning
1985
9 [40.9]
13 [81.3]
7 [50.0]
65 [11.0]
Magnitude based pruning
1986
8 [36.4]
13 [81.3]
6 [42.9]
64 [10.8]
Skeletonization pruning
1984
3 [13.6]
11 [68.8]
9 [64.3]
141 [23.9]
Skeletonization pruning
1985
8 [36.4]
13 [81.3]
11 [78.6]
258 [43.7]
Skeletonization pruning
1986
9 [40.9]
9 [56.3]
11 [78.6]
191 [32.4]
Table 1: Number of components in each preferred network [with % of original]

5.1.2 Random mutation with resilient learning and hard pruning

The automated application of the combined pruning and genetic algorithm procedures also produced extensive reductions in network architecture with marked differences in the number and distribution of units depending upon which combinations of data were used (Table 2). The most striking feature in this table is that the fittest networks all contained a full set of input nodes. Program records indicate that a limited amount of input node mutation occurred - but whether or not this aspect of the result is an outcome of low mutation probabilities, improved fitness performance from multiple inputs, or a spurious artifact associated with the training programme and node insertion procedure is for the moment unknown. The final outcome for both hidden layers, in contrast to the input layer, shows a massive reduction in the number of hidden nodes and a modest degree of between-network variation. Final numbers range from 2 to 12 units in the first hidden layer, and from 2 to 5 units in the second hidden layer, with most counts being much lower than those reported in the earlier experiments. The total number of connections in each solution also exhibited considerable variation ranging from 49 to 269 and again there appeared to be no explicit relationship between (a) the number of units in each layer and (b) the total number of connections. The counts in this instance were similar to those reported in the earlier experiments.

Training data used in RPROP
Fitness data used in GA
Number of nodes in input layer
Number of nodes in 1st hidden layer
Number of nodes in 2nd hidden layer
Total number of connections
Initial network
---
22
16
14
590
1984
1985
22 [100]
8 [50.0]
2 [14.3]
116 [19.7]
1986
22 [100]
7 [43.8]
4 [28.6]
117 [19.8]
1985
1984
22 [100]
6 [37.5]
4 [28.6]
150 [25.4]
1985
22 [100]
2 [12.5]
2 [14.3]
49 [8.3]
1986
1984
22 [100]
8 [50.0]
5 [35.7]
119 [20.2]
1985
22 [100]
12 [75.0]
5 [35.7]
269 [45.6]
Table 2: Number of components in each fittest network [with % of original]

5.2 Visual inspection of the architecture

5.2.1 Magnitude based pruning

From the three network architecture diagrams (Figures 14-16) it can be seen in all cases that magnitude based pruning brought about a massive reduction in the number of weighted connections whilst at the same time maintaining a reasonable number of hidden units. In so doing this algorithm has created a much reduced network - which from a neural network perspective - has a rather simple looking structure and would require much less time and effort to train and run. Moreover, in all cases, it is evident that numerous input links have been maintained with the most recent past river flow value which has the greatest number of connections (FLOW t-1). Likewise all three networks have maintained several input links with current rainfall (RAIN t). What remains of the other input links from nodes associated with earlier FLOW and RAIN data is less clear cut and there appears to be some degree of variation from network to network - although the main focus of the networks is nonetheless on maintaining some link with previous FLOW and RAIN inputs. The 1984 model had a lot of these additional RAIN and FLOW links; in the other models this association was less marked. Models built on 1985 and 1986 data had no input connections with PET or CLOCK while the 1984 network maintained a link with both.

Figure 14: Preferred network architecture from magnitude based pruning exercise using 1984 training data
= active node = inactive node

Figure 15: Preferred network architecture from magnitude based pruning exercise using 1985 training data
= active node = inactive node

Figure 16: Preferred network architecture from magnitude based pruning exercise using 1986 training data
= active node = inactive node

5.2.2 Skeletonization

As in the case of magnitude based pruning, skeletonization also reduced the number of weighted connections whilst maintaining a reasonable number of hidden units (Figures 17-19). In contrast to magnitude based pruning this algorithm has created networks that have similar, or fewer inputs, although the final number of connections is much greater. Further details are provided in Table 1. From a neural network perspective these reduced networks would also require much less time and effort to train and run. Moreover, in all cases, several links have been maintained with the two most recent past river flow records. The 1985 and 1986 models also have links with other past river flow records; whereas the 1984 model does not. The models for 1985 and 1986 have maintained links with current rainfall (RAIN t); 1986 also with RAIN t-1. No other rainfall links exist. The 1984 and 1985 models both have links to CLOCK; and the 1986 model has links with past PET values.

Figure 17: Preferred network architecture from skeletonization pruning exercise using 1984 training data
= active node = inactive node

Figure 18: Preferred network architecture from skeletonization pruning exercise using 1985 training data
= active node = inactive node

Figure 19: Preferred network architecture from skeletonization pruning exercise using 1986 training data
= active node = inactive node

5.2.3 Random mutation with resilient learning and hard pruning

The output from these experiments is more difficult to interpret (Figures 20-25). Again, in a similar manner to the earlier pruning experiments, the applied combination of evolution-based model breeding and hard pruning brought about a massive reduction in the number of weighted connections. But, as explained earlier, in sharp contrast to the previous experiments all input units have been maintained - albeit with varying degrees of connection. The question of input relevance must therefore be determined from an examination of the connection patterns alone. The number of hidden units shows a more substantial variation between the least complex 22:2:2:1 and most complex 22:12:5:1 solutions. With the RPROP algorithm, using 1984 training data, both network results contained a large number of connections most of which were associated with RAIN and FLOW inputs. PET inputs, likewise, had a substantial number of links while CLOCK inputs were less pronounced. There is little or no real difference between the two network architectures. With the RPROP algorithm, using 1985 training data, both networks exhibited a state of almost full connection and no differentiation in terms of input relevance can be made. The two networks, however, do in fact look quite different as a result of massive differences in the number of hidden nodes with the 22:6:4:1 network having a substantial number of connections and the 22:2:2:1 network having just a few. With the RPROP algorithm, using 1986 training data, both networks contained a substantial number of connections. The 1984 network contained a large number of connections most of which were associated with RAIN and FLOW inputs. PET inputs, likewise, had a substantial number of links. CLOCK inputs were again less pronounced. The 1985 network exhibited a state of almost full connection with little overall differentiation. This network also has more hidden units in the first hidden layer which means that it had a more substantial system of inter-connected parameters.


Figure 20: Fittest network architecture from random mutation exercise
using 1984 (training) and 1985 (fitness evaluation) data
= active node = inactive node

Figure 21: Fittest network architecture from random mutation exercise
using 1984 (training) and 1986 (fitness evaluation) data
= active node = inactive node

Figure 22: Fittest network architecture from random mutation exercise
using 1985 (training) and 1984 (fitness evaluation) data
= active node = inactive node

Figure 23: Fittest network architecture from random mutation exercise
using 1985 (training) and 1986 (fitness evaluation) data
= active node = inactive node

Figure 24: Fittest network architecture from random mutation exercise
using 1986 (training) and 1984 (fitness evaluation) data
= active node = inactive node

Figure 25: Fittest network architecture from random mutation exercise
using 1986 (training) and 1985 (fitness evaluation) data
= active node = inactive node

5.3 Statistical analysis of the output data

One major problem in assessing neural network solutions is the use of global statistics. When neural networks are used to model one-step-ahead predictions the solution will in most cases produce a high or near-perfect goodness-of-fit statistic. All such measures therefore give no real indication of what the network is getting right and wrong or where improvements could be made. Indeed, neural networks are designed to minimise global measures, and a more appropriate metric that identifies real problems and between-network differences is now long overdue. But most other river flow prediction tools also suffer from the same problem, so until such time as a recognised solution is available, one or more simple measures must suffice. Since there is no one right method or definitive evaluation test a multi-criteria assessment was therefore carried out. Eight global evaluation statistics were applied to each output and a brief description of each statistic is given below:

5.3.1 Standard procedures, magnitude based pruning and skeletonization

Test statistics related to these modelling operations are provided in Tables 3a to 3c. Each table pertains to one annual test data set and from an examination of these tables it is apparent that there was no one best overall solution. The best result on each table for a given test statistic produced with a given set of validation data has been coloured red i.e. the best validation result for the minimum error statistic produced with a network tested on 1984 data is -0.05744 nfu. Looking at the pattern of best performing statistics for each different type of model enables three main points to be extracted. First, the overall level of prediction is quite similar, and that all models have produced good results. There is no strong evidence of overfitting which therefore validates the method of selection. Second, there is no single outright winner. The different models would appear to have different qualities; so in all cases the criteria for selection must be determined according to the task in hand, and the use of alternative objective functions should be considered e.g. those which are specific to reservoir management, flood forecasting, or habitat preservation purposes. Third, the different training sets contained different types or amounts of information, and thus produced different levels of generalisation for each given situation. This problem would also have an important influence on the use of individual modelling solutions that were created for a given period, or point in time, and then applied to another one … which is in fact temporal extrapolation.

Network training data: Network training data: Network training data:
1984
1985
1986
1984
1985
1986
1984
1985
1986
Min.
-0.02756 -0.12122 -0.28000 -0.02508 -0.09577 -0.07803 -0.30212 -0.05744 -0.06943
Max.
0.02321 0.27475 0.30732 0.02441 0.21130 0.27976 0.10633 0.15183 0.15285
SEE
0.00218 0.00962 0.01080 0.00195 0.00727 0.00913 0.00793 0.00727 0.00897
SSE
0.04311 0.83934 1.02444 0.03453 0.49241 0.77598 0.56167 0.49381 0.72999
S4E
0.00000 0.01539 0.02568 0.00000 0.00565 0.01374 0.01002 0.00217 0.00321
RMSE
0.00222 0.00981 0.01083 0.00199 0.00751 0.00943 0.00802 0.00752 0.00914
MAE
0.00156 0.00466 0.00414 0.00133 0.00394 0.00455 0.00436 0.00428 0.00586
% COE
99.75 95.04 93.75 99.80 97.17 95.53 96.63 97.17 95.69
Network trained with standard procedures
Network trained with magnitude based pruning
Network trained with skeletonization
Table 3a: Testing preferred networks with 1984 data

Network training data: Network training data: Network training data:
1984
1985
1986
1984
1985
1986
1984
1985
1986
Min.
-0.29983 -0.17659 -0.12994 -0.23272 -0.17125 -0.14350 -0.22984 -0.19186 -0.16209
Max.
0.09493 0.12502 0.24817 0.19779 0.17271 0.24238 0.18123 0.13218 0.17916
SEE
0.01528 0.00618 0.00715 0.00981 0.00694 0.00686 0.00960 0.00749 0.00781
SSE
2.14472 0.33445 0.45156 0.85264 0.42344 0.42052 0.80978 0.49327 0.53466
S4E
0.03548 0.00209 0.00678 0.00984 0.00315 0.00542 0.01010 0.00357 0.00423
MAE
0.00875 0.00296 0.00265 0.00477 0.00315 0.00314 0.00409 0.00350 0.00432
RMSE
0.01565 0.00618 0.00718 0.00987 0.00695 0.00693 0.00961 0.00750 0.00781
% COE
88.33 98.09 97.44 95.18 97.59 97.65 95.39 97.19 96.95
Network trained with standard procedures
Network trained with magnitude based pruning
Network trained with skeletonization
Table 3b: Testing preferred networks with 1985 data

Network training data: Network training data: Network training data:
1984
1985
1986
1984
1985
1986
1984
1985
1986
Min.
-0.43069 -0.37030 -0.17399 -0.25342 -0.31483 -0.18853 -0.32382 -0.31386 -0.19456
Max.
0.10313 0.14081 0.11269 0.19395 0.15832 0.18385 0.16744 0.18657 0.21652
SEE
0.02180 0.01203 0.00683 0.01346 0.01182 0.00933 0.01518 0.01263 0.01036
SSE
4.25539 1.26821 0.41285 1.59396 1.22539 0.78178 2.02339 1.39775 0.94140
S4E
0.18922 0.04350 0.00280 0.02529 0.02747 0.00921 0.05687 0.03159 0.01141
MAE
0.00955 0.00428 0.00270 0.00558 0.00428 0.00393 0.00551 0.00479 0.00504
RMSE
0.02204 0.01203 0.00687 0.01349 0.01183 0.00945 0.01520 0.01263 0.01037
% COE
90.35 97.06 99.05 96.32 97.16 98.23 95.32 96.76 97.82
Network trained with standard procedures
Network trained with magnitude based pruning
Network trained with skeletonization
Table 3c: Testing preferred networks with 1986 data

5.3.2 Random mutation with resilient learning and hard pruning

Test statistics related to the evolution-based approaches are listed in Tables 4a to 4c. Each table contains the test statistics relating to one annual test data set from which it is again apparent that there was no one best overall solution. It must be stressed at this point that the figures from these tables cannot be used for the purpose of a direct comparison with those provided in the earlier exercise. In the initial exercise an attempt was made to find the optimal modelling solution whereas in these follow-on genetic algorithm exercises the aim was to find an optimal architecture based on a mean error stopping condition of 0.0005 nfu. The two operations would therefore be expected to produce different levels of generalisation because the networks that were involved had been trained to produce different levels of output error. However, in other respects, the pattern of variation in the test statistics is similar to that which was produced in the earlier tables and so the three general observations that resulted from an examination of Tables 3a to 3c can also be applied to these figures.

Training data used in RPROP
1984
1985
1986
Fitness data used for evaluation in GA
1985
1986
1984
1986
1984
1985
Min.
-0.27165 -0.43809 -0.33194 -0.37998 -0.07401 -0.17967
Max.
0.14306 0.09893 0.24663 0.15666 0.25383 0.21670
SEE
0.01753 0.01583 0.01541 0.01423 0.01650 0.01544
SSE
3.96643 2.35324 2.13284 1.82772 2.43064 2.08911
S4E
0.02361 0.10731 0.04185 0.05640 0.04509 0.01450
MAE
0.01681 0.00944 0.01058 0.00858 0.00246 0.00099
RMSE
0.02132 0.01642 0.01563 0.01447 0.01669 0.01547
% COE
83.54 86.57 87.28 89.15 85.41 87.23
Table 4a: Testing fittest networks with 1984 data

Training data used in RPROP
1984
1985
1986
Fitness data used for evaluation in GA
1985
1986
1984
1986
1984
1985
Min.
-0.44597 -0.61062 -0.50871 -0.55396 -0.25340 -0.24718
Max.
0.14213 0.10342 0.12665 0.14620 0.17607 0.16051
SEE
0.02757 0.02458 0.01727 0.01952 0.01235 0.01449
SSE
6.82100 5.31593 2.61340 3.33685 1.33988 1.84931
S4E
0.20778 0.46206 0.17787 0.27223 0.01591 0.01929
MAE
0.01896 0.01232 0.00782 0.00840 0.00673 0.00849
RMSE
0.02790 0.02463 0.01727 0.01952 0.01237 0.01453
% COE
61.98 69.77 85.08 80.94 92.36 89.50
Table 4b: Testing fittest networks with 1985 data

Training data used in RPROP
1984
1985
1986
Fitness data used for evaluation in GA
1985
1986
1984
1986
1984
1985
Min.
-0.66836 -0.77372 -0.68467 -0.71219 -0.42376 -0.36923
Max.
0.15809 0.10349 0.14185 0.15715 0.17378 0.17015
SEE
0.03592 0.04334 0.03496 0.03808 0.01994 0.02358
SSE
11.65647 16.49307 10.71488 12.71874 3.48315 4.90989
S4E
1.38230 4.18160 2.10349 2.78505 0.17819 0.19223
MAE
0.01887 0.01426 0.01140 0.01174 0.00835 0.01061
RMSE
0.03648 0.04339 0.03497 0.03810 0.01994 0.02367
% COE
73.82 61.89 75.21 70.58 91.93 88.71
Table 4c: Testing fittest networks with 1986 data

All of the above statistics required some form of reduction. A simple scoring system (that could have been weighted in some manner of other) was first devised for the rapid identification of the two 'best' overall modelling solutions. On each individual table a score of one was given to the preferred or fittest solution for each best performing 'fitness evaluation' or 'unseen validation data' assessment statistic. No mark could be awarded for those scenarios in which the testing data were identical to those on which the network was trained. This procedure therefore equates to one mark being awarded for each statistical measure per annual data set. These values are coloured red on the error statistic tables and were subject to an earlier more detailed explanation. Final marks are provided in Tables 5 and 6. To accommodate further enlightenment these scores are also summed to provide 'per training data' and 'per modelling process' totals.

Training data
1984
1985
1986
Row total
Standard procedures
2
0
2
4
Magnitude based pruning
2
10
4
16
Skeletonization
0
3
1
4
Column total
4
13
7
24
Table 5: Final scores for pruning algorithm exercise

Training data
Fitness data
1984
1985
1986
Row total
1984
0
1
13
14
1985
0
0
3
3
1986
2
5
0
7
Column total
2
6
16
24
Table 6: Final scores for genetic algorithm exercise

In addition to the highest performer it is also useful to investigate the level of variation that has been exhibited for each data set between the different neural network solutions. High levels of variation would reflect marked differences in a test statistic which is indicative of dissimilar generalisation or poor modelling capabilities. This could be applicable on an annual basis, on a more localised event-type basis, or on some combination of both. Further investigation of the output data would be required to provide a definitive answer to this question. Two further tabulations have therefore been constructed for the provision of appropriate between-model variation measures which in this instance have been computed and reported using the Coefficient of Determination i.e. standard deviation expressed as a percentage of the mean (Table 7). Numerous differences in variation were observed to exist within the COD statistics which ranged from a minimum of 2.10 % to a maximum of 129.13 %. S4E statistics exhibited the greatest level of variation and these numbers have been coloured red. %COE statistics exhibited the least amount of variation and these numbers have been coloured blue. Important patterns can also be observed within the variation statistics, between the various annual data sets related to each method, and between each annual data set for one method and its counterpart in the other.

Standard Procedure and Pruning Experiments
Model Breeding Experiments
Test data
1984
1985
1986
1984
1985
1986
Min.
87.84
27.45
31.03
48.12
35.30
27.52
Max.
62.67
29.45
23.28
33.59
17.87
17.08
SEE
43.45
32.65
33.42
7.05
30.38
27.52
SSE
61.14
78.43
72.41
31.01
59.85
49.25
S4E
102.14
115.35
129.13
67.98
87.12
86.33
MAE
38.17
44.82
37.39
70.81
43.80
29.04
RMSE
43.14
33.65
33.68
14.45
30.75
27.55
% COE
2.10
3.17
2.62
2.20
14.79
14.72
Table 7: % Coefficient of Determination grouped according to method

5.4 Graphical analysis of the hydrograph plots

Time series plots of forecast and actual flows were inspected for bias in network performance. It is important to determine whether the computed global measures applied to all stages of flow and in different seasons because it is possible to get significant statistical relationships on long time series, where the low flows are modelled in an accurate manner, but where high flows are in error. These visual plots were also used to check for a consistent temporal response. It was anticipated, for example, that there could be greater errors in forecasts for winter snowmelt events which are rare occurrences in the training data. It was also important to check: (a) the timing of events; (b) the accuracy of the starting point of rising flow; and (c) the ability to model peak discharge in terms of both time and volume.

Three representative hydrographs that are considered to provide some typical illustrations of the overall results were selected for graphical presentation (Figure 26[a - c]). These graphs contain information on three different individual 50 hour periods taken from 1986. Each individual hydrograph depicts the output response to a different type of situation: [a] low flow; [b] medium flow; and [c] high flow events. The three periods are not connected to each other in time but do however occur within a similar hydrological season and form part of a longer sequence of storm events. The two chosen models were those that produced the best overall score per model building exercise. Each graph contains three plots. Blue lines represent actual river flow values, red lines represent predictions associated with the best performing pruned network model (magnitude based pruning / 1985 training data), and green lines represent predictions associated with the best performing genetic algorithm network model (1984 fitness evaluation data / 1986 training data). The decision to use 1986 test data in these plots was based on three things. First, it was important to use data that had a high 'information content'. Second, to use data on which an optimised solution had been developed would be to use an unrepresentative sample. Third, to minimise exogenous factors, it was important to have a near temporal juxtaposition of storm events, comprising low flow, medium flow and high flow situations.

Figure 26: Three 50 hour hydrograph plots from taken from Autumn / Winter 1986.
Each individual graph contains three plots: comprising real values (blue); highest scoring pruning algorithm model predictions (red); and highest scoring genetic algorithm model predictions (green).

6. Discussion

A series of initial investigations have been performed using automated approaches as an aid to resolving the complex task of designing an optimal neural network architecture for rainfall-runoff modelling but the results were inconclusive. This was not surprising, given the nature of neural network modelling, wherein multiple solutions of a similar but not identical nature will be the most probable outcome of all such investigations. However, in the exploration process a lot of useful basic knowledge about the need for simple architectures has been gained. In addition, there is now some hard evidence to support previous suppositions, about which inputs were thought to have the most important influence.

Different modelling operations resulted in different network architectures. One possible interpretation of the different architectures that were produced from the pruning exercises is that models trained on poorer data sets are having to look beyond the basic rainfall-runoff information in order to create a reasonable solution surface and that this requirement varies both within and between the individual data sets. Both pruning results also lead us to conclude that different factors are being taken into account due to differences in the function that is being modelled and that this problem will therefore be reflected in the form of poor results when attempting to transfer these final models from (a) testing with the original training data to (b) testing with validation data related to a different period in time. Despite all of the networks having produced a reasonable set of results, major variation in network complexities existed, which is another important feature that has emerged from these investigations. This is both of scientific interest and a possible cause for concern. From a positive standpoint these experiments have been investigating items at the 'mesostructure' scale i.e. the manner in which the neural network is organised, including such features as the number of layers, the connection patterns, and the flow of information. But, perhaps, the 'microstructure' is also having a strong influence e.g. the processing characteristics of each node within the network. So these two levels should therefore be examined together in some linked manner. From a negative standpoint, given the overall lack of a consistent result, it is possible that no real optimal solution can be achieved and that what appear to be improved architectural solutions are in fact just manifestations of a random sampling process with no hidden meaning in the arrangement of the network nodes and weights. These differences were taken to extreme measures in the model breeding experiments where one solution had 4 hidden units and 49 weighted connections (Figure 23) and another has 17 hidden units and 269 weighted connections (Figure 25). However, it is also possible to conclude from these results that the exact intricacies of the architecture is not ultra important, which in turn suggests that less effort should be expended on searching for an optimal solution when for most practical purposes a simple sub-optimal solution would be sufficient for the task in hand and a lot quicker to obtain. More radical and extensive analytical experimentation, coupled with a more detailed internal inspection of the final models, is therefore required to test this hypothesis.

Statistical assessment was problematic and no one definite winner could be established. Table 5 indicates that the outright favourite from networks created using standard procedures is magnitude based pruning trained on 1985 data. This solution was awarded ten out of the twenty-four marks. It was also a clear leader in terms of both training data and modelling process scores. Table 6 indicates that a clear favourite from networks that were created using random mutation with resilient learning and hard pruning is RPROP training using 1986 data coupled with GA fitness evaluation using 1984 data. This solution was awarded thirteen out of the twenty-four marks. It is interesting to note that for training purposes the 1986 data set now moves into first position whereas the 1985 data set was the best performer in the earlier pruning exercises. The successful use of the 1984 data set for network fitness evaluation and model breeding purposes is however more controversial since this data is known to be 'information poor' and, all other things being equal, should therefore have given the weakest performance. However, irrespective of these difficulties, the integrated combination of batch training, fixed stopping condition, and fitness evaluation has somehow or other not just managed to create a reasonable modelling generalisation - but one which is also transferable to alternative data sets. Table 7 also contains some important information. High values in this table for a particular method indicate variable results. S4E, which was highlighted in red, in all but one instance exhibited the greatest degree of variation which is interesting because it is this statistic that places particular emphasis on the model fit at peak flows. This means that it is the fit of the various neural network models to such phenomena which exhibits the greatest amount of variation across the numerous different solutions. %COE, which was highlighted in blue, produced the least amount of variation per test data set and has therefore been unable to offer sufficient differentiation between the numerous neural network models. In various instances, marked similarities can also be observed between the results obtained from testing with 1985 and 1986 data, and marked differences likewise observed between these two results and those obtained for 1984. Some important between-method comparisons can also be made for various statistical measures related to each annual data set e.g. there are substantial differences associated with the results for 1984.

Graphical output associated with the highest scoring pruning algorithm solution [PAS] and the highest scoring genetic algorithm solution [GAS] was provided in Figures 26 [a-c]. These three hydrographs contain a wealth of additional information about the underlying functions that are being reproduced, and a clear indication of the output response that is associated with a given situation, for each individual model. In all three temporal windows it can be seen that most low flow situations have been modelled in a reasonable manner. PAS and GAS produce similar and accurate results for this section of the solution surface. Nevertheless, even at low levels of flow, when the level of flow is falling GAS generates what could at best be described as a noticeable number of underpredictions. Next, looking at the small to medium events in hydrographs [a] and [b], these are also observed to be modelled in an acceptable manner, but with clear problems occurring in the timing and magnitude of peak flow predictions. PAS generates greater peak flow errors and these predictions are all late. GAS is therefore the better model in medium event situations - which is the exact opposite of that observed in falling low-flow instances. Last, looking at hydrograph [c], this is a peaked high flow event, and the relative difference between the various models produces a spectacular plot. Neither of the two models performed particularly well. PAS was the better of the two with reasonable timing but poor level prediction. GAS produced a broad near-flat peak that fell well short of the high levels required. From these various simple observations it is therefore apparent that the three hydrographs, together, confirm the statistical results. Low flow and limited change situations are modelled quite well. Peak flow events are not modelled that well and considerable variation is observed to exist between the different modelling solutions and the manner in which these items are modelled. Moving from description to explanation, it must of course be remembered that a direct comparison between the PAS and GAS models is limited because one model was produced from a desire to create an optimal solution, whereas the other model was produced from a desire to create an optimal architecture based on a fixed level of error. But one is still forced to wonder about the extent to which these various differences in output can be attributed to differences in the method of model creation. For example, did the use of a batch update procedure and a fixed stopping condition prevent the GAS model from producing more accurate high flow prediction? Questions of this nature are the subject of further research.

The discussion thus far has focused on neural network aspects of river flow prediction. Although much of the reported work has been of a computational nature it is also important to view these results from a hydrological perspective. Simple neural network models can be produced and evaluated using automated techniques. In this work several thousand models were created and tested using batch programs and overnight runs. This method of creating models has clear cost-benefit implications for model development and model application times. Whereas each neural solution is in fact just a combination of simple processing elements and weighted connections the power of these interconnected processing elements to act in concert and produce complex non-linear models is tremendous. In the reported experiments acceptable results were produced from a limited number of input measurements. But these computational devices can also perform data fusion operations using different types of data, from various sources, and at different resolutions - which is something that was hitherto more or less unthinkable. Such capabilities have clear implications for the collection or purchasing of useful input data which is often not available, or if available, is in an inappropriate format. Most existing hydrological models also focus on peak flow prediction. However, whilst offering a complete hydrological modelling solution, the neural networks were also found to be excellent low flow predictors - which is of particular merit for water resource applications in drought prone regions. Other areas in which good low flow predictions would be beneficial might include reservoir management in drought periods or semi-arid locations, river balance planning and water supply operations, designing irrigation and water extraction regimes, or to promote various ecological interests and aesthetic considerations.

7. Conclusions

Simple iterative learning, which is the main network optimisation tool that has up until this time been used in most geographical modelling operations, was extended in this research to create a more complex procedure that focused on the progressive removal of 'unimportant components' in a destructive cycle of training and pruning. To this operation network reconstruction sequences and fitness testing using alternative criteria were then added to create a powerful automated model building environment. In certain instances there was evidence to suggest that a more suitable network architecture, which had improved generalisation capabilities, had been found. However, in all cases there was a substantial reduction in network architecture, which produced neural network models that had fewer computational overheads. The removal of various non-essential inputs, which has clear implications for data collection requirements and information processing times, was another characteristic of the pruned networks. Further extended and more radical evolution-based river flow forecasting and prediction modelling investigations are now planned.

There is still no reliable scoring system in existence that can overcome the difficulties of measuring peaks and troughs or to perform event-based separation of the appropriate statistical descriptors. Multi-criteria evaluation, with appropriate weightings based on specific end-user requirements, offers one possible method through which this goal can be achieved. But the application of all such subjective approaches must be looked at in a rigourous and comprehensive manner. There is also a pressing need for the creation of dedicated software programs that can perform multi-criteria assessment - perhaps in a interactive manner or with direct links to an EDA (Exploratory Data Analysis) toolbox.

Neural networks work. These computational tools offer great scientific promise and real practical benefits - but there is still a great deal more exploration that needs to be undertaken on the use of these tools and their application to solving different types of practical problems in different areas of geographical research. There is also a need to examine the available options for building better networks and to investigate the different methodologies for their efficacious application.

Acknowledgments

References

Abrahart, R.J. 1998. "Neural networks and the problem of accumulated error: an embedded solution that offers new opportunities for modelling and testing". Proceedings Hydroinformatics'98: Third International Conference on Hydroinformatics, Copenhagen, Denmark, 24-26 August 1998.

Abrahart, R.J. and Kneale, P.E. 1997. "Exploring Neural Network Rainfall-Runoff Modelling". Proceedings Sixth National Hydrology Symposium, University of Salford, 15-18 September 1997, 9.35-9.44.

Abrahart, R.J. and See, L. 1998. "Neural Network vs. ARMA Modelling: constructing benchmark case studies of river flow prediction". Proceedings GeoComputation'98: Third International Conference on GeoComputation, University of Bristol, United Kingdom, 17-19 September 1998.

Bathurst, J. 1986. "Sensitivity analysis of the Systeme Hydrologique Europeen for an upland catchment", Journal of Hydrology, 87, 103-123.

Beven, K., Kirkby, M.J., Schofield, N. and Tagg, A.F. 1984. "Testing a physically-based flood forecasting model (TOPMODEL) for three U.K. catchments, Journal of Hydrology, 69, 119-143.

Blackie, J.R. and Eeles, W.O. 1985. "Lumped catchment models". Chapter 11 in: Anderson, M.G. and Burt, T.P. Eds. 1985. Hydrological Forecasting. Chichester: John Wiley & Sons Ltd.

Dawson, C.W. and Wilby, R. 1998. "An artificial neural network approach to rainfall-runoff modelling", Hydrological Sciences Journal, 43, 1, 47-66.

Fischer, M.M. and Abrahart, R.J. (forthcoming)."Neurocomputing - Tools for Geographers". Chapter 8 in: Openshaw, S., Abrahart, R.J. and Harris, T.E. Eds. GeoComputation. Reading: Gordon & Breach.

Fischer, M.M. and Gopal, S. 1994. "Artificial neural networks: a new approach to modelling interregional telecommunication flows", Journal of Regional Science, 34, 503-527.

French, M.N., Krajewski, W.F. and Cuykendall, R.R. 1992. "Rainfall forecasting in space and time using a neural network", Journal of Hydrology, 137, 1-31.

Hsu, K-L, Gupta, H.V. and Sorooshian, S. 1995. "Artificial neural network modeling of the rainfall-runoff process", Water Resources Research, 31, 10, 2517-2530.

Johnstone, D. and Cross, W.P. 1949. Elements of Applied Hydrology. New York: Ronald. Cited in: Minns, A.W. and Hall, M.J. 1997. "Living with the ultimate black box: more on artificial neural networks", Proceedings Sixth National Hydrology Symposium, University of Salford, 15-18 September 1997, 9.45-9.49.

Karunanithi, N., Grenney, W.J., Whitley, D. and Bovee, K. 1994. "Neural Networks for River Flow Prediction", Journal of Computing in Civil Engineering, 8, 2, 201-220.

Knapp, B.J. 1970. Patterns of water movement on a steep upland hillside, Plynlimon, Central Wales, Unpublished PhD Thesis, Department of Geography, University of Reading, Reading.

Kohonen, T. 1995. Self-Organizing Maps. Heidelberg: Springer-Verlag.

Lorrai, M. and Sechi, G.M. 1995. "Neural nets for modelling rainfall-runoff transformations", Water Resources Management, 9, 299-313.

Minns, A.W. and Hall, M.J. 1996. "Artificial neural networks as rainfall-runoff models", Hydrological Sciences Journal, 41, 3, 399-417.

Newson, M.D. 1976. The physiography, deposits and vegetation of the Plynlimon catchments, Institute of Hydrology, Wallingford, Oxon. Report No. 30.

NERC (Natural Environment Research Council). 1975. Flood Studies Report, Vols 1-5. London: Natural Environment Research Council. Cited in : Minns, A.W. and Hall, M.J. 1997. "Living with the ultimate black box: more on artificial neural networks", Proceedings Sixth National Hydrology Symposium, University of Salford, 15-18 September 1997, 9.45-9.49.

Openshaw, S. and Openshaw, C. 1997. Artificial Intelligence in Geography. Chichester: John Wiley & Sons Ltd.

Quinn, P. F. and Beven, K. J. 1993. "Spatial and temporal predictions of soil moisture dynamics, runoff, variable source areas and evapotranspiration for Plynlimon, Mid-Wales", Hydrological Processes, 7, 425-448.

Raman, H. and Sunilkumar, N. 1995. "Multivariate modelling of water resources time series using artificial neural networks", Hydrological Sciences Journal, 40, 2, 145-163.

Riedmiller, M. and Braun, H. 1993. "A direct adaptive method for faster backpropagation learning: The RPROP algorithm". In Proceedings ICNN'93: IEEE International Conference on Neural Networks, 1993.

Rizzo, D.M. and Dougherty, D.E. 1994. "Characterization of acquifer properties using artificial neural networks: Neural kriging", Water Resources Research, 30, 2, 483-497.

Rogers, L.L. and Dowla, F.U. 1994. "Optimization of groundwater remediation using artificial neural networks with parallel solute transport modeling", Water Resources Research, 30, 2, 457-481.

Schaap, M.G. and Bouten, W. 1996. "Modelling water retention curves of sandy soils using neural networks", Water Resources Research, 32, 10, 3033-3040.

Smith, J. and Eli, R.N. 1995. "Neural-Network Models of Rainfall-Runoff Process", Journal of Water Resources Planning and Management, 121, 6, 499-509.