Robert J. Abrahart
Department of Geography, University College Cork, Ireland.
Email: bob@ashville.demon.co.uk
Linda See
Centre for Computational Geography
School of Geography, University of Leeds, UK.
Email: l.see@geog.leeds.ac.uk
This paper provides forecasting benchmarks for river flow prediction in the form of a numerical comparison between neural networks and ARMA models. Naive predictions are also provided. Benchmarking was based on a three year period of continuous river flow data for two catchments: the Upper River Wye (Central Wales) and the River Ouse (Yorkshire). Two sets of benchmarks have been established: (i) modelling the central year, with the two adjacent years being used for validation purposes; and (ii) modelling the entire three year period, with the relative performance of each individual year being used as a metric. The choice of an appropriate neural network architecture for hydrological forecasting, in terms of hidden layers and nodes, was first investigated. Two simple neural network architectures were selected for more detailed evaluation from which one final design was then chosen for comparison with the ARMA model and naive prediction forecasts. Six global evaluation measures were used to provide ‘goodnessoffit’ statistics. Alternative evaluation measures were also used to examine specific performance on storm events, such as peak prediction, and timetopeak. The benchmark results showed that simple neural networks were able to produce similar results to ARMA models given the same data inputs. Finally, a selforganising map was used to split the data series into classes, where each class represents a different type of hydrological event e.g. rising limb, falling limb, etc. Two rising event types were then modelled with a neural network and improved forecasting performance obtained. The results from this datasplitting operation suggest some interesting possibilities for future multinetwork modelling explorations, and point to the clear need for current benchmarks against which all such advances can be compared.
River flow forecasts are an essential requirement for solving a wide range of scientific and/or management problems. Physical models offer one possible forecasting method, but such tools are often considered to be too complex, or too demanding, for most practical implementations. Simpler approaches offered through ‘conceptual’ and ‘blackbox’ solutions are thus fast becoming attractive alternatives. Artificial neural networks for instance have much to offer and show great promise as effective tools for modelling the rainfallrunoff response. Nevertheless, within the hydrological sciences, the widespread acceptance and adoption of these new datadriven tools has been slow. The most demanding application of neural networks in the area of rainfallrunoff simulation modelling has in most cases involved nothing more sophisticated than a simple onestepahead time series prediction. Such basic implementations are thus comparable to standard statistical methods of the kind from which hydrologists have long since moved forward. But if neural networks are to be given serious consideration as forecasting instruments in their own right  then direct comparison with more formidable equationbased tools and/or stateoftheart distributed process models should be the longterm goal of all serious hydrological modelling neural network advocates. However, before this can happen, it must be demonstrated that neural networks possess sufficient capabilities, or show sufficient promise, to make such experimentation worthwhile. The role of these tools must therefore be extended both above and beyond their current limited or inconsequential usage  which in turn necessitates a requirement for well reported benchmark case studies against which these various improvements can be assessed. This paper provides a group of dedicated benchmarks for two catchments: the Upper Wye (Central Wales) and the River Ouse (Yorkshire). Benchmarking involved a numerical comparison between standard backpropagation networks and a recognised statistical time series predictor. In addition, naive predictions, which use the current value as the prediction, were produced for comparison as a bottomline benchmark. Numerous different network architectures were first explored in a comprehensive effort to find an optimal model. Two main modelling operations were then implemented in each catchment using continuous river flow time series data for a three year period. For the first set of benchmarks, models were developed on data for the central year, with the other two years being retained for model validation purposes. For the second set of benchmarks, the entire three year data set was modelled, and the relative performance of each individual year against the global three year model assessed. In addition to producing standard ‘goodnessoffit’ statistics, some alternative performance measures, more relevant to high flow situations and flood forecasting, were also investigated. The final part of this paper contains a report on the application of a selforganising map (Kohonen, 1995) to the same data set. This tool was used to perform a classification of the data series into distinct hydrological event types, two of which were then modelled on an individual basis to determine if more accurate neural network solutions could be achieved.
Neural networks offer an important alternative to traditional methods of data analysis and modelling. In conventional computing a model is expressed as a series of equations which are then translated into 3GL code, such as C or Pascal, and run on a computer. But a neural network is much more flexible. Instead of being told the precise nature of a relationship or model  the neural network is trained to best represent the relationships and processes that are implicit, albeit invisible, within the data. There are several different types of neural network. The neural network that is of greatest interest at the moment is the feedforward multilayered perceptron (Figure 1). The basic structure is not complicated. It consists of a number of simple processing elements (also known as neurons or nodes), which are arranged in a number of different layers, and joined together to form a network. The processing elements sum their inputs, effect a nonlinear data squashing process (e.g. using a sigmoidal function), and then transmit a single output to all processing elements in the next layer via the connections. In Figure 1, for example, data enters the network on the left and is then fed forward through successive layers to emerge on the right. This is called a feedforward network because the flow of information is all in one direction going from input nodes (left) to output nodes (right). The outer layer where information is presented to the network is called the input layer. The layer on the far side, where processed information is retrieved, is called the output layer. All layers between the two outer ones are called hidden layers (being hidden from direct contact with the outside world). To avoid confusion the recommended method for describing a neural network is based on the number of hidden layers. Figure 1, for example, is a twohiddenlayer network. There are weights on each of the interconnections and it is these weights that are altered during the training process to ensure that the inputs produce an output that is close to the desired value, with an appropriate ‘training rule’ being employed to adjust the weights in accordance with the data that is presented to the network e.g. backpropagation (Rumelhart et al., 1986; Tveter, 19968). Through the process of training the network will ‘learn’ from examples and in so doing acquire some capabilities for generalisation beyond the training data. After training has stopped, input data are then passed through the network in its nontraining mode, where the data are transformed within the hidden layers to provide an appropriate output value or set of output values  termed ‘associative mapping’.
Figure 1: Basic configuration of a feedforward multilayer perceptron 
Neural networks are seen to offer a plethora of good hydrological modelling opportunities and various successes have now been reported in the literature e.g. French et al. (1992); Karunanithi et al. (1994); Rizzo & Dougherty (1994); Rogers & Dowla (1994); Lorrai & Sechi (1995); Raman & Sunilkumar (1995); Minns & Hall (1996); Schaap & Bouten (1996). Most neural network related hydrological research has to date focused on rainfallrunoff applications that range from modelling a 5 x 5 cell synthetic watershed using inputs derived from a stochastic rainfall generator (Hsu et al., 1995) to predicting runoff for the Leaf River Basin (1,949 km^{2}) using five year daily data (Smith & Eli, 1995). A growing appreciation of the potential associated with these datadriven technologies and their hydrological application has therefore encouraged us to explore their effectiveness as a forecasting tool at the catchment scale. In the initial stages of this emergent paradigm it is of prime importance to examine and report on the science involved. In this respect we have now identified three principal directions for current neural network hydrological research:
This paper is our first report produced under item three, although the results do in fact address several broader issues relating to items one and two. The benchmarking exercise was restricted to an investigation of models developed on the minimum type of original data that other workers would have access to i.e. river flow ordinates. Associated data of a more specialist or catchment specific nature were excluded from consideration such that the whole modelling operation could therefore be reproduced on alternative catchments or in other institutions. We also sought to build models that ran on maximum information extracted from minimum amounts of data  using simple pattern formats and spreadsheetbased data manipulation techniques. Likewise, within the modelling process, we sought to discover a suitable network architecture and training programme  with sufficient power to provide a universal solution for the problematic task of river flow prediction. The overriding aim of this work was however to generate a set of practical results  for subsequent use in later modelling exercises  which is of course the main function of benchmarks and a justification for their existence.
The two areas that were chosen for this benchmarking exercise are the Upper River Wye in Central Wales and the River Ouse in Yorkshire (Figure 2). The Upper Wye comprises an upland research catchment that has been used on several previous occasions for various hydrological modelling purposes e.g. Bathurst (1986); Quinn & Beven (1993). The basin covers an area of some 10.55 km^{2}, elevations range from 350700 m above sea level, and average annual rainfall is in the order of 2500 mm. Ground cover comprises grass or moorland and soil profiles are thin, most of the area being peat, overlying a podzol or similar type of soil (Knapp, 1970; Newson, 1976). This small catchment has a quick response. Data were available from the gauging station at Cefn Brwyn. The Ouse has a much larger catchment, which covers an area of 3,286 km^{2}, and encompasses an assorted mixture of urban and rural land uses. Gauging stations are distributed throughout the catchment along each of its three main tributaries: the Nidd, Swale and Ure. Two gauging stations were chosen for this exercise (i) Skelton, located just north of York on the River Ouse and (ii) Kilgram, located further upstream on the River Ure. Skelton, with its downstream location, far from the headwaters, has a relatively stable regime. Kilgram, situated further upstream and hence closer to the headwaters, has a flashier regime, with corresponding flood types that are more difficult to predict.
Figure 2: Location map for the (a) River Ouse and (b) Upper River Wye catchments 
Previous research on the use of artificial neural networks for river flow prediction has focused on the implementation of standard backpropagation architectures and fixedlength moving time frame windows. In most cases true past river flow records, or the differences between each individual time step, have been used as input data to predict the next actual ‘flow’ or ‘change in flow’ value (e.g. Abrahart & Kneale, 1997; Minns & Hall, 1997). Although each of these two inputoutput combinations will have its own particular set of merits, to reject one in favour of another is to (a) create an information sink and (b) disregard potential synergistic gains. It also goes against the neurocomputing principle of no a priori models. If both input formats have something to offer then both should be included  thus allowing the network to decide for itself which inputs are relevant in each given situation. The case for outputs is less clear cut. Whilst both output values could in fact be modelled at the same time, modelling one item with one network should make for quicker training, and be somewhat less demanding in terms of an overall neural network solution surface. We therefore opted where possible to model combined inputs (‘past flow’ and ‘past change in flow’) against each of the individual outputs (‘current flow’ or ‘current change in flow’).
All available data were on a one hour time step for the period 19846. River flow values for the three stations were first preprocessed into what has now become the standard format for temporal neural network modelling. This format comprised a fixed length moving time frame window wherein each window contained related data for both present and past river flow values, these being grouped together into paired inputoutput patterns for presentation to the network. River flow forecasts can then be made using previous river flow data (inputs) to predict current river flow events (output). The initial multicolumn pattern files had separate columns for: annual hourcount (CLOCK), FLOW for t6 to t1 (consecutive river flow records), DIFF for t6 to t1 (difference between consecutive river flow records), and either FLOW or DIFF at time t (the value to be predicted). A ‘seasonal’ modelling component (CLOCK) was also incorporated to allow for variation in system output according to ‘timeofyear’ and thus avoid possible excessive generalisation. For example, an agricultural catchment might be expected to produce different responses in summer (drier) and winter (wetter). The sixhour historical record was considered sufficient for predictive modelling purposes based on previous reported experiments (Abrahart & Kneale, 1997). It also tallies with the empirical rule that at least five or six points should be used to define the rising limb of a finiteperiod unit hydrograph, which dates back at least to F.F. Snyder in the late1930s (Johnstone & Cross, 1949), and is promulgated in the UK Flood Studies Report (NERC, 1975). Given the circular nature of CLOCK these particular values were next transformed into their sine and cosine equivalents with the final pattern files containing fourteen input variables and one output variable (Figures 3 and 4). All variables were subjected to linear normalisation between 0.1 (lowest value per variable per station) and 0.9 (highest value per variable per station). Three additional datasets were also created from each of the main files, comprising one new pattern file per station, for each of the individual years. This made a grand total of eight normalised pattern files for each gauging station  as shown in Table 1.
Figure 3: Column headings for absolute multicolumn data file 
Figure 4: Column headings for difference multicolumn data file 
Table 1: The eight pattern files for each station 
SSE is a global total that provides an overall estimate of modelling performance. No account is made for sample size and a direct comparison of unequal samples will generate misleading information. It is the most common objective criterion used in fitting and testing models and is the error minimisation function used in the backpropagation training procedure. The result is weighted in favour of significant errors associated with high flow events.
Although high flow events will exert a strong influence on the SSE statistic  particular emphasis can also be placed on model fit at peak flows using a higher even power (Blackie & Eeles, 1985). Error values in this investigation were therefore raised to the fourth power and then summed.
MAE is a global average wherein all deviations from the original data, be these positive or negative, are treated on an equal basis. Variations in sample size are also accounted for and the statistic is not weighted towards high flow events  which tend to be amongst the poorest predictions.
RMSE is a common statistical measure. It is often used in neural network operations and in hydrological modelling to report on the fit of model output to hydrograph data. Sample size is taken into consideration and the result is adjusted to reduce the impact of significant errors associated with high flow events.
This is a standard hydrological statistic based on moments about the mean (Nash & Sutcliffe, 1970; Diskin & Simon, 1977). The final value is multiplied by 100 to convert it to a percentage. In addition to its intrinsic value, this descriptive statistic facilitates a comparison for the Upper Wye with the TOPMODEL work of Quinn & Beven (1993), although their implementation was restricted to the nine snowfree months (AprilDecember).
Positive and negative errors underpin all assessment procedures  yet simple descriptive items such as the largest positive error (overprediction) and largest negative error (underprediction) are seldom reported. The range of prediction errors is nevertheless a useful indicator of troublesome matters that should not be ignored, although it is often the case that some concession is required to allow for those ‘extreme numbers’ that can sometimes arise from problematic data, exceptional circumstances, or true outliers. Good numerical representations can therefore be obtained using prediction errors in a percentage format, expressed in terms of observed (i.e. actual) river flow values, and reported in terms of grouped information. This method of reporting errors also facilitates a direct comparison between the different gauging stations which would otherwise be inappropriate for absolute numbers.
Global error statistics, which take all instances into account, will provide relevant albeit indiscriminate information on the overall modelling situation. Indeed, a significant problem with global evaluation measures is that this kind of statistical assessment does not provide specific information about model performance at high levels of flow, which in a flood forecasting context is of critical importance. Peak flow is also the item that is most difficult to predict and such instances will often contain the greatest amounts of error. Given these related problems, two additional stormspecific measures of evaluation were also used to assess this particular aspect of the modelling operation: the average difference in peak prediction over all flood events calculated using MAE_{pp} and RMSE_{pp}, and the percentage of early, late and correct occurrences in predicting individual peaks. Both measures, acting in concert, should thus provide a more accurate indication of model performance on storm events in a flood forecasting context.
Two benchmark comparisons were produced from the following operations:
ARMA[p,q] (autoregressive moving average) models (Box & Jenkins, 1976) use a weighted linear combination of previous values and shocks which can be written as:
where x_{t} is the predicted value, a_{i}'s are the shocks or residuals, and f_{i}'s and q_{i}'s are the weights associated with each previous observation and shock respectively. The [p,q] notation refers to the number of autoregressive and moving average terms in the model. The BoxJenkins method, which can be tested for statistical significance, provides a systematic iterative approach to determine the optimal number of terms and assigns varying weights until an optimal set of weights is discovered. Standard ARMA models were fitted to the data using software developed and supplied with Masters' textbook on time series prediction (Masters, 1995). For the first set of benchmarks, ARMA models were fitted to the central year (1985), with the other data sets being used to assess the solution. In the second set of benchmarks, ARMA models were developed on the full three year data set (19846) and the relative performance of the three individual years compared. The data series from each of the stations exhibited nonstationarity; the mean had a clear trend that could be observed from plots of the autocorrelation function. A single adjacent point differencing operation was therefore required to render the time series stationary. The final models were ARMA[1,1] for Kilgram and the Upper Wye and ARMA[1,2] for Skelton.
Naive predictions substitute the last known figure as the current prediction and represent a good bottomline benchmark against which all other onestepahead predictions can be measured. Evaluation measures for the naive predictions were calculated for each individual year and on the full three year database for the three gauging stations.
There is no general procedure to determine the optimum number of nodes or layers although one or two ‘rules of thumb’ have been put forward. Trial and error, using different architectures and fixed stopping conditions, still appears to be the best approach  although random initialisations and training with random pattern selection could have an influence on model performance. Fischer & Gopal (1994) for example used a onehiddenlayer neural network with three inputs and one output to model interregional telecommunication flows in Austria. Trial runs were performed on ten alternative architectures containing 5, 10, 15, 20, 25, 30, 35, 40, 45 and 50 hidden units in the hidden layer. Each network was subjected to five random starts and training was stopped at 3000 epochs. The results were then assessed using two global evaluation measures: (i) average relative variance and (ii) coefficient of determination. Variation in the final output values due to initial conditions was quite small, which may or may not have been significant for the selection of an optimal model with the best generalisation capabilities.
We adopted a similar initial stance but with three major exceptions: exploration involved both onehiddenlayer and twohiddenlayer networks; just one random start was performed on each network (with betweennetwork discrepancies being used to highlight training problems); and just one basic evaluation measure was computed (SSE). The Stuttgart Neural Network Simulator (SNNS Group, 199098) was used to construct twelve onehiddenlayer and twelve twohiddenlayer feedforward networks with 6, 12, 18, 24, 30, 36, 42, 48, 54, 60, 66 and 72 hidden nodes. All networks had 14 input nodes and 1 output node. These networks were trained to predict first FLOW and then DIFF values using the 1985 pattern sets for each individual station. The stopping condition for each run was set at 800 epochs and trained networks were saved at 100 epoch intervals. All connections and unit biases were initialised with random weights set between plus and minus one. Training was undertaken using the inbuilt ‘enhanced backpropagation’ algorithm with ‘learning rate’ and ‘momentum’ parameters being set at the following ratios: 0.8/0.6 for 200 epochs, 0.6/0.4 for 200 epochs, 0.4/0.2 for 200 epochs, and 0.2/0.1 for 200 epochs. The ‘flat spot elimination’ and ‘maximum tolerated difference’ parameters were kept at zero. Weight updates were implemented after the presentation of each pattern, and random pattern selection was employed throughout all stages of the training programme. There were 144 runs in total, comprising 24 different networks, 3 different stations, and 2 different output variables. The appropriate annual data sets were then passed through the 576 saved networks, in nontraining mode, and the 3 SSE values for each network computed. These results were tabulated and the average error computed over all three years for each network. We performed one random initialisation per network, but given that there were 24 networks per output variable, there should have been sufficient runs to overcome potential difficulties associated with the limited individual impact of a poor initialisation or a situation in which the network became trapped in a false minimum. The adoption of a high initial momentum factor was also intended to help counteract the worst of these effects. Our final results for the different architectures were all quite similar and no clear cut trend was observed. The training programme also produced numerous instances of wild fluctuations. In several cases function approximation did not follow a straightforward path and the network did not converge in a smooth or uneventful manner. Some networks also got stuck at a high point on the error curve and experienced little change thereafter.
The main results from this initial analysis suggest that the use of additional hidden nodes had little or no real impact on the end result. The assumption that river flow modelling is a complex problem requiring a complex solution must therefore be abandoned; any simple network of modest size would appear to be sufficient. Moreover, no substantial difference could be found between the onehiddenlayer and twohiddenlayer networks, which supports this conclusion i.e. even with the extra weighted connections and enhanced architectural complexities there was no additional information to be extracted from the training data or further learning that could be performed. These findings are in accordance with other recent work in which the benefits of using a second hidden layer were considered marginal to the rainfallrunoff modelling problem (Minns & Hall, 1996). Moreover, this related work was also considered representative of numerous nonlinear rainfallrunoff situations, which suggests that a onehiddenlayer network should be sufficient for most associated realworld applications. The same authors also found that just four hidden nodes provided good generalisation of various features found in their verification data set, which lends further support to our belief that current neural network methods of river flow prediction require a simple solution. It should also be stressed that a simple solution is not synonymous with a poor result, since minimal architectures can sometimes offer better generalised performance than more complex networks (Rumelhart et al. ,1994).
Given no ‘definitive answer’ or ‘optimal architecture’ two representative onehiddenlayer networks were chosen for more extensive training. These networks comprised 14:6:1 and 14:12:1 configurations and, as before, both networks were trained to predict first FLOW and then DIFF values using the 1985 pattern sets for each individual station. The stopping condition for each run was increased to 4000 epochs. Trained networks were saved at 250 epoch intervals. All connections and unit biases were initialised with random weights set between plus and minus one. Training was undertaken using the inbuilt ‘enhanced backpropagation’ algorithm with the ‘learning rate’ and ‘momentum’ parameters being set at the following ratios: 0.8/0.6 for 1000 epochs, 0.6/0.4 for 1000 epochs, 0.4/0.2 for 1000 epochs, and 0.2/0.1 for 1000 epochs. All other training parameters were as before. Weight updates were implemented after the presentation of each pattern, and random pattern selection was employed throughout all stages of the training programme. There were 12 runs in total, comprising 2 different networks, 3 different stations, and 2 different output variables. The appropriate annual data sets were then passed through the last 6 saved networks from each of the 12 training runs, in nontraining mode, and the 3 sum squared error values for each network computed.
In each case there was again little difference between the final SSE statistics and no clear trend from which to ascertain a winner. The error associated with the validation data sets showed minor fluctuations but there was no sign of either a progressive increase (i) which would suggest that more training was required or (ii) associated with overfitting. Given that there was little difference between the final SSE values for the 6 and 12 hidden node networks, these results also support our earlier conclusion, that a global solution will in all likelihood be a simple one. The six trained 14:6:1 neural networks were therefore selected for full scale analysis and testing. The four applicable data sets (1984; 1985; 1986; 198486) for each network were passed through each trained network, in nontraining mode, and a full output file in each instance produced for statistical examination. Six further training runs were also instigated using the 14:6:1 architecture and 4000 epoch training programme. These networks were trained to predict first FLOW and then DIFF values using the full 198486 pattern sets for each individual station. The four applicable data sets (1984; 1985; 1986; 198486) for each network were then passed through each trained network, in nontraining mode, and a full output file in each instance produced for statistical examination.
A complete set of numerical results is provided in Tables 212. These numbers represent the benchmarks against which later work can be compared. However, the main results are also summarised below to provide a brief description that is more intelligible than an extensive set of tabulated numbers.
For the first set of benchmarks: models were developed on data for the central year (1985) with the other two years (1984 & 1986) being used for model validation purposes. SSE, S4E, MAE and RMSE all show a clear trend: the Upper Wye has the highest levels of error, followed by Kilgram, and then Skelton. COE showed a similar trend, but in reverse, which is to be expected since this particular evaluation measure is assessing a positive as opposed to a negative attribute. In more detail SSE, S4E, MAE and RMSE are dominated by a single high error in the neural network FLOW predictions for Wye’86 which is associated with fifteen or so individual large underpredictions. Global measures for the other neural network and ARMA model predictions at each station are all quite similar with just minor fluctuations in ranking. Naive predictions tended to produce the highest relative SSE, S4E, RMSE and MAE statistics. COE values for each station are all quite high and at a similar level, although efficiencies associated with the neural network DIFF predictions are in each case a little less than the other three. The degree of disagreement between the neural network DIFF predictions and the other evaluations were greatest for the Upper Wye, less substantial for Kilgram, and slight for Skelton.
The majority of all predictions are within 5% of the observed value although the percentage correct are in most cases less than 1% for the neural networks, and for most of the ARMA models, except for the Upper Wye ARMA model where the percentage of correct predictions is much higher. Naive predictions in all cases had the largest number of percentage correct. This is thought to reflect the large number of low flow situations, in which there is no change over time, and where a naive prediction just happens to provide the best answer. The Upper Wye had the highest number of FLOW underpredictions, followed by Kilgram, and then Skelton. This pattern is not preserved in the DIFF underpredictions, where the Upper Wye had the lowest percentage, followed by Skelton, and then Kilgram. The ARMA modelling, which showed no clear relationship to either pattern, was sometimes better and sometimes worse than its neural network counterparts.
For the second set of benchmarks: the entire three year data set was modelled and the relative performance of each individual year (1984; 1985; 1986) against the global threeyear model (198486) assessed. Results from the threeyear model were in most cases similar to those produced using the oneyear model but with one main difference. There was no exceptional error associated with the neural network FLOW predictions for Wye’86  which can doubtless be attributed to the fact there was no longer unseen data, or unknown circumstances, for the network to struggle with.
For the first set of benchmarks MAE and RMSE values for peak prediction show a matching trend: the Upper Wye has the highest levels of average error, followed by Kilgram, and then Skelton. Kilgram had the highest number of storm events over the three year period with 75 followed by the Upper Wye with 69 and Skelton with 58 although the Upper Wye had the highest concentration of events in 1986. There was, however, in both cases a more pronounced ‘division’ between the Upper Wye and Kilgram/Skelton than in previous assessments. The Upper Wye had high errors while Kilgram and Skelton had much lower errors. The highest MAE was associated with the Wye’84 ARMA model predictions but the highest RMSE was associated with the neural network FLOW predictions for Wye’86. Naive predictions often produced the highest relative MAE statistic but did not always have the worst RMSE value at each station. Predictions for the Upper Wye and Skelton tended to be on time, or late, while those for Kilgram tended to be generally on time.
For the second set of benchmarks the results from the threeyear model were again in most cases similar to those produced using the oneyear model and there were no major differences.
NN1985  
NN ALL  
NN1985 DIF  
NNALL DIF  
ARMA1985  
ARMAALL  
Naïve  

NN1985  
NN ALL  
NN1985 DIF  
NNALL DIF  
ARMA1985  
ARMAALL  
Naïve  

NN1985  
NN ALL  
NN1985 DIF  
NNALL DIF  
ARMA1985  
ARMAALL  
Naïve  

NN1985  
NN ALL  
NN1985 DIF  
NNALL DIF  
ARMA1985  
ARMAALL  
Naïve  

NN1985  
NN ALL  
NN1985 DIF  
NNALL DIF  
ARMA1985  
ARMAALL  
Naïve  

NN1985  
NN ALL  
NN1985 DIF  
NNALL DIF  
ARMA1985  
ARMAALL  
Naïve  

NN1985  
NN ALL  
NN1985 DIF  
NNALL DIF  
ARMA1985  
ARMAALL  
Naïve  

NN1985  
NN ALL  
NN1985 DIF  
NNALL DIF  
ARMA1985  
ARMAALL  
Naïve  

NN1985  
NN ALL  
NN1985 DIF  
NNALL DIF  
ARMA1985  
ARMAALL  
Naïve  

NN1985  
NN ALL  
NN1985 DIF  
NNALL DIF  
ARMA1985  
ARMAALL  
Naïve  

NN1985  
NN ALL  
NN1985 DIF  
NNALL DIF  
ARMA1985  
ARMAALL  

Up to this point we have been comparing global neural network solutions with ARMA models and naive predictions all trained and validated on the same datasets. Neural network solutions, in general terms, were observed to:
We have thus demonstrated that simple neural network solutions can be used for one step ahead predictions. These tools are able to generate reasonable results, in line with existing statistical time series predictors, which also now become part of the formal benchmark.
Similar neural network and ARMA model results are not at all surprising, given that both forecasting tools were trained on much the same information, so it is now appropriate to consider ways in which improvements could be achieved. There are two possibilities, although in combination these two options would in all likelihood produce even better results. The first method involves adding more inputs to the neural network, such as river level data from upstream stations, or other relevant hydrological and climatological information, which might allow the network to learn the training patterns better, but perhaps also bring about greater generalisation capabilities. However, the individual benefits of each inclusion would need to be assessed in a stringent manner, and the rather unique nature of such data makes this an unsuitable route for formal benchmarks to follow. The second method, which avoids all such problems, would be to cluster the original flow data into different event types or hydrograph behaviours (where an event is taken to mean a short section of the river flow record). Moreover, commensurate with the idea of creating a neural network benchmark, a selforganising map (SOM) could be used for this purpose (Kohonen, 1995). A SOM is another type of neural network in which the neurons compete against each other to discover the inherent ‘organisation’ that exists within the data and are often used to perform complex classifications. This reductionist approach has already been shown to improve the overall performance of neural network forecasts at Skelton using simple splitlevel partitioning (See et al., 1997).
The SOM approach was therefore adopted to provide some useful information about the degree of improved performance that could perhaps be obtained using a datasplitting multimodelling approach. This work forms part of the benchmark in as much as it provides a pointer to how improved modelling could perhaps be done. The computational software for these experiments was obtained from the SOM Internet ftp site (NNRC, 1998). 2x2, 4x4, 6x6 and 8x8 SOMs were examined using different input data. The original 14 variables listed in Figure 3 produced final clusters differentiated according to season and not on differing level behaviour. Adjacent differences in river flow levels also added little to the clustering process; hence the final input data chosen for the classification exercise were FLOWs t1 to t6. The best results were produced using an 8x8 SOM (64 clusters). This gave reasonable differentiation between event behaviours at high levels of flow. It also produced a large number of similar events at low levels of flow. Figure 5 shows an example of the different event type behaviours over the 6 hour time period for Kilgram using data for all three years. To facilitate this presentation, all profiles were forced through the origin, and all clusters with nearidentical behaviour were omitted from the plot. The three main types of hydrograph event can be seen in this diagram, comprising flat, rising and falling behaviours, each of which can be further partitioned into low, medium and high flow situations.
Figure 5: SOM classification of different event types at Kilgram 
In order to give a brief illustration of this datasplitting technique, and its potential application, an 8x8 SOM was produced for each station for classification purposes. To ensure a sufficient number of cases in each cluster for subsequent training with a neural network, clusters were created from the complete three year data set. By examining plots, like that shown in Figure 5, the two most prevalent rising events at each station were identified for modelling purposes. Table 13 lists the total number of cases in each of the rising clusters. The subset of data corresponding to each of these cluster types was used to train 6 neural network models (2 cluster types for 3 stations) with network architectures and training parameters identical to those used in the earlier benchmarking exercise. MAE, RMSE and COE statistics were then calculated for the network output. Corresponding ARMA model and naïve predictions were also extracted for the relevant subsets and assessed. Numerical results for this comparison are listed in Tables 14 to 19.
Table 13: The number of cases in each rising cluster 
There is a clear improvement in neural network performance on these rising events over and above those produced by the ARMA models. By enabling the network to concentrate on a small well defined task rather than the entire global spectrum of hydrograph behaviours, the network was able to better approximate the rising limb of the hydrograph given the same set of input data. Moreover, ARMA models are at a disadvantage when considering improvements of this nature, because these tools require a continuous time series and therefore cannot benefit from data disaggregation.
NN ALL  
NNALL DIF  
ARMAALL  
Naïve  

NN ALL  
NNALL DIF  
ARMAALL  
Naïve  

NN ALL  
NNALL DIF  
ARMAALL  
Naïve  

NN ALL  
NNALL DIF  
ARMAALL  
Naïve  

NN ALL  
NNALL DIF  
ARMAALL  
Naïve  

NN ALL  
NNALL DIF  
ARMAALL  
Naïve  

The main problem associated with generating neural network river flow predictions is the dominant frequency of low flow events. High flow events, together with the rapid change that occurs between low flow and high flow situations, will always have an inadequate representation. Neural networks and ARMA models will thus in most cases be attempting to fit a universal solution to what is in effect a ‘poor data series’ with inevitable limitations in modelling output. Global statistical assessment will thus be focused on low flow events. The interaction between these two opposing factors, in the form of alternative evaluation measures, must therefore be addressed before further progress can be made.
The most influential factors in this analysis appear to arise from modelling problems relating to the appropriate identification of a clear deterministic function. Better results will also be produced when function approximation is performed on one or more smooth surfaces  which means that the identification of such items for incorporation within the modelling process will also need to be part of an holistic solution.
Differences were observed between the three stations across all evaluation measures. The Upper Wye has the highest levels of error, Kilgram has less error than the Upper Wye, and Skelton has less error than Kilgram. Further investigation is required to provide an explanation of this phenomenon requiring detailed consideration of such items as catchment characteristics, upstream area and response time.
During the course of this exercise two important additional questions have emerged:
With regard to evaluation measures there are still no acceptable testing procedures for neural network solutions, although we are now getting improved levels of modelling performance that require the next generation of more powerful evaluation tools. The event specific evaluation measures employed here attempted to characterise the capabilities of the models to predict high flow events. However, as hydrological modelling advances, the assessment tools will also need to move beyond the provision of simple descriptive statistics and accommodate more authoritative measures and facilitate multiple output values.
Two suggested improvements were discussed and a limited demonstration of multinetwork modelling has been provided. More input variables should provide additional predictive power for the neural network modelling process but this hypothesis still requires detailed testing and analysis. What factors to include and the manner in which such items should be included is a problematic issue in its own right. Comprehensive testing against physical models is likewise needed.
The authors would like to hear from others who are interested in using neural networks for river flow prediction. All datasets can be made available for collaborative ventures.
Abrahart, R.J. 1998. ‘Neural networks and the problem of accumulated error: an embedded solution that offers new opportunities for modelling and testing’. Proceedings 3rd International Conference on Hydroinformatics, Copenhagen. Denmark, 2426 August 1998.
Abrahart, R.J. and Kneale, P.E. 1997. ‘Exploring Neural Network RainfallRunoff Modelling’. Proceedings Sixth National Hydrology Symposium, University of Salford, 1518 September 1997, 9.359.44.
Bathurst, J. 1986. ‘Sensitivity analysis of the Systeme Hydrologique Europeen for an upland catchment’, Journal of Hydrology, 87, 103123.
Blackie, J.R. and Eeles, W.O. 1985. ‘Lumped catchment models’. Chapter 11 in: Anderson, M.G. and Burt, T.P. eds. 1985. Hydrological Forecasting. Chichester: John Wiley & Sons Ltd.
Box, G.E.P. and Jenkins, G.M. 1976. Time Series Analysis: Forecasting and Control. Oakland (CA): HoldenDay.
Fischer, M.M. and Gopal, S. 1994. ‘Artificial neural networks: a new approach to modelling interregional telecommunication flows’, Journal of Regional Science, 34, 503527.
French, M.N., Krajewski, W.F. and Cuykendall, R.R. 1992. ‘Rainfall forecasting in space and time using a neural network’, Journal of Hydrology, 137, 131.
Hornik, K., Stinchcombe, M. and White, H. (1989) ‘Multilayer feedforward networks are universal approximators’, Neural Networks 2, 359366.
Hsu, KL, Gupta, H.V. and Sorooshian, S. 1995. ‘Artificial neural network modeling of the rainfallrunoff process’, Water Resources Research, 31, 25172530.
Johnstone, D. and Cross, W.P. 1949. Elements of Applied Hydrology. New York: Ronald. Cited in: Minns, A.W. and Hall, M.J. 1997. ‘Living with the ultimate black box: more on artificial neural networks’, Proceedings Sixth National Hydrology Symposium, University of Salford, 1518 September 1997, 9.459.49.
Knapp, B.J. 1970. Patterns of water movement on a steep upland hillside, Plynlimon, central Wales, PhD Thesis, Department of Geography, University of Reading, Reading.
Karunanithi, N., Grenney, W.J., Whitley, D. and Bovee, K. 1994. ‘Neural Networks for River Flow Prediction’, Journal of Computing in Civil Engineering, 8, 201220.
Kohonen, T. 1995. SelfOrganizing Maps. Heidelberg: SpringerVerlag.
Lorrai, M. and Sechi, G.M. 1995. ‘Neural nets for modelling rainfallrunoff transformations’, Water Resources Management, 9, 299313.
Masters, T. 1995. Neural, Novel & Hybrid Algorithms for Time Series Prediction. New York: John Wiley & Sons.
Minns, A.W. and Hall, M.J. 1996. ‘Artificial neutral networks as rainfallrunoff models’, Hydrological Sciences Journal, 41, 399417.
Minns, A.W. and Hall, M.J. 1997. ‘Living with the ultimate black box: more on artificial neural networks’, Proceedings Sixth National Hydrology Symposium, University of Salford, 1518 September 1997, 9.459.49.
NERC (Natural Environment Research Council). 1975. Flood Studies Report, Vols 15. London: Natural Environment Research Council. Cited in : Minns, A.W. and Hall, M.J. 1997. ‘Living with the ultimate black box: more on artificial neural networks’, Proceedings Sixth National Hydrology Symposium, University of Salford, 1518 September 1997, 9.459.49.
Newson, M.D. 1976. The physiography, deposits and vegetation of the Plynlimon catchments, Institute of Hydrology, Wallingford, Oxon., Report No. 30.
NNRC, 1998. Neural Network Research Centre. http://www.cis.hut.fi/nnrc/nnrcprograms.html
Openshaw, S. and Openshaw, C. 1997 Artificial Intelligence in Geography. Chichester: John Wiley & Sons Ltd.
Quinn, P. F. and Beven, K. J. 1993. ‘Spatial and temporal predictions of soil moisture dynamics, runoff, variable source areas and evapotranspiration for Plynlimon, MidWales’, Hydrological Processes, 7, 425448.
Raman, H. and Sunilkumar, N. 1995. ‘Multivariate modelling of water resources time series using artificial neural networks’, Hydrological Sciences Journal, 40, 145163.
Rizzo, D.M. and Dougherty, D.E. 1994. ‘Characterization of acquifer properties using artificial neural networks: Neural kriging’, Water Resources Research, 30, 483497.
Rogers, L.L. and Dowla, F.U. 1994. ‘Optimization of groundwater remediation using artificial neural networks with parallel solute transport modeling’, Water Resources Research, 30, 457481.
Rumelhart, D.E., Hinton, G.E. and Williams, R.J. 1986. ‘Learning internal representations by error propagations’. In: Rumelhart, D.E. and McClelland, J.L. Eds. Parallel Distributed Processing: Explorations in the Microstructures of Cognition  Vol. 1, 318362. Cambridge (MA): MIT Press.
Rumelhart, D. E., Widrow, B. and Lehr, M. A. 1994. ‘The basic ideas in neural networks’. Communications of the ACM , 37, 3, 8792.
Schaap, M.G. and Bouten, W. 1996. ‘Modelling water retention curves of sandy soils using neural networks’, Water Resources Research, 32, 30333040.
See, L., Corne, S., Dougherty, M. and Openshaw, S. 1997. ‘Some initial experiments with neural network models of flood forecasting on the River Ouse’, GeoComputation’97: Proceedings 2nd International Conference on GeoComputation, University of Otago, Dunedin, New Zealand, 2629 August 1997.
Solomatine, D.P. and Avila Torres, L.A. 1996. ‘Neural network approximation of a hydrodynamic model in optimizing reservoir operation’. In: Müller, A., ed., Hydroinformatics’96: Proceedings 2nd International Conference on Hydroinformatics, Zurich, Switzerland, 913 September 1996, Vol. 1, 201206, A.A. Balkema, Rotterdam, 1996.
Smith, J. and Eli, R.N. 1995. ‘NeuralNetwork Models of RainfallRunoff Process’, Journal of Water Resources Planning and Management, 121, 499509.
SNNS Group. 199098. Stuttgart Neural Network Simulator. http://wwwra.informatik.unituebingen.de/SNNS/
Tveter, D. 19968. Backpropagator’s Review. http://www.mcs.com/~drt/bprefs.html
Van den Boogaard, H.F.P. and Kruisbrink, A.C.H. 1996. ‘Hybrid modelling by integrating neural networks and numerical models’. In: Müller, A., ed., Hydroinformatics’96: Proceedings 2nd International Conference on Hydroinformatics, Zurich, Switzerland, 913 September 1996, Vol. 2, 471 477. Rotterdam: A.A. Balkema.