Building New Spatial Models: a Supercomputing Approach

Gary Diplock
Centre for Computational Geography, School of Geography, University of Leeds, Leeds LS2 9JT

Abstract
The majority of spatial models used in geography are "old". Most first appeared over 25 years ago and have merely been embellished in various ways. One reason is that they work well enough. Another is that it has traditionally been easier to "borrow" and extend models that exist than it has to develop new models from scratch. Perhaps also model building was not fashionable. A further reason is that model design in a geographical context has never been easy. The mathematical and statistical aspects are complex, relevant theoretical bases are at best weak and often missing, and until fairly recently there was hardly any data on which to calibrate and evaluate them. The GIS era has breathed a fresh life into this area by providing an increasingly spatial data rich environment within which to build models and also an increasing end-user need for the intelligent use of GIS by modelling rather than merely describing spatial behaviour and relationships.

The question now is how to ease the task of model building. There are today various possibilities. Artificial neural networks provide one approach that is highly non-linear and it is model free in that the form of the model is learnt from data examples rather than being imposed by the model builder. Neural networks constitute a powerful non-linear modelling technology capable of handling complex relationships and noisy data. The principal difficulty is that it is essentially a grey or black box technology. It cannot tell you what the model is only that a model exists and there is little opportunity to input knowledge into the process. More flexible is fuzzy logic systems modelling. This is also equation free but works well on problems with far fewer variables where data are scarce and human knowledge is an important ingredient. A third approach is that of model breeding using either Genetic Algorithms or Genetic Programming; see Openshaw (1988). Genetic Programming (GP) is essentially an attempt to create new models of the traditional mathematical type by "breeding" well performing equations that provide a good description of one or more sets of data.

The underlying idea is very simple. Mathematical and statistical modelling can be regarded as a search for equations that provide a "good" fit to either data or to the underlying theoretical assumptions that are assumed to be relevant. However the search process implicit in traditional methods is somewhat inefficient. Mathematical model building methods such as entropy maximising attempt to identify equations that are the "most likely" solution to theoretical modelling problems given the presence of certain information in a data independent way. By contrast in interactive statistical modelling the user is trying out alternative equations by assessing their performance on a data set and theory is much less important. However both approaches are flawed. The former because it is highly dependent on the theoretical formulation of the macro and meso state constraints and is data independent. Whether it works well or not at all has to be determined later via a model evaluation exercise. It offers no mechanism for explicitly optimising model fit. The latter statistical approach contains little or no theory, it is heavily assumption dependent, and whether it works well or not at all is now strongly dependent on the skills of the human operator. Also, in both cases only an extremely minisculely small fraction of the universe of all possible model equations that could be built from the available variables, functions, rules of syntax, etc. will have been searched. Maybe they find the global "best" result; maybe they don’t! Until recently there was no way of knowing or of doing much better. Genetic Programming changes this in a potentially highly significant way by offering a practical means of searching, in an intelligent manner, the universe of all possible models for good model forms.

A genetic programming approach views model building as a search process. You define the available model pieces (variables, functions, binary operators, rules of syntax). The GP starts with a random selection of equations built from these bits. It then uses the principals of natural evolution to breed new generations of equation such that the better performing equations have a greater probability of having offspring. It is similar to a genetic algorithm except that it deals with the equations (i.e. models) directly as a set of symbols rather than as having it coded as a fixed length bit string. This imparts a much greater degree of flexibility to the model design process. In essence, GP is an attempt to breed a computer program (i.e. representing a spatial model) that best fits one (or multiple) data sets. Since its development by Koza (1992) it has attracted considerable attention. In a GIS context it offers another path to building spatial models able to thrive in spatial data rich environments. The principal impediments to its wider use are: the need for high performance computing, uncertainty about the benefits, and concern about the robustness and meaning of the resulting models.

The strengths and limitations of GP are examined via two case studies: a simple spatial regression model and a more complicated spatial interaction model. The paper examines the issues in parallelising the GP approach using Fortran 77 on the 512 processor Cray T3D parallel supercomputer. Efficient use of this highly parallel hardware requires that the serial code is re-written using a standard message passing library (MPI) and that the GP is re-cast in a form analogous to an asynchronous genetic algorithm. This is to allow a task farm form of parallelisation. The methods used to achieve this are described in the full text of the paper. Another problem with classical GP is that any parameters are treated as part of the genetic optimisation process which also has to find suitable equations. This is inefficient since a good equation with the "wrong" parameters may perform far poorer than a poor equation with a "good" parameter value. To overcome this difficulty an alternative approach is used here. The GP is restricted to creating model equations that may contain one or more unknown parameters. Estimates of values for these parameters are then made using an embedded genetic algorithm based on Schwefel (1995). A conventional non-linear optimiser cannot be used because of the high risk of arithmetic problems given the arbitrary nature of the model forms being created. An evolutionary based parameter estimator was developed that is robust to NaNs and other arithmetic problems and is also able to find good values for non-convex and discontinuous functions. The cost is another 10 to 100 fold increase in compute times.

Both cases studies are run on the Cray T3D parallel supercomputer at Edinburgh and the results are compared against more traditional models run as benchmarks. The conclusion is that here is an opportunity to create new generations of high performing models relevant to GIS that are less of a black box than neural networks and can handle higher degrees of variable complexity than fuzzy logic modelling. The principal remaining problems relate to questions of model validation, the conversion of models to theories, and how best to simplify the mathematical expressions that are produced.

References
Koza, J.R., 1992. Genetic programming; On the programming of computers by means of natural selection. Cambridge: MIT Press.

Openshaw, S., 1988. "Building an automated modelling system to explore a universe of spatial interaction models", Geographical Analysis, 20, 31-46.

Schwefel, H.P., 1995. Evolution and optimum seeking. New York: Wiley.