Random Aggregations for Use in Studying the Modifiable Areal Unit Problem

Alistair Geddes, Mick Green and Robin Flowerdew
North West Regional Research Laboratory, Lancaster University, United Kingdom

In many social science applications of spatial analysis, geographical data are only available as aggregated measurements for households or individuals living within a set of some arbitrarily defined areal units, commonly census zones. Unfortunately, this means that analyses of these area aggregations may be conditional upon the particular configuration of zones which is presented. This effect is known as the modifiable areal unit problem (MAUP) and has two related aspects. First, statistical analysis may give different results for zones of different sizes, even if the geographical area covered is exactly the same; this is the scale effect. Second, results may also differ between different ways of aggregating exactly the same data to the same scale; this may be called the zonation effect.

Although these effects were first identified several decades ago, our knowledge about the types of MAUP effect encountered under various conditions is still quite limited. Work at Lancaster has explored the effects of the MAUP on regression models relating two or more variables. The primary approach has centred on investigating how the MAUP may be generated in the presence of spatial cross-correlation, that is, when the value of the response variable Y at one location is related not only to the values of the explanatory variable X at the same location, but also to X values at neighbouring locations. Findings have suggested that the MAUP detected in regression analyses due to the presence of such spatial cross-correlation may be modelled by including two coefficients for each X in the regression equation - one representing a local effect, the other a regional effect. If an analysis includes only the local effect, it will be mis-specified. On the other hand, if the zones used in the analysis are large enough or shaped appropriately, they may include the regional effect as well, and the model will be accurate. Unfortunately, the appropriate size/shape for the regional effect in a regression analysis is unknown and must be estimated for the given zoning arrangement.

Trials based on simulated data have provided evidence in support of these findings, but we also wanted to explore their effects empirically using 1991 census data. Our main case study has been concerned with the relationship of ethnic group and unemployment rates in Northwest England. Regression parameters for this relationship were found to change substantially, depending on the level of analysis, from EDs to wards or districts; in other words, there is a strong MAUP effect.

So that the effectiveness of differing definitions of the regional effects can be compared, we needed to create new aggregations from the underlying census geography of the study area. In the first instance, the primary goal has been to derive sets of simulated 'pseudo-ward' zoning configurations from the 1991 digitised ED boundaries. However, the sheer data volumes associated with the numbers of input zones - over 14 000 EDs were defined within the Northwest - together with our desire to be able to generate a large number of trials added up to a significant computational overhead. Therefore, we turned to a multi-processing computing facility to automate this simulated zoning process faster and more efficiently than would be possible using traditional sequential programming methods.

Typically, multi-processor applications can best be considered as consisting of a front-end 'host', and processors for the actual running of jobs contained in a 'compute box'. The parallel processing facility which we have been using currently includes 26 Intel i860 processors, each having 16 Mbyte memory, front-ended by a Sun SPARC 5 UNIX workstation. Source code for a zone generation algorithm was written first in FORTRAN77 before being converted to produce i860 compatible executables. The algorithm, which incorporates a contiguity constraint, is intended to produce reasonably compact and realistic zones using polygon data extracted from an existing ARC/INFO zoning coverage. Two input files are first created. One contains the polygon identifiers, simplified to ensure that there is only one polygon included per ED. The ED totals for each relevant census variable are also included in this file. The other file lists polygon identifiers on both sides of each arc, and is used to determine contiguities.

A master process running on the host is used to store this base polygon data, and to manage the allocation of slave aggregation processes running within the compute box. After randomly selecting a specified number of 'seed' zones, each slave process executes the basic aggregation algorithm. The seed zones are examined in order, and one of the zones contiguous to the seed is randomly selected to form a new zone. On the second pass, any zones contiguous to either the original seed or to the zone joined to it first time around become eligible. The aggregation procedure is repeated until no unassigned zones remain adjacent to the new zones that have been formed. The slave process then returns the aggregated results to the master, which passes back a new set of input data, until the requested number of iterations is reached. Output data files are fed back into ARC/INFO so that the data from the constituent old zones can be aggregated to their respective new zones. These results can then be mapped and inspected visually using the tools within ARCPLOT. Lastly, the regression analyses are repeated for the different sets of zones to investigate the sensitivity of the model to aggregation.

The analysis described is appropriate for the study of the zonation effect, but it can also be used for studying the scale effect, simply by changing the number of seed zones. In this way we are looking to investigate groupings of pseudo-ward coverages into higher-level aggregations. The overall approach seems suitable for other kinds of simulation work using GIS coverages as input data.