Defining Boundaries from Synthetic Data

Mike Coombes
C.U.R.D.S., University of Newcastle-upon-Tyne, Claremont Bridge, Newcastle-upon-Tyne NEl 7RU

The paper derives from a project1 (part of the ESRC-sponsored Census Initiative) which set itself both methodological and substantive challenges:

The preliminary task was to review recent developments in regionalisation methods. The literature is relatively slight, partly because the development of spatial analysis facilities in Geographic Information System (GIS) packages has scarcely reached the regionalisation field. Three main 'traditions' were identified:
  1. the longest standing approach is rooted in manual methods and involves a multi-step procedure which typically starts by identifying central cities and moves out to assign other areas to these foci - the official Metropolitan Area definitions in the USA are the classic example (Dahmann & Fitzsimmons, 1995);
  2. more statistical approaches derive from numerical taxonomy principles - there are a number of alternative 'black box' methods ranging from cluster analysis to regionalisation-specific algorithms (e.g. INTRAMAX), but they typically revolve around a single procedure seeking to maximise a statistical criterion which represents the objectives set for the definitions (e.g. "maximise internal coherence, subject to a minimum size"); and
  3. a hybrid of the above two alternatives, adopting a multi-step approach which is based on a traditional understanding of cities as foci for hinterlands, but which uses more statistical methods and criteria with successive stages of the analysis in order to ensure that final boundaries all meet strictly pre-defined objectives, and can be 'optimised' in relation to these objectives.
A recent review for Eurostat (1992) concluded that the European Regionalisation Algorithm (ERA) - a hybrid (type 3) approach - provided the most flexible and reliable form of local labour market area definition (Coombes et al., 1986). This project adopted the ERA software as its standard, but recognised that even as the 'best practice' method it can only mitigate the problems of definitions based on a single run on a single dataset. As a response, the innovation has been to split the whole regionalisation procedure into two phases: It is the second phase which required technical innovations.

The solution devised here centres on creating "synthetic data" which provides the basis for phase two of the method by using as input the initial, phase one, analyses. Each of these analyses produces a classification of all parts of the country (viz. the 10,529 wards (sectors in Scotland) in the 1991 Census). Such a classification identifies which of these 'building block' areas are grouped together as a single region in this set of commuting or migration regions (or whatever the classification represents). Thus the key information in each classification can be re-expressed as a binary matrix of 10,529*10,529 cells (although the matrix is inherently symmetrical, so only half of it is needed).

The crucial benefit from re-expressing each separate classification in this binary form is that these matrices can then be cumulated to produce the synthetic data needed. In GIS terms, it is analogous to layering the sets of boundaries on top of each other and counting the number of layers in which there is no boundary between each pair of areas. It can be seen that this approach provides an assessment of the 'strength of evidence' that two areas should be grouped together. The final synthetic dataset is, then, an ideal basis for the second phase of the definitional procedure - and it can be analysed with a version of ERA which has optimised for this purpose. Other forms of analysis of the synthetic dataset have also been examined.

The methodological innovation of creating synthetic data removes the technical limitations which arise from relying upon a single analysis of a single dataset. In particular, the huge benefit of the synthetic data method is the ability to draw upon analyses of different datasets. Virtually all previous regionalisations have centred on the analysis of a single dataset of flows between areas (most usually commuting flows, but sometimes migration). The synthetic data, however, can draw upon the evidence of many different sets of flows. The synthetic dataset has also been enriched by taking as a further form of input a range of existing sets of boundaries - such as local authority areas - because these are also indicative of which areas might be better kept together and which kept separate.

The paper will illustrate the value of GIS in compiling the synthetic data, for the 10,529 areas, from a large number of boundary sets which were originally defined in terms of many different sets of 'building block' areas. A more fundamental value of GIS here is the near certainty that such a method would not only have been scarcely practicable prior to the diffusion of GIS techniques, it would not have been conceived of without the GIS-based experience of overlaying one boundary set on top of another. Thus it is GIS which has helped to stimulate this innovative methodology, based on visualising localities as areas cut through by relatively few of the many existing and definable sets of boundaries which may be relevant to this project's concern to identify localities for use by researchers with a wide range of interests.

Coombes, M.G., Green, A.E. and Openshaw, S. 1986. "An efficient algorithm to generate official statistical reporting areas: the case of the 1984 Travel-to-Work Areas", Journal of the Operational Research Society, 37, 943-953

Dahmann, D.C. and Fitzsimmons, J.D. 1995. (eds) "Metropolitan and Nonmetropolitan Areas: new approaches to geographical definition", US Bureau of the Census Working Paper 12, Washington D.C.

Eurostat. 1992. Study on Employment Zones. Eurostat E/LOC/20 Luxembourg.

1 Localities City Region Project (ESRC Ref: H507255 129) undertaken at NorthEast Regional Research Laboratory (NE.RRL) by Mike Coombes (CURDS), Colin Wymer (Planning Dept) and David Atkins (Geography Dept) in con junction with Prof Stan Openshaw (Leeds Univerity, School of Geography).