Exploring Relationships in Higher Dimensional Data Sets

A. Stewart Fotheringham, Martin Charlton and Chris Brunsdon
Department of Geography, University of Newcastle-upon-Tyne, Newcastle-upon-Tyne NE1 7RU, United Kingdom

Whilst exploratory techniques have come to dominate recent advances in spatial data analysis, many of these techniques are confined to use in rather simple univariate or bivariate data sets. This paper describes two techniques which have been developed for exploring relationships in more realistic higher-dimensional data sets having large numbers of variables. The two techniques are those of Geographically Weighted Regression and Parallel Coordinates.

The objective of ordinary least squares (OLS) regression is to produce a single set of parameter estimates given data on a dependent variable and one or more independent variables. When applied to spatial data, the single set of relationships described by the parameter estimates are assumed to apply equally to all parts of the region from which the data are drawn. That is, the relationship between the dependent variable and any independent variable is assumed to be stationary over space. However, non-stationarity can occur for two reasons: i) there are intrinsic differences in relationships over space; and ii) the regression equation is not perfect and includes either incorrect functional forms of relationships between variables (such a non-linear relationship between two variables being described by a linear one) and/or excludes relevant variables. In the case of the latter, these relevant variables may be unknown to the researcher or unmeasurable. Whatever, the cause of non-stationarity, it is possible to measure its intensity and to map it using geographically weighted regression (GWR). In GWR, the data are weighted geographically around a point in space so that neighbouring data weight more heavily than data further away. In this way, different parameter estimates are produced for each point in space so that the resulting parameter estimates can be mapped and spatial variations in relationships explored. Clearly, it is useful to utilise the mapping and interrogation tools of a GIS for this exploration. We demonstrate GWR with a national UK data set on house prices which we use to calibrate both a global hedonic price model and set of local models. Variations in relationships across the country in the determinants of house prices are then described.

The second exploratory technique to examine relationships in multivariate data sets is that of parallel coordinates. Exploration of multivariate data can be undertaken within several existing techniques such as multiple regression, principal components analysis, cluster analysis and correspondence analysis. However, a drawback to these techniques for exploring patterns and relationships within data is that they are not very visual. We might transform data to principal components and construct a scatterplot of the locations of the data cases with respect to the first two or three components but our ability to see a four or higher dimensional space in which our data lie is limited. To solve this problem, Inselberg has proposed a technique which he refers to as Parallel Coordinates and which, despite being relatively unknown within geography, has interesting potential applications to spatial data.

The technique of placing data within a series of parallel coordinates is very simple and yet can produce quite powerful visual patens of relationships between attributes. Suppose a data set has n attributes. The procedure involves drawing n parallel lines (axes) each of which is labelled according to the values of a specific attribute with the lowest end of the line corresponding to the lowest value of the attribute. An observation is then plotted by placing the values of each attribute on the axes and joining these points by straight lines. By plotting a set of data in this way, a series of lines forms between the axes and the patterns in these lines can be used to explore relationships within the data. This can be done either by selecting various lines that appear to exhibit similar patterns or by selecting a range of values of one attribute (perhaps an attribute that might be defined as the dependent variable in a regression framework) and then examining the values of the other attributes that are linked to points within this range. In an extension to Inselsberg's work, further exploratory work is possible with spatial data by linking the parallel coordinates display to a map and selecting groups of data exhibiting similar patterns of data in the parallel coordinates display to see their locations on the map. A demonstration of this technique will be given using data on the spatial cognition of UK cities.