Stephen Wise, Robert Haining and Paola Signoretta

Department of Geography and Sheffield Centre for Geographic Information
and Spatial Analysis,

University of Sheffield, United Kingdom.

Email: s.wise@shef.ac.uk

A consistent theme in recent work on developing exploratory spatial data analysis (ESDA) has been the importance attached to visualization techniques, often justified in two ways

- modern computer graphics can be used to
**analyse**rather than**present**data. - graphical, exploratory methods are felt to be more intuitive for non-specialists to use than methods of numerical spatial statistics.

Numerous software packages have been developed which provide visualization facilities to help with the analysis of area data. This poster will use one developed at Sheffield to present:

- Examples of the use of visualization in this type of analysis
- A theoretical framework for assessing and developing visualization tools for ESDA

**Figure 1:** Screenshot of a SAGE session

Figure 1 illustrates some of the key features of the SAGE system:

- A tabular display of the attributes for each area.
- A map of the areas.
- A text window for reporting analytical results.
- One or more graphical views of the data.
- Linkage between the windows - selection of extreme values on the boxplot has caused the relevant areas on the map and rows in the table to be highlighted.

SAGE provides a range of graphical and numerical tools for undertaking ESDA. In order to assess the effectiveness of these tools, a conceptual model has been developed which has two elements:

- a data model, based on the distinction between rough and smooth properties of spatial data, that defines what an analyst is looking for in data (Haining et al 1998a)
- a theoretical model for assessing the quality of visualisation tools (Cleveland 1994).

Exploratory Spatial Data Analysis has certain key characteristics:

- Methods are descriptive rather than confirmatory.
- Aims are to detect patterns, to formulate hypotheses and to assess spatial models.
- Techniques are visual and resistant to unusual data values.
- Techniques 'stay close to the data' in the sense that few data transformations are employed.

Spatial data can be modelled as having two components:

Both the spatial and non-spatial elements of spatial data can be considered to have these two components as shown in the table below:

SmoothRoughNon-SpatialProperties of distribution e.g.Outliers in distributionmedian, interquartile rangeSpatialTrendLocalised clusters of highSpatial autocorrelationvalues; Spatial outliers

- Software for ESDA should therefore provide tools which allow the analysts to explore all four components of the data
- The model therefore provides a framework for assessing the
**range**of tools provided by a package.

According to Cleveland (1994), statistical graphs are used for two purposes, each of which requires the viewer to undertake one or more of three tasks:

ActivityDescriptionPerceptual Tasks required by viewerTable Look Reading off value scanning (relating the case to the axis), Up for an individual interpolating (estimating the value of the case case from the tick marks on the axis) matching (linking the case symbol back to the key) Pattern Identifying trends, detection (recognizing how relationships Perception patterns or between values are coded on the graph e.g. regularities in the distances between symbols relate to whole set of data differences in values of observations) assembly (grouping objects on the graph together e.g. all cases relating to a given year) estimation (of the differences between the grouped cases e.g. Year 1 values tend to be greater than those for Year 2).

Good graphical displays can be defined as those that are 'easy to read' i.e. their design assists the viewer in undertaking the necessary perceptual tasks.

The model can be extended to maps and can therefore be used
to assess the **quality** of the visualization tools provided
in ESDA software.

A full assessment of the visualization tools in SAGE is contained in Haining et al (1998b). The figures here illustrate some of the key features of the system with comments on their strengths and weaknesses. The data used relate to the uptake of the breast cancer screening service in Sheffield. Enumeration district level data (there are 1159 EDs in Sheffield) have been aggregated into approximately 300 areas so that the illustrations can be seen in the prints here. The grouping (implemented in SAGE) was done on the basis of grouping EDs according to similarity of Townsend deprivation score whilst also trying to create areas of similar population size and with a secondary requirement of areal compactness (for details see Wise et al 1997).

**Figure 2:** Screenshot to illustrate some features of SAGE which facilitate
the exploration of data

Figure 2 shows an example of some of the features of SAGE which facilitate exploration of the basic properties of the data. The boxplot shows the distribution of uptake rates. It has been used for a table look up operation, namely to determine the value of the lowest rate - this tasks can be assisted in three ways:

- Re-sizing the window (to bring the graph closer to the axis)
- Turning on gridlines to assist in relating the point to the axis.
- Zooming in on the boxplot window (shown as a separate window on the right)

The linked windows facility makes it easy to see **where**
in Sheffield this outlier is located, and gives a second method
of determining the uptake rate, by highlighting the row in the
table.

**Figure 3:** Illustration of linked windows using map and box plot

The breast cancer screening service is provided in a single location in Sheffield (near one of the major hospitals). It is therefore of interest to see whether distance from this centre affects the proportion of women who use the service i.e. is there a strong SMOOTH element in the spatial pattern of uptake rates. The graph on the right of Figure 3 shows a series of boxplots of the uptake rate, calculated for zones lying at increasing lag distances from the zone containing the centre. The zones at lag three have been highlighted (by selecting the entire boxplot in the right hand window) showing that lag is a reasonable proxy for distance from the screening centre, at least up to lag 3.The graph shows that, perhaps suprisingly, distance from the screening centre does not appear to be a strong factor in determining whether women use the service.

**Figure 4:** Illustration of linked windows using map and moran plot

An alternative possibility is that women are influenced by the social and economic conditions in their neighbourhood. One way to assess this is to look for spatial clustering.

The graph in figure 4 is a Moran plot in which values for a region are plotted on the Y axis, and average values in neighbouring regions on the X axis. The presence of a positive trend in this graph is evidence of positive spatial autocorrelation - another form of SPATIAL SMOOTH pattern in the data. However, there are also some regions which are outliers from this positive relationship, and these are spatial outliers. Six regions have been selected on the graph (the six at the bottom of the graph) in which the uptake rate is lower than in neighbouring areas. However as the map shows these outliers are scattered across the city.

- SAGE provides tools for exploring all four components of the ESDA data model.
- An additional facility is the ability to build areal frameworks suitable for the particular analysis by grouping together existing areal units, such as Eds, under a variety of criteria.
- A good range of standard statistical graphs provided - boxplot, scatter plot, histogram, rankit plot.
- A range of tools provided for exploring spatial properties of data - spatially lagged boxplot; Moran plot, smoothed maps (mean, median and relative risk smoothers)

- Linked windows facility a useful addition to the standard graphics.
- Flexibility of graphical tools good.
- Map drawing is inflexible and cumbersome. This is provided by ARC/INFO and is geared towards presentation graphics rather than visualization.

- Model of ESDA and Cleveland model of visualization provide a useful framework for assessing visualization software to assist ESDA.
- Visualization methods provide a useful set of tools for undertaking ESDA.
- Other features which could usefully be developed include:
- Dynamic brushing, in which graphs change in real time as different cases are selected from the map or another graph. (This is not possible in SAGE due to its client-server architecture).
- Values calculated for areas may have different levels of reliability, due to variable size and nature of population between areal units. Graphs could be modified to reflect this e.g. by highlighting rates calculated on low base populations.

ESRC grant number R000234470 "Developing spatial statistical software for the analysis of area based health data linked to a GIS" enabled the development of SAGE; A grant from the Joint Information Services Committee (JISC) and the ESRC which made possible the visualization assessment. Thanks to Jingsheng Ma for the development of SAGE and to Dawn Thompson for the use of the breast cancer screening uptake data.

Cleveland W.S. (1994) The elements of graphing data. AT&T Bell Laboratories, Murray Hill NJ.

Haining, R.P., Wise, S.M and Ma, J. (1998a) Exploratory spatial
data analysis in a Geographic Information System Environment.
*The Statistician *(in press).

Haining R.P., Wise S.M. and Signoretta P. (1998b) Providing scientific visualization for spatial data analysis: criteria and an assessment of SAGE. Paper presented at the 38th Congress of the European Regional Science Association, Vienna, Aug 28th-Sept 1st 1998.

Wise, S.M, R.P. Haining and J.Ma (1997) "Regionalisation
tools for the exploratory spatial analysis of health data".
In M.Fischer and A.Getis (Eds) *Recent Developments in Spatial
Analysis: Spatial statistics, behavioural modelling and neuro-computing.
*Berlin, Springer-Verlag p83-100.

For further details on SAGE see: http://www.shef.ac.uk/~scgisa