Go to Paper
Return to GeoComputation 99 Index

Geographical Data Mining: Key Design Issues

OPENSHAW, Stan (Stan@geog.leeds.ac.uk), University of Leeds, Center for Computational Geography, School of Geography, Leeds LS2 9JT

Key Words: geographical data mining, artificial intelligence, GIS, geographical analysis, high performance computing

The immense explosion in geographically referenced data and lack of many suitable analysis tools in current Geographic Information Systems (GIS) software is resulting in many important data sets not being fully or appropriately analyzed. In the machine age, information is the raw material for creating new knowledge, new discoveries, and for developing new products and services; however, most of the existing data mining tools are not suitable for making the most of geodata. Many data mining tools claim to function well with any data, but this overlooks the fact that geographical data are different and special. If the data riches created by GIS are to be fully used, then it seems essential to try and develop a suitable Geographical Data Mining (GDM) technology that will meet at least some of the needs for exploratory spatial analysis. This review paper examines the issues and outlines some possible solutions.

It is well known that GIS data consists of three broad classes of data types: (1) geographical coordinates, (2) temporal coordinate(s), and (3) multivariate attributes relating to the geographical entities. All three data types have unrelated measurement scales. The traditional difficulty is that most geographical analysis only starts after necessary, but data damaging, decisions concerning data selection have strangled the data. Indeed, most users are so pre-conditioned by traditional thinking that they seldom even realize how they have unwittingly harmed the unknown patterns and structures that once existed in their data. The problem is that the spatial patterns, found in geography map-space, depend on decisions made in temporal space and the multivariate data space; and vice versa. For example, if you study a disease data for the wrong time period you might find it does not cluster. Change the time period and it does cluster. Unfortunately, the choice of time period is subjective; however, the problem becomes worse. Change the definition of the disease (i.e., add or subtract one or two possibly related types to it) and you may get totally different patterns. Unfortunately, the choice of disease is subjective; change both disease classification and time period selections and you may even discover different spatial patterns in different parts of the map. It is not easy and as you gain access to more and more data at finer and finer levels of resolution, so the problems become more severe. Once data were so restricted you had no choice other than to analyze whatever you had access to. Now, thanks to developments in Information Technology (IT) and GIS, you have so much choice that conventional tools cannot cope without making the most outrageous data reduction decisions based on ignorance.

Currently, there is only one method known to the author that has been developed to explore all three spaces simultaneously; the so-called Space Time Attribute Creature (STAC) dating from the early 1990s. This paper seeks to describe the original STAC idea, how it worked, and then how it can be generalized and improved. Consideration is given to devising practical solutions to the problems of parameterising the search process, leaving a trail that can be visualized in hyperspace, devising an appropriate objective function, searching for multiple near optima, how to handle data uncertainty, and coping with the results. The hypersearch method uses fuzzy logic to handle some of the problems. The outcome is a system called Geographical Data Miner (GDM/1). It is tested on both real and synthetic data sets.