The advances in the field of data warehousing have led to the creation of huge integrated (usually relational) databases. Organisations that have access to these databases are now using data-mining/knowledge discovery tools to discover hidden relationships between entities of interest to them. At the same time, advances in the field of Geographic Information Systems have led to the availability of equally huge spatial databases, and again data-mining/knowledge discovery tools and techniques have been applied to spatial databases in recent years. There is obviously a synergistic effect in applying knowledge discovery techniques to a combination of these two kind of databases. Example questions are: Is there a correlation between the dietary habits of people and the distances of their home from the sea; Do people change their shopping habits depending on local weather conditions (i.e. - shop closer to home or in covered arcades when it snows); 'Do people who live close to sea-beaches and buy pet food go on fewer holidays'. Such correlations could be discovered in a spatio-temporal database that holds the non-spatial attributes related to sales or (with some difficulty) by querying an appropriate spatial database and a non-spatial database that holds temporal sales data and locational references (e.g. postcodes). Though the first option is the more promising one, there are few - if any - such databases available. The second option poses a problem because of the data being located in two kinds of data structures in two databases. One way to overcome this difficulty would be to copy the required information from one database to the other. However, our aim is to discover hitherto unknown relationships based on the database, so we do not know a priori what data to copy from one database to another. Furthermore we may need to consider several attributes together (as in the case of the third question above) in order to discover interesting correlations.
Previous work in related areas A number of researchers have worked in the area of data-mining/knowledge discovery in general, and knowledge discovery in spatial databases in particular. A good overview of current developments and future directions can be found in the papers by Ferguson (1995) and Piatetsky-Shapiro (1993).
A study of Discovery of Association Rules in Spatial Databases, by Koperski et al. (1995) discusses the issue of discovering association rules (as opposed to Characteristic and Discriminant rules). In their approach, the spatial objects are held in one database and the other attributes are held in another relational database, but in Step I of the algorithm the (user-identified) attributes/relations of interest are copied to a unified database. Ester et al. (1995) stress the importance of an efficient interface to the spatial database, to avoid the need for copying attributes to a specialised database. The need of avoiding the use of a knowledge base or a user interface to pre-identify areas of interest and providing a concept hierarchy, is also stressed. An architecture similar to that of Matheus et al. (1993) is proposed in their approach.
The techniques developed by both Koperski et al. (1995) and Ester et al. (1995) could be used in our proposed approach to make the operations on the Spatial database efficient by elimination of non-promising routes in the early stages of the search process.
Beaubouef et al. (1993) have developed a Rough set extension of Relational databases. This approach could be used to achieve generalisation of values of attributes similar to the kind achieved by a concept hierarchy.
Openshaw et al. (1995) examine the use of neural net based classifiers and data optimal segmentation techniques for the discovery of patterns from socio-economic data. The segmentation technique proposed could be used for knowledge discovery from geodemographic data.
Openshaw (1995) suggests the possibility of using artificial intelligence methods for exploration of spatial data.
In this paper we present a two step hypothesis testing methodology which could use an extension of the architecture proposed in Ester et al. (1995). The two steps are:
Conversely, the classification operation corresponding to a hypothesis may be performed in the spatial database and tested in the non-spatial one. For example - a subset of postcodes of locations that had heavy snow on a particular day may be created from the spatial database and used to discover association rules in the non-spatial sales database. Thus leading to the discovery of the correlation implied by the second question.
The generated rules can be screened using methods for testing 'Interestingness', as suggested by Piatetsky-Shapiro et al. (1994) and/or by the user depending on the complexity of the rules and the number of rules generated.
This approach avoids the requirement of holding the non-spatial attributes in the spatial database and the alternative approach of copying spatial relationships as attributes into the non-spatial database. It also has the advantage of restricting its operation to a small subset of the hypothesis testing database, thereby improving response times. For example - the postcodes in the two subsets used to test the no-of-holidays hypothesis, are a very small proportion of postcodes in the whole country. Thus expensive spatial operations like close to are performed only on the small subsets.
The paper will discuss the details of the methodology and how it relates to previous work.
Koperski, K., Han, J. 1995. "Discovery of Spatial Association Rules in Geographic Information Databases", Advances in Spatial Databases - Proceedings of 4th International Symposium, SSD 95, Portland, ME, 47-66.
Ester, M., Kriegel, H., Xu, X. 1995. "Knowledge Discovery in Large Spatial Databases: Focusing Techniques for Efficient Class Identification", Advances in Spatial Databases - Proceedings of 4th International Symposium, SSD 95, Portland, ME, 67-82.
Beaubouef, T., Petry, F.E. 1993. "A Rough Set Model for Relational Databases". Rough Sets, Fuzzy Sets and Knowledge Discovery - Proceedings of the International Workshop on Rough Sets and Knowledge Discovery, RSKD 93, Banff, Alberta, Canada, 100-107.
Piatetsky-Shapiro, G., Matheus, C.J. 1994. "The Interestingness of Deviations", Proceedings of 1994 Workshop on Knowledge Discovery in Databases, Seatle, WA, 25-36.
Matheus, C.J., Chan, P.K., Piatetsky-Shapiro, G. 1993. "Systems for Knowledge Discovery in Databases", IEEE Transactions on Knowledge and Data Engineering, 5(6), 903-913.
Ferguson, M. 1995. "Tools and Techniques for Analysing and Mining Warehouse data", Info DB, 9(3), 13-18.
Openshaw, S., Blake, M. 1995. "Geodemographic Segmentation Systems for Screening Health Data", Journal of Epidemiology and Community Health, 49(S2), S34-S38.
Openshaw, S. 1995. "Developing Automated and Smart Spatial Pattern Exploration tools for Geographical Information Systems Applications", Statistician, 44(1), 3-16.
Piatetsky-Shapiro, G. 1993. "An Overview of Knowledge Discovery in Databases -Recent Progress and Challenges", Rough Sets, Fuzzy Sets and Knowledge Discovery - Proceedings of the International Workshop on Rough Sets and Knowledge Discovery, RSKD 93, Banff, Alberta, Canada, 1-10.