Infostats: a Library of Simple Statistical Functions for Use with Info Files in Arc/Info

Julii Brainard
School of Environmental Sciences, University of East Anglia, Norwich, United Kingdom

Complaints about the lack of statistical functionality in most GIS have been made by many, including Rhind (1988), Burrough (1990), Openshaw (1990), and Goodchild et al. (1992), and it continues to be the topic of presentations at international conferences (e.g. Karimi, 1996). Optimal methods for creating better spatial analysis in GIS were suggested by Anselin and Getis (1993) and Fedra (1993). A central theme in these papers is tight versus loose coupling, which refers to how closely GIS and spatial analysis are integrated. There is some debate about which approach is preferable; but regardless, rather loose coupling is the more common reality for users. Meanwhile, efforts to 'tighten' the coupling have been relatively few and tend to have limited utility.

Some GIS have been deliberately designed to enable the addition of advanced analytical modules (e.g. Idrisi, Eastman, 1993; GRASS, U.S. Army Corps of Engineers, 1993). Examples of this approach to develop tools for spatial analysis can be found in Maclennon (1990) and Dodson (1993). But many other GIS are poorly equipped for the incorporation of users' own programs, notably some of the industry's leaders.

Much attention has focused on ESRI's Arc/Info package, probably due to its widespread penetration in many market areas, particularly research environments. Numerous efforts have been made to introduce better spatial and analytical tools into Arc/Info. Foremost of these, StatSci and ESRI distribute an Splus-Arc/Info link which includes utilities to transfer data between the two systems (ESRI, 1996). The primary strength of the S+GISLINK is that it makes available all of the capabilities of the Splus software for use on Arc/Info data -- but at a prohibitive price, that requires both a separate software license and facility in using Splus. An earlier endeavour was made by Ding and Fotheringham (1992), who produced the Spatial Analysis Module (SAM) software. Most of SAM's operations revolved around calculation of a Moran's coefficient and related proximity measures. Ongoing work is also being undertaken in the UK by Regional Research Laboratory staff (Gatrell et al., 1994) to develop a Spatial Analysis Toolkit (SAT) for use with Arc/Info. The initial version of SAT was designed to perform point pattern analysis, auto-correlation tests, zoning algorithms and modelling of raised incidences. Further aims in the project are to include probability mapping, clustering algorithms and "spatial regression using a neural net" (pp. 16).

Ambitious ventures like SAM and SAT are certainly worthy of praise, but they still will not enable Arc/Info to calculate a simple correlation coefficient. Rudimentary operations, such as finding quartiles for an item in an attribute table, testing that a distribution is normal, or performing a simple OLS regression are equally beyond most GIS.

The oversight of even the most elementary functions, particularly in a widely used GIS, can partly be justified by the nature of spatial information. Geographically referenced data tends to be severely auto-correlated, which violates assumptions required by standard statistical techniques. However, recognition of this has not deterred analysts from employing these methods. This situation could be blamed on ignorance on the part of the users, but it also reflects pragmatic difficulties in applying techniques that can properly handle auto-correlated data. Furthermore, to a certain extent, conventional methods can still produce robust, if limited descriptions of the data and phenomena under study. In short, there is a demonstrated demand for simple statistical functionality in GIS. Admittedly, measures produced without consideration of spatial relationships and collinearity must be used with caution, but their shortcomings do not make them worthless.

Using a combination of AML (the Macro language provided for use with Arc/Info) code and a library of C routines developed by ESRI staff (Stellhorn, 1994) which enables direct access to Info files, I have created a small library of functions, InfoStats, to make basic queries of an Arc/Info database. InfoStats includes a test for normality, scatter plots, pairwise plots, derivation of quantiles, a correlation coefficient, t-test and kappa statistics, ordinary least squares and poisson regression. The library is designed to be integrated relatively painlessly and potentially seamlessly into a user's repertoire of commands.

An empirical example of how InfoStats can be used in a specific application area will be described. A study was made to model visitors to a recreational woodland near Thetford, UK. Demographic measures from the 1991 Census, travel times and an index for substitute forest areas were determined on a ward level for the areas around the study site. Using InfoStats, simple queries could be made of these variables, and eventually a statistical model was derived to predict visitation rates to this woodland using these determinants.

There are qualifications to be made in this simplistic approach towards spatially referenced data. I would argue, however, that it makes little sense to develop higher order functions when few GIS can do the basics, and that InfoStats provides much needed tools. It is also a valid starting point for more complex analyses. This version of InfoStats is restricted in that it can only analyse Info (attribute) files, and not geometric features such as lines and points. It is hoped that I will be able to expand the library in the future to include data that is more explicitly spatially referenced. Some discussion of how InfoStats will be distributed will conclude the presentation.

Anselin, L and Getis, A. 1992. "Spatial statistical analysis and geographic information systems", The Annals of Regional Science, 26(1), 19-33.

Burrough, P.A. 1990. "Methods of Spatial Analysis in GIS", International Journal of Geographical Information Systems, 4(3), 221.

Ding, Y. and Fotheringham, A.S. 1992. "The integration of Spatial Analysis and GIS", Computers, Environment and Urban Systems, 16, 3-19.

Dodson, D. (ed) 1993. A Laboratory Guide for the Integrated SpaceStat, Idrisi Environment, Technical Report 93-5, NCGIA, University of California at Santa Barbara.

Eastman, J.R. (ed) 1994. Idrisi 4.1., Graduate School of Geography, Clark University, Worcester, MA.

ESRI (Environmental Systems Research Institute) 1996. "Statistical Sciences, Inc.", URL:

Fedra, K. 1993. "GIS and Environmental Modeling". In Goodchild, M.F., Parks, B.O. and Steyaert, L.T., 1993. Environmental Modeling with GIS. New York and Oxford: Oxford University Press, pp. 35-50.

Gatrell, A., Openshaw, S., Brunsdon, C., Charlton, M., Rowlingson, B. and Rao, L. 1994. "A Spatial Analysis Toolkit for ARC/INFO", paper presented at the Institute of British Geographers Annual Conference, University of Nottingham, January, 1994.

Goodchild, M., Haining, R., Wise, S. and 12 others 1992. "Integrating GIS and spatial data analysis: problems and possibilities", Int. J. of GIS, 6(5), 407-424.

Karimi, H.A. 1996. "Analysis of Strategies for Integrating Environmental Models with GIS", paper presented at GIS '96, Vancouver, 19-21 March.

Maclennon, M. 1990. Second-order analysis of point patterns in GRASS, NCGIA Research Report, University of California, Santa Barbara.

Openshaw, S.J. 1990. "Spatial analysis and geographical information systems: a review of progress and possibilities". In Scholten, H.J. and Stillwell, J.C.H. (eds) 1990. Geographical Information Systems for Urban and Rural Planning. Dordrecht: Kluwer Academic Publishers, pp. 153-163.

Rhind, D. 1988. "A GIS Research Agenda", Int. J. of GIS, 2(1), 23-28.

Stellhorn, T. 1994. "A set of C language Functions for Accessing INFO Files", available from the author or using anonymous ftp as part of:

U.S. Army Corps of Engineers. 1993. GRASS 4.1 Reference Manual, U.S. Army Corps of Engineers, Construction Engineering Laboratories, Champaign, Illinois.