Spatial Data Quality Control through the use of an Active Repository

Sophie Cockcroft
University of Otago
Email: scockcroft@infoscience.otago.ac.nz



Abstract

The issue of spatial data quality is of concern to researchers and practitioners in the field of Spatial Information Systems (SIS). The results of any spatial analysis depend heavily on the data on which the analysis is based. Despite this, the users of most spatial data sets have no idea of the accuracy of the data contained within them. They base their subsequent analysis using the datasets on the assumption that the data is error free or that errors are kept to an 'acceptable' level. In order to ensure the results of analysis it is imperative that a facility for reporting data quality of the dataset is provided so that error levels can be monitored. To this end it is now becoming common for data providers to furnish their clients with metadata that is data about data, on quality, lineage and age. Data quality research issues in SIS include topological consistency; consistency between spatial and attribute data; and consistency between spatial objects representation and their true representation on the ground. The last category is subdivided into spatial accuracy and attribute accuracy. To some extent these errors can be reduced by adopting a more rigorous approach to integrity constraint management including the imposition of constraints upon data entered into the database. This paper describes an integrated development environment for SIS. At the core of this environment is an active repository that stores and maintains integrity constraints defined by the SIS developer. The result of such an approach is the control of quality in data captured by resulting systems and the facility to report on the quality of the data stored within their databases.

Keywords: Spatial Data Quality, Integrity Constraints, Repository.

1. Introduction

Most users of spatial information have no idea of the quality of the data contained within the databases they use, and yet they rely on the outputs. Various authors have highlighted the folly of such an approach. The problem was identified as early as 1990 by Marble (Marble 1990) who commented "Incredible as it may seem, spatial databases costing hundreds of thousands of dollars to create are being distributed in a form that severely threatens their integrity. Nothing is known about the origins of the data, the digitising protocols used , subsequent "corrections" etc". There are two distinct research agendas in addressing this problem of data quality. Firstly, to arm users with knowledge of the datasets' quality and content and secondly to reduce the errors in the first place. One major initiative with respect to improving the level of knowledge users have about the contents of the spatial datasets has been the Federal Geographic Data Committee (FGDC 1994) metadata content standard which aims to provide a common set of definitions and terminology for the documentation of digital geospatial data. The second approach is exemplified by the work of Tanzi and Ubeda (1996), Ubeda and Servigne (1996) and Ubeda and Egenhofer (1997) in these studies topological errors were identified and categorised from a theoretical standpoint, specifically those introduced by the digitisation process. They added weight to the argument above by noting that the unreliability of much spatial data makes spatial reasoning impossible. The subject of spatial data error, however, is a broad one that spans the whole cycle from acquisition to implementation. For the interested reader, Collins (Collins and Smith 1994) provides a full taxonomy of Geographical Information Systems (GIS) error.

The work presented here contributes both to providing information on data quality and reducing errors. Through the use of an active repository, rules concerning "allowed" relationships between pairs of spatial objects can be stored and imposed at data entry. The means by which the repository imposes these rules is through integrity constraints. A full discussion of integrity constraints as they apply to SIS is given in (Cockcroft 1997). Data quality information concerning the resulting dataset's lineage and age can also, to some extent, be automatically gathered reducing the overhead of data entry to those setting up Geographic databases. This paper presents the results of a system development project concerned with improving SIS data quality. A repository has been developed which stores object and attribute details but also the business rules which apply to them. Data quality reports conforming to the FGDC (FGDC 1994) standards for metadata can be produced from the repository, along with error reporting (See Figure 1). In this paper section 2 describes data quality issues in the current research and outlines the areas that this research seeks to address. Section 3 gives a discussion of metadata and introduces some of the emerging standards. In section 4 the reporting facility of the integrated software engineering environment developed here is discussed, and how the quality reports produced from the repository support metadata standards is described.

 
 Figure 1 Integrated Spatial Software Engineering Environment

2. Data Quality

In this section a review of data quality issues is given. Improvement of data quality provides one of the key motivations in establishing integrity constraints in spatial databases.

2.1 Accuracy and Correctness

Accuracy can be subdivided into; accuracy of attribute values, spatial and temporal references. Positional error results in the coordinates associated with a feature being wrongly described and attribute error concerns its characteristics/qualities. Positional and attribute errors are often discussed together, but there are good reasons for dealing with them separately (Collins and Smith 1994). The main reason is that positional accuracy can be quantified as some true value, and error models are emerging for this purpose. On the other hand attribute accuracy is qualitative in nature. That is, the wrongness of an attribute's description cannot be quantified. The implementation of integrity constraints discussed here has the potential to improve attribute accuracy. Positional accuracy however, is not affected by the imposition of integrity constraints because the source of this error is based more on measurement than knowledge.

Correctness concerns the completeness of, and the consistency between, the data and the original source about which the data are collected. Consistency in spatial data is given a thorough treatment in (Laurini and Milleret-Raffort 1991).

Hunter (Hunter 1996) highlighted some more general quality issues. They include protecting the reputation of the data provider, minimising the exposure to risk of litigation and reducing the likelihood of product misuse through quality reporting. On the last point Hunter coined the phrase 'there is really no such thing as bad data just inappropriate data' (Hunter 1996: page 96). The example given was that of a data set with inaccurate road centre line data. For a utility manager who wanted to exactly pinpoint the location of water mains this would be a severe error. For a marketing manager wanting to identify target addresses along the road in question, however, this would be insignificant. It is now becoming more common for data suppliers to provide their clients with metadata, that is data about data, on quality, lineage and age. A further discussion of this will be given in section 3.

2.2 Error in spatial databases

Figure 2 illustrates the types of errors that can occur in Spatial Information Systems. Of particular concern in spatial data quality are how observations are taken, measurements made and input into the computer, how data are processed and how results are presented. These are represented in the bottom section of Figure 2. The problem of errors in the final product illustrated in the top section of Figure 2 was expressed in the forward to a recent conference (Congalton 1994: page 3)

"Funds for developing digital natural resource data bases are often meagre, and/or hard to justify. Corners are cut; little attention is paid to quality. It may be shocking to some, but themes are often digitised directly on unrectified aerial photographs, or ragged and creased paper map sheets that noone has any idea of how they were produced or where they came from [sic]. I wish I could say this situation was the exception, but it is closer to the rule"

 

Figure 2 A classification of error in Geographic Information Systems (Hunter and Beard 1992)

The errors described in the above quote, and their causes, were originally catalogued by Aronoff (Aronoff 1989). Collins et al (Collins and Smith 1994) presented them in the form shown in Table 1. The use of unrectified or bad quality maps is of concern at the stage when data is being prepared for input. This is a form of data collection error. The lack of supporting information, or metadata, for data sets has implications for the use of results in the final row of Table 1. This will be discussed in greater depth in section 3. There are also implications for data manipulation if topological integrity is not maintained. There has been some work on checking the consistency of spatial data already entered to a database as well as at data entry (Laurini and Milleret-Raffort 1991; Ubeda and Servigne 1996). Ubeda and Servigne's research (Ubeda and Servigne 1996) also specified a means for detecting and correcting topological errors within existing data sets There has also been some work on improving the results of queries through the imposition of spatial integrity constraints (Egenhofer 1994). The work presented here has most relevance to the errors presented in the second row of Table 1, data input, because at data entry database constraints, to ensure the integrity of attribute data, can be imposed.

Data collection
    • Inaccuracies in field measurements 
    • Inaccurate equipment 
    • Incorrect recording procedures 
    • Errors in analysis of remotely sensed data
Data input
    • Digitising error 
    • Nature of fuzzy natural boundaries 
    • Other forms of data entry
Data storage
    • Numerical precision 
    • Spatial precision (in raster systems)
Data manipulation
    • Wrong class intervals 
    • Boundary errors 
    • Spurious polygons and error propagation with overlay operations
Data output
    • Scaling 
    • Inaccurate output device
Use of results
    • Incorrect understanding of information 
    • Incorrect use of data
 

Table 1 Separation of error into time phases (Collins and Smith 1994)

3. Metadata and Reporting of Data Quality

At its simplest level metadata is additional information that is necessary for data to be useful. This is the definition implied in much of the GIS literature. Henderson (Henderson 1987) explained metadata as data that describes what the data in an organisation's databank are, and what they mean. She also classified metadata into dictionary descriptions describing characteristics, relationships and uses and directory metadata describing where the data is and how it can be accessed. This definition was followed by Tannenbaum (1994). Almond (1994) also followed this classification, dividing metadata into those that pertain to storage and use and those describing the informational entity being stored.

In the Geographical Information Systems context a detailed set of metadata content standards have been laid down (FGDC 1994). These are complete for the purposes of the directory metadata. Dictionary metadata is also largely covered by this standard although information concerned with processing and use is specifically excluded. Obviously this type of metadata would be hard to gather and is inappropriate in many cases, although it may assist in solving data integration problems. Metadata concerning the physical representation of each value domain (often technology dependent) and the physical storage structure of aggregated data items (often arbitrary) is specifically excluded from the standard in (FGDC 1994) since it is stated in the introduction that '...the standard does not specify the means by which information is organised in a computer system or in a data transfer, nor the means by which the information is transmitted, communicated or presented to the user...'. The standard does contain reference to entity and attribute information which forms part of the design metadata stored in the repository but excludes data pertaining to relationships which is considered, from a dictionary metadata point of view to be important.

It is contended that the standards in place emphasise directory metadata above all else. The repository approach is put forward as a means of capturing both dictionary and directory metadata. One significant benefit of using this approach is that the repository is active in development and in production. This is a particularly useful feature since dictionary metadata is established prior to data entry whereas directory metadata is gathered at data entry. Reports of this metadata, produced by the repository, are given in appendix 1. It should be noted that the examples given in appendix 1 of this paper are for illustration only. Ideally, in order for data sets to be useful, a large amount of quality metadata needs to be provided. This topic has been explored thoroughly elsewhere (Diederich and Milton 1991; Anderson and Stonebraker 1994; Griffiths and Kertis 1994; Shoshani 1994; Dutton 1996; Medyckyj-scott, Cuthbertson et al. 1996). In addition tools have been developed for its gathering for example (Everett 1994; ANZLIC 1996). Therefore the main purpose of presenting the metadata aspects of the repository is to illustrate that it is the natural place for metadata to be stored and that it is extensible to manage a wide variety of metadata should that be required.

3.1 The Main Sections for FGDC Metadata: (FGDC 1998)

Identification Information 

Basic information about the data set. Who created it? Why was it collected? What period of time is represented by the data set? Where on earth is this data anyway?

Data Quality Information

General assessment of the quality of the data set .How good is this data? Where did the attributes come from? Was the QA/QC process accurate? What kind of topology was there? What kind of media is the dataset stored on? Who was the source?

Spatial Data Organization Information

Mechanism used to represent spatial information in the data set

Spatial Reference Information

What was the Map Projection in? What is the Longitude and Latitude of Point in Zone? What is the Projection Zone Number? Is this Vector, Point, or Raster Data?
  Refers to the projection parameters of the data set.
  Planar Encoding Method: The means used to represent horizontal positions. ADS & MOSS data is coordinate pair. MAPS data is row & column.

Entity and Attribute Information

Information about the information content of the data set, including the entities types, their attributes, and the domains from which attribute values may be assigned. What kind of attributes do we have? What is the content of the data set? What type of entity information does the data set contain?
In the following section the reports produced by the repository are described.

4. Reporting from the repository

Figure 3 Structure of report menu

At the core of the system, as illustrated in Figure 1 is a repository. The repository stores constraints on topological relationships and attribute values, these are then imposed at data entry. Of particular interest is the user integrity constraint which allows database consistency to be maintained according to user defined constraints analogous to business rules in non-spatial DBMS. For example, in a pipe network, it may be determined that a pipe of a particular material cannot intersect with a given valve under the prevailing conditions where the network is to be built. When attempting to enter a case where this material is used, a user rule would be activated. A full analysis of the types of rules that can be stored, and how the repository imposes them, is given in (Cockcroft 1996).

The spatial data management system provides reports as illustrated in Figure 3. The error log report gives a list of all the errors that have occurred with the user rule that has been violated, the ID of the object in question and its coordinates. The FGDC subset report shows the data outlined in Table 2 below. The entity/attribute report gives a list of all entities in a project, their attributes and any constraints on those attributes. This report fulfils the requirements of the entity attribute section of the FGDC standard, and could be included in that report. However, it has the potential to report on a much broader range of design elements so for the purposes of this study it is kept separate.
 

Spatial Reference

1. Name of the coordinate system 
2. Coordinate Units
3. The area units used

Identification

4. Bounding coordinates
5. Scale
6. Author
7. Date input
8. Source 
Table 2 Contents of the FGDC subset report

The data presented in this report could be more extensive, but this would necessitate the development of a more advanced metadata entry tool which is not the main purpose of this work and in any case has been done quite adequately elsewhere. See for example (ANZLIC 1996) Figure 4. In the case of this study the report has been restricted to that information which can be automatically collected.

Figure 4 Directory metadata entry screen (ANZLIC 1996)

4.1 Automatic collection of metrics

Whilst there is no detail in FGDC subset report about data quality the repository has the potential to automatically gather data on the number of constraint violations as a proportion of the total attributes entered ( in the case of attribute constraints) or as a proportion of the total relationships that have violated user rules. This is a topic for further research.

5. Conclusion

This paper describes the role of the reporting facility of an integrated software engineering environment developed for spatial information systems. Part of the philosophy of this work is that metadata is more than just a means of cataloguing datasets. It can be used in database design as well. Through more controlled data entry the repository has the potential to improve SIS data quality. Reporting facilities are provided which provide reports on errors in data entry. This data is gathered automatically by the repository since a log is made each time there is a constraint violation. Secondly the repository can provide metadata describing the identity and lineage of datasets entered using the system., these two reporting facilities raise awareness of the data quality of the dataset in question and therefore inhibit inappropriate use. Finally a full report of the repository content can be provided, which assists in database administration.

References

Anderson, J. T. and M. Stonebraker (1994). "sequoia 2000 metadata schema for satellite images."Sigmod Record 23(4): 42-48.

ANZLIC (1996). ANZLIC Guidelines: Core metadata elements, ANZLIC.

Aronoff, S. (1989). Geographic Information Systems: A Management Perspective. Ottawa, Canada, WDL Publications.

Cockcroft, S. (1996). Towards The Automatic Enforcement of Integrity Rules in Spatial Database Systems. SIRC'96: Proceedings of the 8th Annual Colloquium of the Spatial Information Research Centre, University of Otago, University of Otago.

Cockcroft, S. (1997). "A Taxonomy of Spatial Data Integrity Constraints." Geoinformatica 1(4): 327-343.

Collins, F. C. and J. L. Smith (1994). Taxonomy for error in GIS. International symposium on spatial accuracy in Natural Resource Data Bases "Unlocking the puzzle", Williamsburg, Virginia, American Society for Photogrammetry and Remote Sensing. 1-7

Congalton, R. G. (1994). International symposium on spatial accuracy in Natural Resource Data Bases "Unlocking the puzzle". Williamsburg, Virginia, American Society for Photogrammetry and Remote Sensing.

Diederich, J. and J. Milton (1991). "Creating domain specific metadata for scientific data and knowledge bases." 3(4): 421-434.

Dutton, G. (1996). "Improving locational specificity of map data - a multi resolution metadata driven approach and notation." International journal of geographic information systems 10(3): 253-268.

Egenhofer, M. J. (1994). "Pre-processing Queries with spatial constraints." Photogrammetric Engineering and Remote sensing 60(6): 783-790.

Everett, Y. (1994). About the Klamath metadata dictionary. Technical report, Department of Landscape Architecture University of California, Berkley

FGDC. (1994). Content standards for digital Geospatial Metadata. Washinton, D.C., Federal Geographic Data Committee.

FGDC (1998) http://www.fgdc.gov/index.html

Griffiths, J. M. and K. K. Kertis (1994). Intelligent, self documenting audit mechanisms and extension of metadata, IEEE metadata workshop,University of Tennessee.

Henderson, M. M. (1987). "The importance of data administration in information management." Information Management Review 2(4): 41-47.

Hunter, G. and M. Beard (1992). "Understanding error in spatial databases." The Australian Surveyor 37(2): 108-119.

Hunter, G. J. (1996). Management issues in GIS: Accuracy and Data Quality. Conference on managing Geographic Information Systems for success, Melbourne, Australia, Aurisa.

Laurini, R. and F. Milleret-Raffort (1991). Using integrity constraints for checking consistency of spatial databases. GIS/LIS 91, Atlanta, Georgia. 634-642

Marble, D. F. (1990). The extended data dictionary: A critical element in building viable spatial databases. 11th annual ESRI user conference. 245-261

Medyckyj-scott, D., M. Cuthbertson, et al. (1996). "Discovering environmental data: metadatabases, network information, resource tools and the GENIE system." International Journal of Geographical Information Systems 10(1): 65-84.

Shoshani, A. (1994). Experience with creating a metadatabase and a metadata browsing tool. IEEE workshop on metadata for scientific and technical data management, Washington, May 1994.

Tannenbaum, A. (1994). Implementing a corporate repository. New York, John Wiley and Sons.

Tanzi, T. and T. Ubeda (1996). "Contrôle topologique de la cohérence dans les bases de données géographiques." Revue Internationale de Geomatique 5(2): 131-155.

Ubeda, T. and M. Egenhofer (1997). Topological Error Correcting in GIS. Advances in Spatial Databases 5th International Symposium, SSD '97, Berlin, Germany, Springer Verlag. 283-297

Ubeda, T. and S. Servigne (1996). Geometric and Topological Consistency of Spatial Data. 1st International Conference on Geocomputation, Leeds UK. 830-842

 

Appendix

Error Log Report


 

FGDC Subset Report

Repository content Report