Keywords: Spatial Data Quality, Integrity Constraints, Repository.
The work presented here contributes both to providing information on data quality and reducing errors. Through the use of an active repository, rules concerning "allowed" relationships between pairs of spatial objects can be stored and imposed at data entry. The means by which the repository imposes these rules is through integrity constraints. A full discussion of integrity constraints as they apply to SIS is given in (Cockcroft 1997). Data quality information concerning the resulting dataset's lineage and age can also, to some extent, be automatically gathered reducing the overhead of data entry to those setting up Geographic databases. This paper presents the results of a system development project concerned with improving SIS data quality. A repository has been developed which stores object and attribute details but also the business rules which apply to them. Data quality reports conforming to the FGDC (FGDC 1994) standards for metadata can be produced from the repository, along with error reporting (See Figure 1). In this paper section 2 describes data quality issues in the current research and outlines the areas that this research seeks to address. Section 3 gives a discussion of metadata and introduces some of the emerging standards. In section 4 the reporting facility of the integrated software engineering environment developed here is discussed, and how the quality reports produced from the repository support metadata standards is described.
Figure 1 Integrated Spatial Software Engineering Environment
Correctness concerns the completeness of, and the consistency between, the data and the original source about which the data are collected. Consistency in spatial data is given a thorough treatment in (Laurini and Milleret-Raffort 1991).
Hunter (Hunter 1996) highlighted some more general quality issues. They include protecting the reputation of the data provider, minimising the exposure to risk of litigation and reducing the likelihood of product misuse through quality reporting. On the last point Hunter coined the phrase 'there is really no such thing as bad data just inappropriate data' (Hunter 1996: page 96). The example given was that of a data set with inaccurate road centre line data. For a utility manager who wanted to exactly pinpoint the location of water mains this would be a severe error. For a marketing manager wanting to identify target addresses along the road in question, however, this would be insignificant. It is now becoming more common for data suppliers to provide their clients with metadata, that is data about data, on quality, lineage and age. A further discussion of this will be given in section 3.
"Funds for developing digital natural resource data bases are often meagre, and/or hard to justify. Corners are cut; little attention is paid to quality. It may be shocking to some, but themes are often digitised directly on unrectified aerial photographs, or ragged and creased paper map sheets that noone has any idea of how they were produced or where they came from [sic]. I wish I could say this situation was the exception, but it is closer to the rule"
Figure 2 A classification of error in Geographic Information Systems (Hunter and Beard 1992)
The errors described in the above quote, and their causes, were originally catalogued by Aronoff (Aronoff 1989). Collins et al (Collins and Smith 1994) presented them in the form shown in Table 1. The use of unrectified or bad quality maps is of concern at the stage when data is being prepared for input. This is a form of data collection error. The lack of supporting information, or metadata, for data sets has implications for the use of results in the final row of Table 1. This will be discussed in greater depth in section 3. There are also implications for data manipulation if topological integrity is not maintained. There has been some work on checking the consistency of spatial data already entered to a database as well as at data entry (Laurini and Milleret-Raffort 1991; Ubeda and Servigne 1996). Ubeda and Servigne's research (Ubeda and Servigne 1996) also specified a means for detecting and correcting topological errors within existing data sets There has also been some work on improving the results of queries through the imposition of spatial integrity constraints (Egenhofer 1994). The work presented here has most relevance to the errors presented in the second row of Table 1, data input, because at data entry database constraints, to ensure the integrity of attribute data, can be imposed.
Data collection |
|
Data input |
|
Data storage |
|
Data manipulation |
|
Data output |
|
Use of results |
|
In the Geographical Information Systems context a detailed set of metadata content standards have been laid down (FGDC 1994). These are complete for the purposes of the directory metadata. Dictionary metadata is also largely covered by this standard although information concerned with processing and use is specifically excluded. Obviously this type of metadata would be hard to gather and is inappropriate in many cases, although it may assist in solving data integration problems. Metadata concerning the physical representation of each value domain (often technology dependent) and the physical storage structure of aggregated data items (often arbitrary) is specifically excluded from the standard in (FGDC 1994) since it is stated in the introduction that '...the standard does not specify the means by which information is organised in a computer system or in a data transfer, nor the means by which the information is transmitted, communicated or presented to the user...'. The standard does contain reference to entity and attribute information which forms part of the design metadata stored in the repository but excludes data pertaining to relationships which is considered, from a dictionary metadata point of view to be important.
It is contended that the standards in place emphasise directory metadata above all else. The repository approach is put forward as a means of capturing both dictionary and directory metadata. One significant benefit of using this approach is that the repository is active in development and in production. This is a particularly useful feature since dictionary metadata is established prior to data entry whereas directory metadata is gathered at data entry. Reports of this metadata, produced by the repository, are given in appendix 1. It should be noted that the examples given in appendix 1 of this paper are for illustration only. Ideally, in order for data sets to be useful, a large amount of quality metadata needs to be provided. This topic has been explored thoroughly elsewhere (Diederich and Milton 1991; Anderson and Stonebraker 1994; Griffiths and Kertis 1994; Shoshani 1994; Dutton 1996; Medyckyj-scott, Cuthbertson et al. 1996). In addition tools have been developed for its gathering for example (Everett 1994; ANZLIC 1996). Therefore the main purpose of presenting the metadata aspects of the repository is to illustrate that it is the natural place for metadata to be stored and that it is extensible to manage a wide variety of metadata should that be required.
Identification Information |
Basic information about the data set. Who created it? Why was it collected? What period of time is represented by the data set? Where on earth is this data anyway? |
Data Quality Information |
General assessment of the quality of the data set .How good is this data? Where did the attributes come from? Was the QA/QC process accurate? What kind of topology was there? What kind of media is the dataset stored on? Who was the source? |
Spatial Data Organization Information |
Mechanism used to represent spatial information in the data set |
Spatial Reference Information |
What was the Map Projection in? What is the Longitude and Latitude of Point in Zone? What is the Projection Zone Number? Is this Vector, Point, or Raster Data? |
Refers to the projection parameters of the data set. | |
Planar Encoding Method: The means used to represent horizontal positions. ADS & MOSS data is coordinate pair. MAPS data is row & column. | |
Entity and Attribute Information |
Information about the information content of the data set, including the entities types, their attributes, and the domains from which attribute values may be assigned. What kind of attributes do we have? What is the content of the data set? What type of entity information does the data set contain? |
At the core of the system, as illustrated in Figure 1 is a repository. The repository stores constraints on topological relationships and attribute values, these are then imposed at data entry. Of particular interest is the user integrity constraint which allows database consistency to be maintained according to user defined constraints analogous to business rules in non-spatial DBMS. For example, in a pipe network, it may be determined that a pipe of a particular material cannot intersect with a given valve under the prevailing conditions where the network is to be built. When attempting to enter a case where this material is used, a user rule would be activated. A full analysis of the types of rules that can be stored, and how the repository imposes them, is given in (Cockcroft 1996).
The spatial data management system provides reports as illustrated in
Figure 3. The error log report gives a list of all the errors that have
occurred with the user rule that has been violated, the ID of the object
in question and its coordinates. The FGDC subset report shows the data
outlined in Table 2 below. The entity/attribute report gives a list of
all entities in a project, their attributes and any constraints on those
attributes. This report fulfils the requirements of the entity attribute
section of the FGDC standard, and could be included in that report. However,
it has the potential to report on a much broader range of design elements
so for the purposes of this study it is kept separate.
Spatial Reference |
|
|
|
Identification |
|
|
|
|
|
The data presented in this report could be more extensive, but this would necessitate the development of a more advanced metadata entry tool which is not the main purpose of this work and in any case has been done quite adequately elsewhere. See for example (ANZLIC 1996) Figure 4. In the case of this study the report has been restricted to that information which can be automatically collected.
Figure 4 Directory metadata entry screen (ANZLIC 1996)
ANZLIC (1996). ANZLIC Guidelines: Core metadata elements, ANZLIC.
Aronoff, S. (1989). Geographic Information Systems: A Management Perspective. Ottawa, Canada, WDL Publications.
Cockcroft, S. (1996). Towards The Automatic Enforcement of Integrity Rules in Spatial Database Systems. SIRC'96: Proceedings of the 8th Annual Colloquium of the Spatial Information Research Centre, University of Otago, University of Otago.
Cockcroft, S. (1997). "A Taxonomy of Spatial Data Integrity Constraints." Geoinformatica 1(4): 327-343.
Collins, F. C. and J. L. Smith (1994). Taxonomy for error in GIS. International symposium on spatial accuracy in Natural Resource Data Bases "Unlocking the puzzle", Williamsburg, Virginia, American Society for Photogrammetry and Remote Sensing. 1-7
Congalton, R. G. (1994). International symposium on spatial accuracy in Natural Resource Data Bases "Unlocking the puzzle". Williamsburg, Virginia, American Society for Photogrammetry and Remote Sensing.
Diederich, J. and J. Milton (1991). "Creating domain specific metadata for scientific data and knowledge bases." 3(4): 421-434.
Dutton, G. (1996). "Improving locational specificity of map data - a multi resolution metadata driven approach and notation." International journal of geographic information systems 10(3): 253-268.
Egenhofer, M. J. (1994). "Pre-processing Queries with spatial constraints." Photogrammetric Engineering and Remote sensing 60(6): 783-790.
Everett, Y. (1994). About the Klamath metadata dictionary. Technical report, Department of Landscape Architecture University of California, Berkley
FGDC. (1994). Content standards for digital Geospatial Metadata. Washinton, D.C., Federal Geographic Data Committee.
FGDC (1998) http://www.fgdc.gov/index.html
Griffiths, J. M. and K. K. Kertis (1994). Intelligent, self documenting audit mechanisms and extension of metadata, IEEE metadata workshop,University of Tennessee.
Henderson, M. M. (1987). "The importance of data administration in information management." Information Management Review 2(4): 41-47.
Hunter, G. and M. Beard (1992). "Understanding error in spatial databases." The Australian Surveyor 37(2): 108-119.
Hunter, G. J. (1996). Management issues in GIS: Accuracy and Data Quality. Conference on managing Geographic Information Systems for success, Melbourne, Australia, Aurisa.
Laurini, R. and F. Milleret-Raffort (1991). Using integrity constraints for checking consistency of spatial databases. GIS/LIS 91, Atlanta, Georgia. 634-642
Marble, D. F. (1990). The extended data dictionary: A critical element in building viable spatial databases. 11th annual ESRI user conference. 245-261
Medyckyj-scott, D., M. Cuthbertson, et al. (1996). "Discovering environmental data: metadatabases, network information, resource tools and the GENIE system." International Journal of Geographical Information Systems 10(1): 65-84.
Shoshani, A. (1994). Experience with creating a metadatabase and a metadata browsing tool. IEEE workshop on metadata for scientific and technical data management, Washington, May 1994.
Tannenbaum, A. (1994). Implementing a corporate repository. New York, John Wiley and Sons.
Tanzi, T. and T. Ubeda (1996). "Contrôle topologique de la cohérence dans les bases de données géographiques." Revue Internationale de Geomatique 5(2): 131-155.
Ubeda, T. and M. Egenhofer (1997). Topological Error Correcting in GIS. Advances in Spatial Databases 5th International Symposium, SSD '97, Berlin, Germany, Springer Verlag. 283-297
Ubeda, T. and S. Servigne (1996). Geometric and Topological Consistency of Spatial Data. 1st International Conference on Geocomputation, Leeds UK. 830-842