Go to Paper
Return to GeoComputation 99 Index

Developing Quality Training Data for a Statistical Decision Tree Classifier in a Spatial Environment

CROWTHER, Paul (P.Crowther@utas.edu.au), HARTNET, Jacky, WILLIAMS, Ray, PENDELBURY, Steve, University of Tasmania, School of Computing, P.O. Box 1214, LAUNCESTON, Tasmania, 7250, Australia

Key Words: training data, remote sensing, statistical classifier

One of the strengths of a remotely sensed data set is that it represents a complete spatial population; however, in order to make sense of this population, the most successful classifiers first require that areas of the image be selected as training data. This paper describes a method that solves the problem of selecting suitable training sites when it is not practical to select these using ground truth. The method was developed to provide a training data set for a statistical decision tree classifier to analyse the National Oceanic and Atmospheric Administration's (NOAA) Advanced Very-High Resolution Radiometer (AVHRR) multispectral satellite images of Antarctica. This was in response to a user request for a tool to develop a data set for a statistical decision tree package, "S-Plus." It was found that the same tool could be used to create training data for use with other systems.

Statistical classifiers are dependent for their accuracy on the quality of the training data rather than on the algorithm used for classification. The training data set needs to be representative of the whole area to be classified. The populations of pixels used for training must be statistically significant. This means that there is a need to know the minimum number of observations required to characterise a particular site to an acceptable level of error.

Ground-truthed training data in Antarctica is difficult to obtain. In other domains, for example, vegetation classification training data are developed using ground truth combined with aerial and satellite interpretation to identify representative vegetation types. Vegetation is relatively stable over time. In Antarctica there can be rapid changes in features such as sea ice that cannot easily be verified on the ground; therefore, training data must be developed directly from satellite images.

The tool allows an expert image interpreter to choose sample points on an image and apply a label to each one. The label and pixel values for that point on all bands are then stored. The tool's operation gives the user the choice of automatically sampling an image or allowing a user to sample the image in a directed way. In the former case the user can specify the coarseness of the sampling grid and then is asked to supply names at each sampling point. The user can switch between image bands to help decide on the label to be attached to the sampling point. It is possible to switch between automatic sampling and user-directed sampling. This feature was added because if a coarse grid was chosen for automatic sampling, some important features could be missed.

This paper will describe the operation of the spatial sampling tool and the results of using the sample to develop a decision tree by submitting this sample to the S-Plus decision tree package. The results of applying this decision tree to classify Antarctic sea-ice will be presented.