Parallel Processing for Geographical Applications a Layered Approach
Michael J. Mineter and Steve Dowers
Parallel Architectures Laboratory, Department of Geography, University of Edinburgh, Drummond Street, Edinburgh EH8 9XP
Significant trends in environmental modelling, GIS and remotely sensed data require increasingly powerful software and hardware, consistent with the exploitation of parallel computing. Despite recent progress in technology, exploiting parallel processing is still difficult so that few applications have been developed in the environmental and geographical domains. With reference to designs for four parallel geographical applications the paper illustrates a number of key issues which must be addressed in the parallel processing of grid, raster and vector-topological data. The emphasis is upon developing layered software to facilitate the development of additional applications in future.
The demand for increasingly powerful analyses of geographical data is evident in a range of areas including:
A number of commercial and technical developments have led to parallel processing becoming more viable:
- Large volumes of geographical data are being generated in both commerce and research. In particular, the desire to monitor the environment led to a proliferation of remote-sensing satellites, but the sophistication of the satellite instruments has outstripped the ability to extract information.
- The developing 'information superhighways' will accelerate the demand for shared access to datasets via powerful data servers and integrators.
- Datamodels are being developed to include more 'real-world' concepts, such as 'fuzzy boundaries', errors and time-dependence.
- Increasingly complex models of environmental processes are being developed.
- The benefits of faster processing to support both real-time analyses and environmental decision support systems are being recognised.
- The fields of remote-sensing and GIS are converging. For example, classification of remotely-sensed data often exploits contextual GIS datasets, but performance constrains these analyses to the processing of small datasets.
- Environmental modelling is also converging with both the remote-sensing and the GIS fields as models become more closely coupled to the spatial distribution of data.
The lack of parallel libraries specific to environmental processing now limits the take-up of parallel processing in the geographical domain. These libraries would form a layer of software upon which applications can be built. This layer would support the decomposition of data across multiple processors, would support the creation of the resultant datasets from the distributed subsets of data, and encapsulate the message-passing. By in effect hiding the parallelism from the developer of an environmental application, the parallel computing expertise demanded of that developer would be reduced.
- State-of-the-art processors are now used in parallel architectures. As single processor performance increases, so parallel computers can benefit. Previously there was a 'leap-frogging' whereby parallel computers used relatively weak, bespoke processors. Their effective life-times were short as new generations of single-processor machines outperformed them. If the anticipated plateau in single-processor performance arrives in the next decade then parallelism will be the primary route towards further increasing processing power.
- Parallel hardware is no longer primarily a research topic: most computer manufacturers now build multi-processor architectures; major database packages (some supporting spatial data) now exploit these.
- Parallel software standards are now established for inter-process communication. MPI (Message-Passing Interface) is supported on many major architectures. This extends the lifetime and useability of applications which can now be developed using ANSI-C or FORTRAN, and ported between architectures.
- Alternative approaches to explicit message passing, whereby parallel constructs are added to FORTRAN and C are also becoming standardised.
- Network speeds are improving, and now allow effective use of a number of workstations as a parallel resource reducing the necessity for extra investment in parallel computers.
- A range of parallel libraries, performing generic functions such as 'regular decomposition' of arrays across a number of processors is now available.
Two implementations illustrate options available for the decomposition of raster or grid data across the multiple processors in iterative algorithms, and raise the issues of efficient interprocess communication and load balancing:
These issues are further explored for both raster and vector data in the context of parallel programs in which data for sub-areas of the complete dataset are distributed to each process. Designs for the conversions between raster and vector-topology are used to illustrate:
- The simplification of remotely-sensed data uses established libraries for regular data decomposition, and makes use of a network of workstations as a parallel resource.
- A 3-D finite difference ice sheet model presents more complex issues of data decomposition (due to the evolving number and shape of ice sheets within the grid used by the model). This model was implemented upon the Cray T3D, using low-level shared memory functions and scattered spatial decomposition techniques.
It is proposed that the implementation of a layer of software which can underlie an expanding range of applications in remote sensing, GIS and in modelling should be seen as a matter of some urgency. The complexity is such that its development needs to anticipate the foreseeable processing bottlenecks.
- The distribution of raster data for algorithms which are not iterative. (One-pass algorithms do not require all data to be held in memory throughout the processing, unlike iterative algorithms.)
- The creation of raster data. This entails ordering output from the processes containing parts of the data.
- The decomposition of vector-topological data requires a 'sort and join' phase, to allow data held in different record types to be associated with each other, and to distribute data for different sub-areas to different processors.
- The creation of vector-topological data in parallel. The consequence is that parts of one object (e.g. an edge between two nodes) are generated on a number of processors. The processors need to cooperate in 'stitching' the objects together so that the fragments of each object can be recognised, the topological relationships between the fragmented objects discovered, and complete vector-topological records collated. Approaches which seek to minimise the overhead of stitching are discussed.