![]() |
Return to GeoComputation 99 Index
James Macgill, Stan Openshaw, and Ian Turton
Department of Geography, University of Leeds, Leeds LS2 9JT United Kingdom
Email: [j.macgill, stan,
ian]@geog.leeds.ac.uk
web: www.geog.leeds.ac.uk/pgrads/j.macgill/gis.html
The well documented ‘data explosion’ that has resulted since the GIS revolution of the mid 1980’s (Openshaw 1995), has led to the rapid production of large quantities of high quality spatially-referenced data. This trend shows no sign of slowing. The rate at which new geographic data is being produced is increasing. Indeed, geographic data sets are now being produced many times faster than they can be analysed (Estivill-Castro 1998).
The data that is generated is being linked, to an ever greater degree, to non-spatial attribute data; due partly to the ease by which many new and legacy databases can be linked via their geography, for example through postcodes (Raper 1992). This has resulted in a rapid build-up of data that is, in a large number of cases, simply being archived. The under analysis of a number of key spatial data sets has been referred to by Openshaw(1995) as a CRIME!
As a result of this, it became obvious that there was an urgent need for the development of semi or even fully autonomous analysis tools that could start to clear this backlog of under-analysed data, whilst equipping us with the tools needed to tackle the ever richer sources of data that will arrive in the future.
However, in seeking to meet this challenge it is important not to neglect a unique contribution that intelligent human analysis can make. Additionally, if the aim is the creation of intelligent analysis agents then there are a number of hurdles to be overcome: a demonstration of added value, indications of performance under conditions where manual methods failed to succeed, proof of safety, and preferably more than one good reason for wanting to approach the problem by a route that many would naturally fear and wish to criticise.
The last few years have seen the development of a number of spatial-analysis tools broadly based on what might be termed geographical data mining.
One of the most common and important analysis tasks in spatial analysis is having an ability to spot patterns. For many key data sets such as crime and health data, the most important pattern to be able to identify is a cluster, and it is this capability that has been the main focus for research. Each of the systems that have been developed had to be able to cope with the problem of a varying background population. The problem is that for each location and scale in a study region, there is a different population density. This population-density surface is critical to cluster analysis as a cluster is effectively a region where the incidence of something exceeds the expected rate. It would be unwise to ignore as almost all clustering detected would be as a result of unevenness in the population distribution. Simply put, a large number of cases in a densely populated city may be of less interest than a moderate number of cases in a rural area.
The core technique used by each system is the same: hyper-circles are generated which have a position and radius in geographic space (in this case the radius is in geographic units though it could just as easily be in terms of nearest neighbours) as well as a span in each of the attribute dimensions. The number of cases and the total at-risk population that fall within this hyper-circle are then found. A statistical test appropriate to the numbers and type of data is then applied to see if the number of cases represents a potentially significant level of excess over what might be expected for the captured population at risk.
If the excess is considered potentially significant then the result from that hyper-circle is passed on to be added to a result surface as a kernel distribution around the centre of the hyper-circle.
Where the methods differ, however, is predominantly in the way the circles are generated. The following covers four of the systems under consideration:
The Geographical Analysis Machine is the oldest and most established of the methods in use here. The 'Machine' part of its name gives a clue as to the nature of GAM's approach. GAM uses a brute-force search technique that explores the entire data space at every geographic location and at every scale. In doing so, it is guaranteed not to miss any potential clusters. Unfortunately, as the data sets under investigation increase in both size, and particularly in dimensions, the number of permutations that GAM has to try explodes.
A single GAM run can generate millions of hyper-circles for testing (in multi-attribute data this can escalate to billions).
Map Explorer has been developed to take advantage of the non-linear optimisation capabilites offered by Genetic Algorithms(GA).
A brief overview is given here for readers familiar with GA's. For more details on MAPEX, see Openshaw and Perree (1996).
For each generation of the GA, a population of bitstrings is created with each bitstring representing an encoding of a potential hyper-circle in terms of its x,y coordinate and radius, as well as bits representing spans in time and attribute spaces. Each member of this population is then evaluated to see which produces the highest interesting level of excess. This value is then used to select the fittest members of the population from which to breed the next generation.
This process continues until MAPEX converges on a cluster in the database. As this may be either a local optima or only one of many such clusters, the search is restarted: this time with the cases at that location removed or masked out.
The flock engine utilises a number of independent agents (called geoBoids) that roam through the database foraging for clusters. They communicate and direct their movements cooperatively by borrowing behavioural traits that are based on flocking (Reynolds 1987)
Each member of the flock is able to evaluate and broadcast its performance. Based on this, other flock members can choose to steer towards well-performing geoBoids to assist them or to steer to avoid poorly performing boids. In addition, each geoBoid changes its own behaviour based on its performance. For example, a poorly-performing boid will speed up in order to find a more interesting area. Likewise, a well-performing geoBoid will slow down to investigate an interesting region more carefully.
Macgill (1998) goes into more detail about the flock algorithm. The following figure shows a flock analysis in mid run.
Fig 1 A Flock run.
The simplest of all approaches, as its name suggests. Each hyper-circle is generated entirely at random with
no intelligence or heuristics to guide it.
Its inclusion is partly to act as a benchmark, i.e. do any of the allegedly smart techniques actually outperform
a random search? However, it can actually perform a useful function when used alongside other search techniques
that might otherwise get stuck at local optima.
Method | Speed (per Test) | Thoroughness |
Scalabilty |
---|---|---|---|
Random Search | Fast | Low | High |
MAPEX (GA) | Average | Average | High |
Flock | Average | High | High |
GAM | Fast | ~Total | Low |
GAM - T | Slow | Good | V.Low |
GEM | Slow | Good | ~None |
Table 1
Table 1 summarises the relative abilities of each of the analysis engines outlined above. Given the near total thoroughness of the GAM engine, it is tempting to use it exclusively. Many tests with GAM have shown that it has a high success at locating clusters and, just as importantly, a good tolerance against making false positive detections. (Boyle 1996).
However, GAM owes its efficiency to its highly-optimised, spatial-data retrieval and the fact that it works best on a purely geographical domain. It does not scale to more complex problems involving higher dimensional multi-variate data sets. Although space-attribute and space-time versions have been developed (GEM, GAM-T), they really require HPC platforms and, even then, they will be limited to low-dimensional problems because they do not scale.
The non brute-force systems are attractive for their scalability, but using them means sacrificing the level of resolution and completeness that brute-force, GAM-like methods generate in the final results.
fig 2
Fig 2 is illustrative of how the search might have progressed for three of the methods shortly after starting. The GAM has been very thorough but has yet to complete one pass of one line at one scale. The random search and flock searches have covered more ground and may have found something interesting but it will be some time before either of them has developed a clearer picture of what, if anything, they have found.
For very large temporal, multi-attribute data sets, the computational time becomes prohibitively large, with CPU
time approaching infinity. However, using it to exhaustively search one small part of a spatial data set is less
intensive; thus when a less thorough engine finds something of potential interest, it can call on the help of the
GAM to exhaustively search that region. In this way, you can narrow down the search by reducing the dimensionality;
at which point the GAM style of exploration again becomes technologically viable. Once such an exhaustive search
has finished, that section of the data set can be marked as done and avoided by the flock for the rest of the search.
fig 3
Fig 3 illustrates a potential cooperative scenario in which the random search, flock and GAM are working
together. In this scenario, one of the flock methods (magenta) thinks that
it has found something potentially interesting and has called in a miniGAM search (orange)
to thoroughly examine the area. Meanwhile, one of the random circles (cyan)
has attracted the attention of two other members of the flock for additional study.
Until recently, the database functions, flock search and interactive visualization have been tightly coupled into
a single application. Now, however, the process has been successfully split into three tiers to enable such a multi-engine
cooperative system to be developed. It is anticipated that the remaining search engines will be similarly implemented
over the coming months.
The following section looks at the architecture that is being followed for the implementation of the Multi-Engine Spatial Analysis Tool (MESAT).
Having decided to implement MESAT, it was necessary to decide upon an architecture that was appropriate. It soon became apparent that the problem task was most easily divided into three components:
The following diagram illustrates the overall architecture of the three-tier system
fig 4 The three tier architecture
The first tier is some form of spatial-data storage and query facility. In the trials performed here, this is
simply a flat file stored on a server. In more complete final applications, it is more likely to be a more advanced
online spatial-data warehouse such as that proposed by Han(1998). It should be able to store large amounts of data
and respond to spatially-referenced queries rapidly.
The second tier consists of one or more spatial-exploration engines that will interrogate the spatial data found
at the first tier and attempt to find spatial patterns in the data. By developing using the Object-orientated paradigm,
each engine implements a common set of interfaces that allow them to appear from the outside as identical generic
engines. By doing this, additional engines can be implemented and 'plugged-in' to the system, allowing them to
instantly communicate with both the first and third tier without having to change either.
The top tier of the system is the visualization engine. This collates the information coming in from the different search tools and displays them in an interactive visualization. This visualization is built up in real-time, allowing the user of the system to see the progress of the search as it occurs. The major advantage of this is that the user can employ their expertise or intuition to assist and direct the search process.
The visualizer serves a dual role: aside from the obvious displaying of the results, the visualizer actually coordinates the activities of the different search engines, starting and stopping them, dividing up and sharing the search space, and setting regions to focus or ignore.
fig 5 The result display and control panel.
Its controlling role could be either entirely human-user driven, autonomous, or a mixture of the two. With human control, the user could drag out boxes to indicate regions that different engines should concentrate on or avoid. In autonomous mode, the application could use the results generated so far to decide on how best to guide the remaining search.
A number of implementation strategies were available. However, as the computer science department were interested in a collaborative project involving CORBA, a bridge was set up that allowed the three tiers to communicate through the Internet in a way that allowed each component to run on separate machines if necessary, or with one or more of them on a single computer. The architecture is scaleable, allowing extra computers running additional analysis engines to be added and removed from the system during the analysis.
The CORBA bridge allows the use of components built in multiple languages. For now, the language of choice has been Java, for its platform independence and the speed with which the system can be prototyped, though specific components may later be re-implemented in C/C++ to take advantage of the speed boost that this would provide.
Whilst enabling multiple search engines on separate machines undoubtedly gives a speed increase as it effectively parallelizes the search process, on its own this is little more than a way of providing more computational power and the same effect would be achieved by simply running the task on a more powerful computer. Instead, the goal of this project is to achieve a better performance and set of results by running different search techniques that communicate in parallel, such that they would produce a better result than any single engine could produce, even if all of the engines are in fact running on a single machine.
As stated above, the role of the visualizer is more than that of communicating results to the user: it is also
a shared result surface (in AI terms it could be considered a 'blackboard system' (Englemore 1998)). As such, it
must also act as the central communication centre for the different agents. Given that the shared result surface
is available to all of the analysis engines, what kind of information should be passed back? The first option is
that the entire result surface be sent. However, given its size, bandwidth limitations make that impractical. Instead,
the shared result surface must provide a useful summary of the search so far. Such information might include:
How each engine would take advantage of this information depends very much on the mechanism of the engine itself. For example, a GAM might sit idle on a machine until it receives a specific request to perform a focused search on a particular small region. The flock, on the other hand, employs a number of behavioural patterns to respond to such information.
The implementation of the flock that was being used until recently was based on a fairly simplistic implementation
of the boids algorithm. Whilst this implementation allowed the first prototypes to be set up quickly, it began
to restrict the alterations and improvements that could be made to the flock members' behaviour. By following the
work of Craig Reynolds, the boids have been re-implemented to allow a more complete set of steering behaviours
to be implemented (Reynolds 1998). For example, previous boids would avoid, or 'run away' from, areas of the data
set that have been marked as empty or uninteresting. This could mean that a search which was progressing near but
not towards such an area could be unnecessarily deflected. In the current implementation, a steering behaviour
capable of 'obstacle avoidance' has been used. This means that only boids in direct danger of entering an empty
section of the database will be deflected.
The generation of these obstacles by the shared result surface quickly marks off uninteresting parts of the database.
For example, areas with zero populations such as seas are eliminated early on. With additional information arriving
from the shared result surface regarding regions to avoid, the geoBoids would be able to use this behaviour to
steer around them.
A snapshot of the applet running can be seen here |
This applet shows how a swarm optimization search might take advantage of search knowledge as it builds up. The applet is only intended for illustration purposes as it is not running on real data. However, it does show how past search knowledge can be used to avoid unnecessary search repetition. The large boid in the demonstration is not actually following any form of intelligent search behavior. Instead, it is simply avoiding regions that it has already discovered to be practically empty. (The small boid is attracted to or repelled from the large boid depending on the level of excess of dreaded red dot disease found inside the large boid) if the reset button fails on your browser, shift-reload the page. |
Although a single GAM run will generate millions of circles, its systematic nature means that it will never generate two circles of the same size at the same location. As such, every result can be added to the result surface in the knowledge that it is unique.
Members of the flock search, however, have no such restriction and they may well generate results for one section of the map multiple times, each time adding to the surface of evidence. This problem increases as multiple-analysis engines are used as they may, independently, occasionally explore the same areas. As a result, an area of relatively little importance can be added to the result surface enough times to make it seem interesting. This is particularly true for maps with little or no real clustering as the analysis engines will hunt out the closest thing they can find to a real cluster and keep studying it to try and find a scale at which it may become a cluster.
The simple solution to this was to build two result surfaces. In addition to the normal result surface, the second one records the number of times each part of the map has been examined. This surface is referred to as the search-intensity surface. This result surface can then be divided by the search intensity surface to cancel out the effects of redundant result generation.
As a side effect, this search-intensity surface gives a clear indication of not just where has been searched but, more importantly, where has not yet been searched, providing for the first time a measure of how thorough the flock search is.
The hybrid approach offers many potential advantages over using any one of the methods independently. In a worst case scenario, the system should still perform at least as well as the best contributing component.
This alone may be useful as, given the diferent capabilities of each engine, it would be difficult to chose which engine to employ.
It is, however, with the introduction of time-attribute clusters that the new system has greatest potential. In this scenario, any individual cluster may be purely spatial, purely temporal, or both, as well as being focused within certain attribute constraints. It is tests such as these which push the computational intensity of the problem upwards and it is here that the hybrid approach is most likely to have greatest effect as the wide exploratory powers of the flock and GA will, it is hoped, find some indication of possible clustering which can then be exhaustively searched by a focused GAM run.
Extensive testing is required, however, to give a full indication of the level of gain offered by combining the different approaches in a cooperative, expandable, distributed architecture. Such testing is already underway, involving a series of blind trials on generated data sets that range from simple spatial clustering to complex multi-dimensional data sets (as well, of course, as entirely random ones with no clustering).
The results of this testing should be ready for inclusion in the Conference presentation, at which point a copy of the results will be available at this address: http://www.ccg.leeds.ac.uk/james/geocomp99/results.html
Estivill-Castro, V., Murray, A. (1998): in : Research and development in knowledge discovery and data mining : Second Pacific-Asia Conference, PAKDD-98, eds: Wu, X., Kotagiri, R., Korb, K. Australia;
Englemore, R. and Morgan, T. (1989) eds : Blackboard systems. Addison-Wesley. Reading, MA.
Han,J., Stefanovic, N. and Koperski, K. (1998): Selective Materialization: An Efficient Method for Spatial Data Cube Construction, in: Pacific-Asia Conf. on Knowledge Disovery and Data Mining (PAKDD'98), Australia.
Openshaw, S. (1995): Developing automated and smart spatial pattern exploration tools for geographical information system applications. The Statistician, 44, No 1, pp 3-16.
Openshaw, S., Perrée, T. (1996): User-centered intelligent spatial analysis of point data. In: Parker, D. (Ed) Innovations in GIS 3, 119-124 ,Taylor and Francis, London
Macgill, J. Openshaw, S. (1998): The use of flocks to drive a Geographic Analysis Machine. in: GeoComputation98, Procedings, Bristol.
Raper, J.F., Rhind, D., Shephers, J.W. (1992): Postcodes: the new geography. Longman, Harlow
Reynolds, C,W. (1987): Flocks, herds, and schools: A distributed behavioral Model. Computer Graphics,21(4):25-34.
Reynolds, C,W (1998): Steering Behaviors for Autonomus Characters, online: http://www.hmt.com/cwr/steer/
As a side effect of splitting the system into distributed parts, it is now possible to envisage a 'DIY' super computer by employing the idle CPU cycles of all the computers across the School of Geography's Intranet. Indeed, by the time of the GeoComputation conference this may have been extended to the Internet such that simply pointing your browser to look at the online version of this paper could utilise your computer's spare CPU cycles to help solve a problem.