AUTOCLUST: Automatic clustering via boundary extraction for mining massive
pointdata sets
Vladimir EstivillCastro and Ickjai Lee
Department of Computer Science & Software Engineering, The University
of Newcastle, Callaghan, NSW 2308, Australia
Email: vlad@cs.newcastle.edu.au,
ijlee@cs.newcastle.edu.au
Abstract
Widespread clustering methods require userspecified arguments and prior
knowledge to produce their best results. This demands preprocessing and/or
several trial and error steps. Both are extremely expensive and inefficient
for massive data sets. The need to find bestfit arguments in semiautomatic
clustering is not the only concern, the manipulation of data to find the
arguments opposes the philosophy of ''let the data speak for themselves''
that underpins exploratory data analysis. Our new approach consists of
effective and efficient methods for discovering cluster boundaries in pointdata
sets. The approach automatically extracts boundaries based on Voronoi modelling
and Delaunay Diagrams. Parameters are not specified by users in our automatic
clustering. Rather, values for parameters are revealed from the proximity
structures of the Voronoi modelling, and thus, an algorithm, AUTOCLUST,
calculates them from the Delaunay Diagram. This not only removes humangenerated
bias, but also reduces exploration time. The effectiveness of our approach
allows us to detect not only clusters of different densities, but sparse
clusters near to highdensity clusters. Multiple bridges linking clusters
are identified and removed. All this is performed within O(n
log n) expected time, where n is the number of data points.
We evaluate AUTOCLUST's time efficiency and clustering quality. We compare
and contrast AUTOCLUST with other algorithms for clustering large georeferenced
sets of points. A series of detailed performance comparisons with both
synthetic data sets and real data sets confirms the virtues of our approach.
1 Introduction
As the size of spatial data increases exponentially
and the structure of data becomes more complex, data mining and knowledge
discovery techniques become essential tools for successful analysis of
large spatial data sets [17].
Clustering is central to Spatial Data Mining [7,20]
or Geographical Data Mining [24].
It partitions pointdata sets into smaller groups due to proximity. Resulting
groups represent geographically interesting concentrations in which further
investigations should be undertaken to find possible causal factors. Numerous
clustering approaches have been suggested within data mining [1,12,13],
spatial databases [4,5,16,20,25,26,30],
statistics [9,18,19,28]
and GIS communities [6,7,15,22].
Different clustering methods have been proven to offer different capabilities,
different domains of applicability and different computational requirements.
Nevertheless, clustering methods share their needs for userspecified
arguments and prior knowledge to produce their best results. Such information
needs are supplied as density threshold values, merge/split conditions,
number of parts, prior probabilities, assumptions about the distribution
of continuous attributes within classes, and/or kernel size for intensity
testing (for example, grid size for rasterlike clustering and radius for
vectorlike clustering). This parameter tuning is extremely expensive and
inefficient for huge data sets because it demands preprocessing and/or
several trial and error steps.
The need to find bestfit arguments in widespread semiautomatic clustering
is not the only concern, the manipulation of data to find the arguments
opposes Openshaw's famous postulation [23]
to ''Let the data speak for themselves''. This postulation underpins exploratory
spatial data analysis. Hypotheses should be uncovered from the data, not
from the user's prior knowledge and/or assumptions. So far, clustering
approaches detect relatively highdensity areas (localised excesses) within
the study region. However, this clustering principle alone has difficulty
in determining exactly what is highdensity and what is relatively lessdensity.
Thus, it frequently fails to find slightly sparse groups when they are
adjacent to highdensity groups despite of their great importance for further
investigations. We present specific illustrations of these problems in
Section 2.
We overcome these problems by precise modelling of spatial proximity
in association with density. Density is a natural principle, but it should
be combined with relative proximity. Precise modelling of spatial proximity
means a capability to obtain unambiguous information about global effects
and local effects since geographical phenomena are the result of global
firstorder effects and local secondorder effects [2].
Global effects indicate the degree of excesses in the global sense of view
while local effects represent local variations. With the combination of
both effects, localised groups shall appear.
Gold [10]
argues that the traditional modelling approaches are not able to capture
spatial proximity uniquely or consistently for pointdata sets. As an alternative,
Voronoi modelling has several advantages for proximity relations of points
over the conventional modelling based on either vector like or rasterlike
modelling. In particular, Voronoi models uniquely represent discrete pointdata
sets and consistently capture spatial proximity [10].
We show that, as a consequence, automatic clustering is feasible.
Delaunay Diagrams (these remove ambiguities from Delaunay Triangulations
when cocircular quadruples are present), as duals of Voronoi Diagrams,
are used as the source of our approach. Section 3
discusses the use of Delaunay Triangulation in modelling spatial proximity
and in clustering. With this structure, we find clusters by locating sharp
gradient changes in point density (cluster boundaries). In our automatic
clustering, users do not specify arguments. We believe any parameters must
be encoded in the proximity structures of the Voronoi modelling, and thus,
or algorithm, AUTOCLUST, calculates them from the Delaunay Diagram. This
not only removes humangenerated bias, but also reduces exploration time.
The intuition behind boundary finding is that, in the Delaunay Diagram,
points in the border of a cluster tend to have greater standard deviation
of length of their incident edges since they have both short edges and
long edges. The former connect points within a cluster and the latter straddle
between clusters or between a cluster and noise. This characteristic of
border points is the essence for formulating a dynamic edgeelimination
criterion. The criterion has the flavour of a statistical test, labelling
edges that are units of the standard deviation away from the mean. However,
the principles behind our methods go far beyond this initial intuition
and at least two justifications appear in Section 4.
Removal of edges labelled as too long from the Delaunay Diagram is the
first phase that extracts initial rough boundaries. This phase takes into
account local and global variations to ensure that relatively homogeneous
clusters are not broken up into tiny uninteresting subsets and relatively
heterogeneous subsets are segmented into meaningful subclusters. Section
4 details this process as well as two subsequent
analysis phases in order to eliminate some difficulties with graphbased
clustering. This process removes bridges (linelike paths with a small
number of points) that connect clusters.
The effectiveness of our approach allows us to detect not only clusters
of different densities, but sparse clusters near to highdensity clusters.
Multiple bridges linking clusters are identified and removed. All this
within O(n log n) expected time, where n is
the number of data points. We confirm AUTOCLUST's time efficiency and clustering
quality experimentally in Section 5. These
performance evaluations use both synthetic data sets and real data sets.
We contrast AUTOCLUST with other algorithms for clustering large georeferenced
sets of points. Detailed information can be found in companion technical
report [8].
It categorises clustering methods into 6 groups on the basis of their working
principles. We discuss their major problems in the light of AUTOCLUST.
Finally, we point out some other areas of impact of our approach with concluding
remarks in Section 6.
2 Mere density fails
Spatial clustering for massive data sets has become central in exploratory
data analysis. Early efforts like GAM [22]
and CLARANS [20]
detect clusters within huge data sets but demand high CPU effort (and thus,
are costly with respect to computational time). Later, BIRCH [30]
was able to find clusters in the presence of noise. DBSCAN [5]
and others [4,12,15]
detect clusters of arbitrary shapes. Recently, CHAMELEON [16]
and AMOEBA [7]
report clusters of different densities when data sets exhibit bridges and
high discrepancy in density and noise. Despite a certain level of maturity
in spatial clustering, methods still demand parameter tuning from the users
and fail to find some interesting groups.
2.1 Sparse clusters adjacent to highdensity clusters
Clustering approaches focus on finding relatively highdensity areas (densitybased
and gridbased), merging closer groups (hierarchical and modelbased),
assigning to closer prototypical representatives (partitioning and modelbased)
or detecting shorter or stronger interactions (graphbased). All typically
miss relatively sparse clusters adjacent to highdensity clusters. Consider
the data set shown in Figure 1(a).
Figure 1: A sparse cluster adjacent to two highdensity clusters: (a) a
data set (n = 600); (b) three highdensity clusters; (c) one big
cluster and one highdensity cluster; (d) one sparse cluster; (e) four
clusters.
Most clustering algorithms report either

three very dense clusters (C1, C2 and C3 in Figure
1(b)) when usersupplied arguments combine to reveal
very highdensity clusters, or

one big cluster (C23 in Figure 1(c)) and
one dense cluster (C1), when usersupplied arguments are tuned to
relax the criterion function in order to detect the sparse cluster (C4
in Figure 1(d)).
These algorithms are not able to detect the three highdensity clusters
and the sparse cluster (Figure 1(e)) simultaneously.
Clustering methods are inclined to report the big cluster (C23
in Figure 1(c)) as one cluster because the intracluster
distance of cluster C4 (sparse cluster) is greater than the intercluster
distance between the highdensity clusters (C2 or C3) and
the sparse cluster (C4). Apart from the three highdensity clusters,
the sparse cluster is also important for further correlation investigation
between themes. Spotting and locating interesting groups in a particular
theme is the starting point of exploratory data analysis. If we assume
that points represent crime incidents around cities and surrounding suburbs,
then mere density is unable to find correlation between crime incidents
and suburbs by missing the distinction between the sparse cluster and the
highdensity clusters. Often, this phenomenon is found in real data, such
as big cities having more crime incidents surrounded by suburbs recording
relatively less. Moreover, detection of a boundary allows for exploration
of a geographical feature that can explain the sharp decrease in density
and frequency (the crimes reported are for an interval of time).
2.2 More than one linelike bridges between two
clusters
Graphbased clustering [4,7,15,16,29]
methods have gained popularity as alternatives to density methods since
they work from an underlying graph model. The observations (points) are
assigned to vertices and a discrete relation is encoded, making two related
vertices connected with an edge. In this abstract structure, a cutpoint
is a vertex whose removal disconnects the graph and a bridge is
an edge whose removal disconnects the graph [14].
Connected components are the most elementary clusters. Edges have a weight
corresponding to the geographical distance between the associated endvertices.
Hierarchical clustering with simple linkage corresponds to computing
the Minimum Spanning Tree of the weighted graph. However, it is wellknown
that simple linkage suffers from bridges. Problems also arise for other
strategies when clusters are connected only at bridges or chains. Wellknown
solutions for this find all bridges and cutpoints [29].
This works when only one bridge connects two clusters.
Figure 2: Multiple bridges: (a) a data set (n = 555); (b) three
clusters; (c) three clusters and bridges; (d) one big cluster and one small
cluster.
Consider the data in Figure 2(a). Figure 2(b)
shows three clusters (C1, C2 and
C3). Figure 2(c)
shows that there are three pathlike bridges connecting two clusters (C1
and C2) while one connecting two clusters (C2 and C3).
There are no bridges or cutpoints between clusters C1
and C2 while there are between clusters C2 and C3.
Therefore, the approach using bridges and cutpoints fails
to detect two clusters linked by multiple bridges, thus generates two clusters
C12
and C3 as shown in Figure 2(d).
2.3 Closely located highdensity clusters
Clusters emerge as a combination of global effects and local effects. Clusters
may share global firstorder effects, but have their own local secondorder
effects [7].
Global usersupplied arguments are to be used only when the set of points
is under the same global circumstance. Thus, the use of global usersupplied
arguments should be minimised not only to allow for local variation, but
for spotting highquality clusters.
Figure 3: Closely located highdensity clusters: (a) a data set (n
= 570); (b) five clusters; (c) four highdensity clusters; (d) three highdensity
clusters and one sparse cluster.
Visual inspection of Figure 3(a) reveals four
highdensity clusters and one sparse cluster. Figure 3(b)
highlights the four highdensity clusters with labels C1 to C4
and one sparse cluster with label C5. However, using only global
arguments is very difficult to obtain a reply from a clustering system
that suggests these five clusters simultaneously. Settings of usersupplied
arguments alternate between a reply that

detects four highdensity clusters (C1, C2, C3 and
C4) leaving the sparse cluster (C5) out, perhaps as noise
(refer to Figure 3(c)), or

detect three highdensity clusters (C1, C2 and C34)
and one sparse cluster (C5) as illustrated in Figure 3(d).
The former is the result of arguments slightly of tune. For example, shorter
cutoff criterion function (graphbased), higher density parameter (densitybased)
or smaller size of kernel, normally circle or rectangle, (densitybased
or gridbased). By relaxing the global arguments, the sparse cluster now
can be identified. But typically, because the separation between C3
and C4 is smaller than the intracluster distance in cluster
C5,
the two are combined into one highdensity cluster C34 (refer to
Figure 3(d)).
2.4 Argument free
Minimising the need to adjust arguments reduces the risk of tampering with
the analysis tools. It also promises higher user friendliness. With our
approach, nonexperts can now expect good quality clusters without assistance
from experts towards argumenttuning. In addition, it reduces experimentation
time to search bestfit arguments. Clustering for knowledge discovery is
for revealing implicitly hidden information and patterns in data. Thus,
our approach supports automatic exploratory analysis [23],
rather than humandriven confirmatory analysis.
3 Using Delaunay Diagrams
3.1 Modelling proximity with Delaunay Diagrams
Twodimensional clustering is the nontrivial process of grouping geographically
closer points into the same cluster. Therefore, a model of spatial proximity
for a discrete pointdata set
P = {p_{1},p_{2},...,p_{n}}
must provide robust answers to which are the neighbours of a point p_{i}
and how far are the neighbours relative to the context of the entire data
set P. Once spatial proximity is properly modelled, clustering becomes
the matter of finding relative sudden change.
This is similar to finding sharp gradient change, the boundaries of
clusters, in point density. Note that, we do not have a continuous density
but a discrete point set. The model that does not extrapolate or interpolate
is the discrete graph. One possibility is to use the complete graph, with
all edges weighed by the geographical distance. Here, the strength of interaction
between points is inversely proportional to the length of edge (recall
that distance decays effect [2]).
However, if we have n data points, there is no need to have a model
of size(n^{2}). Moreover, clustering
is a summarisation operation and the goal is to obtain the discrete relation
p_{i} is in the same cluster as p_{j} for
all p_{i},p_{j}
P (a subrelation of the complete graph). Thus, clustering is to
detect and remove inconsistent interactions (edges) in the graph modelling
proximity.
We already mentioned that, despite that raster and vector representations
are widely used in modelling geographical databases, they do not consistently
capture spatial proximity for pointdata sets [10,11].
The Voronoi diagram offers an alternative to overcome the drawbacks of
conventional data models [10].
The Voronoi diagram uniquely captures spatial proximity and represents
the topology explicitly with the dual graph known as the Delaunay Triangulation.
Each Delaunay edge is an explicit representation of a neighbourhood relation
between points.
Because Delaunay Triangulations are not unique when cocircularity occurs
we use Delaunay Diagrams. These are the result of removing all edges e
in the Delaunay Triangulations that are sides of a triangle whose vertices
are cocircular with a fourth point. Even in the presence of cocircularity,
Delaunay Diagrams guarantee a unique topology. They can be constructed
effectively and efficiently (only O(n log n) time).
The structure is succinct [21];
the number of edges is linear to the number of points and the expected
number of edges incident to point is a constant value. In addition, the
effect of long edges is minimised since the Delaunay Diagram is as equilateral
as possible. Further, it is a hybrid approach of the two widely adopted
models. It is geographically comprehensive like the raster model and holds
explicit adjacency like the vector model.
3.2 Previous clustering methods using Delaunay
Triangulations
Delaunay Triangulations are so powerful to succinctly represent proximity
relationships that several clustering methods have been based on them.
The simplest example, is again, the Minimum Spanning Tree, or singlelinkage
hierarchical clustering.
More robust alternatives have been proposed along the paradigm of edgeremoval
from the Delaunay Triangulation to obtain the clusters. One alternative
looks at the perimeter (sum of length of edges) of the triangles (or facets)
of the Delaunay Diagram. The intuition is that, given the nature of Delaunay
Diagrams (where a circumcircle of a Delaunay Triangle does not include
any other data point), triangles with large perimeter must have vertices
that are not in the same cluster. EstivillCastro and Houle [6]
successfully derive the number of clusters using an onedimensional classification
of perimeter values. Another approaches are Kang
et al. [15]
and Eldershaw and Hegland [4].
They analyse the length of Delaunay edges to find clusters of arbitrary
shapes. The intuition is similar, and edges that are too long are classified
as intercluster edges and removed. The relative success of these previous
approaches demonstrates that Delaunay Triangulations contain much implicit
proximity information.
However, these previous ideas use directly Delaunay Triangulations and
return inconsistent results in the presence of cocircularity. More seriously,
they use a global argument as a threshold to discriminate perimeter values
or edges lengths. This kind of static approach is not able to detect local
variations, and again, fails to identify clusters in situations like those
illustrated in Section 2. Recently, EstivillCastro
and Lee [7]
propose a dynamic approach which utilises both global and local variations.
The idea behind this method is to find relatively highdensity clusters
or localised excesses based on Delaunay Diagram. This approach overcomes
some problems of static approaches and produces localised multilevel clusters.
But, it still fails to find relatively sparse clusters in situations like
those in Section 2.1 and Section 2.3.
4 AUTOCLUST
4.1 Intuition and definitions
The previous approaches justify the analysis of Delaunay Triangulations
or Delaunay Diagrams to detect intercluster edges (Delaunay edges whose
endpoints are in different clusters). We postulate that, in Delaunay Diagrams,
border points of clusters exhibit greater variability (dispersion or spread)
of length of their incident edges since they have both intercluster edges
and intracluster edges (Delaunay edges whose endpoints are in the same
cluster).
There are several measures for describing dispersion in statistical
spatial analysis. Undoubtedly, the standard deviation and its square, the
variance, are the most valuable and popular [27].
Thus, for the moment, we may say that AUTOCLUST detects points whose incident
edges exhibit lengths that have unusually large standard deviation. In
order to precisely define this notion we introduce the following notation.
In what follows, let P = {p_{1},p_{2},...,
p_{n}} denote a set of n points in 2 dimensions and
let DD(P) denote the corresponding Delaunay Diagram of P.
The graph DD(P) is a planar map that contains vertices and
edges. Edges of DD(P) will be called Delaunay edges. For
a point p_{i} P (a vertex
of DD(P)), the neighbourhood N(p_{i})
is the set of Delaunay edges incident to point
p_{i}.
Definition 4.1.1 We denote by Local_Mean(p_{i}) the
mean length of edges in
N(p_{i}). That is,
where, d(p_{i}) denotes to the number of Delaunay edges incident
to
p_{i} (the degree of p_{i} in the graph theory sense),
and denotes to the length of Delaunay edge e_{j}.
Definition 4.1.2 We denote by Local_St_Dev(p_{i})
the standard deviation in the length of edges in N(p_{i}). That
is,
Thus, an edge e_{j} incident to a point p_{i}
could potentially be labelled as intercluster (too long) if
However, this would only involve local effects. More seriously,
Local_Mean(p_{i})
and Local_St_Dev(p_{i}) are statistical indicators
based on only d(p_{i}) length values. This is a very
small sample, because = 6 in Delaunay Diagrams.
Thus, we now introduce measures that incorporate global and local effects,
and in fact, this measure will result in tests that make a decision on
each edge using the data of all edges in
DD(P).
Definition 4.1.3We denote by Mean_St_Dev(P)
the average of the
Local_St_Dev(p_{i}) values as we let p_{i}
be each point in
P. That is,
Note that, a value of edgelength spread like Local_St_Dev(p_{i})
is large or small only in relation to the global value reflected in Mean_St_Dev(P).
Thus, we introduce an indicator of the relative size of Local_St_Dev(p_{i})
with respect to Mean_St_Dev(P).
Definition 4.1.4We let Relative_St_Dev(p_{i})
denote the ratio of
Local_St_Dev(p_{i}) and Mean_St_Dev(P). That
is,
With this tool we propose to discriminate edges into those unusually
short, those unusually long and those that seem to be proper intracluster
edges.
Definition 4.1.5 An edge e_{j}
N(p_{i}) belongs to Short_Edges(p_{i}) if and only if the
length of e_{j} is less than
Local_Mean(p_{i})  Mean_St_Dev(P).
That is,
Definition 4.1.6 An edge e_{j}
N(p_{i}) belongs to
Long_Edges(p_{i}) if and only if the
length of e_{j} is greater than Local_Mean(p_{i}) + Mean_St_Dev(P).
That is,
Definition 4.1.7 Other_Edges(p_{i}) denotes the subset
of N(p_{i}) whose edges are greater than or equal to Local_Mean(p_{i})
 Mean_St_Dev(P), and less than or equal to Local_Mean(p_{i}) +
Mean_St_Dev(P).
That is,
Note that, for each p_{i}, the neighbourhood N(p_{i})
is partitioned into exactly 3 subsets: Short_Edges(p_{i}),
Long_Edges(p_{i})
and Other_Edges(p_{i}). However, an edge linking
p_{i} with p_{j} may belong, for example,
to both
Short_Edges(p_{i}) and Other_Edges(p_{j}).
4.2 Working principle
The first phase of our algorithm involves the identification, for each
p_{i}, of Long_Edges(p_{i}), Short_Edges(p_{i})
and
Other_Edges(p_{i}). While it is now possible
to declare
Long_Edges as links between points in different clusters,
simply because they clearly indicate that the points are too far away,
it is not as simple to assume that Short_Edges are links for points
within a cluster. In some cases, these very short links also correspond
to points that align themselves on bridges between clusters. For example,
consider the data in Figure 4(a). The Delaunay
Diagram of these 200 points is shown in Figure 4(b).
Some Short_Edges of the data set are illustrated in Figure 4(c).
They correspond to two bridges between the two clusters. This is because
other Delaunay edges incident to points in the bridges are long, so the
edges on the bridges are exceptionally short. Note that edges inside any
of the two clusters are of approximately the same length as edges on the
bridge. But, edges incident to a point p_{i} inside a cluster
are almost all of equal length. Thus, removing all edges e
Short_Edges(p_{i}) eliminates some bridges,
while if such e happens inside a cluster, its removal can later
be recuperated using analysis of connected components. Later phases of
our algorithm handle other types of bridges and recuperate
Short_Edges
that are intracluster links.
Figure 4: Two clusters connected by two bridges: (a) a data set (n
= 200); (b) its Delaunay Diagram; (c) the
Short_Edges on bridges.
4.2.1 First justification
While we have the intuitively supported rationale in AUTOCLUST to remove
all edges that are in Long_Edges(p_{i})
Short_Edges(p_{i}) for some p_{i}
P, we now give another justifications. Note that the analysis of
edges in DD(P) is the means towards grouping points, towards
clustering. We noted that the relation p_{i} is in the same
cluster as p_{j} is a subset of the complete graph and encodes
the clustering. This is what establishes the connection between grouping
points and the analysis of edges. Previous clustering algorithms using
Delaunay Triangulation [4,6,15]
focus much more on edges because all operations and cutoff criteria functions
are performed and formulated based on edge analysis. AUTOCLUST is closer
to the objective of grouping points because
Local_Mean(p_{i})
and Local_St_Dev(p_{i}) report on local behaviour
around p_{i}. These values represent behaviour of points
more than that of edges. Global effects are incorporated in discriminating
Long_Edges(p_{i}), Short_Edges(p_{i})
and
Other_Edges(p_{i}), with the use of Mean_St_Dev(P).
This makes all the n points influence the decisions for any other
p_{i}. Perturbations of a point p_{j} away
from a point p_{i} do not affect the Delaunay Diagram in
the vicinity of p_{i}. However, the relative configuration
of other points p_{j} away from p_{i} does
influence Mean_St_Dev(P).
4.2.2 Second justification
As a second justification to AUTOCLUST's removal of all edges that are
in Long_Edges(p_{i})
Short_Edges(p_{i}) for some
p_{i}
P, first recall that border points p_{i} of a cluster
C are expected to have Local_Mean(p_{i}) and
Local_St_Dev(p_{i})
larger than points totally inside the cluster C. Thus, if Local_Mean(p_{i})
and
Local_St_Dev(p_{i}) where statistically meaningful
in that were supported on a large sample, a statistical test to remove
unusually long edges would have the form
where k would be a constant to evaluate how unusual an edge is.
However, for a point p_{i} inside a cluster C,
there is uniformity around it, so all its incident edges have similar lengths
and Local_St_Dev(p_{i}) is relatively small. In this
case, we would set the value of k to be larger than 1 in order to
widen the acceptance interval and make sure we preserve edges inside a
cluster. On the other hand, for a point p_{i} that does
not belong to any cluster (an outlier or noise), there are long edges around
it to reach other points. This makes
Local_Mean(p_{i})
large and because of scale that
Local_St_Dev(p_{i})
is in the same units as Local_Mean(p_{i}),
Local_St_Dev(p_{i})
is relatively large. Thus, in this case, we would set the value of k
to be less than one in order to restrict more the acceptance interval and
ensure we remove more edges incident to an outlier.
The inverse of Relative_St_Dev(p_{i}) fulfils
perfectly the role required for the factor k. It is less than one
for those points
p_{i} that locally exhibit more spread
(of length of their incident edges) than the generality of all points in
P. It is larger than one for those points p_{i} that
locally exhibit less spread than the generality of all points in P.
Incorporating k = Relative_St_Dev(p_{i})^{1}
into the statistical test, we have the following derivation.
This is the acceptance interval for Other_Edges(p_{i}).
4.3 A threephase clustering algorithm
AUTOCLUST consists of three phases. Phase
I classifies incident edges into Short_Edges(p_{i}),
Long_Edges(p_{i}) and
Other_Edges(p_{i})
for each p_{i}. Phase I also removes all edges in Short_Edges(p_{i})
and Long_Edges(p_{i}) for all points p_{i}.
This produces the boundaries of clusters, perhaps with some bridges and
perhaps isolating some points. For each point p_{i}, Phase
II recovers edges in Short_Edges(p_{i}) as long as
they do not merge nontrivial connected components, if this happens, it
may reestablish the connected components by removing some Other_Edges(p_{i}).
Finally, Phase III looks for bridges in a local region of each p_{i}.
Figure 5 shows the operation of these phases
in a data set with 4 clusters. The Delaunay Diagram is presented in Figure
5(a), while Figure 5(b)
shows the conclusion of Phase I. Application of the second phase results
in Figure 5(c). Data points isolated are attached
back. Phase III breaks the bridges between the two bottom clusters.
Figure 5: Cleaning procedures: (a) Delaunay Diagram; (b) after Phase I
(initial 3 clusters); (c) after Phase II (3 clusters with refinement);
(d) after Phase III (4 clusters).
4.3.1 Phase I: Finding boundaries
Our earlier discussion justifies in detail the removal of all edges in
all Long_Edges(p_{i}) from the Delaunay Diagram.
They are simply too long and are merging two distinct clusters. We stress
that edges e = (p_{i},p_{j}) in Short_Edges(p_{i})
where p_{j} is in the same cluster are temporary removed
and to be recuperated in the next phase because p_{i} and
p_{j} will be in the same connected component.
4.3.2 Phase II: Restoring and reattaching
The goal of this phase is to restore the edges e = (p_{i},p_{j})
in Short_Edges(p_{i}) where p_{i}
and p_{j} are in the same cluster. Also, the goal is to
remove some Other_Edges(p_{i}) that are creating
paths that may result in bridges or links between border points and noise
points. Thus, the first step of Phase II is to compute connected components
and label each point
p_{i} with its connected component
CC[p_{i}]. If a connected component has more than
one element, it is called nontrivial. If
p_{i} is the only
point in a connected component, then the point p_{i} is
isolated. In that case CC[P_{i}] is not a cluster.
To introduce the processing of Phase II, we bring attention to the following
observation.
Observation 4.3.1Let p_{i}
P be such that
Other_Edges(p_{i}) .
Then, all edges e Other_Edges(p_{i}) connect
p_{i} to the same nontrivial connected component.
Using our notation, this means that if two edges
e_{j}
= (p_{i},p_{j}) and e_{k}
= (p_{i},p_{k}) are in
Other_Edges(p_{i}),
then CC[p_{j}] = CC[p_{k}].
The observation follows trivially from the fact that p_{j}
and p_{k} are connected by the path e_{j},e_{k}.
Now, we recuperate
Short_Edges, and in some cases, revise the connected
component for p_{i}. If all edges in Short_Edges(p_{i})
suggest unanimously a connected component, p_{i} is integrated
to that suggestion. If there are two edges in Short_Edges(p_{i})
suggesting different connected components, the largest is adopted. While
Other_Edges(p_{i}) participate in the creation
of the connected components, now they delegate precedence to the suggestion
by Short_Edges(p_{i}) (which in the large
majority of the cases is the same suggestion). The component swap typically
occurs on isolated boundary points that have no Other_Edges (for
example, the isolated points on the left of the bottom cluster in Figure
5(b)). Details of this process are as follows.

For each point p_{i}, we perform the following steps if
Short_Edges(p_{i})
and (e = (p_{i},p_{j}))
Short_Edges(p_{i}), CC[p_{j}]
is a nontrivial connected component.

Determine a candidate connected component C for p_{i}.

If there are two edges
e_{j} = (p_{i},p_{j})
and
e_{k} = (p_{i},p_{k})
in
Short_Edges(p_{i}) with
CC[p_{j}]
CC[p_{k}], then

Compute, for each edge e = (p_{i},p_{j})
Short_Edges(p_{i}), the size
and let .

Let C be the class label of the largest connected component (if
there are two different connected components with cardinality M,
we let C be the one with the shortest edge to p_{i}).

Otherwise, let C be the label of the connected component all edges
e Short_Edges(p_{i})
connect p_{i} to.

If the edges in Other_Edges(p_{i}) connect
to a connected component different that C, remove them. Note that

because of Observation 4.3.1 above, when this
case applies, all edges in Other_Edges(p_{i})
are removed, and

only in this case, will p_{i} swap nontrivial connected
components.

Restore all edges e ShortEdges(p_{i})
that connect to C.

Recompute connected components.

For each singleton p_{i} (CC[p_{i}]
= 1), we restore all Short_Edges(p_{i}) if
(e = (p_{i},p_{j}))
Short_Edges(p_{i}), CC[p_{j}]
is the same connected component C.
Thus, Phase II incorporates points that are isolated in singleton connected
components to nontrivial connected components if and only if they are
very close (they have at least one Short_Edge to a nontrivial connected
component) in step 1. Singletons whose incident edges connected to another
singletons are reattached in step 3. Phase II also reassigns a point
p_{i} linked to a cluster C by Other_Edges(p_{i})
if there are
Short_Edges(p_{i}) that indicate membership
to .
4.3.3 Phase III: Detecting secondorder inconsistency
After the first two phases, the current subgraph of
DD(P)
is the result of local analysis with regard to global effects. However,
to complete the analysis, the immediate neighbourhood is extended slightly.
Thus, we extend the notion of neighbourhood of a vertex p_{i}
in a graph G as follows.
Definition 4.3.1 The kneighbourhood N_{k,G}(p_{i})
of a vertex p_{i} in graph G is the set of edges of G that belong
to a path of k or less edges starting at p_{i}.
The third phase, now computes Local_Mean for 2neighborhoods.
Definition 4.3.2 We denote by Local_Mean_{k,G}(p_{i})
the mean length of edges in N_{k,G}(p_{i}). That is,
For each point p_{i}, this phase simply removes all edges
in
N_{2,G}(p_{i}) that are long edges.
That is, all
e N_{2,G}(p_{i})
such that > Local_Mean_{2,G}(p_{i})
+ Mean_St_Dev(P).
4.4 AUTOCLUST only requires O(n
log n) time
Subquadratic time complexity is central for clustering algorithms in spatial
data mining. AUTOCLUST is applicable to clustering of very large georeferenced
data since it only requires O(n log n) time. AUTOCLUST
requires O(n log n) time to construct the Delaunay
Diagram. Once this model of proximity is available, AUTOCLUST performs
the remaining work in linear time. The calculation of
Mean_St_Dev(P)
is bounded by twice the number of edges and the edges elimination in Phase
I requires O(n) time. To compute connected components only
requires O(n) time, since this is linear in the sum of both
number of edges and number of points. Each test to restore Short_Edges(p_{i})
in Phase II is bounded by d(p_{i}) and thus, the
total time for Phase II is also linear. Similarly, the local subgraph around
p_{i} in Phase III has size O(d(p_{i})^{2}).
Because in the Delaunay Diagram we have
E[d(p_{i})]
= 6, the total execution in this phase requires time proportional to the
sum of the degrees of all vertices, which is
O(n). Consequently,
the time complexity of AUTOCLUST is dominated by computing Delaunay Diagram
in O(n log n) time.
5 Performance evaluation
This section presents results of experiments with synthetic data sets and
real data sets that illustrate remarkably the robustness of AUTOCLUST.
5.1 Clustering quality with synthetic data sets
Figure 6: Synthetic data sets: (a) Dataset I (n = 8000 and 8 clusters);
(b) Dataset II (n = 8000 and 8 clusters); (c) Dataset III (n
= 8000 and 6 clusters); (d) Dataset IV ( n = 3250 and 12 clusters).
The data in Figure 6(a) is the result of
synthetic data generation aimed at creating situations where clustering
methods typically fail. Section 2 explains
these types of situations. The clustering obtained with AUTOCLUST demonstrates
its robustness. AUTOCLUST correctly identifies two clusters connected by
multiple bridges in the topleft corner of Figure 6(a).
AUTOCLUST has no difficulty in detecting a crossshaped, but sparse cluster
around a highdensity cluster at the topright corner of that data set.
Again, this mimics realworld cases like densely populated city surrounded
by slightly less dense suburbs. AUTOCLUST successfully reports the two
highdensity clusters separated by a twisting narrow gap (at the bottomleft
of the data set). This mimics realworld situations where a river separates
two highly populated areas. AUTOCLUST correctly reports the doughnutshaped
cluster and the surrounded islandshaped cluster (at the bottomright of
that data set).
The data sets in Figure 6(b) and Figure
6(c) belong to CHAMELEON's benchmark [16].
Regardless of the challenge posed by different sizes, different densities,
and different separations (intercluster distances), AUTOCLUST correctly
identifies 8 clusters and 6 clusters, respectively. Some of these clusters
have very little space between them, but AUTOCLUST still detects these
boundaries.
The data in Figure 6(d), consists of 12
clusters of various shapes and different densities. Note that many are
not convex. AUTOCLUST successfully detects all of them.
5.2 Clustering quality with real data sets
The data sets displayed in Figure 7 are real data
sets of Queensland in Australia. Figure 7(a) displays
centres for living areas in Queensland and AUTOCLUST clustering results.
In the bottomright of the data sets, Brisbane, the capital city of Queensland,
creates one large cluster (incorporating 70 percent of the living areas).
AUTOCLUST detected this and simultaneously, it finds three relatively smaller
clusters along the coast which correspond to big cities such as Cairns,
Atherton and Townsville. Inland areas have no clusters. This highlights
the pattern of urban settlement in coastal areas and around capital cities.
Figure 7: Real data sets: (a) living areas (n = 1,837) and AUTOCLUST
grouping into 4 clusters; (b) landing grounds (n = 1,579) and AUTOCLUST
grouping into 3 clusters (greater than 1 % of n); (c) national parks
(n = 227) and AUTOCLUST grouping into 4 clusters (greater than 2%
of n).
The data set displayed in Figure 7(b) represents
landing grounds including aerodromes and airports. Visual inspection does
not suggest any significant cluster. Landing grounds seem to be scattered
all over Queensland. AUTOCLUST groups 90 percent of them into a large cluster
that covers almost all Queensland. AUTOCLUST's results are consistent with
the fact that there is no special pattern in landing grounds. AUTOCLUST
also detects two small clusters in the southern portion of Brisbane. Note
that, visual inspection reveals that there is less density for landing
grounds surrounding the concentrated living areas near Brisbane. Creating
the two other clusters South and East of Brisbane. Densely populated urban
areas are not likely to have enough space for many airports. Conversely,
sparsely populated inland areas are likely to have enough space for small
aerodromes.
Figure 7(c) illustrates a data set for national
parks in Queensland. National parks are generally located along the coast
and the Great Barrier Reef. AUTOCLUST successfully derives clusters
directly overlying the major cities, urban areas and coastal areas. It
finds two highdensity clusters including more than half of national parks
on the outer margin of Australia's continental shelf. Two less dense clusters
are detected around urban areas.
5.3 CPU time comparison
While AUTOCLUST requires O(n log n) time, we are interested
in verifying experimentally that the constants under the O notation
are small. This is to be expected, since the dominating
O(n
log n) time term in AUTOCLUST is the computation of the Delaunay
Diagram. Everything else requires O(n) time.
We compare AUTOCLUST with AMOEBA [7].
AMOEBA also requires O(n log n) time and has been
shown to be very efficient, (small constant under the O notation).
We generated 7 types of data sets based on a simple mixture model. The
number of points varies from 1,000 to 100,000. Each data set has 10 randomly
distributed cluster centres within the unit square [0, 1]
[0, 1] and 10 percent of the total number of points are just noise points
(selected at random within the unit square [0, 1]
[0, 1] ). Since AUTOCLUST and AMOEBA require spatial proximity modelling
(Delaunay Diagram), the run time comparison shown in Figure 8
includes the time for constructing their respective data structures. All
the experiments were performed on a 550MHz Pentium III workstation with
128MB memory.
Figure 8: Comparison of CPU times in seconds.
Figure 8 demonstrates that AUTOCLUST outperforms
AMOEBA with respect to CPUtime by a small margin. Note that, however,
the exploration time to find bestfit arguments for AMOEBA is not included.
AUTOCLUST is argument free. Thus, in exploratory data analysis exercises
AUTOCLUST will easily outperform the use of AMOEBA or other clustering
methods that need parameter tuning. The experimental results shown in Figure
8 include the time requirements for I/O and clustering
process. These experimental results illustrate that AUTOCLUST is efficient
in terms of time complexity and scalable to large data sets.
6 Final remarks
Spatial data mining is the discovery of interesting patterns, structures,
relationships or characteristics that may exist implicitly in spatial databases
[20].
These information is hidden and unknown. Thus, we do not know what kinds
of patterns exist, what causes them, how many clusters are embedded and
where they are. Only the data can genuinely reveal these information. Hypotheses
about the patterns are suggested from the data in exploratory data analysis.
This is very distinctive from the confirmation of human generated hypothesis
in confirmatory data analysis. Hence, for exploratory data analysis, the
need to supply prior knowledge should be minimised, removing humangenerated
biases. AUTOCLUST is a clustering tool for massive georeferenced data
sets free of usersupplied arguments. Moreover, this freedom eliminates
expensive humanbased exploration time for finding bestfit arguments needed
for alternative approaches. AUTOCLUST incorporates global variations and
local variations. Hence, it is robust to noise, outliers, bridges and the
type of distribution. It is able to detect clusters with arbitrary shapes,
different sizes and different densities. Clusters with multiple bridges
are also correctly separated. It only requires O(n log n)
time, where
n is the size of the data set. A series of comparison
proves the superiority of AUTOCLUST in terms of efficiency and effectiveness.
We expect that our methods will extended in two directions. First, the
approach can be extended to 3 dimensions by computing a spanner graph of
linear size and obtaining local and global information efficiently [3]
from this graph. Second, the approach can naturally be extended to other
metrics with welldefined Delaunay Diagrams like Manhattan distance or
distances. Further research will explore these directions.
References

[1]

R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic Subspace
Clustering of High Dimensional Data for Data Mining Applications. In Proceedings
of the ACM SIGMOD International Conference on Management of Data, pages
94105, 1998.

[2]

T. C. Bailey and A. C. Gatrell. Interactive Spatial Analysis. Wiley,
New York, 1995.

[3]

M. T. Dickerson and D. Eppstein. Algorithms for proximity problems in higher
dimensions. Computational Geometry  Theory and Applications, 5:277291,
1996.

[4]

C. Eldershaw and M. Hegland. Cluster Analysis using Triangulation. In B.
J. Noye, M. D. Teubner, and A. W. Gill, editors, Computational Techniques
and Applications: CTAC97, pages 201208. World Scientific, Singapore,
1997.

[5]

M. Ester, M. P. Kriegel, J. Sander, and X. Xu. A DensityBased Algorithm
for Discovering Clusters in Large Spatial Databases with Noise. In Proceedings
of the 2nd International Conference on Knowledge Discovery and Data Mining,
pages 226231, 1996.

[6]

V. EstivillCastro and M. E. Houle. Robust Clustering of Large Georeferenced
Data Sets. In Proceedings of the 3rd PacificAsia Conference on Knowledge
Discovery and Data Mining, pages 327337, 1999.

[7]

V. EstivillCastro and I. Lee. AMOEBA: Hierarchical Clustering Based on
Spatial Proximity Using Delaunay Diagram. In Proceedings of the 9th
International Symposium on Spatial Data Handling, 2000. to appear.

[8]

V. EstivillCastro and I. Lee. AUTOCLUST: Automatic Clustering via Boundary
Extraction for Mining Massive PointData Sets. Technical Report 200003,
Department of Computer Science and Software Engineering, University of
Newcastle, 2000.

[9]

C. Fraley. Algorithms for ModelBased Gaussian Hierarchical Clustering.
SIAM Journal on Scientific Computing, 20(1):270281, 1998.

[10]

C. M. Gold. Problems with handling spatial data  The Voronoi approach.
CISM Journal ACSGC, 45(1):6580, 1991.

[11]

C. M. Gold. The meaning of Neighbour. In G. Tinhofer and G. Schmidt, editors,
Theories and Methods of SpatioTemporal Reasoning in Geographic Space,
pages 220235, Berlin, 1992. SpringerVerlag Lecture Notes in Computer
Science 639.

[12]

S. Guha, R. Rastogi, and K. Shim. CURE: An Efficient Clustering Algorithm
for Large Databases. In Proceedings of the ACM SIGMOD International
Conference on Management of Data, pages 7384, 1998.

[13]

S. Guha, R. Rastogi, and K. Shim. ROCK: A Robust Clustering Algorithm for
Categorical Attributes. In Proceedings of the International Conference
on Data Engineering, pages 512521, 1999.

[14]

F. Harary. Graph Theory. AddisonWesley, Reading, MA, 1969.

[15]

I. Kang, T. Kim, and K. Li. A Spatial Data Mining Method by Delaunay Triangulation.
In Proceedings of the 5th International Workshop on Advances in Geographic
Information Systems (GIS97), pages 3539, 1997.

[16]

G. Karypis, E. Han, and V. Kumar. CHAMELEON: A Hierarchical Clustering
Algorithm Using Dynamic Modeling. IEEE Computer: Special Issue on Data
Analysis and Mining, 32(8):6875, 1999.

[17]

K. Koperski, J. Han, and J. Adhikari. Mining Knowledge in Geographical
Data. Communications of the ACM. to appear.

[18]

F. Murtagh. A Survey of Recent Advances in Hierarchical Clustering Algorithms.
The Computer Journal, 26:354359, 1983.

[19]

F. Murtagh. Cluster analysis using proximities. In J. Hampton, R. Michalski,
P. Theuns, and I. V. Mechelen, editors, Categories and Concepts: Theoretical
Views and Inductive Data Analysis, pages 225245. Academic Press, New
York, 1993.

[20]

R. T. Ng and J. Han. Efficient and Effective Clustering Method for Spatial
Data Mining. In Proceedings of the 20th International Conference on
Very Large Data Bases (VLDB), pages 144155, 1994.

[21]

A. Okabe, B. N. Boots, and K. Sugihara. Spatial Tessellations: Concepts
and Applications of Voronoi Diagrams. John Wiley & Sons, West Sussex,
1992.

[22]

S. Openshaw. A Mark 1 Geographical Analysis Machine for the automated analysis
of point data sets. International Journal of GIS, 1(4):335358,
1987.

[23]

S. Openshaw. Two exploratory spacetimeattribute pattern analysers relevant
to GIS. In S. Fotheringham and P. Rogerson, editors, Spatial Analysis
and GIS, pages 83104. Taylor and Francis, London, 1994.

[24]

S. Openshaw. Geographical data mining: key design issues. In Proceedings
of GeoComputation 99, 1999.

[25]

W. Wang, J. Yang, and R. Muntz. STING: A Statistical Information Grid Approach
to Spatial Data Mining. In Proceedings of the 23rd International Conference
on Very Large Data Bases (VLDB), pages 144155, 1997.

[26]

W. Wang, J. Yang, and R. Muntz. STING+: An Approach to Active Spatial Data
Mining. In Proceedings of the International Conference on Data Engineering,
pages 116125, 1999.

[27]

R. Webster and M. A. Oliver. Statistical Methods in Soil and Land Resource
Survey. Oxford University Press, New York, 1990.

[28]

C. S. Wllace. Intrinsic Classfication of Spatially Correlated Data. The
Computer Journal, 41(8):602611, 1998.

[29]

C. T. Zahn. GraphTheoretical Methods for Detecting and Describing Gestalt
Clusters. IEEE Transactions of Computers, 20(1):6886, 1971.

[30]

T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An Efficient Data Clustering
Method for Very Large Databases. In Proceedings of the ACM SIGMOD International
Conference on Management of Data, pages 103114, 1996.