###
An Assessment of the Effect of Training Sample Size on Bayesian Network
Learning

GRIFFIN, A.L.

The Pennsylvania State University, USA

Email: alg207@psu.edu
Key words: Bayesian Network Learning, Training Sample Size, Overfitting,
Underfitting

Bayesian networks have often been used to classify data sets for which
there are a (relatively) large number of available training sites (e.g.,
remotely sensed imagery). They have been shown to classify more accurately
than Gaussian statistical methods (e.g., maximum-likelihood classifiers),
especially when a large proportion of training sites relative to the total
number of classification sites and attributes are available. Their utility
in solving complex classification problems where the proportion of available
training samples is small has not been examined. In order to address this
question, this paper examines the shape of the classification accuracy
curve as the proportion of training samples decreases, and compares Bayesian
network learning (BNL) performance to that achieved with traditional statistics
(cluster and discriminant analysis). Several Bayesian networks were constructed
to represent the problem of predicting dominant overstory and understory
species at a particular field site based on a suite of environmental measurements.
Data from the Oregon Woody Plant and Environment Database (OWPED) were
used to test the ability of the network to generalize using varying proportions
of the total number of observations (n = 2254) as training data. The classification
accuracy curves and measures of attribute value variability of each of
the networks were quantified and compared to determine whether there are
common curve responses to variation in both the proportion of training
sites available as well as attribute variability. Knowledge of the shape
of this curve helps in assessing the whether BNL is an appropriate technique
for exploring a particular data set.