A method is provided for selecting a representative set of training data
for training a statistical model in a machine condition monitoring
system. The method reduces the time required to choose representative
samples from a large data set by using a nearest-neighbor sequential
clustering technique in combination with a kd-tree. A distance threshold
is used to limit the geometric size the clusters. Each node of the
kd-tree is assigned a representative sample from the training data, and
similar samples are subsequently discarded.