In one exemplary embodiment the invention provides a data mining system for
use in finding cluster of data items in a database or any other data
storage medium. A portion of the data in the database is read from a
storage medium and brought into a rapid access memory buffer whose size is
determined by the user or operating system depending on available memory
resources. Data contained in the data buffer is used to update the
original model data distributions in each of the K clusters in a
clustering model. Some of the data belonging to a cluster is summarized or
compressed and stored as a reduced form of the data representing
sufficient statistics of the data. More data is accessed from the database
and the models are updated. An updated set of parameters for the clusters
is determined from the summarized data (sufficient statistics) and the
newly acquired data. Stopping criteria are evaluated to determine if
further data should be accessed from the database. Each time the data is
read from the database, a holdout set of data is used to evaluate the
model then current as well as other possible cluster models chosen from a
candidate set of cluster models. The evaluation of the holdout data set
allows a cluster model with a different cluster number K' to be chosen if
that model more accurately models the data based upon the evaluation of
the holdout set.