A system and associated method for tuning a data clustering program to a
clustering task, determine at least one internal parameter of a data
clustering program. The determination of one or more of the internal
parameters of the data clustering program occurs before the clustering
begins. Consequently, clustering does not need to be performed
iteratively, thus improving clustering program performance in terms of
the required processing time and processing resources. The system
provides pairs of data records; the user indicates whether or not these
data records should belong to the same cluster. The similarity values of
the records of the selected pairs are calculated based on the default
parameters of the clustering program. From the resulting similarity
values, an optimal similarity threshold is determined. When the
optimization criterion does not yield a single optimal similarity
threshold range, equivalent candidate ranges are selected. To select one
of the candidate ranges, pairs of data records having a calculated
similarity value within the critical region are offered to the user.