A probabilistic classifier is used to classify data items in a data
stream. The probabilistic classifier is trained, and an initial
classification threshold is set, using unique training and evaluation
data sets (i.e., data sets that do not contain duplicate data items).
Unique data sets are used for training and in setting the initial
classification threshold so as to prevent the classifier from being
improperly biased as a result of similarity rates in the training and
evaluation data sets that do not reflect similarity rates encountered
during operation. During operation, information regarding the actual
similarity rates of data items in the data stream is obtained and used to
adjust the classification threshold such that misclassification costs are
minimized given the actual similarity rates.