Techniques are provided that identify near-duplicate items in large
collections of items. A list of (value, frequency) pairs is received, and
a sample (value, instance) is returned. The value is chosen from the
values of the first list, and the instance is a value less than
frequency, in such a way that the probability of selecting the same
sample from two lists is equal to the similarity of the two lists.