A first embodiment of the invention provides a system that automatically
classifies documents in a collection into clusters based on the
similarities between documents, that automatically classifies new
documents into the right clusters, and that may change the number or
parameters of clusters under various circumstances. A second embodiment
of the invention provides a technique for comparing two documents, in
which a fingerprint or sketch of each document is computed. In
particular, this embodiment of the invention uses a specific algorithm to
compute the document's fingerprint, One embodiment uses a sentence in the
document as a logical delimiter or window from which significant words
are extracted and, thereafter, a hash is computed of all pair-wise
permutations. Words are extracted based on their weight in the document,
which can be computed using measures such as term frequency and the
inverse document frequency.