A method and apparatus are provided for determining when electronic documents
stored
in a large collection of documents are similar to one another. A plurality of similarity
information is derived from the documents. The similarity information may be based
on a variety of factors, including hyperlinks in the documents, text similarity,
user click-through information, similarity in the titles of the documents or their
location identifiers, and patterns of user viewing. The similarity information
is fed to a combination function that synthesizes the various measures of similarity
information into combined similarity information. Using the combined similarity
information, an objective function is iteratively maximized in order to yield a
generalized similarity value that expresses the similarity of particular pairs
of documents. In an embodiment, the generalized similarity value is used to determine
the proper category, among a taxonomy of categories in an index, cache or search
system, into which certain documents belong.