A method automatically determines groups of words or phrases that are
descriptive names of a small set of documents, as well as infers concepts
in the small set of documents that are more general and more specific
than the descriptive names, without any prior knowledge of the hierarchy
or the concepts, in a language independent manner. The descriptive names
and the concepts may not even be explicitly contained in the documents.
The primary application of the invention is for searching of the World
Wide Web, but the invention is not limited solely to use with the World
Wide Web and may be applied to any set of documents. Classes of features
are identified in order to promote understanding of a set of documents.
Preferably, there are three classes of features. "Self" features or terms
describe the cluster as a whole. "Parent" features or terms describe more
general concepts. "Child" features or terms describe specializations of
the cluster. The self features can be used as a recommended name for a
cluster, while parents and children can be used to place the clusters in
the space of a larger collection. Parent features suggest a more general
concept, while children features suggest concepts that describe a
specialization of the self feature(s). Automatic discovery of parent,
self and child features is useful for several purposes including
automatic labeling of web directories and improving information
retrieval.