The present invention provides a method, system and computer program for naming
a cluster, or a hierarchy of clusters, of words and phrases that have been extracted
from a set of documents. The invention takes these clusters as the input and generates
appropriate labels for the clusters using a lexical database. Naming involves first
finding out all possible word senses for all the words in the cluster, using the
lexical database; and then augmenting each word sense with words that are semantically
similar to that word sense to form respective definition vectors. Thereafter, word
sense disambiguation is done to find out the most relevant sense for each word.
Definition vectors are clustered into groups. Each group represents a concept.
These concepts are thereafter ranked based on their support. Finally, a pre-specified
number of words and phrases from the definition vectors of the dominant concepts
are selected as labels, based on their generality in the lexical database. Therefore,
the labels may not necessarily consist of the original words in the cluster. A
hierarchy of clusters is named in a recursive fashion starting from leaf clusters.
Dominant concepts in child clusters are propagated into their parent to reduce
the labeling complexity of parent clusters.