The invention is a method, system and computer program for automatically
discovering concepts from a corpus of documents and automatically
generating a labeled concept hierarchy. The method involves extraction of
signatures from the corpus of documents. The similarity between
signatures is computed using a statistical measure. The frequency
distribution of signatures is refined to alleviate any inaccuracy in the
similarity measure. The signatures are also disambiguated to address the
polysemy problem. The similarity measure is recomputed based on the
refined frequency distribution and disambiguated signatures. The
recomputed similarity measure reflects actual similarity between
signatures. The recomputed similarity measure is then used for clustering
related signatures. The signatures are clustered to generate concepts and
concepts are arranged in a concept hierarchy. The concept hierarchy
automatically generates query for a particular concept and retrieves
relevant documents associated with the concept.