A method and platform for statistically extracting terms from large sets
of documents is described. An importance vector is determined for each
document in the set of documents based on importance values for words in
each document. A binary document classification tree is formed by
clustering the documents into clusters of similar documents based on the
importance vector for each document. An infrastructure is built for the
set of documents by generalizing the binary document classification tree.
The document clusters are determined by dividing the generalized tree of
the infrastructure into two parts and cutting away the upper part.
Statistically significant individual key words are extracted from the
clusters of similar documents. Key words are treated as seeds and terms
are extracted by starting from the seeds and extending to their left or
right contexts.