Method and platform for term extraction from large collection of documents

A method and platform for statistically extracting terms from large sets of documents is described. An importance vector is determined for each document in the set of documents based on importance values for words in each document. A binary document classification tree is formed by clustering the documents into clusters of similar documents based on the importance vector for each document. An infrastructure is built for the set of documents by generalizing the binary document classification tree. The document clusters are determined by dividing the generalized tree of the infrastructure into two parts and cutting away the upper part. Statistically significant individual key words are extracted from the clusters of similar documents. Key words are treated as seeds and terms are extracted by starting from the seeds and extending to their left or right contexts.

Web www.patentalert.com

< Computer-implemented procurement of items using parametric searching

> Image processing apparatus, and method for controlling the image processing apparatus to process displayable and non-displayable data received from a server

~ 00445