A keyphrase extraction system and method are provided. The system and
method can be employed to create an automatic summary of a subset of
document(s). The system can automatically extract a list of keyword(s)
that can operate on multiple documents, and across many different
domains. The system is unsupervised and requires no prior learning.A term
identifier identifies candidate terms (e.g., words and/or phrases) in the
document subset which are used to form a document-term matrix. A
probability computation component calculates probability values of: (1)
the joint probability of a word (e.g., term) and a document, (2) the
marginal probability of the word (e.g., term), and (3) the marginal
probability of the document. Based on the probability values, a partial
mutual information metric can be calculated for each candidate term.
Based on the partial mutual information metric, one or more of the terms
can be identified as summary keyphrases.