A method and a storage medium, that includes instructions for causing a
computer to implement the method, for document categorization is
presented. The method includes identifying terms occurring in a
collection of documents, and determining a cohesion score for each of the
terms. The cohesion score is a function of a cosine difference between
each of the documents containing the term and a centroid of all the
documents containing the term. The method further includes sorting the
terms based on the cohesion scores. The method also includes creating
categories based on the cohesion scores of the terms, wherein each of the
categories includes only documents (i) containing a selected one of the
terms and (ii) that have not already been assigned to a category. The
method still further includes moving each of the documents to a category
of a nearest centroid, thereby refining the categories.