Methods, apparatus and systems to generate from a set of training
documents a set of training data and a set of features for a taxonomy of
categories. In this generated taxonomy the degree of feature overlap
among categories is minimized in order to optimize use with a
machine-based categorizer. However, the categories still make sense to a
human because a human makes the decisions regarding category definitions.
In an example embodiment, for each category, a plurality of training
documents selected using Web search engines is generated, the documents
winnowed to produce a more refined set of training documents, and a set
of features highly differentiating for that category within a set of
categories (a supercategory) extracted. This set of training documents or
differentiating features is used as input to a categorizer, which
determines for a plurality of test documents the plurality of categories
to which they best belong.