A method and system for augmenting a training set used to train a
classifier of documents is provided. The augmentation system augments a
training set with training data derived from features of documents based
on a document hierarchy. The training data of the initial training set
may be derived from the root documents of the hierarchies of documents.
The augmentation system generates additional training data that includes
an aggregate feature that represents the overall characteristics of a
hierarchy of documents, rather than just the root document. After the
training data is generated, the augmentation system augments the initial
training set with the newly generated training data.