The present invention discloses a document descriptor extraction method
and system. The document descriptor extraction method and system creates
a document descriptor by generalizing input sequences within a document;
factoring the input sequences and generalized input sequences; and
selecting a document descriptor from the input sequences, generalized
sequences, and factored sequences, preferably using minimum descriptor
length (MDL) principles. Novel algorithms are employed to perform the
generalizing, factoring, and selecting.