A method of quickly and automatically comparing a new document to a large
number of previously seen documents and identifying the document type.
First, provide a plurality of document type distributions, each document
type distribution describes layout characteristics of an independent
document type and may include a plurality of data points. Each document
type distribution includes data derived from at least one basis document
signature which may include data defining pixels of a low-resolution image
of the independent basis document resolved to between 1 and 75 dots per
inch or may include document segmentation data derived from the
independent basis document. Next provide a new electronic document. Then
create new document signature from the new electronic document. Next,
distances between the new document signature and each of the plurality of
document type distributions are calculated using an algorithm based on a
Bayesian framework for a Gaussian distribution. The distances calculated
may be Euclidean distances or may be Mahalanobis distances. Additionally,
calculating the distances may include weighting the value given each of a
plurality of data points in the document signatures based on the
usefulness of each of the plurality of data points in distinguishing
between the document signatures. Next, select at least one candidate
document type for the new electronic document from among the independent
document types described by the plurality of document type distributions.
The selection of the at least one candidate document type may include
selecting a preselected fixed number of the independent document types or
may include selecting the independent document types described by those of
the plurality of document type distributions having calculated distances
that are within a preselected threshold distance of the smallest of the
distances calculated. In addition, the invention provides for a program
storage medium readable by computer, tangibly embodying a program of
instructions executable by the computer to perform the method steps
described above.