An apparatus and method for determining if a query document matches one or more
of a plurality of documents in a database. In a coarse matching stage, a compressed
file or other query document is scanned to produce a bit profile. Global statistics
such as line spacing and text height are calculated from the bit profile and used
to narrow the field of documents to be searched in an image database. The bit profile
is cross-correlated with bit profiles of documents in the search space to identify
candidates for a detailed matching stage. If multiple candidates are generated
in the coarse matching stage, a set of endpoint features is extracted from the
query document for detailed matching in the detailed matching stage. Endpoint features
contain sufficient information for various levels of processing, including page
skew and orientation estimation. In addition, endpoint features are stable, symmetric
and easily computable from commonly used compressed files including, but not limited
to, CCITT Group 4 compressed files. Endpoint features extracted in the detailed
matching stage are used to correctly identify a matching document in a high percentage
of cases.