Systems and methods for performing clustering of a document image are
disclosed. A property of an extracted mark from a document is compared to
the properties of the existing clusters. If the property of the mark
fails to match any of the properties of the existing clusters, the mark
is added as a new cluster to the existing cluster. One property that can
be utilized is x size and y size, which is the width and height, of the
existing clusters. Another property that can be employed is ink size,
which refers to the ratio of black pixels to total pixels in a cluster.
Yet another property that can be utilized is a reduced mark or image,
which is a pixel size reduced version the bitmap of the mark and/or
cluster. The above properties can be employed to identify mismatches and
reduce the number of bit by bit comparisons performed.