Following scanning of a document image, and optical character recognition
(OCR) processing, the outputted OCR text is processed to determine a text
format (typeface and font size) to match the OCR text to the originally
scanned image. The text format is identified by matching word sizes rather
than individual character sizes. In particular, for each word and for each
of a plurality of candidate typefaces, a scaling factor is calculated to
match a typeface rendering of the word to the width of the word in the
originally scanned image. After all of the scaling factors have been
calculated, a cluster analysis is performed to identify close clusters of
scaling factors for a typeface, indicative of a good typeface fit at a
constant scaling factor (font size).