Techniques for identifying discrete records within a multi-record document
are provided. According to one technique, a document is encoded based on
some combination of visual tag encoding, text category encoding, and text
content encoding that produces hash values based on the contents of
portions of the document. According to one technique, repeating candidate
patterns are identified in a document so encoded. The candidate patterns
may be identified in a "fuzzy" manner that allows for some
inconsistencies in the individual pattern instances. According to one
technique, the identified candidate patterns are validated based on
specified factors to determine a "best" pattern. According to one
technique, the boundaries of discrete records in a multi-record document
are marked based on the portions of the document that correspond to an
identified repeating pattern.