The process generates a parser to extract records from a set of documents.
The process operates on a sample document from the set. The sample
document is an XML document or is converted to an XML document. Simple
Xpaths of the XML document are identified. Complex extensions of the
simple Xpath are clustered according to common substructures. The complex
Xpath clusters are scored according to content in instances or
differences in content among instances. Candidate parsers are created.
Each candidate consists of a single record Xpath and one or more field
value Xpaths that are descendents of the record Xpath. The candidate
parsers are ranked using the Xpath scores.