An extraction-rule generation and training system uses information
obtained from multiple markup language documents (e.g. web pages) of
similar structure to generate an extraction rule for extracting
datapoints from markup language documents. By using information extracted
from multiple documents of similar structure, including information
regarding correlations between such documents, the method produces data
extraction rules that provide improved datapoint extraction reliability.
Where the structures of two or more documents are not sufficiently
similar, the system maintains separate extraction rules for the same
datapoint, and applies these separate extraction rules in combination to
particular markup language documents to extract the datapoint.