A method and system for generating wrappers for hierarchically organized
documents by jointly optimizing template detection and wrapper generation
is provided. A wrapper generation system generates a wrapper for
documents with similar templates by identifying a cluster of document
trees and generating a wrapper tree for the cluster. A wrapper tree
defines the wrapper for documents that match the template of the cluster.
The wrapper generation system clusters document trees by generating a
wrapper tree for the cluster based on an initial document tree. The
wrapper generation system then repeatedly determines whether any other
document tree matches or nearly matches the wrapper tree for the cluster
and, if so, adds the document tree to the cluster and adjusts the wrapper
tree as appropriate so that all the document trees, including the newly
added one, match the wrapper tree.