A template or wrapper tree for a document such as a web page is
generalized from the bottom up (from leaf toward root of a logical tree
structure of the template). At a given level in the tree, sub-trees are
clustered and the clustered sub-trees are generalized, and the process is
repeated at a next higher level in the tree, resulting in a generalized
template or wrapper tree. This can be done by generating a nested pattern
regular expression based on the sub-tree clusters, merging sub-trees
based on the nested pattern regular expression, and then replacing
sub-trees in a tree-based regular expression of the template or wrapper
at the given level with the merged sub-trees. This process is repeated at
a next higher level of the tree (progressing from leaf towards root)
until the wrapper or tree-based regular expression that represents the
template is fully generalized.