Methods and apparatus, including computer program products, for
identifying Web page content with a granularity finer than individual Web
pages, e.g., finer than individual HTML documents. The invention provides
a computer-implemented method for identifying Web page content. The
method includes receiving a string of markup language source code that
includes tags. The method includes identifying sub-sequences in which
tags occur in the string. Each sub-sequence is associated with the
portion of the string that starts with the first tag of the sub-sequence
and ends with the last tag of the sub-sequence. The sub-sequences
identified are ones that satisfy criteria for being classified as
associated with a portion of the string that define Web page content
constituting an entire listing. The criteria includes a requirement that
an identified sub-sequence be repeated in tandem, either exactly or
approximately, in the string. The method includes returning the
identified sub-sequences.