A dynamic-content web crawler is disclosed. These New Crawlers (NCs) are
located at points between the server and user, and monitor content from
said points, for example by proxying the web traffic or sniffing the
traffic as it goes by. Web page content is recursively parsed into
subcomponents. Sub-components are fingerpinted with a cyclic redundancy
check code or other loss-full compression in order to be able to detect
recurrence of the sub-component in subsequent pages. Those sub-components
which persist in the web traffic, as measured by the frequency NCs (6)
are defined as having substantive content of interest to data-mining
applications. Where a substantive content sub-component is added to or
removed from a web page, then this change is significant and is sent to a
duplication filter (11) so that if multiple NCs (6) detect a change in a
web page only one announcement of the changed URL will be broadcast to
data-mining applications (8). The NC (6) identifies substantive content
sub-components which repeatably are part of a page pointed to by a URL.
Provision is also made for limiting monitoring to pages having a flag
authorizing discovery of the page by a monitor.