This invention pioneers an enhanced crawling mechanism and technique
called "Enhanced Browser Based Web Crawling". It permits the
fault-tolerant gathering of dynamic data documents on the World Wide Web
(WWW). The Enhanced Browser Based Web Crawler technology of this
invention is implemented by incorporating the intricate functionality of
a web browser into the crawler engine so that documents are properly
analyzed. Essentially, the Enhanced Browser Based Crawler acts similarly
to a web browser after retrieving the initially requested document. It
then loads additional or included documents as needed or required (e.g.
inline-frames, frames, images, applets, audio, video, or equivalents.).
The Crawler then executes client side script or code and produces the
final HTML markup. This final HTML markup is ordinarily used for the
rendering for user presentation process. However, unlike a web browser
this invention does not render the composed document for viewing
purposes. Rather it analyzes or summarizes it, thereby extracting
valuable metadata and other important information contained within the
document. Also, this invention introduces the integration of optical
character recognition (OCR) techniques into the crawler architecture. The
reason for this is to enable the web crawler summarization process to
properly summarize image content (e.g. GIF, JPEG or an equivalent)
without errors.