Computer processing method and apparatus for searching and retrieving Web
pages to collect people and organization information are disclosed. A Web site
of potential interest is accessed. A subset of Web pages from the accessed site
are determined for processing. According to types of contents found on a subject
Web page, extraction of people and organization information is enabled. Internal
links of a Web site are collected and recorded in a links-to-visit table. To avoid
duplicate processing of Web sites, unique identifiers or Web site signatures are
utilized. Respective time thresholds (time-outs) for processing a Web site and
for processing a Web page are employed. A database is maintained for storing indications
of domain URLs, names of respective owners of the URLs as identified from the corresponding
Web sites, type of each Web site, processing frequencies, dates of last processings,
outcomes of last processings, size of each domain and number of data items found
in the last processing of each Web site.