A system and method facilitating incremental web crawl(s) using chunk(s)
is provided. The system can be employed, for example, to facilitate a
web-crawling system that crawls (e.g., continuously) the Internet for
information (e.g., data) and indexes the information so that it can be
used as part of a web search engine.The system facilitates incremental
re-crawls and/or selective updating of information (e.g., documents)
using a structure called a chunk to simplify the process of an
incremental crawl. A chunk is a set of documents that can be manipulated
as a set (e.g., of up to 65,536 (64K) documents). "Document" refers to a
corpus of data that is stored at a particular URL (e.g., HTML, PDF, PS,
PPT, XLS, and/or DOC Files etc.)A chunk is created by an indexer. The
indexer can place into a chunk documents that have similar property(ies).
These property(ies) include but are not limited to: average time between
change and average importance. These property(ies) can be stored at the
chunk level in a chunk map. The chunk map can then be employed (e.g., on
a daily basis) to determine which chunk(s) should be re-crawled.