A process for constructing a server for collecting, arranging and storing
data that defines the connectivity of pages on the World Wide Web (Web).
The process input is a set of compressed ASCII links files, wherein each
links file is a series of source URLs and corresponding destination URLs.
A temporary URLs_info Table is created and initialized. The links files
and URLs metadata are read. Buffers of unique URLs are sorted and written
from the links files into URL runs. An ID Index is created from the
URL_info table. CS_ids are assigned to URLs and written to the ID Index.
Both a compressed URL data structure and a URL Index are created. A Host
Table is created. URL fingerprints are converted to CS_ids, and
preliminary outstarts to CS_ids and preliminary outstarts and outlinks
tables are created. Compressed outstarts and outlinks tables are created
from the preliminary tables. Subsequently, compressed instarts and
inlinks tables are created based on the outstarts and outlinks tables.