Metadata files representing Web document content are parsed in accordance
with a specification file, with a specification file being generated for each class
of documents, e.g., HTML pages, newsgroup articles, and JAVA programs. Each specification
file has the same format, i.e., schema, as a metadata file for the associated document
class. Within each specification file, each element in the hierarchy is associated
with a weight. When a metadata file is received, both the metadata file and the
specification file are walked through top-down to parse data out of the metadata
file into an index file in accordance with the weights in the specification file,
e.g., a data element having a weight of zero is not written to the index file,
an element with a weight of two is written out twice to the index file, and so
on. Importantly, the tags in the metadata file are not written out to the index
file. The index file is then used by an index engine to build an index, which can
then be accessed by a query executor to respond to a user query for Web documents
without having to search through an index containing tags and other data that is
irrelevant to the search.