System for weighted indexing of hierarchical documents page

Metadata files representing Web document content are parsed in accordance with a specification file, with a specification file being generated for each class of documents, e.g., HTML pages, newsgroup articles, and JAVA programs. Each specification file has the same format, i.e., schema, as a metadata file for the associated document class. Within each specification file, each element in the hierarchy is associated with a weight. When a metadata file is received, both the metadata file and the specification file are walked through top-down to parse data out of the metadata file into an index file in accordance with the weights in the specification file, e.g., a data element having a weight of zero is not written to the index file, an element with a weight of two is written out twice to the index file, and so on. Importantly, the tags in the metadata file are not written out to the index file. The index file is then used by an index engine to build an index, which can then be accessed by a query executor to respond to a user query for Web documents without having to search through an index containing tags and other data that is irrelevant to the search.