A system and method for retrieving and intelligently grouping definitions
with common semantic meaning is disclosed. In response to a user's
textual query for the definition of a term or phrase, a set of documents
is retrieved from a repository of structured documents. The retrieved
documents are labeled with a prediction score based upon predetermined
glossary characteristics of the documents. In order to determine whether
the retrieved documents are likely to be definitions, features commonly
found in definitions are identified. The identified features are
classified with numeric values and weighed using a support vector
regression algorithm. Definitions that fail to meet a predetermined
threshold score are discarded, and those that exceed a predetermined
threshold score are labeled and stored in the local database.