The invention relates to a method and apparatus for generating translations of
natural language terms from a first language to a second language. A plurality
of terms are extracted from unaligned comparable corpora of the first and second
languages. Comparable corpora are sets of documents in different languages that
come from the same domain and have similar genre and content. Unaligned documents
are not translations of one another and are not linked in any other way. By accessing
monolingual thesauri of the first and second languages, a category is assigned
to each extracted term. Then, category-to-category translation probabilities are
estimated, and using said category-to-category translation probabilities, term-to-term
translation probabilities are estimated. The invention preferably exploits class-based
normalization of probability estimates, bi-directionality, and relative frequency
normalization. The most important applications are cross-language text retrieval,
semi-automatic bilingual thesaurus enhancement, and machine-aided human translation.