A parallel bilingual training corpus is parsed into its content words.
Word association scores for each pair of content words consisting of a
word of language L1 that occurs in a sentence aligned in the bilingual
corpus to a sentence of language L2 in which the other word occurs. A
pair of words is considered "linked" in a pair of aligned sentences if
one of the words is the most highly associated, of all the words in its
sentence, with the other word. The occurrence of compounds is
hypothesized in the training data by identifying maximal, connected sets
of linked words in each pair of aligned sentences in the processed and
scored training data. Whenever one of these maximal, connected sets
contains more than one word in either or both of the languages, the
subset of the words in that language is hypothesized as a compound.