A method and apparatus are provided for automatically training or
modifying one or more models of acoustic units in a speech recognition
system. Acoustic models are modified based on information about a
particular application with which the speech recognizer is used,
including speech segment alignment data for at least one correct
alignment and at least one wrong alignment. The correct alignment
correctly represents a phrase that the speaker uttered. The wrong
alignment represents a phrase that the speech recognition system
recognized that is incorrect. The segment alignment data is compared by
segment to identify competing segments and those that induced the
recognition error. When an erroneous segment is identified, acoustic
models of the phoneme in the correct alignment are modified by moving
their mean values closer to the segment's acoustic features.
Concurrently, acoustic models of the phoneme in the wrong alignment are
modified by moving their mean values further from the acoustic features
of the segment of the wrong alignment. As a result, the acoustic models
will converge to more optimal values based on empirical utterance data
representing recognition errors.