A system and method for generating a video sequence having mouth movements
synchronized with speech sounds are disclosed. The system utilizes a
database of n-phones as the smallest selectable unit, wherein n is larger
than 1 and preferably 3. The system calculates a target cost for each
candidate n-phone for a target frame using a phonetic distance,
coarticulation parameter, and speech rate. For each n-phone in a target
sequence, the system searches for candidate n-phones that are visually
similar according to the target cost. The system samples each candidate
n-phone to get a same number of frames as in the target sequence and
builds a video frame lattice of candidate video frames. The system
assigns a joint cost to each pair of adjacent frames and searches the
video frame lattice to construct the video sequence by finding the
optimal path through the lattice according to the minimum of the sum of
the target cost and the joint cost over the sequence.