Methods and apparatus are disclosed for predicting events using acoustic
and visual cues. The present invention processes audio and video information to
identify one or more (i) acoustic cues, such as intonation patterns, pitch and
loudness, (ii) visual cues, such as gaze, facial pose, body postures, hand gestures
and facial expressions, or (iii) a combination of the foregoing, that are typically
associated with an event, such as behavior exhibited by a video conference participant
before he or she speaks. In this manner, the present invention allows the video
processing system to predict events, such as the identity of the next speaker.
The predictive speaker identifier operates in a learning mode to learn the characteristic
profile of each participant in terms of the concept that the participant "will
speak" or "will not speak" under the presence or absence of one or more predefined
visual or acoustic cues. The predictive speaker identifier operates in a predictive
mode to compare the learned characteristics embodied in the characteristic profile
to the audio and video information and thereby predict the next speaker.