An apparatus for tracking and identifying objects includes an audio
likelihood module which determines corresponding audio likelihoods for
each of a plurality of sounds received from corresponding different
directions, each audio likelihood indicating a likelihood a sound is an
object to be tracked; a video likelihood module which receives a video
and determines video likelihoods for each of a plurality of images
disposed in corresponding different directions in the video, each video
likelihood indicating a likelihood that the image is an object to be
tracked; and an identification and tracking module which determines
correspondences between the audio likelihoods and the video likelihoods,
if a correspondence is determined to exist between one of the audio
likelihoods and one of the video likelihoods, identifies and tracks a
corresponding one of the objects using each determined pair of audio and
video likelihoods.