The present invention provides a system and method for automatically
combining image and audio data to create a multimedia presentation. In
one embodiment, audio and image data are received by the system. The
audio data includes a list of events that correspond to points of
interest in an audio file. The audio data may also include an audio file
or audio stream. The received images are then matched to the audio file
or stream using the time. In one embodiment, the events represent times
within the audio file or stream at which there is a certain feature or
characteristic in the audio file. The audio events list may be processed
to remove, sort or predict or otherwise generate audio events. Images
processing may also occur, and may include image analysis to determine
image matching to the event list, deleting images, and processing images
to incorporate effects. Image effects may include cropping, panning,
zooming and other visual effects.