Methods and apparatus are disclosed for tracking an object of interest in a
video processing system, using clustering techniques. An area is
partitioned into approximate regions, referred to as clusters, each
associated with an object of interest. Each cluster has associated average
pan, tilt and zoom values. Audio or video information, or both, are used
to identify the cluster associated with a speaker (or another object of
interest). Once the cluster of interest is identified, the camera is
focused on the cluster, using the recorded pan, tilt and zoom values, if
available. An event accumulator initially accumulates audio (and
optionally video) events for a specified time, to allow several speakers
to speak. The accumulated audio events are then used by a cluster
generator to generate clusters associated with the various objects of
interest. After initialization of the clusters, the illustrative event
accumulator gathers events at periodic intervals. The mean of the pan and
tilt values (and zoom value, if available) occurring in each time interval
are then used to compute the distance between the various clusters in the
database by a similarity estimator, based on an empirically-set threshold.
If the distance is greater than the established threshold, then a new
cluster is formed, corresponding to a new speaker, and indexed into the
database. Fuzzy clustering techniques allow the camera to be focused on
more than one cluster at a given time, when the object of interest may be
located in one or more clusters.