A method (100) and apparatus (700) are disclosed for detecting and
tracking human faces across a sequence of video frames. Spatiotemporal
segmentation is used to segment (115) the sequence of video frames into
3D segments. 2D segments are then formed from the 3D segments, with each
2D segment being associated with one 3D segment. Features are extracted
(140) from the 2D segments and grouped into groups of features. For each
group of features, a probability that the group of features includes
human facial features is calculated (145) based on the similarity of the
geometry of the group of features with the geometry of a human face
model. Each group of features is also matched with a group of features in
a previous 2D segment and an accumulated probability that said group of
features includes human facial features is calculated (150). Each 2D
segment is classified (155) as a face segment or a non-face segment based
on the accumulated probability. Human faces are then tracked by finding
2D segments in subsequent frames associated with 3D segments associated
with face segments.