A method and system for training an audio analyzer (114) to identify asynchronous
segments of audio types using sample data sets, the sample data sets being representative
of audio signals for which segmentation is desired. The system and method then
label asynchronous segments of audio samples, collected at the target site, into
a plurality of categories by cascading hidden Markov models (HMM). The cascaded
HMMs consist of 2 stages, the output of the first stage HMM (208) being
transformed and used as observation inputs to the second stage HMM (212).
This cascaded HMM approach allows for modeling processes with complex temporal
characteristics by using training data. It also contains a flexible framework that
allows for segments of varying duration. The system and method are particularly
useful in identifying and separating segments of the human voice for voice recognition
systems from other audio such as music.