A method and apparatus for building a training set for an automated speech
recognition-based system, which determines the statistically optimal
number of frequently requested responses to automate in order to achieve
a desired automation rate. The invention may be used to select the
appropriate tokens and responses to train the system and to achieve a
desired "phrase coverage" for all of the many different ways human beings
may phrase a request that calls for one of a plurality of
frequently-requested responses. The invention also determines the
statistically optimal number of tokens (spoken requests) required to
train a speech recognition-based system to achieve the desired phrase
coverage and optimal allocation of tokens over the set of responses that
are to be automated.