Descriptions of audio contents available for playback via voice calls
are displayed visually on a screen of a telephone handset (such as a cellular phone
supporting Internet access, or a personal digital assistant supporting voice telephony).
On selection of an audio content, the handset places a voice call to a computer
that plays the audio content to the user (during the voice call). A data connection
is used to retrieve description(s) for visual display, but this data connection
is not used for retrieval of a file containing the audio content. Instead, a voice
call is placed in the normal telephony manner, and the audio content is played
by the computer that receives the voice call. The just-described method and system
eliminates a prior art need for the user to navigate through a set of voice prompts
to identify an audio content to be played, e.g. as required by an interactive voice
response system. Instead, the user merely uses a display of the handset and the
related input mechanism (such as touch screen or keypad) to navigate e.g. through
a list of hits from a search engine, or through a number of categories and subcategories
to identify an audio content. The combination of a conventional visual interface
for navigation and a conventional audio interface for serving audio contents provides
the benefits of both: the ease of navigation provided by web pages, and the quality
of audio playback provided by the telephone handset.