Systems and methods are disclosed for packet voice conferencing. An encoding
system accepts two sound field signals, representing the same sound field sampled
at two spatially-separated points. The relative delay between the two sound field
signals is detected over a given time interval. The sound field signals are combined
and then encoded as a single audio signal, e.g., by a method suitable for monophonic
VoIP. The encoded audio payload and the relative delay are placed in one or more
packets and sent to a decoding device via the packet network. The decoding device
uses the relative delay to drive a playout splitter—once the encoded audio
payload has been decoded, the playout splitter creates multiple presentation channels
by inserting the transmitted relative delay in the decoded signal for one (or more)
of the presentation channels. The listener thus perceives a speaker's voice as
originating from a location related to the speaker's physical position at the other
end of the conference. An advantage of these embodiments is that a pseudo-stereo
conference can be conducted with virtually the same bandwidth as a monophonic conference.