An attention direction of a robot, indicated by a face, eyes or the like
thereof, can be aligned with a directivity direction of a microphone
array. Specifically, an acoustic signal from a sound source can be
captured, and input signals for individual microphones can be generated.
A direction of the sound source can be estimated from the input signals.
A visual line of the robot, a posture thereof, or both, can be controlled
such that the attention direction of the robot coincides with the
direction of the sound source. Then, the directivity direction of the
microphone array can be aligned with the attention direction. Thereafter,
voice recognition can be performed with an input of a delay sum
corresponding to the directivity direction.