Animals are able to robustly localize audio sources using a variety of auditory cues [Bregman]. Experiments have shown that one of the most important cues for azimuthal position determination is interaural time difference, corresponding to the slight delay it takes for a sound to reach the different ears [Konishi]. Similarly, the two microphones on the head of our robot are separated horizontally in space to optimize the interaural difference between their audio signals. This information is then combined with visual cues by the convolutional neural network to determine the overall saliency of the different locations in its field of view. In order for the neural network to process the auditory information in the same manner as it does visual information, the raw audio data has to first be converted into an auditory space map as depicted in Figure 4.
The audio signal from the left and right microphones are first digitized at 16000 Hz in the sound card. These waveforms are then passed through a rectification nonlinearity to emphasize the envelope of the sound waveforms present in the recordings. The resulting signals are then cross-correlated, giving the following time correlation function:
(1) |