Audio preprocessing

Next: Supervised learning Up: Tracking application Previous: Video preprocessing

Audio preprocessing

**Figure 4:** Processing of the audio waveforms to yield a topological map of audio sources as a function of time delay.
$\includegraphics[width=3.0in]{audio}$

Animals are able to robustly localize audio sources using a variety of auditory cues [Bregman]. Experiments have shown that one of the most important cues for azimuthal position determination is interaural time difference, corresponding to the slight delay it takes for a sound to reach the different ears [Konishi]. Similarly, the two microphones on the head of our robot are separated horizontally in space to optimize the interaural difference between their audio signals. This information is then combined with visual cues by the convolutional neural network to determine the overall saliency of the different locations in its field of view. In order for the neural network to process the auditory information in the same manner as it does visual information, the raw audio data has to first be converted into an auditory space map as depicted in Figure 4.

The audio signal from the left and right microphones are first digitized at 16000 Hz in the sound card. These waveforms are then passed through a rectification nonlinearity to emphasize the envelope of the sound waveforms present in the recordings. The resulting signals are then cross-correlated, giving the following time correlation function:

$\begin{displaymath}A(\Delta t) = \sum_t \vert x_L(t)\vert \vert x_R(t+\Delta t)\vert \end{displaymath}$

(1)

where x_L(t) and x_R(t) are the audio inputs from the left and right microphones. Ambient background noises are attenuated by subtracting out the mean of this correlation function. Different values of the time delay $\Delta t$ correspond to varying azimuthal positions of auditory sources, assuming their elevation angle is near the horizontal plane of the microphones. Thus, the cross-correlation indicates the likelihood that an auditory source is located at the various azimuthal positions. This correlation function can be considered a one-dimensional auditory space map of the environment.

Next: Supervised learning Up: Tracking application Previous: Video preprocessing

1999-03-20