Training example

**Figure:** Fast online adaptation of the neural network. The head location error in pixels in a $120 \times 160$ image is plotted as a function of frame number (5 frames/sec).
$\includegraphics[width=3.0in]{ploterror}$

An example of how quickly the robot is able to adapt is given by the learning curve in Figure 6. In this particular example, the system was trained to track the head of one of the authors as he moved around and talked in his office. The weights were first initialized to small random values and the learning parameters were set to g_m = 0, g_n = 1, and $\eta = 0.1$ . The robot was then corrected in an online fashion using supervisory inputs to follow the author's head.

After only a few seconds of training at 5 frames/sec (200 ms processing time per image), the system was able to locate the head to within four pixels of accuracy, as determined by hand labelling the video data afterwards. As saccadic eye movements were initiated at the times indicated by the arrows in Figure 6, new environments of the office were sampled and an occasional large error was seen. However, over time these errors were corrected, and the neural network learned to robustly discriminate the head from the office surroundings.

Figure 7 shows the inputs and weights of the network after a minute of training as the author walked around his office. The kernels necessarily appear slightly smeared because they are adapted to be invariant to slight changes in head position, rotation, and scale. But they clearly depict the dark hair, facial features, and skin color of the head. The relative weighting ( c_Y, c_U, c_V > c_D, c_A) of the different input channels shows that the luminance and color information are the most reliable for tracking the head. This is probably because the presence of other moving body parts and external noise sources in the office made the motion and auditory channels relatively unreliable.

**Figure 7:** Example showing the inputs and weights used in tracking a head ( c_A = 0.05). The head position as calculated by the neural network is marked with a box.
$\includegraphics[width=3.0in]{headpos}$

More complicated neural network architectures could be used for combining the different sensory inputs to achieve better tracking performance. However, this example shows how a simple convolutional network architecture can efficiently integrate the different visual and auditory cues in order to learn how to robustly track an object. Moreover, by using fast online adaptation of the neural network weights, the system is able to rapidly accommodate changing environments.