Let us consider the problem of trying to program the robot to watch and follow with its head as someone walks around the room. This type of active perception is critical for proper functioning of the human visual system [Yarbus]. Because human retinas have small, high resolution foveal regions surrounded by visual fields of relatively much lower resolution, the eyes need to be making constant movements in order to keep objects focused on the foveas. Thus, the role of the human oculomotor system is to direct relevant objects in the visual world onto these high resolution areas. To accomplish this task, biology has evolved many complex neural circuits to control eye movements. Some examples include the saccadic system which rapidly acquires new objects, and the smooth pursuit system which tracks slowly moving objects [Horiuchi,Rao].
In analogy with the biological system, our robot also needs to first determine which feature in its visual field is most relevant and then direct its gaze towards that object. As the target person moves about, the robot's tracking system will attempt to stabilize the person's image in the center of the field of view. Unfortunately, the physical performance of our hardware system is very lacking compared to biology. A fixed lens is used on the camera which gives it roughly a horizontal and vertical field of view. With the video digitized at frames of pixels, this corresponds to a little less than half a degree of angular resolution, which is 25 times worse than human foveal acuity. Also, the maximum saccadic velocity of the servo motors is about 300 deg/sec, which is twice as slow as human eye movements. The tracking algorithm needs to be able to overcome these physical limitations.
A naive heuristic algorithm for the robot to track someone is to simply have it follow the largest moving object in the image. However, such a system is easily fooled if there are multiple objects moving in the room. More sophisticated algorithms [Darrell,Petajan] have proposed locating human heads by employing rules such as flesh color detection [Yang], or matching ellipses to head outlines [Eleftheriadis]. Another approach is to use the directional microphones to locate any person who is speaking using audio cues [Bregman]. But all these algorithms will fail in situations for which they were not designed. Such predetermined tracking algorithms tend to be very brittle and break when the surrounding environment is highly variable.
Recently, a potentially more robust method has been demonstrated for head detection using neural networks to learn the appropriate grayscale features of a face [Sung,Nowlan,Rowley]. Although these networks can very accurately detect faces in images, they need to be trained on a very large set of labelled face and non-face images. Since these networks are typically trained in batch mode on a preset ensemble of faces, they also do not learn to discriminate one person from another.
For our robot, we use a convolutional neural network that rapidly learns to locate and track a particular person's head. The system learns to do this task in real time with online supervisory signals. The network architecture integrates multimodal information in a natural way, and the fast online adaptation of the network's weights allows it to adjust the relative importance of the inputs depending upon the current environment. This enables the system to robustly track a person even in a cluttered background in the presence of other moving objects.