![]() |
The raw video signal from the robot first needs to be preprocessed
before it can be input to the convolutional neural network. The
composite video signal from the CCD camera is digitized with a video
capture board into a time series of raw
RGB images as
shown in Figure 3. Each RGB color image is then
converted into its YUV representation, and a difference (D) image is
also computed from the absolute value of the difference between
consecutive frames. The Y component represents the luminance or
grayscale information in the image, while the U and V channels contain
the chromatic or color data. Motion information in the video stream
is represented in the D image where moving objects appear highlighted.
At each time step, the four YUVD images are then subsampled successively to yield representations at lower and lower resolutions. The resulting ``image pyramids'' allow the network to achieve recognition invariance across many different image scales without having to train separate neural networks for each resolution. Instead, a single neural network with the same set of weights is simultaneously run across the different resolutions, and the maximally active resolution and position is selected.