The raw video signal from the robot first needs to be preprocessed before it can be input to the convolutional neural network. The composite video signal from the CCD camera is digitized with a video capture board into a time series of raw RGB images as shown in Figure 3. Each RGB color image is then converted into its YUV representation, and a difference (D) image is also computed from the absolute value of the difference between consecutive frames. The Y component represents the luminance or grayscale information in the image, while the U and V channels contain the chromatic or color data. Motion information in the video stream is represented in the D image where moving objects appear highlighted.
At each time step, the four YUVD images are then subsampled successively to yield representations at lower and lower resolutions. The resulting ``image pyramids'' allow the network to achieve recognition invariance across many different image scales without having to train separate neural networks for each resolution. Instead, a single neural network with the same set of weights is simultaneously run across the different resolutions, and the maximally active resolution and position is selected.