Authors propose a technique to adapt CNN trained on rgb images to leverage depth images at test time. This is done by implementing a mid-level fusion of RGB and depth CNNs.
The approach is based on the fact that CNN are largely task and category agnostic and domain specific at lower levels, but the opposite at higher levels.
Question they try to answer: is there a way to use large amount of labeled RGB data, along with some RGB-D data, to train detectors that can use RGB-D data at test time to boost performance over the RGB detector, even for objects that don't have depth labeled data?
The resulting RGB-D detector is able to utilize the depth data provided at test time to improve detection, without ever being trained on any depth data for some categories (U in the paper).
They average the fc6 activations, after relu, and proceed to the fc7 layer.
Begin by training the RGB network (AlexNet architecture), with all categories.
Then copy weights to depth network, and train with the available data. Depth data is HHA encoded, which encodes the image geocentrically using three channels: horizontal disparity, height above ground and angle between the pixel's local surface normal and the inferred gravity direction.
After this, it is produced the final network. For layers before the merge point, use the RGB and depth weights directly. Then merge as explained before. For layers after merge point, use RGB weights.
The authors claim that it improves by 21% the performance of the RGB-only detector trained on this data, besides other interesting results.