An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution
Rosanne Liu
and
Joel Lehman
and
Piero Molino
and
Felipe Petroski Such
and
Eric Frank
and
Alex Sergeev
and
Jason Yosinski
arXiv e-Print archive - 2018 via Local arXiv
Keywords:
cs.CV, cs.LG, stat.ML
First published: 2018/07/09 (6 years ago) Abstract: Few ideas have enjoyed as large an impact on deep learning as convolution.
For any problem involving pixels or spatial representations, common intuition
holds that convolutional neural networks may be appropriate. In this paper we
show a striking counterexample to this intuition via the seemingly trivial
coordinate transform problem, which simply requires learning a mapping between
coordinates in (x,y) Cartesian space and one-hot pixel space. Although
convolutional networks would seem appropriate for this task, we show that they
fail spectacularly. We demonstrate and carefully analyze the failure first on a
toy problem, at which point a simple fix becomes obvious. We call this solution
CoordConv, which works by giving convolution access to its own input
coordinates through the use of extra coordinate channels. Without sacrificing
the computational and parametric efficiency of ordinary convolution, CoordConv
allows networks to learn either perfect translation invariance or varying
degrees of translation dependence, as required by the task. CoordConv solves
the coordinate transform problem with perfect generalization and 150 times
faster with 10--100 times fewer parameters than convolution. This stark
contrast raises the question: to what extent has this inability of convolution
persisted insidiously inside other tasks, subtly hampering performance from
within? A complete answer to this question will require further investigation,
but we show preliminary evidence that swapping convolution for CoordConv can
improve models on a diverse set of tasks. Using CoordConv in a GAN produced
less mode collapse as the transform between high-level spatial latents and
pixels becomes easier to learn. A Faster R-CNN detection model trained on MNIST
detection showed 24% better IOU when using CoordConv, and in the RL domain
agents playing Atari games benefit significantly from the use of CoordConv
layers.