This paper was presented at ICML 2019.
Do you remember greedy layer-wise training? Are you curious what a modern take on the idea can achieve? This is the paper for you then. And it has its own very good summary:
> We use standard convolutional and fully connected network architectures, but instead of globally back-propagating errors, each weight layer is trained by a local learning signal,that is not back-propagated down the network. The learning signal is provided by two separate single-layer sub-networks, each with their own distinct loss function. One sub-network is trained with a standard cross-entropy loss, and the other with a similarity matching loss.
If it's a bit unclear, this figure might help:
![local_error_signal](https://user-images.githubusercontent.com/8659132/59717441-ff672980-91e5-11e9-90d5-8f81c3468391.png)
The cross-entropy loss is the standard classification loss. The similarity loss is between the output of the layer and the one-hot encoded labels:
$$
L_{\mathrm{sim}}=\|\| S(\text { NeuralNet }(H))-S(Y) \|\|_{F}^{2}
$$
The similarity is a cosine similarity matrix $S$ where the elements are:
$$
s_{i j}=s_{j i}=\frac{\tilde{\mathbf{x}}_{i}^{T} \tilde{\mathbf{x}}_{j}}{\|\|\widetilde{\mathbf{x}}_{i}\|\|_{2}\|\|\widetilde{\mathbf{x}}_{j}\|\|_{2}}
$$
The method is used to train VGG-like models on MNIST, Fashion-MNIST, CIFAR-10 and 100, SVHN and STL-10. While it gets near-SOTA up to CIFAR-10, it's not there yet for more complex datasets. It gets 80% accuracy on CIFAR-100 where SOTA is 90% accuracy. Still, this is better than a standard ResNet for example.
Why would we prefer a local loss to a global loss? A big advantage is that the weights can be updated during the forward pass, thus avoiding storing the activations in memory.
There was another paper on a similar topic, which I didn't read: [Greedy Layerwise Learning Can Scale to ImageNet](https://arxiv.org/abs/1812.11446).
# Comments
- While this is clearly not ready to just replace standard backprop, I find this line of work very interesting as it casts a doubt on one of the assumption of backprop: that we need a global signal to learn complex functions.
- Though not mentioned in the paper, wouldn't a local loss naturally avoid vanishing and exploding gradients?