On orthogonality and learning recurrent networks with long term dependencies on ShortScience.org

arxiv.org
arxiv-vanity.com
scholar.google.com

On orthogonality and learning recurrent networks with long term dependencies
Eugene Vorontsov and Chiheb Trabelsi and Samuel Kadoury and Chris Pal
arXiv e-Print archive - 2017 via Local arXiv
Keywords: cs.LG, cs.NE
more

Summaries/Notes 2

[link] Summary by Joseph Paul Cohen 6 years ago

Here is a video overview:
https://www.youtube.com/watch?v=t-fow6GJepQ

Here is an image of the poster:
https://i.imgur.com/Ti9btj9.png

The code can be found here: https://github.com/veugene/spectre_release

Your comment:

[link] Summary by Raza Habib 7 years ago

## Headline Summary
The problem of vanishing or exploding gradients when training recurrent neural networks can be partially overcome by enforcing an orthogonality constraint on the weight matrices between consecutive hidden states. This paper explores the trade-off between enforced orthogonality and the rate of learning by slowly increasing the extent to which the weight matrices are allowed to deviate from orthogonality.  They find that a hard orthogonality constraint can be too restrictive and perseveres gradients at the cost of slowing learning.

### Why orthogonal matrices are likely to help with exploding gradients

For a  neural network  with N hidden layers we can write the gradient of the loss wrt to the weight matrix between hidden states $W_i$ as: 

$ \frac{\partial L}{\partial W_i} =  \frac{\partial a_i}{\partial w_i} \prod_j^N (\frac{\partial h_i}{\partial a_i} \frac{\partial a_i}{\partial h_i}) \frac{\partial L}{\partial a_{N+1}} = \frac{\partial a_i}{\partial w_i} \prod_j^N (D_i W_{j+1}) \frac{\partial L}{\partial a_{N+1}}$

where L is the loss function and D is the jacobian of the activation function. It's in this long chain of products that the gradient tends to either vanish or explode. This problem is particularly bad in recurrent nets where the chain is the length of the unrolled computation graph.

The norm of the derivative $\|\frac{\partial h_i}{\partial a_i} \frac{\partial a_i}{\partial h_i} \|= \|D_i W_{j+1}\| \leq  \|D_i \| \| W_{j+1}\| \leq \lambda_D \lambda_W$ 

where $\lambda_{D/W}$ is the largest singular value of the given matrix. Its clear that if $\lambda$ is less than or greater than 1, that the norm of the gradient will grow or decay exponentially with length of the chain of products given above.

Constraining the weight matrices to be orthogonal ensures that the singular values are 1 and so perseveres the gradient norm.

### How to enforce the orthogonality constraints

The set of all orthogonal matrices from a continuos manifold known as the stiefel manifold. One way to ensure that the weight matrices remain orthogonal is to initialise them to be orthogonal and then to perform gradient optimisation on this manifold. The orthogonality of the matrix can be preserved by a Caley transformation outlined in the paper. 

In order to allow the weight  matrix to deviate slightly from the stiefel manifold, the authors took the SVD of $W = USV^T $ and parameterised the singular values. They used geodesic gradient descent to maintain the orthogonality of $U$ and $V$. The singular values were restricted to between $ 1 ^+_- m$ where  $m$ is a margin by parameterisation with a sigmoid.

### Experiment Summary

They experiment on a number of synthetic and real data-sets with elman and factorial RNNs. The synthetic tasks are the hochrieter memory and addition tasks, classifying sequential mnist and the real task is a language modelling task using the Penn Tree Bank dataset.

In the synthetic memory tasks they found that pure othogonalisation hindered performance but that a small margin away from orthogonality was better than no orthogonal constraint.

On the mnist tasks, the LSTM outperformed all other versions of simple RNNs but the simple RNNs with orthogonal constraints did quite well. Orthonal initialisation outperformed identity and glorot initialisation.

On the language modelling task with short sentences, orthogonal constraints were detrimental but on the longer sentences beneficial. They did not compare to LSTM. They also found that orthogonal initialisation was almost as good as orthogonal constraints during learning suggesting that preserving gradient flow is most important early in learning.

### My Two Cents
This paper explores an interesting approach to learning long term dependencies even with very simple RNNs. However, as theoretically appealing as the results may be, they still fail to match the memory capacities of an LSTM so for the time being it doesn't seem that there are any major practical take aways.

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private