First published: 2017/01/31 (7 years ago) Abstract: It is well known that it is challenging to train deep neural networks and
recurrent neural networks for tasks that exhibit long term dependencies. The
vanishing or exploding gradient problem is a well known issue associated with
these challenges. One approach to addressing vanishing and exploding gradients
is to use either soft or hard constraints on weight matrices so as to encourage
or enforce orthogonality. Orthogonal matrices preserve gradient norm during
backpropagation and can therefore be a desirable property; however, we find
that hard constraints on orthogonality can negatively affect the speed of
convergence and model performance. This paper explores the issues of
optimization convergence, speed and gradient stability using a variety of
different methods for encouraging or enforcing orthogonality. In particular we
propose a weight matrix factorization and parameterization strategy through
which we can bound matrix norms and therein control the degree of expansivity
induced during backpropagation.
## Headline Summary
The problem of vanishing or exploding gradients when training recurrent neural networks can be partially overcome by enforcing an orthogonality constraint on the weight matrices between consecutive hidden states. This paper explores the trade-off between enforced orthogonality and the rate of learning by slowly increasing the extent to which the weight matrices are allowed to deviate from orthogonality. They find that a hard orthogonality constraint can be too restrictive and perseveres gradients at the cost of slowing learning.
### Why orthogonal matrices are likely to help with exploding gradients
For a neural network with N hidden layers we can write the gradient of the loss wrt to the weight matrix between hidden states $W_i$ as:
$ \frac{\partial L}{\partial W_i} = \frac{\partial a_i}{\partial w_i} \prod_j^N (\frac{\partial h_i}{\partial a_i} \frac{\partial a_i}{\partial h_i}) \frac{\partial L}{\partial a_{N+1}} = \frac{\partial a_i}{\partial w_i} \prod_j^N (D_i W_{j+1}) \frac{\partial L}{\partial a_{N+1}}$
where L is the loss function and D is the jacobian of the activation function. It's in this long chain of products that the gradient tends to either vanish or explode. This problem is particularly bad in recurrent nets where the chain is the length of the unrolled computation graph.
The norm of the derivative $\|\frac{\partial h_i}{\partial a_i} \frac{\partial a_i}{\partial h_i} \|= \|D_i W_{j+1}\| \leq \|D_i \| \| W_{j+1}\| \leq \lambda_D \lambda_W$
where $\lambda_{D/W}$ is the largest singular value of the given matrix. Its clear that if $\lambda$ is less than or greater than 1, that the norm of the gradient will grow or decay exponentially with length of the chain of products given above.
Constraining the weight matrices to be orthogonal ensures that the singular values are 1 and so perseveres the gradient norm.
### How to enforce the orthogonality constraints
The set of all orthogonal matrices from a continuos manifold known as the stiefel manifold. One way to ensure that the weight matrices remain orthogonal is to initialise them to be orthogonal and then to perform gradient optimisation on this manifold. The orthogonality of the matrix can be preserved by a Caley transformation outlined in the paper.
In order to allow the weight matrix to deviate slightly from the stiefel manifold, the authors took the SVD of $W = USV^T $ and parameterised the singular values. They used geodesic gradient descent to maintain the orthogonality of $U$ and $V$. The singular values were restricted to between $ 1 ^+_- m$ where $m$ is a margin by parameterisation with a sigmoid.
### Experiment Summary
They experiment on a number of synthetic and real data-sets with elman and factorial RNNs. The synthetic tasks are the hochrieter memory and addition tasks, classifying sequential mnist and the real task is a language modelling task using the Penn Tree Bank dataset.
In the synthetic memory tasks they found that pure othogonalisation hindered performance but that a small margin away from orthogonality was better than no orthogonal constraint.
On the mnist tasks, the LSTM outperformed all other versions of simple RNNs but the simple RNNs with orthogonal constraints did quite well. Orthonal initialisation outperformed identity and glorot initialisation.
On the language modelling task with short sentences, orthogonal constraints were detrimental but on the longer sentences beneficial. They did not compare to LSTM. They also found that orthogonal initialisation was almost as good as orthogonal constraints during learning suggesting that preserving gradient flow is most important early in learning.
### My Two Cents
This paper explores an interesting approach to learning long term dependencies even with very simple RNNs. However, as theoretically appealing as the results may be, they still fail to match the memory capacities of an LSTM so for the time being it doesn't seem that there are any major practical take aways.