[link]
## Headline Summary The problem of vanishing or exploding gradients when training recurrent neural networks can be partially overcome by enforcing an orthogonality constraint on the weight matrices between consecutive hidden states. This paper explores the tradeoff between enforced orthogonality and the rate of learning by slowly increasing the extent to which the weight matrices are allowed to deviate from orthogonality. They find that a hard orthogonality constraint can be too restrictive and perseveres gradients at the cost of slowing learning. ### Why orthogonal matrices are likely to help with exploding gradients For a neural network with N hidden layers we can write the gradient of the loss wrt to the weight matrix between hidden states $W_i$ as: $ \frac{\partial L}{\partial W_i} = \frac{\partial a_i}{\partial w_i} \prod_j^N (\frac{\partial h_i}{\partial a_i} \frac{\partial a_i}{\partial h_i}) \frac{\partial L}{\partial a_{N+1}} = \frac{\partial a_i}{\partial w_i} \prod_j^N (D_i W_{j+1}) \frac{\partial L}{\partial a_{N+1}}$ where L is the loss function and D is the jacobian of the activation function. It's in this long chain of products that the gradient tends to either vanish or explode. This problem is particularly bad in recurrent nets where the chain is the length of the unrolled computation graph. The norm of the derivative $\\frac{\partial h_i}{\partial a_i} \frac{\partial a_i}{\partial h_i} \= \D_i W_{j+1}\ \leq \D_i \ \ W_{j+1}\ \leq \lambda_D \lambda_W$ where $\lambda_{D/W}$ is the largest singular value of the given matrix. Its clear that if $\lambda$ is less than or greater than 1, that the norm of the gradient will grow or decay exponentially with length of the chain of products given above. Constraining the weight matrices to be orthogonal ensures that the singular values are 1 and so perseveres the gradient norm. ### How to enforce the orthogonality constraints The set of all orthogonal matrices from a continuos manifold known as the stiefel manifold. One way to ensure that the weight matrices remain orthogonal is to initialise them to be orthogonal and then to perform gradient optimisation on this manifold. The orthogonality of the matrix can be preserved by a Caley transformation outlined in the paper. In order to allow the weight matrix to deviate slightly from the stiefel manifold, the authors took the SVD of $W = USV^T $ and parameterised the singular values. They used geodesic gradient descent to maintain the orthogonality of $U$ and $V$. The singular values were restricted to between $ 1 ^+_ m$ where $m$ is a margin by parameterisation with a sigmoid. ### Experiment Summary They experiment on a number of synthetic and real datasets with elman and factorial RNNs. The synthetic tasks are the hochrieter memory and addition tasks, classifying sequential mnist and the real task is a language modelling task using the Penn Tree Bank dataset. In the synthetic memory tasks they found that pure othogonalisation hindered performance but that a small margin away from orthogonality was better than no orthogonal constraint. On the mnist tasks, the LSTM outperformed all other versions of simple RNNs but the simple RNNs with orthogonal constraints did quite well. Orthonal initialisation outperformed identity and glorot initialisation. On the language modelling task with short sentences, orthogonal constraints were detrimental but on the longer sentences beneficial. They did not compare to LSTM. They also found that orthogonal initialisation was almost as good as orthogonal constraints during learning suggesting that preserving gradient flow is most important early in learning. ### My Two Cents This paper explores an interesting approach to learning long term dependencies even with very simple RNNs. However, as theoretically appealing as the results may be, they still fail to match the memory capacities of an LSTM so for the time being it doesn't seem that there are any major practical take aways.
Your comment:
