Advances in Optimizing Recurrent Networks
Yoshua Bengio
and
Nicolas Boulanger-Lewandowski
and
Razvan Pascanu
arXiv e-Print archive - 2012 via Local arXiv
Keywords:
cs.LG
First published: 2012/12/04 (11 years ago) Abstract: After a more than decade-long period of relatively little research activity
in the area of recurrent neural networks, several new developments will be
reviewed here that have allowed substantial progress both in understanding and
in technical solutions towards more efficient training of recurrent networks.
These advances have been motivated by and related to the optimization issues
surrounding deep learning. Although recurrent networks are extremely powerful
in what they can in principle represent in terms of modelling sequences,their
training is plagued by two aspects of the same issue regarding the learning of
long-term dependencies. Experiments reported here evaluate the use of clipping
gradients, spanning longer time ranges with leaky integration, advanced
momentum techniques, using more powerful output probability models, and
encouraging sparser gradients to help symmetry breaking and credit assignment.
The experiments are performed on text and music data and show off the combined
effects of these techniques in generally improving both training and test
error.
#### Introduction
* Recurrent Neural Networks (RNNs) are very powerful at modelling sequences but they are not good at learning long-term dependencies.
* The paper discusses the reasons behind this difficulty and some suggestions to mitigate it.
* [Link to the paper.](https://arxiv.org/abs/1212.0901)
#### Optimization Difficulty
* RNNs form a deterministic state variable h<sup>t</sup> as function of input observation and previous state.
* Learnable parameters to decide what will be remembered about the past sequence.
* Using local optimisation techniques like Stochastic Gradient Descent (SGD) are unlikely to find optimal values of tunable parameters
* When computations performed by RNN are unfolded through time, a deep Neural Network with shared weights is realised.
* The cost function of this deep network depends on the output of hidden layers.
* Gradient descent updates could "explode" (become very large) or "vanish" (become very small).
#### Training Recurrent Networks
* **Clip Gradient** - when the norm of the gradient vector ($g$) is above a threshold, update is done in direction of threshold $g/||g||$. This normalisation implements a simple form of second-order normalisation (the second-order derivate will also be large in regions of exploding gradient).
* Use a **leaky integration** state-to-state map:
$h_{t, i} = \alpha_{i}h_{t-1, i} + (1-\alpha _{i})F_{i}(h_{t-1}, x_{t})$
Different values of α allow a different amount of the previous state to "leak" through the unfolded layers to further in time. This simply expands the time-scale of vanishing gradients and not totally remove them.
* Use **output probability models** like Restricted Boltzmann Machine or NADE to capture higher order dependencies between variables in case of multivariate prediction.
* By using **rectifier non-linearities**, the gradient on hidden units becomes sparse and these sparse gradients help the hidden units to specialise. The basic idea is that if the gradient is concentrated in fewer paths (in the unfolded computational graph) the vanishing gradient effect would be limited.
* A **simplified Nesterov Momentum** rule is proposed to allow storing past velocities for a longer time while actually using these velocities more conservatively. The new formulation is also easier to implement.
#### Results
* SGD with these optimisations outperforms a vanilla SGD.