PhD at UCL CS under David Barber. Interested in Approximate (Variational) Inference, RNNs and applications to Machine Comprehension.

Training Deep and Recurrent Networks with Hessian-Free Optimization

Martens, James and Sutskever, Ilya

Springer Neural Networks: Tricks of the Trade (2nd ed.) - 2012 via Local Bibsonomy

Keywords: dblp

Martens, James and Sutskever, Ilya

Springer Neural Networks: Tricks of the Trade (2nd ed.) - 2012 via Local Bibsonomy

Keywords: dblp

[link]
## Very Short Summary The authors introduce a number of modifications to traditional hessian-free optimisation that makes the method work better for neural networks. The modifications are: * Use the Generalised Gauss Newton Matrix (GGN) rather than the Hessian. * Damp the GGN so that $G' = G + \lambda I$ and adjust $\lambda$ using levenberg-marquardt heuristic. * Use an efficient recursion to calculate the GGN. * Initialise each round of conjugated gradients with the final vector of the previous iteration. * A new simpler termination criterion for CG. Terminate CG when the relative decrease in the objective falls below some threshold. * Back-tracking of the CG solution. ie you store intermediate solutions to CG and only update if the new CG solution actually decreases the over all problem objective. ## Less Short Summary ### Hessian Free Optimisation in General Hessian free optimisation is used when one wishes to optimise some objective $f(\theta)$ using second order methods but inversion or even computation of the Hessian is intractable or infeasible. The method is an iterative method and at each iteration, we take a second order approximation to the objective. i.e at iterantion n, we take a second order taylor expansion of $f$ to get: $M^n(\theta) = f(\theta^n) + \nabla_{\theta}^Tf(\theta^n)(\theta - \theta^n) + (\theta - \theta^n)^TH(\theta - \theta^n) $ Where $H$ is the hessian matrix. If we minimise this second order approximation with respect to $\theta$ we would find that that $\theta^{n+1} = H^{-1}(-\nabla_{\theta}^Tf(\theta^n))$. However, inverting $H$ is usually not possible for even moderately sized neural networks. There does however exist an efficient algorithm for calculating hessian vector products $Hv$ for any $v$. The insight of hessian-free optimisation is that one can solve linear problems of the form $Hx = v$ using only hessian vector products via the linear conjugated gradients algorithm. You therefore avoid the need to ever actually compute either the Hessian or its inverse. To run vanilla hessian free all you need to do at each iteration is: 1) Calculate the gradient vector using standard backprop. 2) Calculate $H\theta$ product using an efficient recursion. 3) calculate the next update $\theta^{n+1} = ConjugatedGradients(H, -\nabla_{\theta}^Tf(\theta^n))$ The main contribution of this paper is to take the above algorithm and make the changes outlined in the very short summary. ## Take aways Hessian-Free optimisation was perhaps the best method at the time of publication. Recently it seems that first order methods using per-parameter learning rates like ADAM or even learning-to-learn can outperform Hessian-Free. This is primarily because of the increased cost per iteration of Hessian Free. However it still seems that using curvature information if its available is beneficial though expensive. More resent second order curvature appoximations like Kroeniker Factored Approximate Curvature (KFAC) and Kroeniker Factored Recursive Approximation (KFRA) are cheaper ways to achieve the same benefit. |

On orthogonality and learning recurrent networks with long term dependencies

Eugene Vorontsov and Chiheb Trabelsi and Samuel Kadoury and Chris Pal

arXiv e-Print archive - 2017 via Local arXiv

Keywords: cs.LG, cs.NE

**First published:** 2017/01/31 (5 years ago)

**Abstract:** It is well known that it is challenging to train deep neural networks and
recurrent neural networks for tasks that exhibit long term dependencies. The
vanishing or exploding gradient problem is a well known issue associated with
these challenges. One approach to addressing vanishing and exploding gradients
is to use either soft or hard constraints on weight matrices so as to encourage
or enforce orthogonality. Orthogonal matrices preserve gradient norm during
backpropagation and can therefore be a desirable property; however, we find
that hard constraints on orthogonality can negatively affect the speed of
convergence and model performance. This paper explores the issues of
optimization convergence, speed and gradient stability using a variety of
different methods for encouraging or enforcing orthogonality. In particular we
propose a weight matrix factorization and parameterization strategy through
which we can bound matrix norms and therein control the degree of expansivity
induced during backpropagation.
more
less

Eugene Vorontsov and Chiheb Trabelsi and Samuel Kadoury and Chris Pal

arXiv e-Print archive - 2017 via Local arXiv

Keywords: cs.LG, cs.NE

[link]
## Headline Summary The problem of vanishing or exploding gradients when training recurrent neural networks can be partially overcome by enforcing an orthogonality constraint on the weight matrices between consecutive hidden states. This paper explores the trade-off between enforced orthogonality and the rate of learning by slowly increasing the extent to which the weight matrices are allowed to deviate from orthogonality. They find that a hard orthogonality constraint can be too restrictive and perseveres gradients at the cost of slowing learning. ### Why orthogonal matrices are likely to help with exploding gradients For a neural network with N hidden layers we can write the gradient of the loss wrt to the weight matrix between hidden states $W_i$ as: $ \frac{\partial L}{\partial W_i} = \frac{\partial a_i}{\partial w_i} \prod_j^N (\frac{\partial h_i}{\partial a_i} \frac{\partial a_i}{\partial h_i}) \frac{\partial L}{\partial a_{N+1}} = \frac{\partial a_i}{\partial w_i} \prod_j^N (D_i W_{j+1}) \frac{\partial L}{\partial a_{N+1}}$ where L is the loss function and D is the jacobian of the activation function. It's in this long chain of products that the gradient tends to either vanish or explode. This problem is particularly bad in recurrent nets where the chain is the length of the unrolled computation graph. The norm of the derivative $\|\frac{\partial h_i}{\partial a_i} \frac{\partial a_i}{\partial h_i} \|= \|D_i W_{j+1}\| \leq \|D_i \| \| W_{j+1}\| \leq \lambda_D \lambda_W$ where $\lambda_{D/W}$ is the largest singular value of the given matrix. Its clear that if $\lambda$ is less than or greater than 1, that the norm of the gradient will grow or decay exponentially with length of the chain of products given above. Constraining the weight matrices to be orthogonal ensures that the singular values are 1 and so perseveres the gradient norm. ### How to enforce the orthogonality constraints The set of all orthogonal matrices from a continuos manifold known as the stiefel manifold. One way to ensure that the weight matrices remain orthogonal is to initialise them to be orthogonal and then to perform gradient optimisation on this manifold. The orthogonality of the matrix can be preserved by a Caley transformation outlined in the paper. In order to allow the weight matrix to deviate slightly from the stiefel manifold, the authors took the SVD of $W = USV^T $ and parameterised the singular values. They used geodesic gradient descent to maintain the orthogonality of $U$ and $V$. The singular values were restricted to between $ 1 ^+_- m$ where $m$ is a margin by parameterisation with a sigmoid. ### Experiment Summary They experiment on a number of synthetic and real data-sets with elman and factorial RNNs. The synthetic tasks are the hochrieter memory and addition tasks, classifying sequential mnist and the real task is a language modelling task using the Penn Tree Bank dataset. In the synthetic memory tasks they found that pure othogonalisation hindered performance but that a small margin away from orthogonality was better than no orthogonal constraint. On the mnist tasks, the LSTM outperformed all other versions of simple RNNs but the simple RNNs with orthogonal constraints did quite well. Orthonal initialisation outperformed identity and glorot initialisation. On the language modelling task with short sentences, orthogonal constraints were detrimental but on the longer sentences beneficial. They did not compare to LSTM. They also found that orthogonal initialisation was almost as good as orthogonal constraints during learning suggesting that preserving gradient flow is most important early in learning. ### My Two Cents This paper explores an interesting approach to learning long term dependencies even with very simple RNNs. However, as theoretically appealing as the results may be, they still fail to match the memory capacities of an LSTM so for the time being it doesn't seem that there are any major practical take aways. |

An Empirical Analysis of Deep Network Loss Surfaces

Im, Daniel Jiwoong and Tao, Michael and Branson, Kristin

arXiv e-Print archive - 2016 via Local Bibsonomy

Keywords: dblp

Im, Daniel Jiwoong and Tao, Michael and Branson, Kristin

arXiv e-Print archive - 2016 via Local Bibsonomy

Keywords: dblp

[link]
### Brief Summary: Does what it says on the tin. The authors examine the loss surface of the Network-In-Network and VGG neural networks near the critical points found by 5 different stochastic optimisation algorithms. These algorithms are Adam, Adadelta, SGD, SGD with momentum and a new Runge-Kutta based optimiser. They examine the loss function by interpolating along lines between the initial and final parameter values. They find that the different algorithms find qualitatively genuinely different types of minima. They find that the basin around the minima of ADAM is bigger than those for other optimisation algorithms. They find also that all of the minima are similarly good. |

Generative Temporal Models with Memory

Mevlana Gemici and Chia-Chun Hung and Adam Santoro and Greg Wayne and Shakir Mohamed and Danilo J. Rezende and David Amos and Timothy Lillicrap

arXiv e-Print archive - 2017 via Local arXiv

Keywords: cs.LG, cs.NE

**First published:** 2017/02/15 (5 years ago)

**Abstract:** We consider the general problem of modeling temporal data with long-range
dependencies, wherein new observations are fully or partially predictable based
on temporally-distant, past observations. A sufficiently powerful temporal
model should separate predictable elements of the sequence from unpredictable
elements, express uncertainty about those unpredictable elements, and rapidly
identify novel elements that may help to predict the future. To create such
models, we introduce Generative Temporal Models augmented with external memory
systems. They are developed within the variational inference framework, which
provides both a practical training methodology and methods to gain insight into
the models' operation. We show, on a range of problems with sparse, long-term
temporal dependencies, that these models store information from early in a
sequence, and reuse this stored information efficiently. This allows them to
perform substantially better than existing models based on well-known recurrent
neural networks, like LSTMs.
more
less

Mevlana Gemici and Chia-Chun Hung and Adam Santoro and Greg Wayne and Shakir Mohamed and Danilo J. Rezende and David Amos and Timothy Lillicrap

arXiv e-Print archive - 2017 via Local arXiv

Keywords: cs.LG, cs.NE

[link]
#### Very Brief Summary: This paper combines stochastic variational inference with memory-augmented recurrent neural networks. The authors test 4 variants of their models against the Variational Recurrent Neural Network on 7 artificial tasks requiring long term memory. The reported log-likelihood lower bound is not obviously improved by the new models on all tasks but is slightly better on tasks requiring high capacity memory. #### Slightly Less Brief Summary: The authors propose a general class of generative models for time-series data with both deterministic and stochastic latents. The deterministic latents, $h_t$, evolve as a recurrent net with augmented memory and the stochastic latents, $z_t$ are gaussians whose mean and variance are a deterministic function of $h_t$. The observations at each time-step $x_t$ are also gaussians whose mean and variance are parametrised by a function of $h_{<t}, x_{<t}$. #### Generative Temporal Models without Augmented Memory: The family of generative temporal models is fairly broad and includes kalman filters, non-linear dynamical systems, hidden-markov models and switching state-space models. More recent non-linear models such as the variational RNN are most similar to the new models in this paper. In general all of the mentioned temporal models can be written as: $P_\theta(x_{\leq T}, z_{\leq T} ) = \prod_t P_\theta(x_t | f_x(z_{\leq t}, x_{\leq t}))P_\theta(z_t | f_z(z_{\leq t}, x_{\leq t}))$ The differences between models then come from the the exact forms of $f_x$ and $f_z$ with most models making strong conditional independence assumptions and/or having linear dependence. For example in a Gaussian State Space model both $f_x$ and $f_z$ are linear, the latents form a first order Markov chain and the observations $x_t$ are conditionally independent of everything given $z_t$. In the Variational Recurrent Neural Net (VRNN) an additional deterministic latent variable $h_t$ is introduced and at each time-step $x_t$ is the output of a VAE whose prior $z_t$ is conditioned on $h_t$. $h_t$ evolves as an RNN. #### Types of Model with Augmented Memory: This paper follows the same strategy as the VRNN but adds more structure to the underlying recurrent neural net. The authors motivate this by saying that the VRNN "scales poorly when higher capacity storage is required". * "Introspective" Model: In the first augmented memory model, the deterministic latent M_t is simply a concatenation of the last $L$ latent stochastic variables $z_t$. A soft method of attention over the latent memory is used to generate a "memory context" vector at each time step. The observed output $x_t$ is a gaussian with mean and variance parameterised by the "memory context' and the stochastic latent $z_t$. Because this model does not learn to write to memory it is faster to train. * In the later models the memory read and write operations are the same as those in the neural turing machine or differentiable neural computer. #### My Two Cents: In some senses this paper feels fairly inevitable since VAE's have already been married with RNNs and so it's a small leap to add augmented memory. The actual read write operations introduced in the "introspective" model feel a little hacky and unprincipled. The actual images generated are quite impressive. I'd like to see how these kind of models do on language generation tasks and wether they can be adapted for question answering. |

About