[link]
Summary by Peter O'Connor 6 years ago
# Really Short
A method for training a recurrent network that avoids doing backpropagation through time by instead approximately forward-propagating the derivatives of the recurrent-state variables with respect to the parameters.
# Short
This paper deals with learning **Online** setting, where we have an infinite sequence of points $<(x_t, y_t): t \in \mathbb N>$, where at each time $t$ we would like to predict $y_t$ given $x_1, ... x_t$. We'd like do this by training a **Recurrent** model: $o_t , s_t := f_\theta(x_t, s_{t-1})$ to **Optimize** parameters $\theta$ such that we minimize the error of the next prediction: $\mathcal L_t := \ell(o_t, y_t)$
The standard way to do this is is Truncated Backpropagation through Time (TBPTT). We "unroll" the network for T steps (the truncation window), every T steps, and update $\theta$ to minimize $\sum_{\tau=t-T+1}^t \mathcal L_\tau$ given the last T data points: $< (x_{\tau}, y_\tau) : \tau \in [t-T+1 .. t]>$ and the previous recurrent state $s_{t-T}$. This has the disadvantage of having to store T intermediate hidden states and do 2T sequential operations in the forward/backward pass. Moreover it gives a biased gradient estimate because it ignores the effect of $\theta$ on $s_{t-T}$.
Another option which usually is even more expensive is Real-Time Recurrent Learning (RTRL). RTRL is the application of [forward mode automatic differentiation](https://en.wikipedia.org/wiki/Automatic_differentiation#The_chain_rule,_forward_and_reverse_accumulation) to recurrent networks (Backpropagation Through Time is reverse-mode automatic differentiation). Instead of first computing the loss and then backpropagating the gradient, we forward-propagate the jacobian defining the derivitive of the state with respect to the parameters: $\frac{\partial s_t}{\partial \theta} = \frac{\partial s_t}{\partial s_{t-1}} \frac{\partial s_{t-1}}{\partial \theta} + \frac{\partial s_t}{\partial \theta}|_{s_{t-1}} \in \mathbb R^{|S| \times |\Theta|}$ (where the second term is "the derivative with $s_{t-1}$ held constant"), and updating $\theta$ using the gradient $\frac{\partial \mathcal L_t}{\partial \theta} = \frac{\partial \mathcal L_t}{\partial s_{t-1}} \frac{\partial s_{t-1}}{\partial \theta} + \frac{\partial \mathcal L_t}{\partial \theta}|_{s_{t-1}}$. This is very expensive because it involves computing, storing and multiplying large Jacobian matrices.
This paper proposes doing an approximate form of RTRL where the state-derivative is stochastically approximated as a Rank-1 matrix: $\frac{\partial s_t}{\partial \theta} \approx \tilde s_t \otimes \tilde \theta_t: \tilde s_t \in \mathbb R^{|S|}, \tilde \theta_t \in \mathbb R^{|\Theta|}$ where $\otimes$ denotes the outer-product. The approximation uses the "Rank-1 Trick"*****.
They show that this approximation is **Unbiased**** (i.e. $\mathbb E[\tilde s_t \otimes \tilde \theta_t] = \frac{\partial s_t}{\partial \theta}$), and that using this approximation we can do much more computationally efficient updates than RTRL, and without the biasedness and backtracking required in TBPTT.
They demonstrate this result on some toy datasets. They demonstrate that it's possible to construct a situation where TBPTT fails (due to biasedness of the gradient) and UORO converges, and that for other tasks they achieve comparable performance to TBPTT.
----
\* The "Rank-1 Trick", is that if matrix $A \in \mathbb R ^{M\times N}$ can be decomposed as $A = \sum_k^K v_k \otimes w_k: v_k \in \mathbb R^M, w_k \in \mathbb R^N$, then we can define $\tilde A :=\left( \sum_k^K \nu_k v_k \right) \otimes \left( \sum_k^K \nu_k w_k \right) \approx A$, with $\nu_k = \{1 \text{ with } p=\frac12 \text{ otherwise } -1\}$, and show that $\mathbb E[\tilde A] = A$. This trick is applied twice to approximate the RTRL updates: First to approximate $s'\otimes \theta' :\approx \frac{\partial s_t}{\partial \theta}|_{s_{t-1}}$, then $\tilde s_t \otimes \tilde \theta_t :\approx \frac{\partial s_t}{\partial s_{t-1}} (\tilde s_{t-1} \otimes \tilde \theta_{t-1}) + s' \otimes \theta'$. (Note that in the paper they add the additional terms $\rho_k$ to this equation to reduce the variance of this estimator).
** The "unbiasedness" is a bit of a lie, because it is only true if $\theta$ is not changing over time, which it is during training. This is the case for RTRL in general, and not just this work.
more
less