Early Inference in Energy-Based Models Approximates Back-Propagation

Yoshua Bengio and Asja Fischer

arXiv e-Print archive - 2015 via Local arXiv

Keywords: cs.LG

**First published:** 2015/10/09 (7 years ago)

**Abstract:** We show that Langevin MCMC inference in an energy-based model with latent
variables has the property that the early steps of inference, starting from a
stationary point, correspond to propagating error gradients into internal
layers, similarly to back-propagation. The error that is back-propagated is
with respect to visible units that have received an outside driving force
pushing them away from the stationary point. Back-propagated error gradients
correspond to temporal derivatives of the activation of hidden units. This
observation could be an element of a theory for explaining how brains perform
credit assignment in deep hierarchies as efficiently as back-propagation does.
In this theory, the continuous-valued latent variables correspond to averaged
voltage potential (across time, spikes, and possibly neurons in the same
minicolumn), and neural computation corresponds to approximate inference and
error back-propagation at the same time.
more
less

Yoshua Bengio and Asja Fischer

arXiv e-Print archive - 2015 via Local arXiv

Keywords: cs.LG

[link]
# Very Short The authors define a neural network as a nonlinear dynamical system whose fixed points correspond to the minima of some **energy function**. They then show that if one were to start at a fixed-point and *perturb* the output units in the direction that minimizes a loss, the initial perturbation that would flow back through the network would be proportional to the gradient of the neural activations with respect to this loss. Thus, the initial propagation of those propagations (i.e. **early inference**) **approximates** the **backpropagated** gradients of the loss. |

Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning

Lotter, William and Kreiman, Gabriel and Cox, David D.

arXiv e-Print archive - 2016 via Local Bibsonomy

Keywords: dblp

Lotter, William and Kreiman, Gabriel and Cox, David D.

arXiv e-Print archive - 2016 via Local Bibsonomy

Keywords: dblp

[link]
# Very Short The authors propose a deep, recurrent, convolutional architecture called PredNet, inspired by the idea of predictive coding from neuroscience. In PredNet, first layer attempts to predict the input frame, based on past frames and input from higher layers. The next layer then attempts to predict the *prediction error* of the first layer, and so on. The authors show that such an architecture can predict future frames of video, and predict the parameters synthetically-generated video, better than a conventional recurrent autoencoder. # Short ## The Model PredNet has the following architecture: https://i.imgur.com/7vOcGwI.png Where the R blocks are Recurrent Neural Networks, and the A blocks are Convolutional Layers. $E_l$ indictes the prediction error at layer $l$. The network is trained from snippets of video, and the loss is given as: $L_{train} = \sum_{t=1}^T \sum_l \frac{\lambda_l}{n_l} \sum_i^{2n_l} [E_l^t]_i$ Where $t$ indexes the time step, $l$ indexes the layer, $n_l$ is the number of units in the layer, $E_l^t = [ReLU(A_l^t-\hat A_l^t) ; ReLU(\hat A_l^t - A_l^t) ]$ is the concatenation of the negative and positive components of the error, $\lambda_l$ is a hyperparameter determining the effect that layer $l$ error should have on the loss. In the experiments, they use two settings for the $\lambda_l$ hyperparameters. In the "$L_0$" setting, they set $\lambda_0=1, \lambda_{>0}=0$, which ends up being optimal when trying to optimize next-frame L1 error. In the "$L_{all}$" setting, they use $\lambda_0=1, \lambda_{>0}=0.1$, which, in the synthetic-images experiment, seems to be better at predicting the parameters of the synthetic-image generator. ## Results They apply the model on two tasks: 1) Predicting the future frames of a synthetic video generated by a graphics engine. Here they predict both the next frame (in which their $L_0$ model does best), and the parameters (face characteristics, rotation, angle) of the program that generates the synthetic faces, (on which their $L_{all}$ model does best). They predict face generating parameters by first training the model, and then freezing weights and regressing from the learned representations at a given layer to the parameters. They show that both the $L_0$ and $L_{all}$ models outperform a more conventional recurrent autoencoder. https://i.imgur.com/S8PpJnf.png **Next-frame predictions on a sequence of faces (note: here, predictions are *not* fed back into the model to generate the next frame)** 2) Predicting future frames of video from dashboard cameras. https://i.imgur.com/Zus34Vm.png **Next-frame predictions of dashboard-camera images** The authors conclude that allowing higher layers to model *prediction errors*, instead of *abstract representations* can lead to better modeling of video. |

Spiking Boltzmann Machines

Hinton, Geoffrey E. and Brown, Andrew D.

Neural Information Processing Systems Conference - 1999 via Local Bibsonomy

Keywords: dblp

Hinton, Geoffrey E. and Brown, Andrew D.

Neural Information Processing Systems Conference - 1999 via Local Bibsonomy

Keywords: dblp

[link]
# Very Short The authors first describe how spiking neurons could be used to represent a multimodal probability distribution that evolves through time, and then use this idea to design a variant on a Restricted Boltzmann machine which can learn a temporal sequence. # Short ## 1. Population codes and energy landscapes Imagine that a neural network perceives an object, which has some *instantiation parameters* (e.g. position, velocity, size, orientation), and we want to infer the values of these *instantiation parameters* through some kind of noisy observation process (for instance, the object being represented as an image). Due to the noisiness of the observation process, we will never be able to exactly infer the parameters, but need to infer distributions over these parameters. One way to interpret the state of a neural network as a probability distribution is to imagine that each neuron corresponds to a certain probability distribution in the space of instantiation parameters, and the activations of the neurons represent *unnormalized mixing proportions* i.e. the extent to which each neuron is right about the instantiation parameters. The distribution represented by the network is then a weighted sum of the distributions represented by each neuron. This is called a *disjunctive* representation. The distribution of the network can never be as sharp as the distribution of each neuron. Another option is to instead to do a weighted addition of neurons' *Energy landscapes* (i.e. negative log probability distributions), where the neuron's activation represents this weight. This is called a *conjunctive* representation. When neuron distributions are combined like this, the distribution of the network can be *sharper* than the distributions for each neuron. https://i.imgur.com/u6W9kBU.png ## 2 Representing the coefficients on the basis functions A biological spiking neuron outputs sparse binary signals. However, these spikes are convolved by a temporal kernel when they cross synapses so that other neurons see them as smoothly-varying *postsynaptic potentials*. If we consider these postsynaptic potentials to be the weights on the neuron's contribution the the energy landscape, we can see how a spiking neural network could define a probability distribution that varies smoothly in time. https://i.imgur.com/pTWOCPx.png **Left: Blue vertical lines are spikes. Red lines are postsynaptic potentials. Blue horizontal lines are contour lines of the distribution defined by adding both neurons contributions to the energy landscape. Right: The effects of spikes on two neurons on the probability distribution. The effects of a spike can be seen as an hourglass, with a "neck" at the point of maximum activation of the postsynaptic potential.** ## 3. A learning algorithm for restricted Boltzmann machines Restricted Boltzmann Machines are an example of neural networks that use a *conjunctive* representation. They consist of a layer of visible units connected to a layer of hidden units through symmetric connections. They learn by *contrastive divergence*, which involves taking an input (a vector of visible activations $s_i^0$), projecting it to the hidden layer to get hidden layer activations $s_j^0$, where $\Pr(s_j=1) = \sigma\left(\sum_i s_i w_{ij} \right)$, where $\sigma = (1+e^{-x})^{-1}$ is the logistic sigmoid. Hidden activations then similarily reconstruct visible activations $s_i^1$, and those are used to create new hidden activations $s_j^1$. Their learning rule is: $\Delta w_{ij} = \epsilon ( < s_i s_j >^0 - < s_i s_j >^1)$ Where : $\Delta w_{ij}$ is the change in the symmetric weight connecting visible unit $i$ to hidden unit $j$ $\epsilon$ is a learning rate. $< s_i s_j >^0$ and $< s_i s_j >^1$ are the averages (over samples) of the product of the visible activation $s_i$ and hidden activation $s_j$ on the first and second pass, respectively. ## 4. Restricted Boltzmann Machines through time Finally, the authors propose a restricted Boltzmann machine through time, where an RBM is replicated across time. The new weight matrix is defined as $w_{ij\tau} = w_{ij} r(\tau)$. Where $r(\tau)$ is a fixed, causal, temporal kernel: https://i.imgur.com/mLmfAhp.png The authors also allow hidden-to-hidden and visible-to-visible connections. These act as a predictive prior over activations. To overcome the problem that introducing hidden-to-hidden connections makes sampling hidden states intractable, they use an approximation wherein the sampled values of past hidden activations are treated like data, and inference is not done over them. The forward pass thus becomes: $\Pr(s_j=1) = \sigma \left( \sum_i w_{ij} \sum_{\tau=0}^\infty s_i(t-\tau) r(\tau) + \sum_i w_{kj} \sum_{\tau=0}^\infty s_k(t-\tau) r(\tau) \right)$ Where $w_{ij}$'s index vidible to hidden connections and $w_{kj}$'s index hidden-to-hidden connections. The hidden-to-visible pass is computed similarly. The constrastive-divergence update rule then becomes: $\Delta w_{ij} = \epsilon \sum_{t=1}^{\infty} \sum_{\tau=0}^\infty r(\tau) \left( < s_j(t) s_i(t-\tau)>^0 - < s_j(t) s_i(t-\tau)>^1 \right)$ ## 5. Results: The authors demonstrate that they can use this to train a network to learn a temporal model on a toy dataset consisting of an image of a ball that travels repeatedly along a circular path. --- ## Notes: A description of Spiking Restricted Boltzmann Machines in the context of other variants on Boltzmann Machines can be found in [Boltzmann Machines for Time Series](https://arxiv.org/pdf/1708.06004.pdf) (Sept 2017) by Takayuki Osogami |

Learning to learn by gradient descent by gradient descent

Marcin Andrychowicz and Misha Denil and Sergio Gomez and Matthew W. Hoffman and David Pfau and Tom Schaul and Nando de Freitas

arXiv e-Print archive - 2016 via Local arXiv

Keywords: cs.NE, cs.LG

**First published:** 2016/06/14 (6 years ago)

**Abstract:** The move from hand-designed features to learned features in machine learning
has been wildly successful. In spite of this, optimization algorithms are still
designed by hand. In this paper we show how the design of an optimization
algorithm can be cast as a learning problem, allowing the algorithm to learn to
exploit structure in the problems of interest in an automatic way. Our learned
algorithms, implemented by LSTMs, outperform generic, hand-designed competitors
on the tasks for which they are trained, and also generalize well to new tasks
with similar structure. We demonstrate this on a number of tasks, including
simple convex problems, training neural networks, and styling images with
neural art.
more
less

Marcin Andrychowicz and Misha Denil and Sergio Gomez and Matthew W. Hoffman and David Pfau and Tom Schaul and Nando de Freitas

arXiv e-Print archive - 2016 via Local arXiv

Keywords: cs.NE, cs.LG

[link]
# Very Short The authors propose **learning** an optimizer **to** optimally **learn** a function (the *optimizee*) which is being trained **by gradient descent**. This optimizer, a recurrent neural network, is trained to make optimal parameter updates to the optimizee **by gradient descent**. # Short Let's suppose we have a stochastic function $f: \mathbb R^{\text{dim}(\theta)} \rightarrow \mathbb R^+$, (the *optimizee*) which we wish to minimize with respect to $\theta$. Note that this is the typical situation we encounter when training a neural network with Stochastic Gradient Descent - where the stochasticity comes from sampling random minibatches of the data (the data is omitted as an argument here). The "vanilla" gradient descent update is: $\theta_{t+1} = \theta_t - \alpha_t \nabla_{\theta_t} f(\theta_t)$, where $\alpha_t$ is some learning rate. Other optimizers (Adam, RMSProp, etc) replace the multiplication of the gradient by $-\alpha_t$ with some sort of weighted sum of the history of gradients. This paper proposes to apply an optimization step $\theta_{t+1} = \theta_t + g_t$, where the update $g_t \in \mathbb R^{\text{dim}(\theta)}$ is defined by a recurrent network $m_\phi$: $$(g_t, h_{t+1}) := m_\phi (\nabla_{\theta_t} f(\theta_t), h_t)$$ Where in their implementation, $h_t \in \mathbb R^{\text{dim}(\theta)}$ is the hidden state of the recurrent network. To make the number of parameters in the optimizer manageable, they implement their recurrent network $m$ as a *coordinatewise* LSTM (i.e. A set of $\text{dim}(\theta)$ small LSTMs that share parameters $\phi$). They train the optimizer networks's parameters $\phi$ by "unrolling" T subsequent steps of optimization, and minimizing: $$\mathcal L(\phi) := \mathbb E_f[f(\theta^*(f, \phi))] \approx \frac1T \sum_{t=1}^T f(\theta_t)$$ Where $\theta^*(f, \phi)$ are the final optimizee parameters. In order to avoid computing second derivatives while calculating $\frac{\partial \mathcal L(\phi)}{\partial \phi}$, they make the approximation $\frac{\partial}{\partial \phi} \nabla_{\theta_t}f(\theta_t) \approx 0$ (corresponding to the dotted lines in the figure, along which gradients are not backpropagated). https://i.imgur.com/HMaCeip.png **The computational graph of the optimization of the optimizer, unrolled across 3 time-steps. Note that $\nabla_t := \nabla_{\theta_t}f(\theta_t)$. The dotted line indicates that we do not backpropagate across this path.** The authors demonstrate that their method usually outperforms traditional optimizers (ADAM, RMSProp, SGD, NAG), on a synthetic dataset, MNIST, CIFAR-10, and Neural Style Transfer. They argue that their algorithm constitutes a form of transfer learning, since a pre-trained optimizer can be applied to accelerate training of a newly initialized network. |

Unbiased Online Recurrent Optimization

Tallec, Corentin and Ollivier, Yann

arXiv e-Print archive - 2017 via Local Bibsonomy

Keywords: dblp

Tallec, Corentin and Ollivier, Yann

arXiv e-Print archive - 2017 via Local Bibsonomy

Keywords: dblp

[link]
# Really Short A method for training a recurrent network that avoids doing backpropagation through time by instead approximately forward-propagating the derivatives of the recurrent-state variables with respect to the parameters. # Short This paper deals with learning **Online** setting, where we have an infinite sequence of points $<(x_t, y_t): t \in \mathbb N>$, where at each time $t$ we would like to predict $y_t$ given $x_1, ... x_t$. We'd like do this by training a **Recurrent** model: $o_t , s_t := f_\theta(x_t, s_{t-1})$ to **Optimize** parameters $\theta$ such that we minimize the error of the next prediction: $\mathcal L_t := \ell(o_t, y_t)$ The standard way to do this is is Truncated Backpropagation through Time (TBPTT). We "unroll" the network for T steps (the truncation window), every T steps, and update $\theta$ to minimize $\sum_{\tau=t-T+1}^t \mathcal L_\tau$ given the last T data points: $< (x_{\tau}, y_\tau) : \tau \in [t-T+1 .. t]>$ and the previous recurrent state $s_{t-T}$. This has the disadvantage of having to store T intermediate hidden states and do 2T sequential operations in the forward/backward pass. Moreover it gives a biased gradient estimate because it ignores the effect of $\theta$ on $s_{t-T}$. Another option which usually is even more expensive is Real-Time Recurrent Learning (RTRL). RTRL is the application of [forward mode automatic differentiation](https://en.wikipedia.org/wiki/Automatic_differentiation#The_chain_rule,_forward_and_reverse_accumulation) to recurrent networks (Backpropagation Through Time is reverse-mode automatic differentiation). Instead of first computing the loss and then backpropagating the gradient, we forward-propagate the jacobian defining the derivitive of the state with respect to the parameters: $\frac{\partial s_t}{\partial \theta} = \frac{\partial s_t}{\partial s_{t-1}} \frac{\partial s_{t-1}}{\partial \theta} + \frac{\partial s_t}{\partial \theta}|_{s_{t-1}} \in \mathbb R^{|S| \times |\Theta|}$ (where the second term is "the derivative with $s_{t-1}$ held constant"), and updating $\theta$ using the gradient $\frac{\partial \mathcal L_t}{\partial \theta} = \frac{\partial \mathcal L_t}{\partial s_{t-1}} \frac{\partial s_{t-1}}{\partial \theta} + \frac{\partial \mathcal L_t}{\partial \theta}|_{s_{t-1}}$. This is very expensive because it involves computing, storing and multiplying large Jacobian matrices. This paper proposes doing an approximate form of RTRL where the state-derivative is stochastically approximated as a Rank-1 matrix: $\frac{\partial s_t}{\partial \theta} \approx \tilde s_t \otimes \tilde \theta_t: \tilde s_t \in \mathbb R^{|S|}, \tilde \theta_t \in \mathbb R^{|\Theta|}$ where $\otimes$ denotes the outer-product. The approximation uses the "Rank-1 Trick"*****. They show that this approximation is **Unbiased**** (i.e. $\mathbb E[\tilde s_t \otimes \tilde \theta_t] = \frac{\partial s_t}{\partial \theta}$), and that using this approximation we can do much more computationally efficient updates than RTRL, and without the biasedness and backtracking required in TBPTT. They demonstrate this result on some toy datasets. They demonstrate that it's possible to construct a situation where TBPTT fails (due to biasedness of the gradient) and UORO converges, and that for other tasks they achieve comparable performance to TBPTT. ---- \* The "Rank-1 Trick", is that if matrix $A \in \mathbb R ^{M\times N}$ can be decomposed as $A = \sum_k^K v_k \otimes w_k: v_k \in \mathbb R^M, w_k \in \mathbb R^N$, then we can define $\tilde A :=\left( \sum_k^K \nu_k v_k \right) \otimes \left( \sum_k^K \nu_k w_k \right) \approx A$, with $\nu_k = \{1 \text{ with } p=\frac12 \text{ otherwise } -1\}$, and show that $\mathbb E[\tilde A] = A$. This trick is applied twice to approximate the RTRL updates: First to approximate $s'\otimes \theta' :\approx \frac{\partial s_t}{\partial \theta}|_{s_{t-1}}$, then $\tilde s_t \otimes \tilde \theta_t :\approx \frac{\partial s_t}{\partial s_{t-1}} (\tilde s_{t-1} \otimes \tilde \theta_{t-1}) + s' \otimes \theta'$. (Note that in the paper they add the additional terms $\rho_k$ to this equation to reduce the variance of this estimator). ** The "unbiasedness" is a bit of a lie, because it is only true if $\theta$ is not changing over time, which it is during training. This is the case for RTRL in general, and not just this work. |

About