[link]
# Very Short The authors first describe how spiking neurons could be used to represent a multimodal probability distribution that evolves through time, and then use this idea to design a variant on a Restricted Boltzmann machine which can learn a temporal sequence. # Short ## 1. Population codes and energy landscapes Imagine that a neural network perceives an object, which has some *instantiation parameters* (e.g. position, velocity, size, orientation), and we want to infer the values of these *instantiation parameters* through some kind of noisy observation process (for instance, the object being represented as an image). Due to the noisiness of the observation process, we will never be able to exactly infer the parameters, but need to infer distributions over these parameters. One way to interpret the state of a neural network as a probability distribution is to imagine that each neuron corresponds to a certain probability distribution in the space of instantiation parameters, and the activations of the neurons represent *unnormalized mixing proportions* i.e. the extent to which each neuron is right about the instantiation parameters. The distribution represented by the network is then a weighted sum of the distributions represented by each neuron. This is called a *disjunctive* representation. The distribution of the network can never be as sharp as the distribution of each neuron. Another option is to instead to do a weighted addition of neurons' *Energy landscapes* (i.e. negative log probability distributions), where the neuron's activation represents this weight. This is called a *conjunctive* representation. When neuron distributions are combined like this, the distribution of the network can be *sharper* than the distributions for each neuron. https://i.imgur.com/u6W9kBU.png ## 2 Representing the coefficients on the basis functions A biological spiking neuron outputs sparse binary signals. However, these spikes are convolved by a temporal kernel when they cross synapses so that other neurons see them as smoothly-varying *postsynaptic potentials*. If we consider these postsynaptic potentials to be the weights on the neuron's contribution the the energy landscape, we can see how a spiking neural network could define a probability distribution that varies smoothly in time. https://i.imgur.com/pTWOCPx.png **Left: Blue vertical lines are spikes. Red lines are postsynaptic potentials. Blue horizontal lines are contour lines of the distribution defined by adding both neurons contributions to the energy landscape. Right: The effects of spikes on two neurons on the probability distribution. The effects of a spike can be seen as an hourglass, with a "neck" at the point of maximum activation of the postsynaptic potential.** ## 3. A learning algorithm for restricted Boltzmann machines Restricted Boltzmann Machines are an example of neural networks that use a *conjunctive* representation. They consist of a layer of visible units connected to a layer of hidden units through symmetric connections. They learn by *contrastive divergence*, which involves taking an input (a vector of visible activations $s_i^0$), projecting it to the hidden layer to get hidden layer activations $s_j^0$, where $\Pr(s_j=1) = \sigma\left(\sum_i s_i w_{ij} \right)$, where $\sigma = (1+e^{-x})^{-1}$ is the logistic sigmoid. Hidden activations then similarily reconstruct visible activations $s_i^1$, and those are used to create new hidden activations $s_j^1$. Their learning rule is: $\Delta w_{ij} = \epsilon ( < s_i s_j >^0 - < s_i s_j >^1)$ Where : $\Delta w_{ij}$ is the change in the symmetric weight connecting visible unit $i$ to hidden unit $j$ $\epsilon$ is a learning rate. $< s_i s_j >^0$ and $< s_i s_j >^1$ are the averages (over samples) of the product of the visible activation $s_i$ and hidden activation $s_j$ on the first and second pass, respectively. ## 4. Restricted Boltzmann Machines through time Finally, the authors propose a restricted Boltzmann machine through time, where an RBM is replicated across time. The new weight matrix is defined as $w_{ij\tau} = w_{ij} r(\tau)$. Where $r(\tau)$ is a fixed, causal, temporal kernel: https://i.imgur.com/mLmfAhp.png The authors also allow hidden-to-hidden and visible-to-visible connections. These act as a predictive prior over activations. To overcome the problem that introducing hidden-to-hidden connections makes sampling hidden states intractable, they use an approximation wherein the sampled values of past hidden activations are treated like data, and inference is not done over them. The forward pass thus becomes: $\Pr(s_j=1) = \sigma \left( \sum_i w_{ij} \sum_{\tau=0}^\infty s_i(t-\tau) r(\tau) + \sum_i w_{kj} \sum_{\tau=0}^\infty s_k(t-\tau) r(\tau) \right)$ Where $w_{ij}$'s index vidible to hidden connections and $w_{kj}$'s index hidden-to-hidden connections. The hidden-to-visible pass is computed similarly. The constrastive-divergence update rule then becomes: $\Delta w_{ij} = \epsilon \sum_{t=1}^{\infty} \sum_{\tau=0}^\infty r(\tau) \left( < s_j(t) s_i(t-\tau)>^0 - < s_j(t) s_i(t-\tau)>^1 \right)$ ## 5. Results: The authors demonstrate that they can use this to train a network to learn a temporal model on a toy dataset consisting of an image of a ball that travels repeatedly along a circular path. --- ## Notes: A description of Spiking Restricted Boltzmann Machines in the context of other variants on Boltzmann Machines can be found in [Boltzmann Machines for Time Series](https://arxiv.org/pdf/1708.06004.pdf) (Sept 2017) by Takayuki Osogami |