ShortScience.org - Making Science Accessible!

Welcome to ShortScience.org!

Learning End-to-end Video Classification with Rank-Pooling
Fernando, Basura and Gould, Stephen
International Conference on Machine Learning - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Hugo Larochelle 9 years ago

This paper describes how rank pooling, a very recent approach for pooling representations organized in a sequence $\\{{\bf v}_t\\}_{t=1}^T$, can be used in an end-to-end trained neural network architecture.

Rank pooling is an alternative to average and max pooling for sequences, but with the distinctive advantage of maintaining some order information from the sequence. Rank pooling first solves a regularized (linear) support vector regression (SVR) problem where the inputs are the vector representations ${\bf v}_t$ in the sequence and the target is the corresponding index $t$ of that representation in the sequence (see Equation 5). The output of rank pooling is then simply the linear regression parameters $\bf{u}$ learned for that sequence. Because of the way ${\bf u}$ is trained, we can see that ${\bf u}$ will capture order information, as successful training would imply that ${\bf u}^\top {\bf v}_t < {\bf u}^\top {\bf v}_{t'} $ if $t < t'$. See [this paper](https://www.robots.ox.ac.uk/~vgg/rg/papers/videoDarwin.pdf) for more on rank pooling.

While previous work has focused on using rank pooling on hand-designed and fixed representations, this paper proposes to use ConvNet features (pre-trained on ImageNet) for the representation and backpropagate through rank pooling to fine-tune the ConvNet features. Since the output of rank pooling corresponds to an argmin operation, passing gradients through this operation is not as straightforward as for average or max pooling. However, it turns out that if the objective being minimized (in our case regularized SVR) is twice differentiable, gradients with respect to its argmin can be computed (see Lemmas 1 and 2). The authors derive the gradient for rank pooling (Equation 21). Finally, since its gradient requires inverting a matrix (corresponding to a hessian), the authors propose to either use an efficient procedure for computing it by exploiting properties of sums of rank-one matrices (see Lemma 3) or to simply use an approximation based on using a diagonal hessian.

In experiments on two small scale video activity recognition datasets (UCF-Sports and Hollywood2), the authors show that fine-tuning the ConvNet features significantly improves the performance of rank pooling and makes it superior to max and average pooling.

**My two cents**

This paper was eye opening for me, first because I did not realize that one could backpropagate through an operation corresponding to an argmin that doesn't have a closed form solution (though apparently this paper isn't the first to make that observation). Moreover, I did not know about rank pooling, which itself is a really thought provoking approach to pooling representations in a way that preserves some organizational information about the original representations.

I wonder how sensitive the results are to the value of the regularization constant of the SVR problem. The authors mention some theoretical guaranties on the stability of the solution found by SVR in general, but intuitively I would expect that the regularization constant would play a large role in the stability.

I'll be looking forward to any future attempts to increase the speed of rank pooling (or any similar method). Indeed, as the authors mention, it is currently too slow to be used on the larger video datasets that are currently available. 

Code for computing rank pooling (though not for computing its gradients) seems to be available [here](https://bitbucket.org/bfernando/videodarwin).

2 Comments

arxiv.org
scholar.google.com

Building Machines That Learn and Think Like People
Brenden M. Lake and Tomer D. Ullman and Joshua B. Tenenbaum and Samuel J. Gershman
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.AI, cs.CV, cs.LG, cs.NE, stat.ML
more

[link] Summary by Abhishek Das 8 years ago

This paper performs a comparitive study of recent advances in deep learning with human-like learning from a cognitive science point of view. Since natural intelligence is still the best form of intelligence, the authors list a core set of ingredients required to build machines that reason like humans.

- Cognitive capabilities present from childhood in humans.
- Intuitive physics; for example, a sense of plausibility of object trajectories, affordances.
- Intuitive psychology; for example, goals and beliefs.
- Learning as rapid model-building (and not just pattern recognition).
- Based on compositionality and learning-to-learn.
- Humans learn by inferring a general schema to describe goals, object types and interactions. This enables learning from few examples.
- Humans also learn richer conceptual models.
- Indicator: variety of functions supported by these models: classification, prediction, explanation, communication, action, imagination and composition.
- Models should hence have strong inductive biases and domain knowledge built into them; structural sharing of concepts by compositional reuse of primitives.
- Use of both model-free and model-based learning.
- Model-free, fast selection of actions in simple associative learning and discriminative tasks.
- Model-based learning when a causal model has been built to plan future actions or maximize rewards.
- Selective attention, augmented working memory, and experience replay are low-level promising trends in deep learning inspired from cognitive psychology.
- Need for higher-level aforementioned ingredients.

arxiv.org
arxiv-vanity.com
scholar.google.com

Online Continual Learning with Maximally Interfered Retrieval
Rahaf Aljundi and Lucas Caccia and Eugene Belilovsky and Massimo Caccia and Laurent Charlin and Tinne Tuytelaars
arXiv e-Print archive - 2019 via Local arXiv
Keywords: cs.LG, stat.ML
more

[link] Summary by Massimo Caccia 6 years ago

Disclaimer: I am an author

# Intro

Experience replay (ER) and generative replay (GEN) are two effective continual learning strategies. In the former, samples from a stored memory are replayed to the continual learner to reduce forgetting. In the latter, old data is compressed with a generative model and generated data is replayed to the continual learner. Both of these strategies assume a random sampling of the memories. But learning a new task doesn't cause **equal** interference (forgetting) on the previous tasks!  

In this work, we propose a controlled sampling of the replays. Specifically, we retrieve the samples which are most interfered, i.e. whose prediction will be most negatively impacted by the foreseen parameters update. The method is called Maximally Interfered Retrieval (MIR).

## Cartoon for explanation

https://i.imgur.com/5F3jT36.png

Learning about dogs and horses might cause more interference on lions and zebras than on cars and oranges. Thus, replaying lions and zebras would be a more efficient strategy.

# Method

1) incoming data: $(X_t,Y_t)$

2) foreseen parameter update: $\theta^v= \theta-\alpha\nabla\mathcal{L}(f_\theta(X_t),Y_t)$

### applied to ER (ER-MIR)
3) Search for the top-$k$ values $x$ in the stored memories using the criterion $$s_{MI}(x) = \mathcal{L}(f_{\theta^v}(x),y) -\mathcal{L}(f_{\theta}(x),y)$$

### or applied to GEN (GEN-MIR)
3)   
$$
     \underset{Z}{\max} \, \mathcal{L}\big(f_{\theta^v}(g_\gamma(Z)),Y^*\big) -\mathcal{L}\big(f_{\theta}(g_\gamma(Z)),Y^*\big)
$$
$$
         \text{s.t.}   \quad ||z_i-z_j||_2^2 > \epsilon \forall  z_i,z_j \in Z \,\text{with} \, z_i\neq z_j
$$
i.e. search in the latent space of a generative model $g_\gamma$ for samples that are the most forgotten given the foreseen update.

4) Then add theses memories to incoming data $X_t$ and train $f_\theta$

# Results

### qualitative
https://i.imgur.com/ZRNTWXe.png

Whilst learning 8s and 9s (first row), GEN-MIR mainly retrieves 3s and 4s (bottom two rows) which are similar to 8s and 9s respectively.

### quantitative 

GEN-MIR was tested on MNIST SPLIT and Permuted MNIST, outperforming the baselines in both cases.

ER-MIR was tested on MNIST SPLIT, Permuted MNIST and Split CIFAR-10, outperforming the baselines in all cases.


# Other stuff
### (for avid readers)

We propose a hybrid method (AE-MIR) in which the generative model is replaced with an autoencoder to facilitate the compression of harder dataset like e.g. CIFAR-10.

arxiv.org
arxiv-vanity.com
scholar.google.com

Language GANs Falling Short
Massimo Caccia and Lucas Caccia and William Fedus and Hugo Larochelle and Joelle Pineau and Laurent Charlin
arXiv e-Print archive - 2018 via Local arXiv
Keywords: cs.CL, cs.LG
more

[link] Summary by CodyWild 7 years ago

This paper’s high-level goal is to evaluate how well GAN-type structures for generating text are performing, compared to more traditional maximum likelihood methods. In the process, it zooms into the ways that the current set of metrics for comparing text generation fail to give a well-rounded picture of how models are performing.

In the old paradigm, of maximum likelihood estimation, models were both trained and evaluated on a maximizing the likelihood of each word, given the prior words in a sequence. That is, models were good when they assigned high probability to true tokens, conditioned on past tokens. However, GANs work in a fundamentally new framework, in that they aren’t trained to increase the likelihood of the next (ground truth) word in a sequence, but to generate a word that will make a discriminator more likely to see the sentence as realistic. Since GANs don’t directly model the probability of token t, given prior tokens, you can’t evaluate them using this maximum likelihood framework.

This paper surveys a range of prior work that has evaluated GANs and MLE models on two broad categories of metrics, occasionally showing GANs to perform better on one or the other, but not really giving a way to trade off between the two.
- The first type of metric, shorthanded as “quality”, measures how aligned the generated text is with some reference corpus of text: to what extent your generated text seems to “come from the same distribution” as the original. BLEU, a heuristic frequently used in translation, and also leveraged here, measures how frequently certain sets of n-grams occur in the reference text, relative to the generated text. N typically goes up to 4, and so in addition to comparing the distributions of single tokens in the reference and generated, BLEU also compares shared bigrams, trigrams, and quadgrams (?) to measure more precise similarity of text.
- The second metric, shorthanded as “diversity” measures how different generated sentences are from one another. If you want to design a model to generate text, you presumably want it to be able to generate a diverse range of text - in probability terms, you want to fully sample from the distribution, rather than just taking the expected or mean value. Linguistically, this would be show up as a generator that just generates the same sentence over and over again. This sentence can be highly representative of the original text, but lacks diversity. One metric used for this is the same kind of BLEU score, but for each generated sentence against a corpus of prior generated sentences, and, here, the goal is for the overlap to be as low as possible

The trouble with these two metrics is that, in their raw state, they’re pretty incommensurable, and hard to trade off against one another. The authors of this paper try to address this by observing that all models trade off diversity and quality to some extent, just by modifying the entropy of the conditional token distribution they learn. If a distribution is high entropy, that is, if it spreads probability out onto more tokens, it’s likelier to bounce off into a random place, which increases diversity, but also can make the sentence more incoherent. By contrast, if a distribution is too low entropy, only ever putting probability on one or two words, then it will be only ever capable of carving out a small number of distinct paths through word space.
The below table shows a good example of what language generation can look like at high and low levels of entropy
https://i.imgur.com/YWGXDaJ.png

The entropy of a softmax distribution be modified, without changing the underlying model, by changing the *temperature* of the softmax calculation. So, the authors do this, and, as a result, they can chart out that model’s curve on the quality/diversity axis. Conceptually, this is asking “at a range of different quality thresholds, how good is this model’s diversity,” and vice versa. I mentally analogize this to a ROC curve, where it’s not really possible to compare, say, precision of models that use different thresholds, and so you instead need to compare the curve over a range of different thresholds, and compare models on that.

https://i.imgur.com/C3zdEjm.png

When they do this for GANs and MLEs, they find that, while GANs might dominate on a single metric at a time, when you modulate the temperature of MLE models, they’re able to achieve superior quality when you tune them to commensurate levels of diversity.

papers.nips.cc
scholar.google.com

Variational Dropout and the Local Reparameterization Trick
Blum, Avrim and Haghtalab, Nika and Procaccia, Ariel D.
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Hugo Larochelle 10 years ago

This paper starts by introducing a trick to reduce the variance of stochastic gradient variational Bayes (SGVB) estimators. In neural networks, SGVB consists in learning a variational (e.g. diagonal Gaussian) posterior over the weights and biases of neural networks, through a procedure that (for the most part) alternates between adding (Gaussian) noise to the model's parameters and then performing a model update with backprop.

The authors present a local reparameterization trick, which exploits the fact that the Gaussian noise added into the weights could instead be added directly into the pre-activation (i.e. before the activation fonction) vectors during forward propagation. This is due to the fact that computing the pre-activation is a linear operation, thus noise at that level is also Gaussian. The advantage of doing so is that, in the context of minibatch training, one can efficiently then add independent noise to the pre-activation vectors for each example of the minibatch. The nature of the local reparameterization trick implies that this is equivalent to using one corrupted version of the weights for each example in the minibatch, something that wouldn't be practical computationally otherwise. This is in fact why, in normal SGVB, previous work would normally use a single corrupted version of the weights for all the minibatch.

The authors demonstrate that using the local reparameterization trick yields stochastic gradients with lower variance, which should improve the speed of convergence.

Then, the authors demonstrate that the Gaussian version of dropout (one that uses multiplicative Gaussian noise, instead of 0-1 masking noise) can be seen as the local reparameterization trick version of a SGVB objective, with some specific prior and variational posterior. In this SGVB view of Gaussian dropout, the dropout rate is an hyper-parameter of this prior, which can now be tuned by optimizing the variational lower bound of SGVB. In other words, we now have a method to also train the dropout rate! Moreover, it becomes possible to tune an individual dropout rate parameter for each layer, or even each parameter of the model.

Experiments on MNIST confirm that tuning that parameter works and allows to reach good performance of various network sizes, compared to using a default dropout rate.

##### My two cents

This is another thought provoking connection between Bayesian learning and dropout. Indeed, while Deep GPs have allowed to make a Bayesian connection with regular (binary) dropout learning \cite{journals/corr/GalG15}, this paper sheds light on a neat Bayesian connection for the Gaussian version of dropout. This is great, because it suggests that Gaussian dropout training is another legit way of modeling uncertainty in the parameters of neural networks. It's also nice that that connection also yielded a method for tuning the dropout rate automatically.

I hope future work (by the authors or by others) can evaluate the quality of the corresponding variational posterior in terms of estimating uncertainty in the network and, in particular, in obtaining calibrated output probabilities.

Little detail: I couldn't figure out whether the authors tuned a single dropout rate for the whole network, or used many rates, for instance one per parameter, as they suggest can be done.