ShortScience.org - Making Science Accessible!

Welcome to ShortScience.org!

arxiv.org
arxiv-vanity.com
scholar.google.com

Perceiver: General Perception with Iterative Attention
Andrew Jaegle and Felix Gimeno and Andrew Brock and Andrew Zisserman and Oriol Vinyals and Joao Carreira
arXiv e-Print archive - 2021 via Local arXiv
Keywords: cs.CV, cs.AI, cs.LG, cs.SD, eess.AS
more

[link] Summary by CodyWild 3 years ago

This new architecture out of Deepmind applies combines information extraction and bottlenecks to a traditional Transformer base to get a model that can theoretically apply self-attention to meaningfully larger input sizes than earlier architectures allowed. 

Currently, self-attention models are quite powerful and capable, but because attention is quadratic-in-sequence-length in both time, and, often more saliently, memory, it's infeasible to use on long sequences without some modification. This paper propose what they call "cross-attention," where some smaller-dimensional latent vector attends to the input (the latent generates the queries, the input the keys and values). This lets the network pull information out of the larger-dimensional input into a smaller and fixed-by-hyperparameter, size of latent. From there, multiple self-attention layers are applied to generate a new latent, which can be fed back into the beginning of the process to query new information from the input, accounting for the "iterative" in the title of this work. 

The authors argue this approach lets them take larger inputs, and create deeper models, because the cost of each self-attention layer (going from latent-dim to latent-dim) is small and controlled. Like many other Transformer-based architectures, they use positional encodings, theirs based on Fourier features at different frequencies. 

https://i.imgur.com/Wc8rzII.png

My overall take from the results presented is that it is competitive on many of the audio and vision tasks tested, with none of the convolutional priors that even something like Vision Transformer (which does course convolution-style preprocessing before going into Transformer layers) require, though it didn't dramatically outperform the state-of-the-art on any of the tested tasks. One thing that was strange to me was that they didn't (at least in the main paper, haven't read the appendix) seem to evaluate on text, which would seem like an obvious benchmark if you're proposing a Transformer-alternate architecture.

arxiv.org
arxiv-vanity.com
scholar.google.com

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
Hassan Akbari and Liangzhe Yuan and Rui Qian and Wei-Hong Chuang and Shih-Fu Chang and Yin Cui and Boqing Gong
arXiv e-Print archive - 2021 via Local arXiv
Keywords: cs.CV, cs.AI, cs.LG, cs.MM, eess.IV
more

[link] Summary by CodyWild 3 years ago

This strikes me as a really straightforward, clever, and exciting paper that uses the supervision intrinsic in the visual, audio, and text streams of a video to train a shared multimodal model. 

The basic premise is: 

- Tokenize all three modalities into a sequence of embedding tokens. For video, split into patches, and linearly project the voxels of these patches to get a per-token representation. For audio, a similar strategy but with waveform patches. For text, the normal per-token embedding is done. Combine this tokenization with a modality-specific positional encoding.
- Run all of these embeddings through a Transformer with shared weights for all three modalities
- Take the final projected CLS representation for each the video patches, and perform contrastive learning against both an aligned audio patch, and an aligned text region. This contrastive loss is calculated by, for each pair, projecting into a shared space (video and audio each project into a shared audio-video space, video and text each project into a shared video-text space, with specific projection weights), and then doing a normal contrastive setup where positive pairs come either from a direct alignment of audio and video, or from a soft "nearest neighbors" alignment of text with video, to account for not all video snippets containing text

One technique that was fun in its simplicity was the author's DropToken strategy, which basically just said "hey, we have a high-resolution input, what if we just randomly dropped tokens within our sequence to reduce the S^2 sequence length cost. This obviously leads to some performance cost, but they found it not very dramatic. 

Experimental results were all-around impressive, achieving SOTA on a number of modality-specific tasks (action prediction in video, audio prediction) with their cross-modality model.

arxiv.org
scholar.google.com

Deep Residual Learning for Image Recognition
He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Martin Thoma 9 years ago

Deeper networks should never have a higher **training** error than smaller ones. In the worst case, the layers should "simply" learn identities. It seems as this is not so easy with conventional networks, as they get much worse with more layers. So the idea is to add identity functions which skip some layers. The network only has to learn the **residuals**. 

Advantages:

* Learning the identity becomes learning 0 which is simpler
* Loss in information flow in the forward pass is not a problem anymore
    * No vanishing / exploding gradient
* Identities don't have parameters to be learned

## Evaluation

The learning rate starts at 0.1 and is divided by 10 when the error plateaus. Weight decay of 0.0001 ($10^{-4}$), momentum of 0.9. They use mini-batches of size 128.

* ImageNet ILSVRC 2015: 3.57% (ensemble)
* CIFAR-10: 6.43%
* MS COCO: 59.0% mAp@0.5 (ensemble)
* PASCAL VOC 2007: 85.6% mAp@0.5
* PASCAL VOC 2012: 83.8% mAp@0.5

## See also

* [DenseNets](http://www.shortscience.org/paper?bibtexKey=journals/corr/1608.06993)

arxiv.org
arxiv-vanity.com
scholar.google.com

Reward Augmented Maximum Likelihood for Neural Structured Prediction
Mohammad Norouzi and Samy Bengio and Zhifeng Chen and Navdeep Jaitly and Mike Schuster and Yonghui Wu and Dale Schuurmans
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.LG
more

[link] Summary by Jon Gauthier 8 years ago

(See also a more thorough summary in [a LaTeX PDF][1].)

This paper has some nice clear theory which bridges maximum likelihood (supervised) learning and standard reinforcement learning. It focuses on *structured prediction* tasks, where we want to learn to predict $p_\theta(y \mid x)$ where $y$ is some object with complex internal structure.

We can agree on some deficiencies of maximum likelihood learning:

- ML training fails to assign **partial credit**. Models are trained to maximize the likelihood of the ground-truth outputs in the dataset, and all other outputs are equally wrong. This is an increasingly important problem as the space of possible solutions grows.
- ML training is potentially disconnected from **downstream task reward**. In machine translation, we usually want to optimize relatively complex metrics like BLEU or TER. Since these metrics are non-differentiable, we have to settle for optimizing proxy losses that we hope are related to the metric of interest.

Reinforcement learning offers an attractive alternative in theory. RL algorithms are designed to optimize non-differentiable (even stochastic) reward functions, which sounds like just what we want. But RL algorithms have their own problems with this sort of structured output space:

- Standard RL algorithms rely on samples from the model we are learning, $p_\theta(y \mid x)$. This becomes intractable when our output space is very complex (e.g. 80-token sequences where each word is drawn from a vocabulary of 80,000 words).
- The reward spaces for problems of interest are extremely sparse. Our metrics will assign 0 reward to most of the 80^80K possible outputs in the translation problem in the paper.
- Vanilla RL doesn't take into account the ground-truth outputs available to us in structured prediction.

This paper designs a solution which combines supervised learning with a reinforcement learning-inspired smoothing method. Concretely, the authors design an **exponentiated payoff distribution** $q(y \mid y^*; \tau)$ which assigns high mass to high-reward outputs $y$ and low mass elsewhere. This distribution is used to effectively smooth the loss function established by the ground-truth outputs in the supervised data. We end up optimizing the following objective:

$$\mathcal L_\text{RML} = - \mathbb E_{x, y^* \sim \mathcal D}\left[ \sum_y q(y \mid y^*; \tau) \log p_\theta(y \mid x) \right]$$

This optimization depends on samples from our dataset $\mathcal D$ and, more importantly, the stationary payoff distribution $q$. This contrasts strongly with standard RL training, where the objective depends on samples from the non-stationary model distribution $p_\theta$. To make that clear, we can rewrite the above with another expectation:

$$\mathcal L_\text{RML} = - \mathbb E_{x, y^* \sim \mathcal D, y \sim q(y \mid y^*; \tau)}\left[ \log p_\theta(y \mid x) \right]$$

### Model details

If you're interested in the low-level details, I wrote up the gist of the math in [this PDF][1].

### Analysis

#### Relationship to label smoothing

This training approach is mathematically equivalent to label smoothing, applied here to structured output problems. In next-word prediction language modeling, a popular trick involves smoothing the target distributions by combining the ground-truth output with some simple base model, e.g. a unigram word frequency distribution. (This just means we take a weighted sum of the one-hot vector from our supervised data and a normalized frequency vector calculated on some corpus.) Mathematically, the cross entropy with label smoothing is

$$\mathcal L_\text{ML-smooth} = - \mathbb E_{x, y^* \sim \mathcal D} \left[ \sum_y p_\text{smooth}(y; y^*) \log p_\theta(y \mid x) \right]$$

(The equation above leaves out a constant entropy term.)

The gradient of this objective looks exactly the same as the reward-augmented ML gradient from the paper:

$$\nabla_\theta \mathcal L_\text{ML-smooth} = \mathbb E_{x, y^* \sim \mathcal D, y \sim p_\text{smooth}} \left[ \log p_\theta(y \mid x) \right]$$

So reward-augmented likelihood is equivalent to label smoothing, where our smoothing distribution is log-proportional to our downstream reward function.

#### Relationship to distillation

Optimizing the reward-augmented maximum likelihood is equivalent to minimizing the KL divergence $$D_\text{KL}(q(y \mid y^*; \tau) \mid\mid p_\theta(y \mid x))$$

This divergence reaches zero iff $q = p$. We can say, then, that the effect of optimizing on $\mathcal L_\text{RML}$ is to **distill** the reward function (which parameterizes $q$) into the model parameters $\theta$ (which parameterize $p_\theta$).

It's exciting to think about other sorts of more complex models that we might be able to distill in this framework. The unfortunate (?) restriction is that the "source" model of the distillation ($q$ in this paper) must admit to efficient sampling.

#### Relationship to adversarial training

We can also view reward-augmented maximum likelihood training as a data augmentation technique: it synthesizes new "partially correct" examples using the reward function as a guide. We then train on all of the original and synthesized data, again weighting the gradients based on the reward function.

Adversarial training is a similar data augmentation technique which generates examples that force the model to be robust to changes in its input space (robust to changes of $x$). Both adversarial training and the RML objective encourage the model to be robust "near" the ground-truth supervised data. A high-level comparison:

- Adversarial training can be seen as data augmentation in the input space; RML training performs data augmentation in the output space.
- Adversarial training is a **model-based data augmentation**: the samples are generated from a process that depends on the current parameters during training. RML training performs **data-based augmentation**, which could in theory be done independent of the actual training process.

---

Thanks to Andrej Karpathy, Alec Radford, and Tim Salimans for interesting discussion which contributed to this summary.

[1]: https://drive.google.com/file/d/0B3Rdm_P3VbRDVUQ4SVBRYW82dU0/view

arxiv.org
scholar.google.com

Rethinking Pre-training and Self-training
Zoph, Barret and Ghiasi, Golnaz and Lin, Tsung-Yi and Cui, Yin and Liu, Hanxiao and Cubuk, Ekin D. and Le, Quoc V.
arXiv e-Print archive - 2020 via Local Bibsonomy
Keywords: dblp

[link] Summary by CodyWild 4 years ago

Occasionally, I come across results in machine learning that I'm glad exist, even if I don't fully understand them, precisely because they remind me how little we know about the complicated information architectures we're building, and what kinds of signal they can productively use. This is one such result.

The paper tests a method called self-training, and compares it against the more common standard of pre-training. Pre-training works by first training your model on a different dataset, in a supervised way, with the labels attached to that dataset, and then transferring the learned weights on that model model (except for the final prediction head) and using that as initialization for training on your downstream task. Self-training also uses an external dataset, but doesn't use that external data's labels. It works by

1) Training a model on the labeled data from your downstream task, the one you ultimately care about final performance on

2) Using that model to make label predictions (for the label set of your downstream task), for the external dataset

3) Retraining a model from scratch with the combined set of human labels and predicted labels from step (2)

https://i.imgur.com/HaJTuyo.png
This intuitively feels like cheating; something that shouldn't quite work, and yet the authors find that it equals or outperforms pretraining and self-supervised learning in the setting they examined (transferring from ImageNet as an external dataset to CoCo as a downstream task, and using data augmentations on CoCo). They particularly find this to be the case when they're using stronger data augmentations, and when they have more labeled CoCo data to train with from the pretrained starting point. They also find that self-training outperforms self-supervised (e.g. contrastive) learning in similar settings. They further demonstrate that self-training and pre-training can stack; you can get marginal value from one, even if you're already using the other. They do acknowledge that - because it requires training a model on your dataset twice, rather than reusing an existing model directly - their approach is more computationally costly than the pretrained-Imagenet alternative.

This work is, I believe, rooted in the literature on model distillation and student/teacher learning regimes, which I believe has found that you can sometimes outperform a model by training on its outputs, though I can't fully remember the setups used in those works.

The authors don't try too hard to give a rigorous theoretical account of why this approach works, which I actually appreciate. I think we need to have space in ML for people to publish what (at least to some) might be unintuitive empirical results, without necessarily feeling pressure to articulate a theory that may just be a half-baked after-the-fact justification.

One criticism or caveat I have about this paper is that I wish they'd evaluated what happened if they didn't use any augmentation. Does pre-training do better in that case? Does the training process they're using just break down? Only testing on settings with augmentations made me a little less confident in the generality of their result. Their best guess is that it demonstrates the value of task-specificity in your training. I think there's a bit of that, but also feel like this ties in with other papers I've read recently on the surprising efficacy of training with purely random labels. I think there's, in general, a lot we don't know about what ostensibly supervised networks learn in the face of noisy or even completely permuted labels.