ShortScience.org - Making Science Accessible!

Welcome to ShortScience.org!

arxiv.org
arxiv-vanity.com
scholar.google.com

Reward Augmented Maximum Likelihood for Neural Structured Prediction
Mohammad Norouzi and Samy Bengio and Zhifeng Chen and Navdeep Jaitly and Mike Schuster and Yonghui Wu and Dale Schuurmans
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.LG
more

[link] Summary by Jon Gauthier 8 years ago

(See also a more thorough summary in [a LaTeX PDF][1].)

This paper has some nice clear theory which bridges maximum likelihood (supervised) learning and standard reinforcement learning. It focuses on *structured prediction* tasks, where we want to learn to predict $p_\theta(y \mid x)$ where $y$ is some object with complex internal structure.

We can agree on some deficiencies of maximum likelihood learning:

- ML training fails to assign **partial credit**. Models are trained to maximize the likelihood of the ground-truth outputs in the dataset, and all other outputs are equally wrong. This is an increasingly important problem as the space of possible solutions grows.
- ML training is potentially disconnected from **downstream task reward**. In machine translation, we usually want to optimize relatively complex metrics like BLEU or TER. Since these metrics are non-differentiable, we have to settle for optimizing proxy losses that we hope are related to the metric of interest.

Reinforcement learning offers an attractive alternative in theory. RL algorithms are designed to optimize non-differentiable (even stochastic) reward functions, which sounds like just what we want. But RL algorithms have their own problems with this sort of structured output space:

- Standard RL algorithms rely on samples from the model we are learning, $p_\theta(y \mid x)$. This becomes intractable when our output space is very complex (e.g. 80-token sequences where each word is drawn from a vocabulary of 80,000 words).
- The reward spaces for problems of interest are extremely sparse. Our metrics will assign 0 reward to most of the 80^80K possible outputs in the translation problem in the paper.
- Vanilla RL doesn't take into account the ground-truth outputs available to us in structured prediction.

This paper designs a solution which combines supervised learning with a reinforcement learning-inspired smoothing method. Concretely, the authors design an **exponentiated payoff distribution** $q(y \mid y^*; \tau)$ which assigns high mass to high-reward outputs $y$ and low mass elsewhere. This distribution is used to effectively smooth the loss function established by the ground-truth outputs in the supervised data. We end up optimizing the following objective:

$$\mathcal L_\text{RML} = - \mathbb E_{x, y^* \sim \mathcal D}\left[ \sum_y q(y \mid y^*; \tau) \log p_\theta(y \mid x) \right]$$

This optimization depends on samples from our dataset $\mathcal D$ and, more importantly, the stationary payoff distribution $q$. This contrasts strongly with standard RL training, where the objective depends on samples from the non-stationary model distribution $p_\theta$. To make that clear, we can rewrite the above with another expectation:

$$\mathcal L_\text{RML} = - \mathbb E_{x, y^* \sim \mathcal D, y \sim q(y \mid y^*; \tau)}\left[ \log p_\theta(y \mid x) \right]$$

### Model details

If you're interested in the low-level details, I wrote up the gist of the math in [this PDF][1].

### Analysis

#### Relationship to label smoothing

This training approach is mathematically equivalent to label smoothing, applied here to structured output problems. In next-word prediction language modeling, a popular trick involves smoothing the target distributions by combining the ground-truth output with some simple base model, e.g. a unigram word frequency distribution. (This just means we take a weighted sum of the one-hot vector from our supervised data and a normalized frequency vector calculated on some corpus.) Mathematically, the cross entropy with label smoothing is

$$\mathcal L_\text{ML-smooth} = - \mathbb E_{x, y^* \sim \mathcal D} \left[ \sum_y p_\text{smooth}(y; y^*) \log p_\theta(y \mid x) \right]$$

(The equation above leaves out a constant entropy term.)

The gradient of this objective looks exactly the same as the reward-augmented ML gradient from the paper:

$$\nabla_\theta \mathcal L_\text{ML-smooth} = \mathbb E_{x, y^* \sim \mathcal D, y \sim p_\text{smooth}} \left[ \log p_\theta(y \mid x) \right]$$

So reward-augmented likelihood is equivalent to label smoothing, where our smoothing distribution is log-proportional to our downstream reward function.

#### Relationship to distillation

Optimizing the reward-augmented maximum likelihood is equivalent to minimizing the KL divergence $$D_\text{KL}(q(y \mid y^*; \tau) \mid\mid p_\theta(y \mid x))$$

This divergence reaches zero iff $q = p$. We can say, then, that the effect of optimizing on $\mathcal L_\text{RML}$ is to **distill** the reward function (which parameterizes $q$) into the model parameters $\theta$ (which parameterize $p_\theta$).

It's exciting to think about other sorts of more complex models that we might be able to distill in this framework. The unfortunate (?) restriction is that the "source" model of the distillation ($q$ in this paper) must admit to efficient sampling.

#### Relationship to adversarial training

We can also view reward-augmented maximum likelihood training as a data augmentation technique: it synthesizes new "partially correct" examples using the reward function as a guide. We then train on all of the original and synthesized data, again weighting the gradients based on the reward function.

Adversarial training is a similar data augmentation technique which generates examples that force the model to be robust to changes in its input space (robust to changes of $x$). Both adversarial training and the RML objective encourage the model to be robust "near" the ground-truth supervised data. A high-level comparison:

- Adversarial training can be seen as data augmentation in the input space; RML training performs data augmentation in the output space.
- Adversarial training is a **model-based data augmentation**: the samples are generated from a process that depends on the current parameters during training. RML training performs **data-based augmentation**, which could in theory be done independent of the actual training process.

---

Thanks to Andrej Karpathy, Alec Radford, and Tim Salimans for interesting discussion which contributed to this summary.

[1]: https://drive.google.com/file/d/0B3Rdm_P3VbRDVUQ4SVBRYW82dU0/view

arxiv.org
scholar.google.com

Not All Unlabeled Data are Equal: Learning to Weight Data in Semi-supervised Learning
Ren, Zhongzheng and Yeh, Raymond A. and Schwing, Alexander G.
- 2020 via Local Bibsonomy
Keywords: dataset, semi-supervised, machine-learning, data, 2020

[link] Summary by CodyWild 4 years ago

This paper argues that, in semi-supervised learning, it's suboptimal to use the same weight for all examples (as happens implicitly, when the unsupervised component of the loss for each example is just added together directly. Instead, it tries to learn weights for each specific data example, through a meta-learning-esque process. 

The form of semi-supervised learning being discussed here is label-based consistency loss, where a labeled image is augmented and run through the current version of the model, and the model is optimized to try to induce the same loss for the augmented image as the unaugmented one. The premise of the authors argument for learning per-example weights is that, ideally, you would enforce consistency loss less on examples where a model was unconfident in its label prediction for an unlabeled example. 

As a way to solve this, the authors suggest learning a vector of parameters - one for each example in the dataset - where element i in the vector is a weight for element i of the dataset, in the summed-up unsupervised loss. They do this via a two-step process, where first they optimize the parameters of the network given the example weights, and then the optimize the example weights themselves. To optimize example weights, they calculate a gradient of those weights on the post-training validation loss, which requires backpropogating through the optimization process (to determine how different weights might have produced a different gradient, which might in turn have produced better validation loss). This requires calculating the inverse Hessian (second derivative matrix of the loss), which is, generally speaking, a quite costly operation for huge-parameter nets. To lessen this cost, they pretend that only the final layer of weights in the network are being optimized, and so only calculate the Hessian with respect to those weights. They also try to minimize cost by only updating the example weights for the examples that were used during the previous update step, since, presumably those were the only ones we have enough information to upweight or downweight. With this model, the authors achieve modest improvements - performance comparable to or within-error-bounds better than the current state of the art, FixMatch. 

Overall, I find this paper a little baffling. It's just a crazy amount of effort to throw into something that is a minor improvement. A few issues I have with the approach: 

- They don't seem to have benchmarked against the simpler baseline of some inverse of using Dropout-estimated uncertainty as the weight on examples, which would, presumably, more directly capture the property of "is my model unsure of its prediction on this unlabeled example"
- If the presumed need for this is the lack of certainty of the model, that's a non-stationary problem that's going to change throughout the course of training, and so I'd worry that you're basically taking steps in the direction of a moving target
- Despite using techniques rooted in meta-learning, it doesn't seem like this models learns anything generalizable - it's learning index-based weights on specific examples, which doesn't give it anything useful it can do with some new data point it finds that it wasn't specifically trained on

Given that, I think I'd need to see a much stronger case for dramatic performance benefits for something like this to seem like it was worth the increase in complexity (not to mention computation, even with the optimized Hessian scheme)

openreview.net
scholar.google.com

Comparing Rewinding and Fine-tuning in Neural Network Pruning
Renda, Alex and Frankle, Jonathan and Carbin, Michael
International Conference on Learning Representations - 2020 via Local Bibsonomy
Keywords: dblp

[link] Summary by CodyWild 5 years ago

This is an interestingly pragmatic paper that makes a super simple observation. Often, we may want a usable network with fewer parameters, to make our network more easily usable on small devices. It's been observed (by these same authors, in fact), that pruned networks can achieve comparable weights to their fully trained counterparts if you rewind and retrain from early in the training process, to compensate for the loss of the (not ultimately important) pruned weights. This observation has been dubbed the "Lottery Ticket Hypothesis", after the idea that there's some small effective subnetwork you can find if you sample enough networks. 

Given these two facts - the usefulness of pruning, and the success of weight rewinding - the authors explore the effectiveness of various ways to train after pruning. Current standard practice is to prune low-magnitude weights, and then continue training remaining weights from values they had at pruning time, keeping the final learning rate of the network constant. The authors find that: 
1. Weight rewinding, where you rewind weights to *near* their starting value, and then retrain using the learning rates of early in training, outperforms fine tuning from the place weights were when you pruned 

but, also 
2. Learning rate rewinding, where you keep weights as they are, but rewind learning rates to what they were early in training, are actually the most effective for a given amount of training time/search cost 

To me, this feels a little bit like burying the lede: the takeaway seems to be that when you prune, it's beneficial to make your network more "elastic" (in the metaphor-to-neuroscience sense) so it can more effectively learn to compensate for the removed neurons. So, what was really valuable in weight rewinding was the ability to "heat up" learning on a smaller set of weights, so they could adapt more quickly. And the fact that learning rate rewinding works better than weight rewinding suggests that there is value in the learned weights after all, that value is just outstripped by the benefit of rolling back to old learning rates. 

All in all, not a super radical conclusion, but a useful and practical one to have so clearly laid out in a paper.

arxiv.org
scholar.google.com

Mask R-CNN
He, Kaiming and Gkioxari, Georgia and Dollár, Piotr and Girshick, Ross B.
arXiv e-Print archive - 2017 via Local Bibsonomy
Keywords: dblp

[link] Summary by Qure.ai 8 years ago

Mask RCNN takes off from where Faster RCNN left, with some augmentations aimed at bettering instance segmentation (which was out of scope for FRCNN). Instance segmentation was achieved remarkably well in *DeepMask* , *SharpMask* and later *Feature Pyramid Networks* (FPN).

Faster RCNN was not designed for pixel-to-pixel alignment between network inputs and outputs. This is most evident in how RoIPool , the de facto core operation for attending to instances, performs coarse spatial quantization for feature extraction. Mask RCNN fixes that by introducing RoIAlign in place of RoIPool.

#### Methodology

Mask RCNN retains most of the architecture of Faster RCNN. It adds the a third branch for segmentation. The third branch takes the output from RoIAlign layer and predicts binary class masks for each class.

##### Major Changes and intutions

**Mask prediction**

Mask prediction segmentation predicts a binary mask for each RoI using fully convolution - and the stark difference being usage of *sigmoid* activation for predicting final mask instead of *softmax*, implies masks don't compete with each other. This *decouples* segmentation from classification. The class prediction branch is used for class prediction and for calculating loss, the mask of predicted loss is used calculating Lmask.

Also, they show that a single class agnostic mask prediction works almost as effective as separate mask for each class, thereby supporting their method of decoupling classification from segmentation

**RoIAlign**

RoIPool first quantizes a floating-number RoI to the discrete granularity of the feature map, this quantized RoI is then subdivided into spatial bins which are themselves quantized, and finally feature values covered by each bin are aggregated (usually by max pooling). Instead of  quantization of the RoI boundaries
or bin bilinear interpolation is used to compute the exact values of the input features at four regularly sampled locations in each RoI bin, and aggregate the result (using max or average).

**Backbone architecture**

Faster RCNN uses a VGG like structure for extracting features from image, weights of which were shared among RPN and region detection layers. Herein, authors experiment with 2 backbone architectures - ResNet based VGG like in FRCNN and ResNet based [FPN](http://www.shortscience.org/paper?bibtexKey=journals/corr/LinDGHHB16) based. FPN uses convolution feature maps from previous layers and recombining them to produce pyramid of feature maps to be used for prediction instead of single-scale feature layer (final output of conv layer before connecting to fc layers was used in Faster RCNN) 

**Training Objective**

The training objective looks like this 
![](https://i.imgur.com/snUq73Q.png)

Lmask is the addition from Faster RCNN. The method to calculate was mentioned above

#### Observation

Mask RCNN performs significantly better than COCO instance segmentation winners *without any bells and whiskers*. Detailed results are available in the paper

scholar.google.com

Distributed representations of words and phrases and their compositionality
Mikolov, Tomas and Sutskever, Ilya and Chen, Kai and Corrado, Greg S and Dean, Jeff
Advances in neural information processing systems - 2013 via Local Bibsonomy
Keywords: thema:deepwalk, language, modelling, representation

[link] Summary by NIPS Conference Reviews 9 years ago

The paper discusses a number of extensions to the Skip-gram model previously proposed by Mikolov et al (citation [7] in the paper): which learns linear word embeddings that are particularly useful for analogical reasoning type tasks. The extensions proposed (namely, negative sampling and sub-sampling of high frequency words) enable extremely fast training of the model on large scale datasets. This also results in significantly improved performance as compared to previously proposed techniques based on neural networks. The authors also provide a method for training phrase level embeddings by slightly tweaking the original training algorithm.

This paper proposes 3 improvements for the skip-gram model which allows for learning embeddings for words. The first improvement is subsampling frequent word, the second is the use of a simplified version of noise constrastive estimation (NCE) and finally they propose a method to learn idiomatic phrase embeddings. In all three cases the improvements are somewhat ad-hoc. In practice, both the subsampling and negative samples help to improve generalization substantially on an analogical reasoning task. The paper reviews related work and furthers the interesting topic of additive compositionality in embeddings.

The article does not propose any explanation as to why the negative sampling produces better results than NCE which it is suppose to loosely approximate. In fact it doesn't explain why besides the obvious generalization gain the negative sampling scheme should be preferred to NCE since they achieve similar speeds.