Welcome to ShortScience.org! 
[link]
This is an interestingly pragmatic paper that makes a super simple observation. Often, we may want a usable network with fewer parameters, to make our network more easily usable on small devices. It's been observed (by these same authors, in fact), that pruned networks can achieve comparable weights to their fully trained counterparts if you rewind and retrain from early in the training process, to compensate for the loss of the (not ultimately important) pruned weights. This observation has been dubbed the "Lottery Ticket Hypothesis", after the idea that there's some small effective subnetwork you can find if you sample enough networks. Given these two facts  the usefulness of pruning, and the success of weight rewinding  the authors explore the effectiveness of various ways to train after pruning. Current standard practice is to prune lowmagnitude weights, and then continue training remaining weights from values they had at pruning time, keeping the final learning rate of the network constant. The authors find that: 1. Weight rewinding, where you rewind weights to *near* their starting value, and then retrain using the learning rates of early in training, outperforms fine tuning from the place weights were when you pruned but, also 2. Learning rate rewinding, where you keep weights as they are, but rewind learning rates to what they were early in training, are actually the most effective for a given amount of training time/search cost To me, this feels a little bit like burying the lede: the takeaway seems to be that when you prune, it's beneficial to make your network more "elastic" (in the metaphortoneuroscience sense) so it can more effectively learn to compensate for the removed neurons. So, what was really valuable in weight rewinding was the ability to "heat up" learning on a smaller set of weights, so they could adapt more quickly. And the fact that learning rate rewinding works better than weight rewinding suggests that there is value in the learned weights after all, that value is just outstripped by the benefit of rolling back to old learning rates. All in all, not a super radical conclusion, but a useful and practical one to have so clearly laid out in a paper. 
[link]
Oh et al. propose two different approaches for whitening black box neural networks, i.e. predicting details of their internals such as architecture or training procedure. In particular, they consider attributes regarding architecture (activation function, dropout, max pooling, kernel size of convolutional layers, number of convolutionaly/fully connected layers etc.), attributes concerning optimization (batch size and optimization algorithm) and attributes regarding the data (data split and size). In order to create a dataset of models, they trained roughly 11k models on MNIST; they ensured that these models have at least 98% accuracy on the validation set and they also consider ensembles. For predicting model attributes, they propose two models, called kenneno and kenneni, see Figure 1. Kenneno takes as input a set of $100$ predictions of the models (i.e. final probability distributions) and tries to directly learn the attributes using a MLP of two fully connected layers. Kenneni instead crafts a single input which allows to reason about a specific model attribute. An example for kenneni is shown in Figure 2. In experiments, they demonstrate that both models are able to predict model attributes significantly better than chance. For details, I refer to the paper. https://i.imgur.com/YbFuniu.png Figure 1: Illustration of the two proposed approaches, kenneno (top) and kenneni (bottom). https://i.imgur.com/ZXj22zG.png Figure 2: Illustration of the images created by kenneni to classify different attributes. See the paper for details. Also view this summary at [davidstutz.de](https://davidstutz.de/category/reading/). 
[link]
**Background:** The goal of this work is to indicate image features which are relevant to the prediction of a neural network and convey that information to the user by displaying a counterfactual image animation. **The Latent Shift Method:** This method works on any pretrained encoder/decoder and classifier which is differentiable. No special considerations are needed during model training. With this approach they want the exact opposite of an adversarial attack but it is using the same idea. They want to perturb the input image so that the classifier reduces its prediction. If they just compute $\frac{\partial f}{\partial x}$ and move the pixels directly then they will get an imperceivable difference like an adversarial attack. Using a decoder they can regularize the transformation so it will only yield value images. The encoder takes the input image and encodes it into a latent representation $z$. Then the decoder reconstructs the image and feeds this image into the classifier. The gradient is computed from the output of the classifier with respect to $z$. Subtracting the gradient from z and reconstructing the image generates a counterfactual. https://i.imgur.com/iuZGUTH.gif They found that if they change the prediction by 30% the images come out pretty good. So an iterative search along the vector defined by the gradient in the latent space until the prediction is reduced by 30%. From this sequence a 2D image can be reconstructed which is similar to a traditional attribution map by taking the maximum pixel wise difference between every image and the unperturbed reconstruction. https://i.imgur.com/V3PCgXZ.png The results look great! https://i.imgur.com/DBki84c.gif https://i.imgur.com/kFfQNKD.gif In order to validate if this approach can help spot false positive predictions, two radiologists to evaluate how confident they were in a models predictions. For each image, radiologists viewed the prediction in two ways, using traditional methods or the Latent Shift images. Traditional methods includes the image gradient, guided backprop, and integrated gradients. The Latent Shift Counterfactual includes the animation as well as the 2D version. https://i.imgur.com/TlUBhzL.png What they would like to see, that for true positives, the results are all 5 and for false positives they are all 1. What they observe however, is that many false positives still cause high confidence in the model predictions but not as much as the true positives. Between these two methods they find for true positives that the latent shift counterfactuals show a significant increase in confidence which is good. > 0.15±0.95 confidence increase using the Latent Shift method (p=0.01). For false positives they find an increase in confidence but it is not significant. > 0.04±1.06 increase which is not significant (p=0.57) **Conclusions:**  Latent Shift's ability to generate counterfactuals is pretty good!  Vanilla autoencoders are sufficient for some pathologies.  StyleGAN and higher quality models should improve performance.  IoU analysis may not be the best fit.  Explainable AI methods can have an impact on the user confidence in the model. (Disclaimer: I am the author of this work) Project Website: https://mlmed.org/gifsplanation/ 
[link]
In this article, the authors provide a framework for training two translation models with large accessible monolingual corpus. In traditional methods, machine translation models always require large parallel corpus to train a good quality model, which is expensive to acquire. However, the massive monolingual data is not fully utilized. The monolingual corpus are typically used in pretraining the NMT decoder rnn and augmenting initial parallel corpus through selfgenerated translations. The authors embed machine translation task into a reinforcement learning framework, in which two agents act as two different native speakers respectively and know little about each other and then they learn to translate by trying to communicate with each other. **The two speakers**, `A` and `B`, obviously know well about their corresponding language respectively, this situation is easily simulated by two welltrained language models for `A` and `B`. Then, speaker `A` tries to tell a sentence $x$ to `B` by translating it into $y$ in `B`'s language. Since they don't know each other, `B` is uncertain about what `A` truly means by saying $y$. However, `B` is capable of evaluate the degree of sensibility of $y$ from his own understanding. Next, `B` informs `A` his sensibility evaluation score and tries to recover what `A` truly means in `A`'s language, i.e. $x'$. And similarly, `A` can also evaluate the degree of sensibility of $x'$ from his own understanding. In general, the very original idea that `A` tried to convey, is passed through a noisy channel to `B`, and then back to `A` through another noisy channel. The former noisy channel is a `AB` translation model and the latter a `BA` translation model in the framework. Think about how the first American learnt Chinese in history and I think it is intuitively similar to the principle in this work.
1 Comments

[link]
## General Framework Extends TREX (see [summary](https://www.shortscience.org/paper?bibtexKey=journals/corr/1904.06387&a=muntermulehitch)) so that preferences (rankings) over demonstrations are generated automatically (back to the common IL/IRL setting where we only have access to a set of unlabeled demonstrations). Also derives some theoretical requirements and guarantees for betterthandemonstrator performance. ## Motivations * Preferences over demonstrations may be difficult to obtain in practice. * There is no theoretical understanding of the requirements that lead to outperforming demonstrator. ## Contributions * Theoretical results (with linear reward function) on when betterthandemonstrator performance is possible: 1 the demonstrator must be suboptimal (room for improvement, obviously), 2 the learned reward must be close enough to the reward that the demonstrator is suboptimally optimizing for (be able to accurately capture the intent of the demonstrator), 3 the learned policy (optimal wrt the learned reward) must be close enough to the optimal policy (wrt to the ground truth reward). Obviously if we have 2 and a good enough RL algorithm we should have 3, so it might be interesting to see if one can derive a requirement from only 1 and 2 (and possibly a good enough RL algo). * Theoretical results (with linear reward function) showing that pairwise preferences over demonstrations reduce the error and ambiguity of the reward learning. They show that without rankings two policies might have equal performance under a learned reward (that makes expert's demonstrations optimal) but very different performance under the true reward (that makes the expert optimal everywhere). Indeed, the expert's demonstration may reveal very little information about the reward of (suboptimal or not) unseen regions which may hurt very much the generalizations (even with RL as it would try to generalize to new states under a totally wrong reward). They also show that pairwise preferences over trajectories effectively give halfspace constraints on the feasible reward function domain and thus may decrease exponentially the reward function ambiguity. * Propose a practical way to generate as many ranked demos as desired. ## Additional Assumption Very mild, assumes that a Behavioral Cloning (BC) policy trained on the provided demonstrations is better than a uniform random policy. ## Disturbancebased Reward Extrapolation (DREX) ![](https://i.imgur.com/9g6tOrF.png) ![](https://i.imgur.com/zSRlDcr.png) They also show that the more noise added to the BC policy the lower the performance of the generated trajs. ## Results Pretty much like TREX. 