Welcome to ShortScience.org! |

- ShortScience.org is a platform for post-publication discussion aiming to improve accessibility and reproducibility of research ideas.
- The website has 1584 public summaries, mostly in machine learning, written by the community and organized by paper, conference, and year.
- Reading summaries of papers is useful to obtain the perspective and insight of another reader, why they liked or disliked it, and their attempt to demystify complicated sections.
- Also, writing summaries is a good exercise to understand the content of a paper because you are forced to challenge your assumptions when explaining it.
- Finally, you can keep up to date with the flood of research by reading the latest summaries on our Twitter and Facebook pages.

Sequence-to-Sequence Learning as Beam-Search Optimization

Sam Wiseman and Alexander M. Rush

arXiv e-Print archive - 2016 via Local arXiv

Keywords: cs.CL, cs.LG, cs.NE, stat.ML

**First published:** 2016/06/09 (8 years ago)

**Abstract:** Sequence-to-Sequence (seq2seq) modeling has rapidly become an important
general-purpose NLP tool that has proven effective for many text-generation and
sequence-labeling tasks. Seq2seq builds on deep neural language modeling and
inherits its remarkable accuracy in estimating local, next-word distributions.
In this work, we introduce a model and beam-search training scheme, based on
the work of Daume III and Marcu (2005), that extends seq2seq to learn global
sequence scores. This structured approach avoids classical biases associated
with local training and unifies the training loss with the test-time usage,
while preserving the proven model architecture of seq2seq and its efficient
training approach. We show that our system outperforms a highly-optimized
attention-based seq2seq system and other baselines on three different sequence
to sequence tasks: word ordering, parsing, and machine translation.
more
less

Sam Wiseman and Alexander M. Rush

arXiv e-Print archive - 2016 via Local arXiv

Keywords: cs.CL, cs.LG, cs.NE, stat.ML

[link]
**Problem Setting:** Sequence to Sequence learning (seq2seq) is one of the most successful techniques in machine learning nowadays. The basic idea is to encode a sequence into a vector (or a sequence of vectors if using attention based encoder) and then use a recurrent decoder to decode the target sequence conditioned on the encoder output. While researchers have explored various architectural changes to this basic encoder-decoder model, the standard way of training such seq2seq models is to maximize the likelihood of each successive target word conditioned on the input sequence and the *gold* history of target words. This is also known as *teacher-forcing* in RNN literature. Such an approach has three major issues: 1. **Exposure Bias:** Since we teacher-force the model with *gold* history during training, the model is never exposed to its errors during training. At test time, we will not have access to *gold* history and we feed the history generated by the model. If it is erroneous, the model does not have any clue about how to rectify it. 2. **Loss-Evaluation Mismatch:** While we evaluate the model using sequence level metrics (such as BLEU for Machine Translation), we are training the model with word level cross entropy loss. 3. **Label bias:** Since the word probabilities are normalized at each time step (by using softmax over the final layer of the decoder), this can result in label bias if we vary the number of possible candidates in each step. More about this later. **Solution:** This paper proposes an alternative training procedure for seq2seq models which attempt to solve all the 3 major issues listed above. The idea is to pose seq2seq learning as beam-search optimization problem. Authors begin by removing the final softmax activation function from the decoder. Now instead of probability distributions, we will get score for next possible word. Then the training procedure is changed as follows: At every time step $t$, they maintain a set $S_t$ of $K$ candidate sequences of length $t$. Now the loss function is defined with the following characteristics: 1. If the *gold* sub-sequence of length $t$ is in set $S_t$ and the score for *gold* sub-sequence exceeds the score of the $K$-th ranked candidate by a margin, the model incurs no loss. Now the candidates for next time-step are chosen in a way similar to regular beam-search with beam-size $K$. 2. If the *gold* sub-sequence of length $t$ is in set $S_t$ and it is the $K$-th ranked candidate, then the loss will push the *gold* sequence up by increasing its score. The candidates for next time-step are chosen in a way similar as first case. 3. If the *gold* sub-sequence of length $t$ is NOT in set $S_t$, then the score of the *gold* sequence is increased to be higher than $K$-th ranked candidate by a margin. In this case, candidates for next step or chosen by only considering *gold* word at time $t$ and getting its top-$K$ successors. 4. Further, since we want the full *gold* sequence to be at top of the beam at the end of the search, when $t=T$, the loss is modified to require the score of *gold* sequence to exceed the score of the *highest* ranked incorrect prediction by a margin. This non-probabilistic training method has several advantages: * The model is trained in a similar way it would be tested, since we use beam-search during training as well as testing. Hence this helps to eliminate exposure bias. * The score based loss can be easily scaled by a mistake-specific cost function. For example, in MT, one could use a cost function which is inversely proportional to BLEU score. So there is no loss-evaluation mismatch. * Each time step can have different set of successor words based on any hard constraints in the problem. Note that the model is non-probabilistic and hence this varying successor function will not introduce any label bias. Refer [this set of slides][1] for an excellent illustration of label bias. Cost of forward-prop grows linearly with respect to beam size $K$. However, GPU implementation should help to reduce this cost. Authors propose a clever way of doing BPTT which makes the back-prop almost same cost as ordinary seq2seq training. **Additional Tricks** 1. Authors pre-train the seq2seq model with regular word level cross-entropy loss and this is crucial since random initialization did not work. 2. Authors use "curriculum beam" strategy in training where they start with beam size of 2 and increase the beam size by 1 for every 2 epochs until it reaches the required beam size. You have to train your model with training beam size of at least test beam size + 1. (i.e $K_{tr} >= K_{te} + 1$). 3. When you use drop-out, you need to be careful to use the same dropout value during back-prop. Authors do this by sharing a single dropout across all sequences in a time step. **Experiments** Authors compare the proposed model against basic seq2seq in word ordering, dependency parsing and MT tasks. The proposed model achieves significant improvement over the strong baseline. **Related Work:** The whole idea of the paper is based on [learning as search optimization (LaSO) framework][2] of Daume III and Marcu (2005). Other notable related work are training seq2seq models using mix of cross-entropy and REINFORCE called [MIXER][3] and [an actor-critic based seq2seq training][4]. Authors compare with MIXER and they do significantly better than MIXER. **My two cents:** This is one of the important research directions in my opinion. While other recent methods attempt to use reinforcement learning to avoid the issues in word-level cross-entropy training, this paper proposes a really simple score based solution which works very well. While most of the language generation research is stuck with probabilistic framework (I am saying this w.r.t Deep NLP research), this paper highlights the benefits on non-probabilistic generation models. I see this as one potential way of avoiding the nasty scalability issues that come with softmax based generative models. [1]: http://www.cs.stanford.edu/~nmramesh/crf [2]: https://www.isi.edu/~marcu/papers/daume05laso.pdf [3]: http://arxiv.org/pdf/1511.06732v7.pdf [4]: https://arxiv.org/pdf/1607.07086v2.pdf |

Learning to learn by gradient descent by gradient descent

Marcin Andrychowicz and Misha Denil and Sergio Gomez and Matthew W. Hoffman and David Pfau and Tom Schaul and Nando de Freitas

arXiv e-Print archive - 2016 via Local arXiv

Keywords: cs.NE, cs.LG

**First published:** 2016/06/14 (8 years ago)

**Abstract:** The move from hand-designed features to learned features in machine learning
has been wildly successful. In spite of this, optimization algorithms are still
designed by hand. In this paper we show how the design of an optimization
algorithm can be cast as a learning problem, allowing the algorithm to learn to
exploit structure in the problems of interest in an automatic way. Our learned
algorithms, implemented by LSTMs, outperform generic, hand-designed competitors
on the tasks for which they are trained, and also generalize well to new tasks
with similar structure. We demonstrate this on a number of tasks, including
simple convex problems, training neural networks, and styling images with
neural art.
more
less

Marcin Andrychowicz and Misha Denil and Sergio Gomez and Matthew W. Hoffman and David Pfau and Tom Schaul and Nando de Freitas

arXiv e-Print archive - 2016 via Local arXiv

Keywords: cs.NE, cs.LG

[link]
# Very Short The authors propose **learning** an optimizer **to** optimally **learn** a function (the *optimizee*) which is being trained **by gradient descent**. This optimizer, a recurrent neural network, is trained to make optimal parameter updates to the optimizee **by gradient descent**. # Short Let's suppose we have a stochastic function $f: \mathbb R^{\text{dim}(\theta)} \rightarrow \mathbb R^+$, (the *optimizee*) which we wish to minimize with respect to $\theta$. Note that this is the typical situation we encounter when training a neural network with Stochastic Gradient Descent - where the stochasticity comes from sampling random minibatches of the data (the data is omitted as an argument here). The "vanilla" gradient descent update is: $\theta_{t+1} = \theta_t - \alpha_t \nabla_{\theta_t} f(\theta_t)$, where $\alpha_t$ is some learning rate. Other optimizers (Adam, RMSProp, etc) replace the multiplication of the gradient by $-\alpha_t$ with some sort of weighted sum of the history of gradients. This paper proposes to apply an optimization step $\theta_{t+1} = \theta_t + g_t$, where the update $g_t \in \mathbb R^{\text{dim}(\theta)}$ is defined by a recurrent network $m_\phi$: $$(g_t, h_{t+1}) := m_\phi (\nabla_{\theta_t} f(\theta_t), h_t)$$ Where in their implementation, $h_t \in \mathbb R^{\text{dim}(\theta)}$ is the hidden state of the recurrent network. To make the number of parameters in the optimizer manageable, they implement their recurrent network $m$ as a *coordinatewise* LSTM (i.e. A set of $\text{dim}(\theta)$ small LSTMs that share parameters $\phi$). They train the optimizer networks's parameters $\phi$ by "unrolling" T subsequent steps of optimization, and minimizing: $$\mathcal L(\phi) := \mathbb E_f[f(\theta^*(f, \phi))] \approx \frac1T \sum_{t=1}^T f(\theta_t)$$ Where $\theta^*(f, \phi)$ are the final optimizee parameters. In order to avoid computing second derivatives while calculating $\frac{\partial \mathcal L(\phi)}{\partial \phi}$, they make the approximation $\frac{\partial}{\partial \phi} \nabla_{\theta_t}f(\theta_t) \approx 0$ (corresponding to the dotted lines in the figure, along which gradients are not backpropagated). https://i.imgur.com/HMaCeip.png **The computational graph of the optimization of the optimizer, unrolled across 3 time-steps. Note that $\nabla_t := \nabla_{\theta_t}f(\theta_t)$. The dotted line indicates that we do not backpropagate across this path.** The authors demonstrate that their method usually outperforms traditional optimizers (ADAM, RMSProp, SGD, NAG), on a synthetic dataset, MNIST, CIFAR-10, and Neural Style Transfer. They argue that their algorithm constitutes a form of transfer learning, since a pre-trained optimizer can be applied to accelerate training of a newly initialized network. |

Sanity Checks for Saliency Maps

Julius Adebayo and Justin Gilmer and Michael Muelly and Ian Goodfellow and Moritz Hardt and Been Kim

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.CV, cs.LG, stat.ML

**First published:** 2018/10/08 (5 years ago)

**Abstract:** Saliency methods have emerged as a popular tool to highlight features in an
input deemed relevant for the prediction of a learned model. Several saliency
methods have been proposed, often guided by visual appeal on image data. In
this work, we propose an actionable methodology to evaluate what kinds of
explanations a given method can and cannot provide. We find that reliance,
solely, on visual assessment can be misleading. Through extensive experiments
we show that some existing saliency methods are independent both of the model
and of the data generating process. Consequently, methods that fail the
proposed tests are inadequate for tasks that are sensitive to either data or
model, such as, finding outliers in the data, explaining the relationship
between inputs and outputs that the model learned, and debugging the model. We
interpret our findings through an analogy with edge detection in images, a
technique that requires neither training data nor model. Theory in the case of
a linear model and a single-layer convolutional neural network supports our
experimental findings.
more
less

Julius Adebayo and Justin Gilmer and Michael Muelly and Ian Goodfellow and Moritz Hardt and Been Kim

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.CV, cs.LG, stat.ML

[link]
**Idea:** With the growing use of visual explanation systems of machine learning models such as saliency maps, there needs to be a standardized method of verifying if a saliency method is correctly describing the underlying ML model. **Solution:** In this paper two Sanity Checks have been proposed to verify the accuracy and the faithfulness of a saliency method: * *Model parameter randomization test:* In this sanity check the outputs of a saliency method on a trained model is compared to that of the same method on an untrained randomly parameterized model. If these images are similar/identical then this saliency method does not correctly describe the model. In the course of this experiment it is found that certain methods such as the Guided BackProp are constant in their explanations despite alterations in the model. * *Data Randomization Test:* This method explores the relationship of saliency methods to data and their associated labels. In this test, the labels of the training data are randomized thus there should be no definite pattern describing the model (Since the model is as good as randomly guessing an output label). If there is a definite pattern, this shows that the saliency methods are independent of the underlying model/training data labels. In this test as well Guided BackProp did not fare well, implying this saliency method is as good as an edge detector as opposed to a ML explainer. Thus this paper makes a valid argument toward having standardized tests that an interpretation model must satisfy to be deemed accurate or faithful. |

Implicit Neural Representations with Periodic Activation Functions

Sitzmann, Vincent and Martel, Julien N. P. and Bergman, Alexander W. and Lindell, David B. and Wetzstein, Gordon

- 2020 via Local Bibsonomy

Keywords: neural-network, machine-learinng

Sitzmann, Vincent and Martel, Julien N. P. and Bergman, Alexander W. and Lindell, David B. and Wetzstein, Gordon

- 2020 via Local Bibsonomy

Keywords: neural-network, machine-learinng

[link]
[First off, full credit that this summary is essentially a distilled-for-my-own-understanding compression of Yannic Kilcher's excellent video on the topic] I'm interested in learning more about Neural Radiance Fields (or NERFs), a recent technique for learning a representation of a scene that lets you generate multiple views from it, and a paper referenced as a useful prerequisite for that technique was SIRENs, or Sinuisodial Representation Networks. In my view, the most complex part of understanding this technique isn't the technique itself, but the particularities of the problem being solved, and the ways it differs from a more traditional ML setup. Typically, the goal of machine learning is to learn a model that extracts and represents properties of a data distribution, and that can generalize to new examples drawn from that distribution. Instead, in this framing, a single network is being used to capture information about a single image, essentially creating a compressed representation of that image that brings with it some nice additional properties. Concretely, the neural network is representing a function that maps inputs of the form (x, y), representing coordinates within the image, to (r, g, b) values, representing the pixel values of the image at that coordinate. If you're able to train an optimal version of such a network, it would mean you have a continuous representation of the image. A good way to think about "continuous," here, is that, you could theoretically ask the model for the color value at pixel (3.5, 2.5), and, given that it's simply a numerical mapping, it could give you a prediction, even though in your discrete "sampling" of pixels, that pixel never appears. Given this problem setting, the central technique proposed by SIRENs is to use sinusoidal non-linearities between the layers. On the face of it, this may seem like a pretty weird choice: non-linearities are generally monotonic, and a sine wave is absolutely not that. The appealing property of sinusoidal activations in this context is: if you take a derivative of a sine curve, what you get is a cosine curve (which is essentially a shifted sine curve), and the same is true in reverse. This means that you can take multiple derivatives of the learned function (where, again, "learned function" is your neural network optimized for this particular image), and have them still be networks of the same underlying format, with shifting constants. This allows SIRENs to use an enhanced version of what would be a typical training procedure for this setting. Simplistically, the way you'd go about training this kind of representation would be to simply give the inputs, and optimize against a loss function that reduced your prediction error in predicting the output values, or, in other words, the error on the f(x, y) function itself. When you have a model structure that makes it easy to take first and second derivatives of the function calculated by the model, you can, as this paper does, decide to train against a loss function of matching, not just the true f(x, y) function (again, the pixel values at coordinates), but also the first and second-derivatives (gradients and Laplacian) of the image at those coordinates. This supervision lets you learn a better underlying representation, since it enforces not just what comes "above the surface" at your sampled pixels, but the dynamics of the true function between those points. One interesting benefit of this procedure of using loss in a first or second derivative space (as pointed out in the paper), is that if you want to merge the interesting parts of multiple images, you can approximate that by training a SIREN on the sum of their gradients, since places where gradients are zero likely don't contain much contrast or interesting content (as an example: a constant color background). The Experiments section goes into a lot of specific applications in boundary-finding problems, which I understand at less depth, and thus won't try to explain. It also briefly mentions trying to learn a prior over the space of image functions (that is, a prior over the set of network weights that define the underlying function of an image); having such a prior is interesting in that it would theoretically let you sample both the implicit image function itself (from the prior), and then also points within that function. |

Towards a Neural Statistician

Harrison Edwards and Amos Storkey

arXiv e-Print archive - 2016 via Local arXiv

Keywords: stat.ML, cs.LG

**First published:** 2016/06/07 (8 years ago)

**Abstract:** An efficient learner is one who reuses what they already know to tackle a new
problem. For a machine learner, this means understanding the similarities
amongst datasets. In order to do this, one must take seriously the idea of
working with datasets, rather than datapoints, as the key objects to model.
Towards this goal, we demonstrate an extension of a variational autoencoder
that can learn a method for computing representations, or statistics, of
datasets in an unsupervised fashion. The network is trained to produce
statistics that encapsulate a generative model for each dataset. Hence the
network enables efficient learning from new datasets for both unsupervised and
supervised tasks. We show that we are able to learn statistics that can be used
for: clustering datasets, transferring generative models to new datasets,
selecting representative samples of datasets and classifying previously unseen
classes.
more
less

Harrison Edwards and Amos Storkey

arXiv e-Print archive - 2016 via Local arXiv

Keywords: stat.ML, cs.LG

[link]
This paper can be thought as proposing a variational autoencoder applied to a form of meta-learning, i.e. where the input is not a single input but a dataset of inputs. For this, in addition to having to learn an approximate inference network over the latent variable $z_i$ for each input $x_i$ in an input dataset $D$, approximate inference is also learned over a latent variable $c$ that is global to the dataset $D$. By using Gaussian distributions for $z_i$ and $c$, the reparametrization trick can be used to train the variational autoencoder. The generative model factorizes as $p(D=(x_1,\dots,x_N), (z_1,\dots,z_N), c) = p(c) \prod_i p(z_i|c) p(x_i|z_i,c)$ and learning is based on the following variational posterior decomposition: $q((z_1,\dots,z_N), c|D=(x_1,\dots,x_N)) = q(c|D) \prod_i q(z_i|x_i,c)$. Moreover, latent variable $z_i$ is decomposed into multiple ($L$) layers $z_i = (z_{i,1}, \dots, z_{i,L})$. Each layer in the generative model is directly connected to the input. The layers are generated from $z_{i,L}$ to $z_{i,1}$, each layer being conditioned on the previous (see Figure 1 *Right* for the graphical model), with the approximate posterior following a similar decomposition. The architecture for the approximate inference network $q(c|D)$ first maps all inputs $x_i\in D$ into a vector representation, then performs mean pooling of these representations to obtain a single vector, followed by a few more layers to produce the parameters of the Gaussian distribution over $c$. Training is performed by stochastic gradient descent, over minibatches of datasets (i.e. multiple sets $D$). The model has multiple applications, explored in the experiments. One is of summarizing a dataset $D$ into a smaller subset $S\in D$. This is done by initializing $S\leftarrow D$ and greedily removing elements of $S$, each time minimizing the KL divergence between $q(c|D)$ and $q(c|S)$ (see the experiments on a synthetic Spatial MNIST problem of section 5.3). Another application is few-shot classification, where very few examples of a number of classes are given, and a new test example $x'$ must be assigned to one of these classes. Classification is performed by treating the small set of examples of each class $k$ as its own dataset $D_k$. Then, test example $x$ is classified into class $k$ for which the KL divergence between $q(c|x')$ and $q(c|D_k)$ is smallest. Positive results are reported when training on OMNIGLOT classes and testing on either the MNIST classes or unseen OMNIGLOT datasets, when compared to a 1-nearest neighbor classifier based on the raw input or on a representation learned by a regular autoencoder. Finally, another application is that of generating new samples from an input dataset of examples. The approximate posterior is used to compute $q(c|D)$. Then, $c$ is assigned to its posterior mean, from which a value for the hidden layers $z$ and finally a sample $x$ can be generated. It is shown that this procedure produces convincing samples that are visually similar from those in the input set $D$. **My two cents** Another really nice example of deep learning applied to a form of meta-learning, i.e. learning a model that is trained to take *new* datasets as input and generalize even if confronted to datasets coming from an unseen data distribution. I'm particularly impressed by the many tasks explored successfully with the same approach: few-shot classification and generative sampling, as well as a form of summarization (though this last probably isn't really meta-learning). Overall, the approach is quite elegant and appealing. The very simple, synthetic experiments of section 5.1 and 5.2 are also interesting. Section 5.2 presents the notion of a *prior-interpolation layer*, which is well motivated but seems to be used only in that section. I wonder how important it is, outside of the specific case of section 5.2. Overall, very excited by this work, which further explores the theme of meta-learning in an interesting way. |

About