Welcome to ShortScience.org! |

- ShortScience.org is a platform for post-publication discussion aiming to improve accessibility and reproducibility of research ideas.
- The website has 1584 public summaries, mostly in machine learning, written by the community and organized by paper, conference, and year.
- Reading summaries of papers is useful to obtain the perspective and insight of another reader, why they liked or disliked it, and their attempt to demystify complicated sections.
- Also, writing summaries is a good exercise to understand the content of a paper because you are forced to challenge your assumptions when explaining it.
- Finally, you can keep up to date with the flood of research by reading the latest summaries on our Twitter and Facebook pages.

Learning to learn by gradient descent by gradient descent

Marcin Andrychowicz and Misha Denil and Sergio Gomez and Matthew W. Hoffman and David Pfau and Tom Schaul and Nando de Freitas

arXiv e-Print archive - 2016 via Local arXiv

Keywords: cs.NE, cs.LG

**First published:** 2016/06/14 (8 years ago)

**Abstract:** The move from hand-designed features to learned features in machine learning
has been wildly successful. In spite of this, optimization algorithms are still
designed by hand. In this paper we show how the design of an optimization
algorithm can be cast as a learning problem, allowing the algorithm to learn to
exploit structure in the problems of interest in an automatic way. Our learned
algorithms, implemented by LSTMs, outperform generic, hand-designed competitors
on the tasks for which they are trained, and also generalize well to new tasks
with similar structure. We demonstrate this on a number of tasks, including
simple convex problems, training neural networks, and styling images with
neural art.
more
less

Marcin Andrychowicz and Misha Denil and Sergio Gomez and Matthew W. Hoffman and David Pfau and Tom Schaul and Nando de Freitas

arXiv e-Print archive - 2016 via Local arXiv

Keywords: cs.NE, cs.LG

[link]
# Very Short The authors propose **learning** an optimizer **to** optimally **learn** a function (the *optimizee*) which is being trained **by gradient descent**. This optimizer, a recurrent neural network, is trained to make optimal parameter updates to the optimizee **by gradient descent**. # Short Let's suppose we have a stochastic function $f: \mathbb R^{\text{dim}(\theta)} \rightarrow \mathbb R^+$, (the *optimizee*) which we wish to minimize with respect to $\theta$. Note that this is the typical situation we encounter when training a neural network with Stochastic Gradient Descent - where the stochasticity comes from sampling random minibatches of the data (the data is omitted as an argument here). The "vanilla" gradient descent update is: $\theta_{t+1} = \theta_t - \alpha_t \nabla_{\theta_t} f(\theta_t)$, where $\alpha_t$ is some learning rate. Other optimizers (Adam, RMSProp, etc) replace the multiplication of the gradient by $-\alpha_t$ with some sort of weighted sum of the history of gradients. This paper proposes to apply an optimization step $\theta_{t+1} = \theta_t + g_t$, where the update $g_t \in \mathbb R^{\text{dim}(\theta)}$ is defined by a recurrent network $m_\phi$: $$(g_t, h_{t+1}) := m_\phi (\nabla_{\theta_t} f(\theta_t), h_t)$$ Where in their implementation, $h_t \in \mathbb R^{\text{dim}(\theta)}$ is the hidden state of the recurrent network. To make the number of parameters in the optimizer manageable, they implement their recurrent network $m$ as a *coordinatewise* LSTM (i.e. A set of $\text{dim}(\theta)$ small LSTMs that share parameters $\phi$). They train the optimizer networks's parameters $\phi$ by "unrolling" T subsequent steps of optimization, and minimizing: $$\mathcal L(\phi) := \mathbb E_f[f(\theta^*(f, \phi))] \approx \frac1T \sum_{t=1}^T f(\theta_t)$$ Where $\theta^*(f, \phi)$ are the final optimizee parameters. In order to avoid computing second derivatives while calculating $\frac{\partial \mathcal L(\phi)}{\partial \phi}$, they make the approximation $\frac{\partial}{\partial \phi} \nabla_{\theta_t}f(\theta_t) \approx 0$ (corresponding to the dotted lines in the figure, along which gradients are not backpropagated). https://i.imgur.com/HMaCeip.png **The computational graph of the optimization of the optimizer, unrolled across 3 time-steps. Note that $\nabla_t := \nabla_{\theta_t}f(\theta_t)$. The dotted line indicates that we do not backpropagate across this path.** The authors demonstrate that their method usually outperforms traditional optimizers (ADAM, RMSProp, SGD, NAG), on a synthetic dataset, MNIST, CIFAR-10, and Neural Style Transfer. They argue that their algorithm constitutes a form of transfer learning, since a pre-trained optimizer can be applied to accelerate training of a newly initialized network. |

Sequence-to-Sequence Learning as Beam-Search Optimization

Sam Wiseman and Alexander M. Rush

arXiv e-Print archive - 2016 via Local arXiv

Keywords: cs.CL, cs.LG, cs.NE, stat.ML

**First published:** 2016/06/09 (8 years ago)

**Abstract:** Sequence-to-Sequence (seq2seq) modeling has rapidly become an important
general-purpose NLP tool that has proven effective for many text-generation and
sequence-labeling tasks. Seq2seq builds on deep neural language modeling and
inherits its remarkable accuracy in estimating local, next-word distributions.
In this work, we introduce a model and beam-search training scheme, based on
the work of Daume III and Marcu (2005), that extends seq2seq to learn global
sequence scores. This structured approach avoids classical biases associated
with local training and unifies the training loss with the test-time usage,
while preserving the proven model architecture of seq2seq and its efficient
training approach. We show that our system outperforms a highly-optimized
attention-based seq2seq system and other baselines on three different sequence
to sequence tasks: word ordering, parsing, and machine translation.
more
less

Sam Wiseman and Alexander M. Rush

arXiv e-Print archive - 2016 via Local arXiv

Keywords: cs.CL, cs.LG, cs.NE, stat.ML

[link]
**Problem Setting:** Sequence to Sequence learning (seq2seq) is one of the most successful techniques in machine learning nowadays. The basic idea is to encode a sequence into a vector (or a sequence of vectors if using attention based encoder) and then use a recurrent decoder to decode the target sequence conditioned on the encoder output. While researchers have explored various architectural changes to this basic encoder-decoder model, the standard way of training such seq2seq models is to maximize the likelihood of each successive target word conditioned on the input sequence and the *gold* history of target words. This is also known as *teacher-forcing* in RNN literature. Such an approach has three major issues: 1. **Exposure Bias:** Since we teacher-force the model with *gold* history during training, the model is never exposed to its errors during training. At test time, we will not have access to *gold* history and we feed the history generated by the model. If it is erroneous, the model does not have any clue about how to rectify it. 2. **Loss-Evaluation Mismatch:** While we evaluate the model using sequence level metrics (such as BLEU for Machine Translation), we are training the model with word level cross entropy loss. 3. **Label bias:** Since the word probabilities are normalized at each time step (by using softmax over the final layer of the decoder), this can result in label bias if we vary the number of possible candidates in each step. More about this later. **Solution:** This paper proposes an alternative training procedure for seq2seq models which attempt to solve all the 3 major issues listed above. The idea is to pose seq2seq learning as beam-search optimization problem. Authors begin by removing the final softmax activation function from the decoder. Now instead of probability distributions, we will get score for next possible word. Then the training procedure is changed as follows: At every time step $t$, they maintain a set $S_t$ of $K$ candidate sequences of length $t$. Now the loss function is defined with the following characteristics: 1. If the *gold* sub-sequence of length $t$ is in set $S_t$ and the score for *gold* sub-sequence exceeds the score of the $K$-th ranked candidate by a margin, the model incurs no loss. Now the candidates for next time-step are chosen in a way similar to regular beam-search with beam-size $K$. 2. If the *gold* sub-sequence of length $t$ is in set $S_t$ and it is the $K$-th ranked candidate, then the loss will push the *gold* sequence up by increasing its score. The candidates for next time-step are chosen in a way similar as first case. 3. If the *gold* sub-sequence of length $t$ is NOT in set $S_t$, then the score of the *gold* sequence is increased to be higher than $K$-th ranked candidate by a margin. In this case, candidates for next step or chosen by only considering *gold* word at time $t$ and getting its top-$K$ successors. 4. Further, since we want the full *gold* sequence to be at top of the beam at the end of the search, when $t=T$, the loss is modified to require the score of *gold* sequence to exceed the score of the *highest* ranked incorrect prediction by a margin. This non-probabilistic training method has several advantages: * The model is trained in a similar way it would be tested, since we use beam-search during training as well as testing. Hence this helps to eliminate exposure bias. * The score based loss can be easily scaled by a mistake-specific cost function. For example, in MT, one could use a cost function which is inversely proportional to BLEU score. So there is no loss-evaluation mismatch. * Each time step can have different set of successor words based on any hard constraints in the problem. Note that the model is non-probabilistic and hence this varying successor function will not introduce any label bias. Refer [this set of slides][1] for an excellent illustration of label bias. Cost of forward-prop grows linearly with respect to beam size $K$. However, GPU implementation should help to reduce this cost. Authors propose a clever way of doing BPTT which makes the back-prop almost same cost as ordinary seq2seq training. **Additional Tricks** 1. Authors pre-train the seq2seq model with regular word level cross-entropy loss and this is crucial since random initialization did not work. 2. Authors use "curriculum beam" strategy in training where they start with beam size of 2 and increase the beam size by 1 for every 2 epochs until it reaches the required beam size. You have to train your model with training beam size of at least test beam size + 1. (i.e $K_{tr} >= K_{te} + 1$). 3. When you use drop-out, you need to be careful to use the same dropout value during back-prop. Authors do this by sharing a single dropout across all sequences in a time step. **Experiments** Authors compare the proposed model against basic seq2seq in word ordering, dependency parsing and MT tasks. The proposed model achieves significant improvement over the strong baseline. **Related Work:** The whole idea of the paper is based on [learning as search optimization (LaSO) framework][2] of Daume III and Marcu (2005). Other notable related work are training seq2seq models using mix of cross-entropy and REINFORCE called [MIXER][3] and [an actor-critic based seq2seq training][4]. Authors compare with MIXER and they do significantly better than MIXER. **My two cents:** This is one of the important research directions in my opinion. While other recent methods attempt to use reinforcement learning to avoid the issues in word-level cross-entropy training, this paper proposes a really simple score based solution which works very well. While most of the language generation research is stuck with probabilistic framework (I am saying this w.r.t Deep NLP research), this paper highlights the benefits on non-probabilistic generation models. I see this as one potential way of avoiding the nasty scalability issues that come with softmax based generative models. [1]: http://www.cs.stanford.edu/~nmramesh/crf [2]: https://www.isi.edu/~marcu/papers/daume05laso.pdf [3]: http://arxiv.org/pdf/1511.06732v7.pdf [4]: https://arxiv.org/pdf/1607.07086v2.pdf |

Sanity Checks for Saliency Maps

Julius Adebayo and Justin Gilmer and Michael Muelly and Ian Goodfellow and Moritz Hardt and Been Kim

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.CV, cs.LG, stat.ML

**First published:** 2018/10/08 (5 years ago)

**Abstract:** Saliency methods have emerged as a popular tool to highlight features in an
input deemed relevant for the prediction of a learned model. Several saliency
methods have been proposed, often guided by visual appeal on image data. In
this work, we propose an actionable methodology to evaluate what kinds of
explanations a given method can and cannot provide. We find that reliance,
solely, on visual assessment can be misleading. Through extensive experiments
we show that some existing saliency methods are independent both of the model
and of the data generating process. Consequently, methods that fail the
proposed tests are inadequate for tasks that are sensitive to either data or
model, such as, finding outliers in the data, explaining the relationship
between inputs and outputs that the model learned, and debugging the model. We
interpret our findings through an analogy with edge detection in images, a
technique that requires neither training data nor model. Theory in the case of
a linear model and a single-layer convolutional neural network supports our
experimental findings.
more
less

Julius Adebayo and Justin Gilmer and Michael Muelly and Ian Goodfellow and Moritz Hardt and Been Kim

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.CV, cs.LG, stat.ML

[link]
**Idea:** With the growing use of visual explanation systems of machine learning models such as saliency maps, there needs to be a standardized method of verifying if a saliency method is correctly describing the underlying ML model. **Solution:** In this paper two Sanity Checks have been proposed to verify the accuracy and the faithfulness of a saliency method: * *Model parameter randomization test:* In this sanity check the outputs of a saliency method on a trained model is compared to that of the same method on an untrained randomly parameterized model. If these images are similar/identical then this saliency method does not correctly describe the model. In the course of this experiment it is found that certain methods such as the Guided BackProp are constant in their explanations despite alterations in the model. * *Data Randomization Test:* This method explores the relationship of saliency methods to data and their associated labels. In this test, the labels of the training data are randomized thus there should be no definite pattern describing the model (Since the model is as good as randomly guessing an output label). If there is a definite pattern, this shows that the saliency methods are independent of the underlying model/training data labels. In this test as well Guided BackProp did not fare well, implying this saliency method is as good as an edge detector as opposed to a ML explainer. Thus this paper makes a valid argument toward having standardized tests that an interpretation model must satisfy to be deemed accurate or faithful. |

Implicit Neural Representations with Periodic Activation Functions

Sitzmann, Vincent and Martel, Julien N. P. and Bergman, Alexander W. and Lindell, David B. and Wetzstein, Gordon

- 2020 via Local Bibsonomy

Keywords: neural-network, machine-learinng

Sitzmann, Vincent and Martel, Julien N. P. and Bergman, Alexander W. and Lindell, David B. and Wetzstein, Gordon

- 2020 via Local Bibsonomy

Keywords: neural-network, machine-learinng

[link]
[First off, full credit that this summary is essentially a distilled-for-my-own-understanding compression of Yannic Kilcher's excellent video on the topic] I'm interested in learning more about Neural Radiance Fields (or NERFs), a recent technique for learning a representation of a scene that lets you generate multiple views from it, and a paper referenced as a useful prerequisite for that technique was SIRENs, or Sinuisodial Representation Networks. In my view, the most complex part of understanding this technique isn't the technique itself, but the particularities of the problem being solved, and the ways it differs from a more traditional ML setup. Typically, the goal of machine learning is to learn a model that extracts and represents properties of a data distribution, and that can generalize to new examples drawn from that distribution. Instead, in this framing, a single network is being used to capture information about a single image, essentially creating a compressed representation of that image that brings with it some nice additional properties. Concretely, the neural network is representing a function that maps inputs of the form (x, y), representing coordinates within the image, to (r, g, b) values, representing the pixel values of the image at that coordinate. If you're able to train an optimal version of such a network, it would mean you have a continuous representation of the image. A good way to think about "continuous," here, is that, you could theoretically ask the model for the color value at pixel (3.5, 2.5), and, given that it's simply a numerical mapping, it could give you a prediction, even though in your discrete "sampling" of pixels, that pixel never appears. Given this problem setting, the central technique proposed by SIRENs is to use sinusoidal non-linearities between the layers. On the face of it, this may seem like a pretty weird choice: non-linearities are generally monotonic, and a sine wave is absolutely not that. The appealing property of sinusoidal activations in this context is: if you take a derivative of a sine curve, what you get is a cosine curve (which is essentially a shifted sine curve), and the same is true in reverse. This means that you can take multiple derivatives of the learned function (where, again, "learned function" is your neural network optimized for this particular image), and have them still be networks of the same underlying format, with shifting constants. This allows SIRENs to use an enhanced version of what would be a typical training procedure for this setting. Simplistically, the way you'd go about training this kind of representation would be to simply give the inputs, and optimize against a loss function that reduced your prediction error in predicting the output values, or, in other words, the error on the f(x, y) function itself. When you have a model structure that makes it easy to take first and second derivatives of the function calculated by the model, you can, as this paper does, decide to train against a loss function of matching, not just the true f(x, y) function (again, the pixel values at coordinates), but also the first and second-derivatives (gradients and Laplacian) of the image at those coordinates. This supervision lets you learn a better underlying representation, since it enforces not just what comes "above the surface" at your sampled pixels, but the dynamics of the true function between those points. One interesting benefit of this procedure of using loss in a first or second derivative space (as pointed out in the paper), is that if you want to merge the interesting parts of multiple images, you can approximate that by training a SIREN on the sum of their gradients, since places where gradients are zero likely don't contain much contrast or interesting content (as an example: a constant color background). The Experiments section goes into a lot of specific applications in boundary-finding problems, which I understand at less depth, and thus won't try to explain. It also briefly mentions trying to learn a prior over the space of image functions (that is, a prior over the set of network weights that define the underlying function of an image); having such a prior is interesting in that it would theoretically let you sample both the implicit image function itself (from the prior), and then also points within that function. |

Density estimation using Real NVP

Laurent Dinh and Jascha Sohl-Dickstein and Samy Bengio

arXiv e-Print archive - 2016 via Local arXiv

Keywords: cs.LG

**First published:** 2016/05/27 (8 years ago)

**Abstract:** Unsupervised learning of probabilistic models is a central yet challenging
problem in machine learning. Specifically, designing models with tractable
learning, sampling, inference and evaluation is crucial in solving this task.
We extend the space of such models using real-valued non-volume preserving
(real NVP) transformations, a set of powerful invertible and learnable
transformations, resulting in an unsupervised learning algorithm with exact
log-likelihood computation, exact sampling, exact inference of latent
variables, and an interpretable latent space. We demonstrate its ability to
model natural images on four datasets through sampling, log-likelihood
evaluation and latent variable manipulations.
more
less

Laurent Dinh and Jascha Sohl-Dickstein and Samy Bengio

arXiv e-Print archive - 2016 via Local arXiv

Keywords: cs.LG

[link]
This paper presents a novel neural network approach (though see [here](https://www.facebook.com/hugo.larochelle.35/posts/172841743130126?pnref=story) for a discussion on prior work) to density estimation, with a focus on image modeling. At its core, it exploits the following property on the densities of random variables. Let $x$ and $z$ be two random variables of equal dimensionality such that $x = g(z)$, where $g$ is some bijective and deterministic function (we'll note its inverse as $f = g^{-1}$). Then the change of variable formula gives us this relationship between the densities of $x$ and $z$: $p_X(x) = p_Z(z) \left|{\rm det}\left(\frac{\partial g(z)}{\partial z}\right)\right|^{-1}$ Moreover, since the determinant of the Jacobian matrix of the inverse $f$ of a function $g$ is simply the inverse of the Jacobian of the function $g$, we can also write: $p_X(x) = p_Z(f(x)) \left|{\rm det}\left(\frac{\partial f(x)}{\partial x}\right)\right|$ where we've replaced $z$ by its deterministically inferred value $f(x)$ from $x$. So, the core of the proposed model is in proposing a design for bijective functions $g$ (actually, they design its inverse $f$, from which $g$ can be derived by inversion), that have the properties of being easily invertible and having an easy-to-compute determinant of Jacobian. Specifically, the authors propose to construct $f$ from various modules that all preserve these properties and allows to construct highly non-linear $f$ functions. Then, assuming a simple choice for the density $p_Z$ (they use a multidimensional Gaussian), it becomes possible to both compute $p_X(x)$ tractably and to sample from that density, by first samples $z\sim p_Z$ and then computing $x=g(z)$. The building blocks for constructing $f$ are the following: **Coupling layers**: This is perhaps the most important piece. It simply computes as its output $b\odot x + (1-b) \odot (x \odot \exp(l(b\odot x)) + m(b\odot x))$, where $b$ is a binary mask (with half of its values set to 0 and the others to 1) over the input of the layer $x$, while $l$ and $m$ are arbitrarily complex neural networks with input and output layers of equal dimensionality. In brief, for dimensions for which $b_i = 1$ it simply copies the input value into the output. As for the other dimensions (for which $b_i = 0$) it linearly transforms them as $x_i * \exp(l(b\odot x)_i) + m(b\odot x)_i$. Crucially, the bias ($m(b\odot x)_i$) and coefficient ($\exp(l(b\odot x)_i)$) of the linear transformation are non-linear transformations (i.e. the output of neural networks) that only have access to the masked input (i.e. the non-transformed dimensions). While this layer might seem odd, it has the important property that it is invertible and the determinant of its Jacobian is simply $\exp(\sum_i (1-b_i) l(b\odot x)_i)$. See Section 3.3 for more details on that. **Alternating masks**: One important property of coupling layers is that they can be stacked (i.e. composed), and the resulting composition is still a bijection and is invertible (since each layer is individually a bijection) and has a tractable determinant for its Jacobian (since the Jacobian of the composition of functions is simply the multiplication of each function's Jacobian matrix, and the determinant of the product of square matrices is the product of the determinant of each matrix). This is also true, even if the mask $b$ of each layer is different. Thus, the authors propose using masks that alternate across layer, by masking a different subset of (half of) the dimensions. For images, they propose using masks with a checkerboard pattern (see Figure 3). Intuitively, alternating masks are better because then after at least 2 layers, all dimensions have been transformed at least once. **Squeezing operations**: Squeezing operations corresponds to a reorganization of a 2D spatial layout of dimensions into 4 sets of features maps with spatial resolutions reduced by half (see Figure 3). This allows to expose multiple scales of resolutions to the model. Moreover, after a squeezing operation, instead of using a checkerboard pattern for masking, the authors propose to use a per channel masking pattern, so that "the resulting partitioning is not redundant with the previous checkerboard masking". See Figure 3 for an illustration. Overall, the models used in the experiments usually stack a few of the following "chunks" of layers: 1) a few coupling layers with alternating checkboard masks, 2) followed by squeezing, 3) followed by a few coupling layers with alternating channel-wise masks. Since the output of each layers-chunk must technically be of the same size as the input image, this could become expensive in terms of computations and space when using a lot of layers. Thus, the authors propose to explicitly pass on (copy) to the very last layer ($z$) half of the dimensions after each layers-chunk, adding another chunk of layers only on the other half. This is illustrated in Figure 4b. Experiments on CIFAR-10, and 32x32 and 64x64 versions of ImageNet show that the proposed model (coined the real-valued non-volume preserving or Real NVP) has competitive performance (in bits per dimension), though slightly worse than the Pixel RNN. **My Two Cents** The proposed approach is quite unique and thought provoking. Most interestingly, it is the only powerful generative model I know that combines A) a tractable likelihood, B) an efficient / one-pass sampling procedure and C) the explicit learning of a latent representation. While achieving this required a model definition that is somewhat unintuitive, it is nonetheless mathematically really beautiful! I wonder to what extent Real NVP is penalized in its results by the fact that it models pixels as real-valued observations. First, it implies that its estimate of bits/dimensions is an upper bound on what it could be if the uniform sub-pixel noise was integrated out (see Equations 3-4-5 of [this paper](http://arxiv.org/pdf/1511.01844v3.pdf)). Moreover, the authors had to apply a non-linear transformation (${\rm logit}(\alpha + (1-\alpha)\odot x)$) to the pixels, to spread the $[0,255]$ interval further over the reals. Since the Pixel RNN models pixels as discrete observations directly, the Real NVP might be at a disadvantage. I'm also curious to know how easy it would be to do conditional inference with the Real NVP. One could imagine doing approximate MAP conditional inference, by clamping the observed dimensions and doing gradient descent on the log-likelihood with respect to the value of remaining dimensions. This could be interesting for image completion, or for structured output prediction with real-valued outputs in general. I also wonder how expensive that would be. In all cases, I'm looking forward to saying interesting applications and variations of this model in the future! |

About