Welcome to ShortScience.org! |

- ShortScience.org is a platform for post-publication discussion aiming to improve accessibility and reproducibility of research ideas.
- The website has 1583 public summaries, mostly in machine learning, written by the community and organized by paper, conference, and year.
- Reading summaries of papers is useful to obtain the perspective and insight of another reader, why they liked or disliked it, and their attempt to demystify complicated sections.
- Also, writing summaries is a good exercise to understand the content of a paper because you are forced to challenge your assumptions when explaining it.
- Finally, you can keep up to date with the flood of research by reading the latest summaries on our Twitter and Facebook pages.

Conditional Generative Adversarial Nets

Mirza, Mehdi and Osindero, Simon

arXiv e-Print archive - 2014 via Local Bibsonomy

Keywords: dblp

Mirza, Mehdi and Osindero, Simon

arXiv e-Print archive - 2014 via Local Bibsonomy

Keywords: dblp

[link]
# Conditional Generative Adversarial Nets ## Introduction * Conditional version of [Generative Adversarial Nets (GAN)](https://gist.github.com/shagunsodhani/1f9dc0444142be8bd8a7404a226880eb) where both generator and discriminator are conditioned on some data **y** (class label or data from some other modality). * [Link to the paper](https://arxiv.org/abs/1411.1784) ## Architecture * Feed **y** into both the generator and discriminator as additional input layers such that **y** and input are combined in a joint hidden representation. ## Experiment ### Unimodal Setting * Conditioning MNIST images on class labels. * *z* (random noise) and **y** mapped to hidden layers with ReLu with layer sizes of 200 and 1000 respectively and are combined to obtain ReLu layer of dimensionality 1200. * Discriminator maps *x* (input) and **y** to maxout layers and the joint maxout layer is fed to sigmoid layer. * Results do not outperform the state-of-the-art results but do provide a proof-of-the-concept. ### Multimodal Setting * Map images (from Flickr) to labels (or user tags) to obtain the one-to-many mapping. * Extract image and text features using convolutional and language model. * Generative Model * Map noise and convolutional features to a single 200 dimensional representation. * Discriminator Model * Combine the representation of word vectors (corresponding to tags) and images. ## Future Work * While the results are not so good, they do show the potential of Conditional GANs, especially in the multimodal setting. |

An Empirical Investigation of Catastrophic Forgetting in Gradient-Based Neural Networks

Ian J. Goodfellow and Mehdi Mirza and Da Xiao and Aaron Courville and Yoshua Bengio

arXiv e-Print archive - 2013 via Local arXiv

Keywords: stat.ML, cs.LG, cs.NE

**First published:** 2013/12/21 (9 years ago)

**Abstract:** Catastrophic forgetting is a problem faced by many machine learning models
and algorithms. When trained on one task, then trained on a second task, many
machine learning models "forget" how to perform the first task. This is widely
believed to be a serious problem for neural networks. Here, we investigate the
extent to which the catastrophic forgetting problem occurs for modern neural
networks, comparing both established and recent gradient-based training
algorithms and activation functions. We also examine the effect of the
relationship between the first task and the second task on catastrophic
forgetting. We find that it is always best to train using the dropout
algorithm--the dropout algorithm is consistently best at adapting to the new
task, remembering the old task, and has the best tradeoff curve between these
two extremes. We find that different tasks and relationships between tasks
result in very different rankings of activation function performance. This
suggests the choice of activation function should always be cross-validated.
more
less

Ian J. Goodfellow and Mehdi Mirza and Da Xiao and Aaron Courville and Yoshua Bengio

arXiv e-Print archive - 2013 via Local arXiv

Keywords: stat.ML, cs.LG, cs.NE

[link]
The paper discusses and empirically investigates by empirical testing the effect of "catastrophic forgetting" (**CF**), i.e. the inability of a model to perform a task it was previously trained to perform if retrained to perform a second task. An illuminating example is what happens in ML systems with convex objectives: regardless of the initialization (i.e. of what was learnt by doing the first task), the training of the second task will always end in the global minimum, thus totally "forgetting" the first one. Neuroscientific evidence (and common sense) suggest that the outcome of the experiment is deeply influenced by the similarity of the tasks involved. Namely, if (i) the two tasks are *functionally identical but input is presented in a different format* or if (ii) *tasks are similar* and the third case for (iii) *dissimilar tasks*. Relevant examples may be provided respectively by (i) performing the same image classification task starting from two different image representations as RGB or HSL, (ii) performing image classification tasks with semantically similar as classifying two similar animals and (iii) performing a text classification followed by image classification. The problem is investigated by an empirical study covering two methods of training ("SGD" and "dropout") combined with 4 activations functions (logistic sigmoid, RELU, LWTA, Maxout). A random search is carried out on these parameters. From a practitioner's point of view, it is interesting to note that dropout has been set to 0.5 in hidden units and 0.2 in the visible one since this is a reasonably well-known parameter. ## Why the paper is important It is apparently the first to provide a systematic empirical analysis of CF. Establishes a framework and baselines to face the problem. ## Key conclusions, takeaways and modelling remarks * dropout helps in preventing CF * dropout seems to increase the optimal model size with respect to the model without dropout * choice of activation function has a less consistent effect than dropout\no dropout choice * dissimilar task experiment provides a notable exception of then dissimilar task experiment * the previous hypothesis that LWTA activation is particularly resistant to CF is rejected (even if it performs best in the new task in the dissimilar task pair the behaviour is inconsistent) * choice of activation function should always be cross-validated * If computational resources are insufficient for cross-validation the combination dropout + maxout activation function is recommended. |

Actions ~ Transformations

Wang, Xiaolong and Farhadi, Ali and Gupta, Abhinav

arXiv e-Print archive - 2015 via Local Bibsonomy

Keywords: dblp

Wang, Xiaolong and Farhadi, Ali and Gupta, Abhinav

arXiv e-Print archive - 2015 via Local Bibsonomy

Keywords: dblp

[link]
Originally posted [here](https://github.com/abhshkdz/papers/blob/master/reviews/actions-~-transformations.md). This paper introduces a novel representation for actions in videos as transformations that change the state of the environment from what it was before the action (precondition) to what it will be after it (effect). - Model - The model utilizes a Siamese architecture with each head having convolutional and fully-connected layers (similar to VGG16). Each head extracts features for a subset of video frames (precondition or effect) that are aggregated by average pooling and followed by a fully-connected layer. - The precondition frames are indexed from 1 to z\_p and the effect frames from z\_e to t. Both z\_p and z\_e are latent variables, constrained to be from [1/3t, 1/2t] and [1/2t, 2/3t] respectively and estimated via brute force search during training. - The action is represented as a linear transformation between the final fully-connected layers of the two heads. For n action categories, the transformation layer has n transformation matrices. - The model is trained with a contrastive loss function to 1) maximize cosine similarity between the effect embedding and the transformed precondition embedding, and 2) maximize distance for incorrect transformations if greater than a chosen margin. - ACT Dataset - 50 keywords, 43 classes, ~500 YouTube videos per keyword. - The authors collect the ACT dataset primarily for the task of cross-category generalization (as it doesn't allow models to overfit to contextual information). For example, how would a model learned on "opening a window" generalize to recognize "opening the trunk of the car"? How about generalizing from a model trained on "climbing a cliff" to recognize "climbing a tree"? - The ACT dataset has class and super-class annotations from human workers. Each super-class has different sub-categories which are the same action under different subjects, objects and scenes. - Experiments - Action recognition on UCF101, HMDB51, ACT. - Cross-category generalization on ACT. - Visualizations - Nearest neighbor: modeling the actions as transformations gives semantically meaningful retrievals that don't just depend on motion and color. - Gradient visualizations (Simonyan et al. 2014): model focuses on changes in scene (human + object) than context. - Embedding retrievals based on transformed precondition embeddings. ** Thoughts ** - Modeling action as a transformation from precondition to effect is a very neat idea. - The exact formulation and supporting experiments and ablation studies are thorough. - During inference, the model first extracts features for all frames and then does a brute force search over (y,z\_p,z\_e) to estimate the action category and segmentation into precondition and effect. For longer sequences, this seems expensive. Although hard decisions aren't differentiable, a soft attention mechanism on z might be feasible and reduce computation to a single forward pass. |

Gaussian Processes in Machine Learning

Rasmussen, Carl Edward

Springer Advanced Lectures on Machine Learning - 2003 via Local Bibsonomy

Keywords: dblp

Rasmussen, Carl Edward

Springer Advanced Lectures on Machine Learning - 2003 via Local Bibsonomy

Keywords: dblp

[link]
In this tutorial paper, Carl E. Rasmussen gives an introduction to Gaussian Process Regression focusing on the definition, the hyperparameter learning and future research directions. A Gaussian Process is completely defined by its mean function $m(\pmb{x})$ and its covariance function (kernel) $k(\pmb{x},\pmb{x}')$. The mean function $m(\pmb{x})$ corresponds to the mean vector $\pmb{\mu}$ of a Gaussian distribution whereas the covariance function $k(\pmb{x}, \pmb{x}')$ corresponds to the covariance matrix $\pmb{\Sigma}$. Thus, a Gaussian Process $f \sim \mathcal{GP}\left(m(\pmb{x}), k(\pmb{x}, \pmb{x}')\right)$ is a generalization of a Gaussian distribution over vectors to a distribution over functions. A random function vector $\pmb{\mathrm{f}}$ can be generated by a Gaussian Process through the following procedure: 1. Compute the components $\mu_i$ of the mean vector $\pmb{\mu}$ for each input $\pmb{x}_i$ using the mean function $m(\pmb{x})$ 2. Compute the components $\Sigma_{ij}$ of the covariance matrix $\pmb{\Sigma}$ using the covariance function $k(\pmb{x}, \pmb{x}')$ 3. A function vector $\pmb{\mathrm{f}} = [f(\pmb{x}_1), \dots, f(\pmb{x}_n)]^T$ can be drawn from the Gaussian distribution $\pmb{\mathrm{f}} \sim \mathcal{N}\left(\pmb{\mu}, \pmb{\Sigma} \right)$ Applying this procedure to regression, means that the resulting function vector $\pmb{\mathrm{f}}$ shall be drawn in a way that a function vector $\pmb{\mathrm{f}}$ is rejected if it does not comply with the training data $\mathcal{D}$. This is achieved by conditioning the distribution on the training data $\mathcal{D}$ yielding the posterior Gaussian Process $f \rvert \mathcal{D} \sim \mathcal{GP}(m_D(\pmb{x}), k_D(\pmb{x},\pmb{x}'))$ for noise-free observations with the posterior mean function $m_D(\pmb{x}) = m(\pmb{x}) + \pmb{\Sigma}(\pmb{X},\pmb{x})^T \pmb{\Sigma}^{-1}(\pmb{\mathrm{f}} - \pmb{\mathrm{m}})$ and the posterior covariance function $k_D(\pmb{x},\pmb{x}')=k(\pmb{x},\pmb{x}') - \pmb{\Sigma}(\pmb{X}, \pmb{x}')$ with $\pmb{\Sigma}(\pmb{X},\pmb{x})$ being a vector of covariances between every training case of $\pmb{X}$ and $\pmb{x}$. Noisy observations $y(\pmb{x}) = f(\pmb{x}) + \epsilon$ with $\epsilon \sim \mathcal{N}(0,\sigma_n^2)$ can be taken into account with a second Gaussian Process with mean $m$ and covariance function $k$ resulting in $f \sim \mathcal{GP}(m,k)$ and $y \sim \mathcal{GP}(m, k + \sigma_n^2\delta_{ii'})$. The figure illustrates the cases of noisy observations (variance at training points) and of noise-free observationshttps://i.imgur.com/BWvsB7T.png (no variance at training points). In the Machine Learning perspective, the mean and the covariance function are parametrised by hyperparameters and provide thus a way to include prior knowledge e.g. knowing that the mean function is a second order polynomial. To find the optimal hyperparameters $\pmb{\theta}$, 1. determine the log marginal likelihood $L= \mathrm{log}(p(\pmb{y} \rvert \pmb{x}, \pmb{\theta}))$, 2. take the first partial derivatives of $L$ w.r.t. the hyperparameters, and 3. apply an optimization algorithm. It should be noted that a regularization term is not necessary for the log marginal likelihood $L$ because it already contains a complexity penalty term. Also, the tradeoff between data-fit and penalty is performed automatically. Gaussian Processes provide a very flexible way for finding a suitable regression model. However, they require the high computational complexity $\mathcal{O}(n^3)$ due to the inversion of the covariance matrix. In addition, the generalization of Gaussian Processes to non-Gaussian likelihoods remains complicated. |

Improving neural networks by preventing co-adaptation of feature detectors

Geoffrey E. Hinton and Nitish Srivastava and Alex Krizhevsky and Ilya Sutskever and Ruslan R. Salakhutdinov

arXiv e-Print archive - 2012 via Local arXiv

Keywords: cs.NE, cs.CV, cs.LG

**First published:** 2012/07/03 (10 years ago)

**Abstract:** When a large feedforward neural network is trained on a small training set,
it typically performs poorly on held-out test data. This "overfitting" is
greatly reduced by randomly omitting half of the feature detectors on each
training case. This prevents complex co-adaptations in which a feature detector
is only helpful in the context of several other specific feature detectors.
Instead, each neuron learns to detect a feature that is generally helpful for
producing the correct answer given the combinatorially large variety of
internal contexts in which it must operate. Random "dropout" gives big
improvements on many benchmark tasks and sets new records for speech and object
recognition.
more
less

Geoffrey E. Hinton and Nitish Srivastava and Alex Krizhevsky and Ilya Sutskever and Ruslan R. Salakhutdinov

arXiv e-Print archive - 2012 via Local arXiv

Keywords: cs.NE, cs.CV, cs.LG

[link]
This paper introduced Dropout, a new layer type. It has a parameter $\alpha \in (0, 1)$. The output dimensionality of a dropout layer is equal to its input dimensionality. With a probability of $\alpha$ any neurons output is set to 0. At testing time, the output of all neurons is multiplied with $\alpha$ to compensate for the fact that no output is set to 0. A much better paper, by the same authors but 2 years later, is [Dropout: a simple way to prevent neural networks from overfitting](http://www.shortscience.org/paper?bibtexKey=journals/jmlr/SrivastavaHKSS14). Dropout can be interpreted as training an ensemble of many networks, which share weights. It was notably used by [ImageNet Classification with Deep Convolutional Neural Networks](http://www.shortscience.org/paper?bibtexKey=krizhevsky2012imagenet). |

About