arXiv is an e-print service in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance and statistics.

- 1996 1
- 2010 1
- 2011 1
- 2012 8
- 2013 31
- 2014 43
- 2015 134
- 2016 231
- 2017 182
- 2018 141
- 2019 100
- 2020 33
- 2021 10
- 2022 7
- 2023 3

An Empirical Investigation of Catastrophic Forgetting in Gradient-Based Neural Networks

Ian J. Goodfellow and Mehdi Mirza and Da Xiao and Aaron Courville and Yoshua Bengio

arXiv e-Print archive - 2013 via Local arXiv

Keywords: stat.ML, cs.LG, cs.NE

**First published:** 2013/12/21 (10 years ago)

**Abstract:** Catastrophic forgetting is a problem faced by many machine learning models
and algorithms. When trained on one task, then trained on a second task, many
machine learning models "forget" how to perform the first task. This is widely
believed to be a serious problem for neural networks. Here, we investigate the
extent to which the catastrophic forgetting problem occurs for modern neural
networks, comparing both established and recent gradient-based training
algorithms and activation functions. We also examine the effect of the
relationship between the first task and the second task on catastrophic
forgetting. We find that it is always best to train using the dropout
algorithm--the dropout algorithm is consistently best at adapting to the new
task, remembering the old task, and has the best tradeoff curve between these
two extremes. We find that different tasks and relationships between tasks
result in very different rankings of activation function performance. This
suggests the choice of activation function should always be cross-validated.
more
less

Ian J. Goodfellow and Mehdi Mirza and Da Xiao and Aaron Courville and Yoshua Bengio

arXiv e-Print archive - 2013 via Local arXiv

Keywords: stat.ML, cs.LG, cs.NE

[link]
The paper discusses and empirically investigates by empirical testing the effect of "catastrophic forgetting" (**CF**), i.e. the inability of a model to perform a task it was previously trained to perform if retrained to perform a second task. An illuminating example is what happens in ML systems with convex objectives: regardless of the initialization (i.e. of what was learnt by doing the first task), the training of the second task will always end in the global minimum, thus totally "forgetting" the first one. Neuroscientific evidence (and common sense) suggest that the outcome of the experiment is deeply influenced by the similarity of the tasks involved. Namely, if (i) the two tasks are *functionally identical but input is presented in a different format* or if (ii) *tasks are similar* and the third case for (iii) *dissimilar tasks*. Relevant examples may be provided respectively by (i) performing the same image classification task starting from two different image representations as RGB or HSL, (ii) performing image classification tasks with semantically similar as classifying two similar animals and (iii) performing a text classification followed by image classification. The problem is investigated by an empirical study covering two methods of training ("SGD" and "dropout") combined with 4 activations functions (logistic sigmoid, RELU, LWTA, Maxout). A random search is carried out on these parameters. From a practitioner's point of view, it is interesting to note that dropout has been set to 0.5 in hidden units and 0.2 in the visible one since this is a reasonably well-known parameter. ## Why the paper is important It is apparently the first to provide a systematic empirical analysis of CF. Establishes a framework and baselines to face the problem. ## Key conclusions, takeaways and modelling remarks * dropout helps in preventing CF * dropout seems to increase the optimal model size with respect to the model without dropout * choice of activation function has a less consistent effect than dropout\no dropout choice * dissimilar task experiment provides a notable exception of then dissimilar task experiment * the previous hypothesis that LWTA activation is particularly resistant to CF is rejected (even if it performs best in the new task in the dissimilar task pair the behaviour is inconsistent) * choice of activation function should always be cross-validated * If computational resources are insufficient for cross-validation the combination dropout + maxout activation function is recommended. |

Playing Atari with Deep Reinforcement Learning

Volodymyr Mnih and Koray Kavukcuoglu and David Silver and Alex Graves and Ioannis Antonoglou and Daan Wierstra and Martin Riedmiller

arXiv e-Print archive - 2013 via Local arXiv

Keywords: cs.LG

**First published:** 2013/12/19 (10 years ago)

**Abstract:** We present the first deep learning model to successfully learn control
policies directly from high-dimensional sensory input using reinforcement
learning. The model is a convolutional neural network, trained with a variant
of Q-learning, whose input is raw pixels and whose output is a value function
estimating future rewards. We apply our method to seven Atari 2600 games from
the Arcade Learning Environment, with no adjustment of the architecture or
learning algorithm. We find that it outperforms all previous approaches on six
of the games and surpasses a human expert on three of them.
more
less

Volodymyr Mnih and Koray Kavukcuoglu and David Silver and Alex Graves and Ioannis Antonoglou and Daan Wierstra and Martin Riedmiller

arXiv e-Print archive - 2013 via Local arXiv

Keywords: cs.LG

[link]
* They use an implementation of Q-learning (i.e. reinforcement learning) with CNNs to automatically play Atari games. * The algorithm receives the raw pixels as its input and has to choose buttons to press as its output. No hand-engineered features are used. So the model "sees" the game and "uses" the controller, just like a human player would. * The model achieves good results on various games, beating all previous techniques and sometimes even surpassing human players. ### How * Deep Q Learning * *This is yet another explanation of deep Q learning, see also [this blog post](http://www.nervanasys.com/demystifying-deep-reinforcement-learning/) for longer explanation.* * While playing, sequences of the form (`state1`, `action1`, `reward`, `state2`) are generated. * `state1` is the current game state. The agent only sees the pixels of that state. (Example: Screen shows enemy.) * `action1` is an action that the agent chooses. (Example: Shoot!) * `reward` is the direct reward received for picking `action1` in `state1`. (Example: +1 for a kill.) * `state2` is the next game state, after the action was chosen in `state1`. (Example: Screen shows dead enemy.) * One can pick actions at random for some time to generate lots of such tuples. That leads to a replay memory. * Direct reward * After playing randomly for some time, one can train a model to predict the direct reward given a screen (we don't want to use the whole state, just the pixels) and an action, i.e. `Q(screen, action) -> direct reward`. * That function would need a forward pass for each possible action that we could take. So for e.g. 8 buttons that would be 8 forward passes. To make things more efficient, we can let the model directly predict the direct reward for each available action, e.g. for 3 buttons `Q(screen) -> (direct reward of action1, direct reward of action2, direct reward of action3)`. * We can then sample examples from our replay memory. The input per example is the screen. The output is the reward as a tuple. E.g. if we picked button 1 of 3 in one example and received a reward of +1 then our output/label for that example would be `(1, 0, 0)`. * We can then train the model by playing completely randomly for some time, then sample some batches and train using a mean squared error. Then play a bit less randomly, i.e. start to use the action which the network thinks would generate the highest reward. Then train again, and so on. * Indirect reward * Doing the previous steps, the model will learn to anticipate the *direct* reward correctly. However, we also want it to predict indirect rewards. Otherwise, the model e.g. would never learn to shoot rockets at enemies, because the reward from killing an enemy would come many frames later. * To learn the indirect reward, one simply adds the reward value of highest reward action according to `Q(state2)` to the direct reward. * I.e. if we have a tuple (`state1`, `action1`, `reward`, `state2`), we would not add (`state1`, `action1`, `reward`) to the replay memory, but instead (`state1`, `action1`, `reward + highestReward(Q(screen2))`). (Where `highestReward()` returns the reward of the action with the highest reward according to Q().) * By training to predict `reward + highestReward(Q(screen2))` the network learns to anticipate the direct reward *and* the indirect reward. It takes a leap of faith to accept that this will ever converge to a good solution, but it does. * We then add `gamma` to the equation: `reward + gamma*highestReward(Q(screen2))`. `gamma` may be set to 0.9. It is a discount factor that devalues future states, e.g. because the world is not deterministic and therefore we can't exactly predict what's going to happen. Note that Q will automatically learn to stack it, e.g. `state3` will be discounted to `gamma^2` at `state1`. * This paper * They use the mentioned Deep Q Learning to train their model Q. * They use a k-th frame technique, i.e. they let the model decide upon an action at (here) every 4th frame. * Q is implemented via a neural net. It receives 84x84x4 grayscale pixels that show the game and projects that onto the rewards of 4 to 18 actions. * The input is HxWx4 because they actually feed the last 4 frames into the network, instead of just 1 frame. So the network knows more about what things are moving how. * The network architecture is: * 84x84x4 (input) * 16 convs, 8x8, stride 4, ReLU * 32 convs, 4x4, stride 2, ReLU * 256 fully connected neurons, ReLU * <N_actions> fully connected neurons, linear * They use a replay memory of 1 million frames. ### Results * They ran experiments on the Atari games Beam Rider, Breakout, Enduro, Pong, Qbert, Seaquest and Space Invaders. * Same architecture and hyperparameters for all games. * Rewards were based on score changes in the games, i.e. they used +1 (score increases) and -1 (score decreased). * Optimizer: RMSProp, Batch Size: 32. * Trained for 10 million examples/frames per game. * They had no problems with instability and their average Q value per game increased smoothly. * Their method beats all other state of the art methods. * They managed to beat a human player in games that required not so much "long" term strategies (the less frames the better). * Video: starts at 46:05. https://youtu.be/dV80NAlEins?t=46m05s ![Algorithm](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Playing_Atari_with_Deep_Reinforcement_Learning__algorithm.png?raw=true "Algorithm") *The original full algorithm, as shown in the paper.* -------------------- ### Rough chapter-wise notes * (1) Introduction * Problems when using neural nets in reinforcement learning (RL): * Reward signal is often sparse, noise and delayed. * Often assumption that data samples are independent, while they are correlated in RL. * Data distribution can change when the algorithm learns new behaviours. * They use Q-learning with a CNN and stochastic gradient descent. * They use an experience replay mechanism (i.e. memory) from which they can sample previous transitions (for training). * They apply their method to Atari 2600 games in the Arcade Learning Environment (ALE). * They use only the visible pixels as input to the network, i.e. no manual feature extraction. * (2) Background * blablabla, standard deep q learning explanation * (3) Related Work * TD-Backgammon: "Solved" backgammon. Worked similarly to Q-learning and used a multi-layer perceptron. * Attempts to copy TD-Backgammon to other games failed. * Research was focused on linear function approximators as there were problems with non-linear ones diverging. * Recently again interest in using neural nets for reinforcement learning. Some attempts to fix divergence problems with gradient temporal-difference methods. * NFQ is a very similar method (to the one in this paper), but worked on the whole batch instead of minibatches, making it slow. It also first applied dimensionality reduction via autoencoders on the images instead of training on them end-to-end. * HyperNEAT was applied to Atari games and evolved a neural net for each game. The networks learned to exploit design flaws. * (4) Deep Reinforcement Learning * They want to connect a reinforcement learning algorithm with a deep neural network, e.g. to get rid of handcrafted features. * The network is supposes to run on the raw RGB images. * They use experience replay, i.e. store tuples of (pixels, chosen action, received reward) in a memory and use that during training. * They use Q-learning. * They use an epsilon-greedy policy. * Advantages from using experience replay instead of learning "live" during game playing: * Experiences can be reused many times (more efficient). * Samples are less correlated. * Learned parameters from one batch don't determine as much the distributions of the examples in the next batch. * They save the last N experiences and sample uniformly from them during training. * (4.1) Preprocessing and Model Architecture * Raw Atari images are 210x160 pixels with 128 possible colors. * They downsample them to 110x84 pixels and then crop the 84x84 playing area out of them. * They also convert the images to grayscale. * They use the last 4 frames as input and stack them. * So their network input has shape 84x84x4. * They use one output neuron per possible action. So they can compute the Q-value (expected reward) of each action with one forward pass. * Architecture: 84x84x4 (input) => 16 8x8 convs, stride 4, ReLU => 32 4x4 convs stride 2 ReLU => fc 256, ReLU => fc N actions, linear * 4 to 18 actions/outputs (depends on the game). * Aside from the outputs, the architecture is the same for all games. * (5) Experiments * Games that they played: Beam Rider, Breakout, Enduro, Pong, Qbert, Seaquest, Space Invaders * They use the same architecture und hyperparameters for all games. * They give a reward of +1 whenever the in-game score increases and -1 whenever it decreases. * They use RMSProp. * Mini batch size was 32. * They train for 10 million frames/examples. * They initialize epsilon (in their epsilon greedy strategy) to 1.0 and decrease it linearly to 0.1 at one million frames. * They let the agent decide upon an action at every 4th in-game frame (3rd in space invaders). * (5.1) Training and stability * They plot the average reward und Q-value per N games to evaluate the agent's training progress, * The average reward increases in a noisy way. * The average Q value increases smoothly. * They did not experience any divergence issues during their training. * (5.2) Visualizating the Value Function * The agent learns to predict the value function accurately, even for rather long sequences (here: ~25 frames). * (5.3) Main Evaluation * They compare to three other methods that use hand-engineered features and/or use the pixel data combined with significant prior knownledge. * They mostly outperform the other methods. * They managed to beat a human player in three games. The ones where the human won seemed to require strategies that stretched over longer time frames. |

Learned-Norm Pooling for Deep Feedforward and Recurrent Neural Networks

Caglar Gulcehre and Kyunghyun Cho and Razvan Pascanu and Yoshua Bengio

arXiv e-Print archive - 2013 via Local arXiv

Keywords: cs.NE, cs.LG, stat.ML

**First published:** 2013/11/07 (10 years ago)

**Abstract:** In this paper we propose and investigate a novel nonlinear unit, called $L_p$
unit, for deep neural networks. The proposed $L_p$ unit receives signals from
several projections of a subset of units in the layer below and computes a
normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$
unit. First, the proposed unit can be understood as a generalization of a
number of conventional pooling operators such as average, root-mean-square and
max pooling widely used in, for instance, convolutional neural networks (CNN),
HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain
degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013)
which achieved the state-of-the-art object recognition results on a number of
benchmark datasets. Secondly, we provide a geometrical interpretation of the
activation function based on which we argue that the $L_p$ unit is more
efficient at representing complex, nonlinear separating boundaries. Each $L_p$
unit defines a superelliptic boundary, with its exact shape defined by the
order $p$. We claim that this makes it possible to model arbitrarily shaped,
curved boundaries more efficiently by combining a few $L_p$ units of different
orders. This insight justifies the need for learning different orders for each
unit in the model. We empirically evaluate the proposed $L_p$ units on a number
of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$
units achieve the state-of-the-art results on a number of benchmark datasets.
Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep
recurrent neural networks (RNN).
more
less

Caglar Gulcehre and Kyunghyun Cho and Razvan Pascanu and Yoshua Bengio

arXiv e-Print archive - 2013 via Local arXiv

Keywords: cs.NE, cs.LG, stat.ML

[link]
#### Problem addressed: A new type of activation function #### Summary: This paper propose a new activation function that computes a Lp norm from multiple projections on an input vector. The p value can be learned from training example, and can also be different for each hidden unit. The intuition is that 1) for different datasets there may exist different optimal p-values, so it make more sense to make p tunable; 2) allowing different unit take different p-values can potentially make the approximation of decision boundaries more efficient and more flexible. The empirical results support these two intuitions, and achieved comparable results on three datasets. #### Novelty: A generalization of pooling but applied through channels, when the data and weight vector dot product plus bias is constrained to non-negative case, the $L_\infty$ is equivalent to maxout unit. #### Drawbacks: Empirical performance is not very impressive, although evidence of supporting the intuition occurs. #### Datasets: MNIST, TFD, Pentomino #### Resources: http://arxiv.org/abs/1311.1780 #### Presenter: Yingbo Zhou |

Bounding the Test Log-Likelihood of Generative Models

Yoshua Bengio and Li Yao and Kyunghyun Cho

arXiv e-Print archive - 2013 via Local arXiv

Keywords: cs.LG

**First published:** 2013/11/24 (10 years ago)

**Abstract:** Several interesting generative learning algorithms involve a complex
probability distribution over many random variables, involving intractable
normalization constants or latent variable normalization. Some of them may even
not have an analytic expression for the unnormalized probability function and
no tractable approximation. This makes it difficult to estimate the quality of
these models, once they have been trained, or to monitor their quality (e.g.
for early stopping) while training. A previously proposed method is based on
constructing a non-parametric density estimator of the model's probability
function from samples generated by the model. We revisit this idea, propose a
more efficient estimator, and prove that it provides a lower bound on the true
test log-likelihood, and an unbiased estimator as the number of generated
samples goes to infinity, although one that incorporates the effect of poor
mixing. We further propose a biased variant of the estimator that can be used
reliably with a finite number of samples for the purpose of model comparison.
more
less

Yoshua Bengio and Li Yao and Kyunghyun Cho

arXiv e-Print archive - 2013 via Local arXiv

Keywords: cs.LG

[link]
#### Problem addressed: Evaluation and comparison of generative models #### Summary: This paper improves upon an existing non parametric estimator by sampling from hidden variables instead of features. They present an unbiased estimator and prove it asymptotically converges to true distribution with number of samples. They also prove that the expected value of unbiased estimator is a lower bound on the true distribution. They also present a biased estimator with a different sampling scheme. They empirically validate their estimators using MNIST dataset on different generative models #### Novelty: Sampling from hidden space for non-parametric estimation #### Drawbacks: This method works only for models which have hidden variables. Application for deep networks is not clear. Procedure for sampling from hidden variables is not explicitly mentioned. Assumes that P(x|h) is easily calculated from the model #### Datasets: MNIST #### Resources: paper: http://arxiv.org/pdf/1311.6184v4.pdf #### Presenter: Bhargava U. Kota |

Better Mixing via Deep Representations

Bengio, Yoshua and Mesnil, Grégoire and Dauphin, Yann and Rifai, Salah

International Conference on Machine Learning - 2013 via Local Bibsonomy

Keywords: dblp

Bengio, Yoshua and Mesnil, Grégoire and Dauphin, Yann and Rifai, Salah

International Conference on Machine Learning - 2013 via Local Bibsonomy

Keywords: dblp

[link]
#### Problem addressed: It has been empirically observed that deep representations lead to better mode mixing when sampled using MCMC. The authors present a set of hypotheses as to why this happens and confirm them empirically. #### Summary: The paper claims that deep representations (specially from parametric models) disentangle the factors of variations in the raw feature space. This disentangling leads to better ""mode mixing"" during MCMC sampling. For eg., in faces, the factors of variation could be identity-pose-illumination. If the higher layer learns these features then changing the representation in this space starting from a ""valid"" point would lead to changes in each of these factors directly and hence will produce ""valid"" images, which in the original feature space would be far apart; thus better mode mixing. This hypothesis is explained using 2 additional ones: (a) the manifold structure of the ""valid"" data is flattened in the higher layer space, and (b) the fraction of total volume occupied by high probability (valid) points is larger in the higher layer space. While (a) should lead to better interpolation in higher layer space, (b) should lead to more valid points in a parzen window around any known sample. These are confirmed experimentally. #### Novelty: novel intuitions why deep representations are good for generative modeling. #### Drawbacks: no theoretical justification #### Datasets: MNIST, Toronto Face dataset (TFD) #### Additional remarks: used DBN and Deep CAE for experiments on the datasets #### Presenter: Devansh Arpit |

Estimating or Propagating Gradients Through Stochastic Neurons

Yoshua Bengio

arXiv e-Print archive - 2013 via Local arXiv

Keywords: cs.LG

**First published:** 2013/05/14 (11 years ago)

**Abstract:** Stochastic neurons can be useful for a number of reasons in deep learning
models, but in many cases they pose a challenging problem: how to estimate the
gradient of a loss function with respect to the input of such stochastic
neurons, i.e., can we "back-propagate" through these stochastic neurons? We
examine this question, existing approaches, and present two novel families of
solutions, applicable in different settings. In particular, it is demonstrated
that a simple biologically plausible formula gives rise to an an unbiased (but
noisy) estimator of the gradient with respect to a binary stochastic neuron
firing probability. Unlike other estimators which view the noise as a small
perturbation in order to estimate gradients by finite differences, this
estimator is unbiased even without assuming that the stochastic perturbation is
small. This estimator is also interesting because it can be applied in very
general settings which do not allow gradient back-propagation, including the
estimation of the gradient with respect to future rewards, as required in
reinforcement learning setups. We also propose an approach to approximating
this unbiased but high-variance estimator by learning to predict it using a
biased estimator. The second approach we propose assumes that an estimator of
the gradient can be back-propagated and it provides an unbiased estimator of
the gradient, but can only work with non-linearities unlike the hard threshold,
but like the rectifier, that are not flat for all of their range. This is
similar to traditional sigmoidal units but has the advantage that for many
inputs, a hard decision (e.g., a 0 output) can be produced, which would be
convenient for conditional computation and achieving sparse representations and
sparse gradients.
more
less

Yoshua Bengio

arXiv e-Print archive - 2013 via Local arXiv

Keywords: cs.LG

[link]
#### Problem addressed: Gradient estimation for stochastic neurons #### Summary: This paper proposed an unbiased estimator of stochastic units so that one can use gradient based learning. In addition, it also proposed a simple, biased estimator called straight through. #### Novelty: A new approach for estimating gradient for stochastic units. #### Drawbacks: The proposed unbised estimator seems to have large variance, and the biased one seems not performing very well in practice #### Presenter: Yingbo Zhou |

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling

arXiv e-Print archive - 2013 via Local arXiv

Keywords: stat.ML, cs.LG

**First published:** 2013/12/20 (10 years ago)

**Abstract:** How can we perform efficient inference and learning in directed probabilistic
models, in the presence of continuous latent variables with intractable
posterior distributions, and large datasets? We introduce a stochastic
variational inference and learning algorithm that scales to large datasets and,
under some mild differentiability conditions, even works in the intractable
case. Our contributions is two-fold. First, we show that a reparameterization
of the variational lower bound yields a lower bound estimator that can be
straightforwardly optimized using standard stochastic gradient methods. Second,
we show that for i.i.d. datasets with continuous latent variables per
datapoint, posterior inference can be made especially efficient by fitting an
approximate inference model (also called a recognition model) to the
intractable posterior using the proposed lower bound estimator. Theoretical
advantages are reflected in experimental results.
more
less

Diederik P Kingma and Max Welling

arXiv e-Print archive - 2013 via Local arXiv

Keywords: stat.ML, cs.LG

[link]
#### Problem addressed: Variational learning of Bayesian networks #### Summary: This paper present a generic method for learning belief networks, which uses variational lower bound for the likelihood term. #### Novelty: Uses a re-parameterization trick to change random variables to deterministic function plus a noise term, so one can apply normal gradient based learning #### Drawbacks: The resulting model marginal likelihood is still intractible, may not be very good for applications that require the use of actual values of the marginal probablities #### Datasets: MNIST, Frey face #### Additional remarks: Experimentally compared with wake sleep algorithm on logliklihood lower bound as well as estimated marginal likelihood #### Resources: Implementation: https://github.com/y0ast/Variational-Autoencoder #### Presenter: Yingbo Zhou |

Intriguing properties of neural networks

Christian Szegedy and Wojciech Zaremba and Ilya Sutskever and Joan Bruna and Dumitru Erhan and Ian Goodfellow and Rob Fergus

arXiv e-Print archive - 2013 via Local arXiv

Keywords: cs.CV, cs.LG, cs.NE

**First published:** 2013/12/21 (10 years ago)

**Abstract:** Deep neural networks are highly expressive models that have recently achieved
state of the art performance on speech and visual recognition tasks. While
their expressiveness is the reason they succeed, it also causes them to learn
uninterpretable solutions that could have counter-intuitive properties. In this
paper we report two such properties.
First, we find that there is no distinction between individual high level
units and random linear combinations of high level units, according to various
methods of unit analysis. It suggests that it is the space, rather than the
individual units, that contains of the semantic information in the high layers
of neural networks.
Second, we find that deep neural networks learn input-output mappings that
are fairly discontinuous to a significant extend. We can cause the network to
misclassify an image by applying a certain imperceptible perturbation, which is
found by maximizing the network's prediction error. In addition, the specific
nature of these perturbations is not a random artifact of learning: the same
perturbation can cause a different network, that was trained on a different
subset of the dataset, to misclassify the same input.
more
less

Christian Szegedy and Wojciech Zaremba and Ilya Sutskever and Joan Bruna and Dumitru Erhan and Ian Goodfellow and Rob Fergus

arXiv e-Print archive - 2013 via Local arXiv

Keywords: cs.CV, cs.LG, cs.NE

[link]
The paper introduces two key properties of deep neural networks: - Semantic meaning of individual units. - Earlier works analyzed learnt semantics by finding images that maximally activate individual units. - Authors observe that there is no difference between individual units and random linear combinations of units. - It is the entire space of activations that contains the bulk of semantic information. - Stability of neural networks to small perturbations in input space. - Networks that generalize well are expected to be robust to small perturbations in the input, i.e. imperceptible noise in the input shouldn't change the predicted class. - Authors find that networks can be made to misclassify an image by applying a certain imperceptible perturbation, which is found by maximizing the network's prediction error. - These 'adversarial examples' generalize well to different architectures trained on different data subsets. ## Strengths - The authors propose a way to make networks more robust to small perturbations by training them with adversarial examples in an adaptive manner, i.e. keep changing the pool of adversarial examples during training. In this regard, they draw a connection with hard-negative mining, and a network trained with adversarial examples performs better than others. - Formal description of how to generate adversarial examples and mathematical analysis of a network's stability to perturbations are useful studies. ## Weaknesses / Notes - Two images that are visually indistinguishable to humans but classified differently by the network is indeed an intriguing observation. - The paper feels a little half-baked in parts, and some ideas could've been presented more clearly. |

OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks

Pierre Sermanet and David Eigen and Xiang Zhang and Michael Mathieu and Rob Fergus and Yann LeCun

arXiv e-Print archive - 2013 via Local arXiv

Keywords: cs.CV

**First published:** 2013/12/21 (10 years ago)

**Abstract:** We present an integrated framework for using Convolutional Networks for
classification, localization and detection. We show how a multiscale and
sliding window approach can be efficiently implemented within a ConvNet. We
also introduce a novel deep learning approach to localization by learning to
predict object boundaries. Bounding boxes are then accumulated rather than
suppressed in order to increase detection confidence. We show that different
tasks can be learned simultaneously using a single shared network. This
integrated framework is the winner of the localization task of the ImageNet
Large Scale Visual Recognition Challenge 2013 (ILSVRC2013) and obtained very
competitive results for the detection and classifications tasks. In
post-competition work, we establish a new state of the art for the detection
task. Finally, we release a feature extractor from our best model called
OverFeat.
more
less

Pierre Sermanet and David Eigen and Xiang Zhang and Michael Mathieu and Rob Fergus and Yann LeCun

arXiv e-Print archive - 2013 via Local arXiv

Keywords: cs.CV

[link]
## Terms * Classification: Assign one label to one image * Localization: Many object, there might be multiple instances of the same class. * Detection: One big dominant object which has to be found and to be classified. Pretty much first classification and then finding a bounding box for the class within the image ## Evaluation * ILSVRC2013, localization task * ILSVRC2013, detection task * ILSVRC2013, classification task ## Implementations * https://github.com/sermanet/OverFeat |

Big Neural Networks Waste Capacity

Dauphin, Yann and Bengio, Yoshua

arXiv e-Print archive - 2013 via Local Bibsonomy

Keywords: dblp

Dauphin, Yann and Bengio, Yoshua

arXiv e-Print archive - 2013 via Local Bibsonomy

Keywords: dblp

[link]
The paper 'Big Neural Networks Waste Capacity' recognizes that adding more layer / parameters does not improve accuracy. When reading this paper, one should bear in mind that it was written well before [Deep Residual Learning for Image Recognition](http://www.shortscience.org/paper?bibtexKey=journals/corr/HeZRS15) or DenseNets. In the experiments, they applied MLPs to SIFT features of ImageNet LSVRC-2010. **Do not read this paper**. Instead, you might want to read the "Deep Residual Learning for Image Recognition". It makes the same point, but clearer and offers a solution to the underfitting problem. ## Criticism I don't understand why they write about k-means. > Assuming minimal error in the human labelling of the dataset, it should be possible to reach errors close to 0%. For ImageNet, the human labeling error is estimated at about 5% (I can't find the source for that, though) > Improvements on ImageNet are thought to be a good proxy for progress in object recognition (Deng et al., 2009). ImageNet images are very different from "typical web images" like the [100 million images Flickr dataset](http://yahoolabs.tumblr.com/post/89783581601/one-hundred-million-creative-commons-flickr-images-for). |

Network In Network

Min Lin and Qiang Chen and Shuicheng Yan

arXiv e-Print archive - 2013 via Local arXiv

Keywords: cs.NE, cs.CV, cs.LG

**First published:** 2013/12/16 (10 years ago)

**Abstract:** We propose a novel deep network structure called "Network In Network" (NIN)
to enhance model discriminability for local patches within the receptive field.
The conventional convolutional layer uses linear filters followed by a
nonlinear activation function to scan the input. Instead, we build micro neural
networks with more complex structures to abstract the data within the receptive
field. We instantiate the micro neural network with a multilayer perceptron,
which is a potent function approximator. The feature maps are obtained by
sliding the micro networks over the input in a similar manner as CNN; they are
then fed into the next layer. Deep NIN can be implemented by stacking mutiple
of the above described structure. With enhanced local modeling via the micro
network, we are able to utilize global average pooling over feature maps in the
classification layer, which is easier to interpret and less prone to
overfitting than traditional fully connected layers. We demonstrated the
state-of-the-art classification performances with NIN on CIFAR-10 and
CIFAR-100, and reasonable performances on SVHN and MNIST datasets.
more
less

Min Lin and Qiang Chen and Shuicheng Yan

arXiv e-Print archive - 2013 via Local arXiv

Keywords: cs.NE, cs.CV, cs.LG

[link]
A paper in the intersection for Computer Vision and Machine Learning. They propose a method (network in network) to reduce parameters. Essentially, it boils down to a pattern of (conv with size > 1) -> (1x1 conv) -> (1x1 conv) -> repeat ## Datasets state-of-the-art classification performances with NIN on CIFAR-10 and CIFAR-100, and reasonable performances on SVHN and MNIST ## Implementations * [Lasagne](https://github.com/Lasagne/Recipes/blob/master/modelzoo/cifar10_nin.py) |

An empirical analysis of dropout in piecewise linear networks

David Warde-Farley and Ian J. Goodfellow and Aaron Courville and Yoshua Bengio

arXiv e-Print archive - 2013 via Local arXiv

Keywords: stat.ML, cs.LG, cs.NE

**First published:** 2013/12/21 (10 years ago)

**Abstract:** The recently introduced dropout training criterion for neural networks has
been the subject of much attention due to its simplicity and remarkable
effectiveness as a regularizer, as well as its interpretation as a training
procedure for an exponentially large ensemble of networks that share
parameters. In this work we empirically investigate several questions related
to the efficacy of dropout, specifically as it concerns networks employing the
popular rectified linear activation function. We investigate the quality of the
test time weight-scaling inference procedure by evaluating the geometric
average exactly in small models, as well as compare the performance of the
geometric mean to the arithmetic mean more commonly employed by ensemble
techniques. We explore the effect of tied weights on the ensemble
interpretation by training ensembles of masked networks without tied weights.
Finally, we investigate an alternative criterion based on a biased estimator of
the maximum likelihood ensemble gradient.
more
less

David Warde-Farley and Ian J. Goodfellow and Aaron Courville and Yoshua Bengio

arXiv e-Print archive - 2013 via Local arXiv

Keywords: stat.ML, cs.LG, cs.NE

[link]
This paper analyses fully connected networks with dropout and ReLU activation functions. |

Visualizing and Understanding Convolutional Networks

Zeiler, Matthew D. and Fergus, Rob

arXiv e-Print archive - 2013 via Local Bibsonomy

Keywords: cnn, deeplearning

Zeiler, Matthew D. and Fergus, Rob

arXiv e-Print archive - 2013 via Local Bibsonomy

Keywords: cnn, deeplearning

[link]
The main contribution of this paper is a new way to analyze CNNs by (a) visualizing intermediate learned features and (b) occlusion sensitivity analysis. ## Analyzation techniques ### Visualization A multi-layer deconvolutional network is used to project the feature activations back into pixel space, showing what input pattern originally caused a given activation in the feature maps. The idea is to train a network which is given the result of a layer $L_i$ and has to reconstruct the input feature map of $L_i$. This is repeated until the input image is reached. The deconv-net has a special **unpooling layer**: The max-pooling layers have to save where an activation came from and store those to a switch variable, which is used in unpooling. ### Occlusion sensitivity analysis * Occlude(I, x, y): Put a gray square centered at $(x, y)$ over a part of the image $I$. Run the classifier. * Create an image like this: * Run Occlude(I, x, y) for all $(x, y)$ (possible with stride) * At $(x, y)$, either ... * (d) ... place a pixel which color-encodes the probability of the correct class * (e) ... place a pixel which color-encodes the most probable class The following image from the Zeiler & Fergus paper visualizes this pretty well: If the dogs face is occluded, the probability of the correct class drops a lot: ![Imgur](http://i.imgur.com/Q1Ama2z.png) If the dogs face is occluded, the most likely class suddenly is "tennisball" and no longer "Pomeranian". ![Imgur](http://i.imgur.com/5QYKh7b.png) See [LIME](http://www.shortscience.org/paper?bibtexKey=journals/corr/1602.04938#martinthoma). ## How visualization helped to construct ZF-Net * "The first layer filters are a mix of extremely high and low frequency information, with little coverage of the mid frequencies" -> Lower filter size from $11 \times 11$ to $7 \times 7$ * "the 2nd layer visualization shows aliasing artifacts caused by the large stride 4 used in the 1st layer convolutions" -> Lower stride from 4 to 2 * The occlusion analysis helps to boost confidence that the kind of features being learned are actually correct. ## ZF-Net Zeiler and Fergus also created a new network for ImageNet. The network consists of multiple interleaved layers of convolutions, non-linear activations, local response normalizations and max pooling layers. Training setup: * **Preprocessing**: Resize smallest dimension to 256, per-pixel mean subtraction per channel, crop $224\text{px} \times 224\text{px}$ region * **Optimization**: Mini-Batch SGD, learning rate $= 10^{-2}$, momentum = $0.9$, 70 epochs * **Resources**: took around 12 days on a single GTX580 GPU The network was evaluated on * ImageNet 2012: 14.8% error * Caltech-101: $86.5 \pm 0.5$ (pretrained on ImageNet) * Caltech-256: $74.2\% \pm 0.3$ (pretrained on ImageNet) ## Minor errors * typo: "goes give" (also: something went wrong with the link there - the whole block is a link) |

Efficient estimation of word representations in vector space

Mikolov, Tomas and Chen, Kai and Corrado, Greg and Dean, Jeffrey

arXiv preprint arXiv:1301.3781 - 2013 via Local Bibsonomy

Keywords: thema:deepwalk, language, modelling, skipgram

Mikolov, Tomas and Chen, Kai and Corrado, Greg and Dean, Jeffrey

arXiv preprint arXiv:1301.3781 - 2013 via Local Bibsonomy

Keywords: thema:deepwalk, language, modelling, skipgram

[link]
## Introduction * Introduces techniques to learn word vectors from large text datasets. * Can be used to find similar words (semantically, syntactically, etc). * [Link to the paper](http://arxiv.org/pdf/1301.3781.pdf) * [Link to open source implementation](https://code.google.com/archive/p/word2vec/) ## Model Architecture * Computational complexity defined in terms of a number of parameters accessed during model training. * Proportional to $E*T*Q$ * *E* - Number of training epochs * *T* - Number of words in training set * *Q* - depends on the model ### Feedforward Neural Net Language Model (NNLM) * Probabilistic model with input, projection, hidden and output layer. * Input layer encodes N previous word using 1-of-V encoding (V is vocabulary size). * Input layer projected to projection layer P with dimensionality *N\*D* * Hidden layer (of size *H*) computes the probability distribution over all words. * Complexity per training example $Q =N*D + N*D*H + H*V$ * Can reduce *Q* by using hierarchical softmax and Huffman binary tree (for storing vocabulary). ### Recurrent Neural Net Language Model (RNNLM) * Similar to NNLM minus the projection layer. * Complexity per training example $Q =H*H + H*V$ * Hierarchical softmax and Huffman tree can be used here as well. ## Log-Linear Models * Nonlinear hidden layer causes most of the complexity. * NNLMs can be successfully trained in two steps: * Learn continuous word vectors using simple models. * N-gram NNLM trained over the word vectors. ### Continuous Bag-of-Words Model * Similar to feedforward NNLM. * No nonlinear hidden layer. * Projection layer shared for all words and order of words does not influence projection. * Log-linear classifier uses a window of words to predict the middle word. * $Q = N*D + D*\log_2V$ ### Continuous Skip-gram Model * Similar to Continuous Bag-of-Words but uses the middle world of the window to predict the remaining words in the window. * Distant words are given less weight by sampling fewer distant words. * $Q = C*(D + D*log_2 V$) where *C* is the max distance of the word from the middle word. * Given a *C* and a training data, a random *R* is chosen in range *1 to C*. * For each training word, *R* words from history (previous words) and *R* words from future (next words) are marked as target output and model is trained. ## Results * Skip-gram beats all other models for semantic accuracy tasks (eg - relating Athens with Greece). * Continuous Bag-of-Words Model outperforms other models for semantic accuracy tasks (eg great with greater) - with skip-gram just behind in performance. * Skip-gram architecture combined with RNNLMs outperforms RNNLMs (and other models) for Microsoft Research Sentence Completion Challenge. * Model can learn relationships like "Queen is to King as Woman is to Man". This allows algebraic operations like Vector("King") - Vector("Man") + Vector("Woman"). |

Knapsack Constrained Contextual Submodular List Prediction with Application to Multi-document Summarization

Jiaji Zhou and Stephane Ross and Yisong Yue and Debadeepta Dey and J. Andrew Bagnell

arXiv e-Print archive - 2013 via Local arXiv

Keywords: cs.LG

**First published:** 2013/08/16 (10 years ago)

**Abstract:** We study the problem of predicting a set or list of options under knapsack
constraint. The quality of such lists are evaluated by a submodular reward
function that measures both quality and diversity. Similar to DAgger (Ross et
al., 2010), by a reduction to online learning, we show how to adapt two
sequence prediction models to imitate greedy maximization under knapsack
constraint problems: CONSEQOPT (Dey et al., 2012) and SCP (Ross et al., 2013).
Experiments on extractive multi-document summarization show that our approach
outperforms existing state-of-the-art methods.
more
less

Jiaji Zhou and Stephane Ross and Yisong Yue and Debadeepta Dey and J. Andrew Bagnell

arXiv e-Print archive - 2013 via Local arXiv

Keywords: cs.LG

[link]
This paper is a direct extension of the work of \cite{journals/corr/1305.2532}. Prior to this work, \cite{journals/corr/1305.2532} proposed an algorithm for learning to predict fixed size lists using either a single greedy policy (SCP) or a list of sequentially applied policies (CONSEQOPT). Given a sub-modular reward function, \cite{journals/corr/1305.2532} provided theoretical guarantees on the regret of the learned policies via a reduction to cost sensitive classification. In practice one can then approximate the cost sensitive classification loss with convex surrogate losses that can be efficiently learned. In this work, the authors extend the framework of \cite{journals/corr/1305.2532} to knapsack problems, wherein each element in the list has an associated length and the sum total lengths may not exceed a fixed budget. (Note that I intentionally use the word "length" to differentiate from the "cost" of cost-sensitive classification in the reduction, which gets a bit confusing in the paper as the authors overload the word "cost" frequently.) They extend the algorithm of \cite{journals/corr/1305.2532} to be sensitive to the length of items added to the list, and extend the analysis of \cite{journals/corr/1305.2532} to provide regret guarantees in this setting. The authors then apply their algorithm to the problem of multi-document summarization, where items correspond to sentences and the budget constraints the total length of the summary. They show an improvement over prior work with their new approach. The paper is reasonably well written and straightforward to follow, with the exception of containing lots of notation and terminology that is sometimes difficult to keep straight (e.g. the definitions of the words "cost" and "weight", and the use of \ell as both cost sensitive losses and the length (cost? weight?) of each item for the budget.) The work itself is a useful extension to prior work that appears to be experimentally sound. Pros: + Novel extension of prior work to incorporate budgetary constraints + Works well experimentally Cons: - Overloading of symbols and words leads to confusion References: \cite{journals/corr/1305.2532} Ross, Stephane, Zhou, Jiaji, Yue, Yisong, Dey, Debadeepta, and Bagnell, .A. Learning policies for contextual submodular prediction. In ICML, 2013. |

Predicting Parameters in Deep Learning

Denil, Misha and Shakibi, Babak and Dinh, Laurent and Ranzato, Marc'Aurelio and de Freitas, Nando

Neural Information Processing Systems Conference - 2013 via Local Bibsonomy

Keywords: dblp

Denil, Misha and Shakibi, Babak and Dinh, Laurent and Ranzato, Marc'Aurelio and de Freitas, Nando

Neural Information Processing Systems Conference - 2013 via Local Bibsonomy

Keywords: dblp

[link]
Motivated by recent attempts to learn very large networks this work proposes an approach for reducing the number of free parameters in neural-network type architectures. The method is based on the intuition that there is typically strong redundancy in the learned parameters (for instance, the first layer filters of of NNs applied to images are smooth): The authors suggest to learn only a subset of the parameter values and to then predicted the remaining ones through some form of interpolation. The proposed approach is evaluated for several architectures (MLP, convolutional NN, reconstruction-ICA) and different vision datasets (MNIST, CIFAR, STL-10). The results suggest that in general it is sufficient to learn fewer than 50% of the parameters without any loss in performance (significantly fewer parameters seem sufficient for MNIST). The method is relatively simple: The authors assume a low-rank decomposition of the weight matrix and then further fix one of the two matrices using prior knowledge about the data (e.g., in the vision case, exploiting the fact that nearby pixels - and weights - tend to be correlated). This can be interpreted as predicting the "unobserved" parameters from the subset of learned filter weights via kernel ridge regression, where the kernel captures prior knowledge about the topology / "smoothness" of the weights. For the situation when such prior knowledge is not available the authors describe a way to learn a suitable kernel from data. The idea of reducing the number of parameters in NN-like architectures through connectivity constraints in itself is of course not novel, and the authors provide a pretty good discussion of related work in section 5. Their method is very closely related to the idea of factorizing weight matrices as is, for instance, commonly done for 3-way RBMs (e.g. ref [22] in the paper), but also occasionally for standard RBMs (e.g. [R1], missing in the paper). The present papers differs from these in that the authors propose to exploit prior knowledge to constrain one of the matrices. As also discussed by the authors, the approach can further be interpreted as a particular type of pooling -- a strategy commonly employed in convolutional neural networks. Another view of the proposed approach is that the filters are represented as a linear combination of basis functions (in the paper, the particular form of the basis functions is determined by the choice of kernel). Such representations have been explored in various forms and to various ends in the computer vision and signal processing literature (see e.g. [R2,R3,R4,R5]). [R4,R5], for instance, represent filters in terms of a linear combination of basis functions that reduce the computational complexity of the filtering process). |

Distributed representations of words and phrases and their compositionality

Mikolov, Tomas and Sutskever, Ilya and Chen, Kai and Corrado, Greg S and Dean, Jeff

Advances in neural information processing systems - 2013 via Local Bibsonomy

Keywords: thema:deepwalk, language, modelling, representation

Mikolov, Tomas and Sutskever, Ilya and Chen, Kai and Corrado, Greg S and Dean, Jeff

Advances in neural information processing systems - 2013 via Local Bibsonomy

Keywords: thema:deepwalk, language, modelling, representation

[link]
The paper discusses a number of extensions to the Skip-gram model previously proposed by Mikolov et al (citation [7] in the paper): which learns linear word embeddings that are particularly useful for analogical reasoning type tasks. The extensions proposed (namely, negative sampling and sub-sampling of high frequency words) enable extremely fast training of the model on large scale datasets. This also results in significantly improved performance as compared to previously proposed techniques based on neural networks. The authors also provide a method for training phrase level embeddings by slightly tweaking the original training algorithm. This paper proposes 3 improvements for the skip-gram model which allows for learning embeddings for words. The first improvement is subsampling frequent word, the second is the use of a simplified version of noise constrastive estimation (NCE) and finally they propose a method to learn idiomatic phrase embeddings. In all three cases the improvements are somewhat ad-hoc. In practice, both the subsampling and negative samples help to improve generalization substantially on an analogical reasoning task. The paper reviews related work and furthers the interesting topic of additive compositionality in embeddings. The article does not propose any explanation as to why the negative sampling produces better results than NCE which it is suppose to loosely approximate. In fact it doesn't explain why besides the obvious generalization gain the negative sampling scheme should be preferred to NCE since they achieve similar speeds. |

Learning Semantic Script Knowledge with Event Embeddings

Modi, Ashutosh and Titov, Ivan

arXiv e-Print archive - 2013 via Local Bibsonomy

Keywords: dblp

Modi, Ashutosh and Titov, Ivan

arXiv e-Print archive - 2013 via Local Bibsonomy

Keywords: dblp

[link]
This paper investigates a model which aims at predicting the order of events; each event is an english sentence. While previous methods relied on a graph representation to infer the right order, the proposed model is made of two stages. The first stage use a continuous representation of a verb frame, where the predicate and its arguments are represented by their word embeddings. A neural network is used to derive this continuous representation in order to capture the compositionality within the verb frame. The second stage uses a large margin extension of PRank. The learning scheme is very interesting: the error made by the ranker is used to update the ranker parameters, but is also back-propagated to update the NN parameters. |

Learning Factored Representations in a Deep Mixture of Experts

Eigen, David and Ranzato, Marc'Aurelio and Sutskever, Ilya

arXiv e-Print archive - 2013 via Local Bibsonomy

Keywords: dblp

Eigen, David and Ranzato, Marc'Aurelio and Sutskever, Ilya

arXiv e-Print archive - 2013 via Local Bibsonomy

Keywords: dblp

[link]
This paper extends the mixture-of-experts (MoE) model by stacking several blocks of the MoEs to form a deep MoE. In this model, each mixture weight is implemented with a gating network. The mixtures at each block is different. The whole deep MoE is trained jointly using the stochastic gradient descent algorithm. The motivation of the work is to reduce the decoding time by exploiting the structure imposed in the MoE model. The model was evaluated on the MNIST and speech monophone classification tasks. |

Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps

Simonyan, Karen and Vedaldi, Andrea and Zisserman, Andrew

arXiv e-Print archive - 2013 via Local Bibsonomy

Keywords: dblp

Simonyan, Karen and Vedaldi, Andrea and Zisserman, Andrew

arXiv e-Print archive - 2013 via Local Bibsonomy

Keywords: dblp

[link]
This paper presents methods for visualizing the behaviour of an object recognition convolutional neural network. The first method generates a "canonical image" for a given class that the network can recognize. The second generates a saliency map for a given input image and specified class, that illustrates the part of the image (pixels) that influence the most the given class's output probability. This can be used to seed a graphcut segmentation and localize objects of that class in the input image. Finally, a connection between the saliency map method and the work of Zeiler and Fergus on using deconvolutions to visualize deep networks is established. |

Learning States Representations in POMDP

Contardo, Gabriella and Denoyer, Ludovic and Artières, Thierry and Gallinari, Patrick

arXiv e-Print archive - 2013 via Local Bibsonomy

Keywords: dblp

Contardo, Gabriella and Denoyer, Ludovic and Artières, Thierry and Gallinari, Patrick

arXiv e-Print archive - 2013 via Local Bibsonomy

Keywords: dblp

[link]
The authors present a model that learns representations of sequential inputs on random trajectories through the state space, then feed those into a reinforcement learner, to deal with partially observable environments. They apply this to a POMDP mountain car problem, where the velocity of the car is not visible but has to be inferred from successive observations. |

Deep Learning for Detecting Robotic Grasps

Lenz, Ian and Lee, Honglak and Saxena, Ashutosh

arXiv e-Print archive - 2013 via Local Bibsonomy

Keywords: dblp

Lenz, Ian and Lee, Honglak and Saxena, Ashutosh

arXiv e-Print archive - 2013 via Local Bibsonomy

Keywords: dblp

[link]
this paper uses the common 2-step procedure to first eliminate most of unlikely detection windows (high recall), then use a network with higher capacity for better discrimination (high precision). Deep learning (in the unsupervised sense) helps having features optimized for each of these 2 different tasks, adapt them for different situations (different robotics grippers) and beat hand-designed features for detection of graspable areas, using a mixture of inputs (depth + rgb + xyz). |

Why Size Matters: Feature Coding as Nystrom Sampling

Vinyals, Oriol and Jia, Yangqing and Darrell, Trevor

arXiv e-Print archive - 2013 via Local Bibsonomy

Keywords: dblp

Vinyals, Oriol and Jia, Yangqing and Darrell, Trevor

arXiv e-Print archive - 2013 via Local Bibsonomy

Keywords: dblp

[link]
This paper presents a theoretical analysis and empirical validation of a novel view of feature extraction systems based on the idea of Nystrom sampling for kernel methods. The main idea is to analyze the kernel matrix for a feature space defined by an off-the-shelf feature extraction system. In such a system, a bound is identified for the error in representing the "full" dictionary composed of all data points by a Nystrom approximated version (i.e., represented by subsampling the data points randomly). The bound is then extended to show that the approximate kernel matrix obtained using the Nystrom-sampled dictionary is close to the true kernel matrix, and it is argued that the quality of the approximation is a reasonable proxy for the classification error we can expect after training. It is shown that this approximation model qualitatively predicts the monotonic rise in accuracy of feature extraction with larger dictionaries and saturation of performance in experiments. |

Joint Training Deep Boltzmann Machines for Classification

Goodfellow, Ian J. and Courville, Aaron C. and Bengio, Yoshua

arXiv e-Print archive - 2013 via Local Bibsonomy

Keywords: dblp

Goodfellow, Ian J. and Courville, Aaron C. and Bengio, Yoshua

arXiv e-Print archive - 2013 via Local Bibsonomy

Keywords: dblp

[link]
The authors aim to introduce a new method for training deep Boltzmann machines. Inspired by inference procedure they turn the model into two hidden layers autoencoder with recurrent connections. Instead of reconstructing all pixels from all (perhaps corrupted) pixels they reconstruct one subset of pixels from the other (the complement). DBM are usually "pre-trained" in a layer-wise manner using RBMs, a conceivably suboptimal procedure. Here the authors propose to use a deterministic criterion that basically turns the DBM into a RNN. This RNN is trained with a loss that resembles that one of denoising auto-encoders (some inputs at random are missing and the task is to predict their values from the observed ones). |

Zero-Shot Learning Through Cross-Modal Transfer

Socher, Richard and Ganjoo, Milind and Manning, Christopher D. and Ng, Andrew Y.

Neural Information Processing Systems Conference - 2013 via Local Bibsonomy

Keywords: dblp

Socher, Richard and Ganjoo, Milind and Manning, Christopher D. and Ng, Andrew Y.

Neural Information Processing Systems Conference - 2013 via Local Bibsonomy

Keywords: dblp

[link]
The paper presents a framework to learn to classify images that can come either from known or unknown classes. This is done by first mapping both images and classes into a joint embedding space. Furthermore, the probability of an image being of an unknown class is estimated using a mixture of Gaussians. Experiments on CIFAR-10 show how performance vary depending on the threshold use to determine if an image is of a known class or not. The model first tries to detect whether an image contains an object from a so-far unseen category. If not, the model relies on a regular, state-of-the art supervised classifier to assign the image to known classes. Otherwise, it attempts to identify what this object is, based on a comparison between the image and each unseen class, in a learned joint image/class representation space. The method relies on pre-trained word representations, extracted from unlabelled text, to represent the classes. Experiments evaluate the compromise between classification accuracy on the seen classes and the unseen classes, as a threshold for identifying an unseen class is varied. |

Unsupervised feature learning by augmenting single images

Dosovitskiy, Alexey and Springenberg, Jost Tobias and Brox, Thomas

arXiv e-Print archive - 2013 via Local Bibsonomy

Keywords: dblp

Dosovitskiy, Alexey and Springenberg, Jost Tobias and Brox, Thomas

arXiv e-Print archive - 2013 via Local Bibsonomy

Keywords: dblp

[link]
The paper presents an approach for learning the filters of a convolutional NN, for an image classification task, without making use of target labels. The algorithm proceeds in two steps: learning a transformation of the original image and then learning a classifier using this new representation. For the first step, patches are sampled from an image collection, each patch will then correspond to a surrogate class and a classifier will be trained to associate transformed versions of the patches to the corresponding class labels using a convolutional net. In a second step, this net is replicated on whole images leading to a transformed representation of the original image. A linear classifier is then trained using this representation as input and the target labels relative to the image collection. Experiments are performed on different image collections and a comparison with several baselines is provided. |

A simple example of Dirichlet process mixture inconsistency for the number of components

Miller, Jeffrey W. and Harrison, Matthew T.

Neural Information Processing Systems Conference - 2013 via Local Bibsonomy

Keywords: dblp

Miller, Jeffrey W. and Harrison, Matthew T.

Neural Information Processing Systems Conference - 2013 via Local Bibsonomy

Keywords: dblp

[link]
This paper addresses one simple but potentially very important point: That Dirichlet process mixture models can be inconsistent in the number of mixture components that they infer. This is important because DPs are nowadays widely used in various types of statistical modeling, for example when building clustering type algorithms. This can have real-world implications, for example when clustering breast cancer data with the aim of identifying distinct disease subtypes. Such subtypes are used in clinical practice to inform treatment, so identifying the correct number of clusters (and hence subtypes) has a very important real-world impact. The paper focuses on proofs concerning two specific cases where the DP turns out to be inconsistent. Both consider the case of the "standard normal DPM", where the likelihood is a univariate normal distribution with unit variance, the mean of which is subject to a normal prior with unit variance. The first proof shows that, if the data are drawn i.i.d. from a zero-mean, unit-variance normal (hence matching the assumed DPM model), $P(T=1 | \text{data})$ does not converge to 1. The second proof takes this further, demonstrating that in fact$ P(T=1 | \text{data}) -> 0 $ |

Robust Multimodal Graph Matching: Sparse Coding Meets Graph Matching

Fiori, Marcelo and Sprechmann, Pablo and Vogelstein, Joshua T. and Musé, Pablo and Sapiro, Guillermo

Neural Information Processing Systems Conference - 2013 via Local Bibsonomy

Keywords: dblp

Fiori, Marcelo and Sprechmann, Pablo and Vogelstein, Joshua T. and Musé, Pablo and Sapiro, Guillermo

Neural Information Processing Systems Conference - 2013 via Local Bibsonomy

Keywords: dblp

[link]
This paper examines the problem of approximate graph matching (isomorphism). Given graphs G, H with p nodes, represented by respective adjacency matrices A, B, Find a permutation matrix P that best "matches" AP and PB. This paper poses the multimodal graph matching problem as a convex optimization problem, and solves it using augmented Langrangian techniques (viz., ADMM). This is an important problem with application in several fields. Experimental results on synthetic and multiple real world datasets demonstrate effectiveness of the proposed approach. |

The Randomized Dependence Coefficient

López-Paz, David and Hennig, Philipp and Schölkopf, Bernhard

Neural Information Processing Systems Conference - 2013 via Local Bibsonomy

Keywords: dblp

López-Paz, David and Hennig, Philipp and Schölkopf, Bernhard

Neural Information Processing Systems Conference - 2013 via Local Bibsonomy

Keywords: dblp

[link]
The authors propose a non-linear measure of dependence between two random variables. This turns out to be the canonical correlation between random, nonlinear projections of the variables after a copula transformation which renders the marginals of the r.vs invariant to linear transformations. The paper introduces a new method called RDC to measure the statistical dependence between random variables. It combines a copula transform to a variant of kernel CCA using random projections, resulting in a $O(n log n)$ complexity. Results on synthetic and real benchmark data show promising results for feature selection. The RDC is a non-linear dependency estimator that satisfies Renyi's criteria and exploits the very recent FastFood speedup trick (ICML13) \cite{journals/corr/LeSS14}. This is a straightforward recipe: 1) copularize the data, effectively preserving the dependency structure while ignoring the marginals, 2) sample k non-linear features of each datum (inspired from Bochner's theorem) and 3) solve the regular CCA eigenvalue problem on the resulting paired datasets. Ultimately, RDC feels like a copularised variation of kCCA (misleading as this may sound). Its efficiency is illustrated successfully on a set of classical non-linear bivariate dependency scenarios and 12 real datasets via a forward feature selection procedure. |

Generalized Denoising Auto-Encoders as Generative Models

Bengio, Yoshua and Yao, Li and Alain, Guillaume and Vincent, Pascal

Neural Information Processing Systems Conference - 2013 via Local Bibsonomy

Keywords: dblp

Bengio, Yoshua and Yao, Li and Alain, Guillaume and Vincent, Pascal

Neural Information Processing Systems Conference - 2013 via Local Bibsonomy

Keywords: dblp

[link]
This paper continues a recent line of theoretical work that seeks to explain what autoencoders learn about the data-generating distribution. Of practical importance from this work have been ways to sample from autoencoders. Specifically, this paper picks up where \cite{journals/jmlr/AlainB14} left off. That paper was able to show that autoencoders (under a number of conditions) estimate the score (derivative of the log-density) of the data-generating distribution in a way that was proportional to the difference between reconstruction and input. However, it was these conditions that limited this work: it only considered Gaussian corruption, it only applied to continuous inputs, it was proven for only squared error, and was valid only in the limit of small corruption. The current paper connects the autoencoder training procedure to the implicit estimation of the data-generating distribution for arbitrary corruption, arbitrary reconstruction loss, and can handle both discrete and continuous variables for non-infinitesimal corruption noise. Moreover, the paper presents a new training algorithm called "walkback" which estimates the same distribution as the "vanilla" denoising algorithm, but, as experimental evidence suggests, may do so in a more efficient way. |

Wireless Message Dissemination via Selective Relay over Bluetooth (MDSRoB)

Cohen, Joseph Paul

arXiv e-Print archive - 2013 via Local Bibsonomy

Keywords: dblp

Cohen, Joseph Paul

arXiv e-Print archive - 2013 via Local Bibsonomy

Keywords: dblp

[link]
This paper proposes a method to send messages between cell phones over Bluetooth by using the device name field. This allows devices to communicate directly with each other without pairing. |

About