[link]
The paper discusses and empirically investigates by empirical testing the effect of "catastrophic forgetting" (**CF**), i.e. the inability of a model to perform a task it was previously trained to perform if retrained to perform a second task. An illuminating example is what happens in ML systems with convex objectives: regardless of the initialization (i.e. of what was learnt by doing the first task), the training of the second task will always end in the global minimum, thus totally "forgetting" the first one. Neuroscientific evidence (and common sense) suggest that the outcome of the experiment is deeply influenced by the similarity of the tasks involved. Namely, if (i) the two tasks are *functionally identical but input is presented in a different format* or if (ii) *tasks are similar* and the third case for (iii) *dissimilar tasks*. Relevant examples may be provided respectively by (i) performing the same image classification task starting from two different image representations as RGB or HSL, (ii) performing image classification tasks with semantically similar as classifying two similar animals and (iii) performing a text classification followed by image classification. The problem is investigated by an empirical study covering two methods of training ("SGD" and "dropout") combined with 4 activations functions (logistic sigmoid, RELU, LWTA, Maxout). A random search is carried out on these parameters. From a practitioner's point of view, it is interesting to note that dropout has been set to 0.5 in hidden units and 0.2 in the visible one since this is a reasonably well-known parameter. ## Why the paper is important It is apparently the first to provide a systematic empirical analysis of CF. Establishes a framework and baselines to face the problem. ## Key conclusions, takeaways and modelling remarks * dropout helps in preventing CF * dropout seems to increase the optimal model size with respect to the model without dropout * choice of activation function has a less consistent effect than dropout\no dropout choice * dissimilar task experiment provides a notable exception of then dissimilar task experiment * the previous hypothesis that LWTA activation is particularly resistant to CF is rejected (even if it performs best in the new task in the dissimilar task pair the behaviour is inconsistent) * choice of activation function should always be cross-validated * If computational resources are insufficient for cross-validation the combination dropout + maxout activation function is recommended. |
[link]
* They use an implementation of Q-learning (i.e. reinforcement learning) with CNNs to automatically play Atari games. * The algorithm receives the raw pixels as its input and has to choose buttons to press as its output. No hand-engineered features are used. So the model "sees" the game and "uses" the controller, just like a human player would. * The model achieves good results on various games, beating all previous techniques and sometimes even surpassing human players. ### How * Deep Q Learning * *This is yet another explanation of deep Q learning, see also [this blog post](http://www.nervanasys.com/demystifying-deep-reinforcement-learning/) for longer explanation.* * While playing, sequences of the form (`state1`, `action1`, `reward`, `state2`) are generated. * `state1` is the current game state. The agent only sees the pixels of that state. (Example: Screen shows enemy.) * `action1` is an action that the agent chooses. (Example: Shoot!) * `reward` is the direct reward received for picking `action1` in `state1`. (Example: +1 for a kill.) * `state2` is the next game state, after the action was chosen in `state1`. (Example: Screen shows dead enemy.) * One can pick actions at random for some time to generate lots of such tuples. That leads to a replay memory. * Direct reward * After playing randomly for some time, one can train a model to predict the direct reward given a screen (we don't want to use the whole state, just the pixels) and an action, i.e. `Q(screen, action) -> direct reward`. * That function would need a forward pass for each possible action that we could take. So for e.g. 8 buttons that would be 8 forward passes. To make things more efficient, we can let the model directly predict the direct reward for each available action, e.g. for 3 buttons `Q(screen) -> (direct reward of action1, direct reward of action2, direct reward of action3)`. * We can then sample examples from our replay memory. The input per example is the screen. The output is the reward as a tuple. E.g. if we picked button 1 of 3 in one example and received a reward of +1 then our output/label for that example would be `(1, 0, 0)`. * We can then train the model by playing completely randomly for some time, then sample some batches and train using a mean squared error. Then play a bit less randomly, i.e. start to use the action which the network thinks would generate the highest reward. Then train again, and so on. * Indirect reward * Doing the previous steps, the model will learn to anticipate the *direct* reward correctly. However, we also want it to predict indirect rewards. Otherwise, the model e.g. would never learn to shoot rockets at enemies, because the reward from killing an enemy would come many frames later. * To learn the indirect reward, one simply adds the reward value of highest reward action according to `Q(state2)` to the direct reward. * I.e. if we have a tuple (`state1`, `action1`, `reward`, `state2`), we would not add (`state1`, `action1`, `reward`) to the replay memory, but instead (`state1`, `action1`, `reward + highestReward(Q(screen2))`). (Where `highestReward()` returns the reward of the action with the highest reward according to Q().) * By training to predict `reward + highestReward(Q(screen2))` the network learns to anticipate the direct reward *and* the indirect reward. It takes a leap of faith to accept that this will ever converge to a good solution, but it does. * We then add `gamma` to the equation: `reward + gamma*highestReward(Q(screen2))`. `gamma` may be set to 0.9. It is a discount factor that devalues future states, e.g. because the world is not deterministic and therefore we can't exactly predict what's going to happen. Note that Q will automatically learn to stack it, e.g. `state3` will be discounted to `gamma^2` at `state1`. * This paper * They use the mentioned Deep Q Learning to train their model Q. * They use a k-th frame technique, i.e. they let the model decide upon an action at (here) every 4th frame. * Q is implemented via a neural net. It receives 84x84x4 grayscale pixels that show the game and projects that onto the rewards of 4 to 18 actions. * The input is HxWx4 because they actually feed the last 4 frames into the network, instead of just 1 frame. So the network knows more about what things are moving how. * The network architecture is: * 84x84x4 (input) * 16 convs, 8x8, stride 4, ReLU * 32 convs, 4x4, stride 2, ReLU * 256 fully connected neurons, ReLU * <N_actions> fully connected neurons, linear * They use a replay memory of 1 million frames. ### Results * They ran experiments on the Atari games Beam Rider, Breakout, Enduro, Pong, Qbert, Seaquest and Space Invaders. * Same architecture and hyperparameters for all games. * Rewards were based on score changes in the games, i.e. they used +1 (score increases) and -1 (score decreased). * Optimizer: RMSProp, Batch Size: 32. * Trained for 10 million examples/frames per game. * They had no problems with instability and their average Q value per game increased smoothly. * Their method beats all other state of the art methods. * They managed to beat a human player in games that required not so much "long" term strategies (the less frames the better). * Video: starts at 46:05. https://youtu.be/dV80NAlEins?t=46m05s ![Algorithm](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Playing_Atari_with_Deep_Reinforcement_Learning__algorithm.png?raw=true "Algorithm") *The original full algorithm, as shown in the paper.* -------------------- ### Rough chapter-wise notes * (1) Introduction * Problems when using neural nets in reinforcement learning (RL): * Reward signal is often sparse, noise and delayed. * Often assumption that data samples are independent, while they are correlated in RL. * Data distribution can change when the algorithm learns new behaviours. * They use Q-learning with a CNN and stochastic gradient descent. * They use an experience replay mechanism (i.e. memory) from which they can sample previous transitions (for training). * They apply their method to Atari 2600 games in the Arcade Learning Environment (ALE). * They use only the visible pixels as input to the network, i.e. no manual feature extraction. * (2) Background * blablabla, standard deep q learning explanation * (3) Related Work * TD-Backgammon: "Solved" backgammon. Worked similarly to Q-learning and used a multi-layer perceptron. * Attempts to copy TD-Backgammon to other games failed. * Research was focused on linear function approximators as there were problems with non-linear ones diverging. * Recently again interest in using neural nets for reinforcement learning. Some attempts to fix divergence problems with gradient temporal-difference methods. * NFQ is a very similar method (to the one in this paper), but worked on the whole batch instead of minibatches, making it slow. It also first applied dimensionality reduction via autoencoders on the images instead of training on them end-to-end. * HyperNEAT was applied to Atari games and evolved a neural net for each game. The networks learned to exploit design flaws. * (4) Deep Reinforcement Learning * They want to connect a reinforcement learning algorithm with a deep neural network, e.g. to get rid of handcrafted features. * The network is supposes to run on the raw RGB images. * They use experience replay, i.e. store tuples of (pixels, chosen action, received reward) in a memory and use that during training. * They use Q-learning. * They use an epsilon-greedy policy. * Advantages from using experience replay instead of learning "live" during game playing: * Experiences can be reused many times (more efficient). * Samples are less correlated. * Learned parameters from one batch don't determine as much the distributions of the examples in the next batch. * They save the last N experiences and sample uniformly from them during training. * (4.1) Preprocessing and Model Architecture * Raw Atari images are 210x160 pixels with 128 possible colors. * They downsample them to 110x84 pixels and then crop the 84x84 playing area out of them. * They also convert the images to grayscale. * They use the last 4 frames as input and stack them. * So their network input has shape 84x84x4. * They use one output neuron per possible action. So they can compute the Q-value (expected reward) of each action with one forward pass. * Architecture: 84x84x4 (input) => 16 8x8 convs, stride 4, ReLU => 32 4x4 convs stride 2 ReLU => fc 256, ReLU => fc N actions, linear * 4 to 18 actions/outputs (depends on the game). * Aside from the outputs, the architecture is the same for all games. * (5) Experiments * Games that they played: Beam Rider, Breakout, Enduro, Pong, Qbert, Seaquest, Space Invaders * They use the same architecture und hyperparameters for all games. * They give a reward of +1 whenever the in-game score increases and -1 whenever it decreases. * They use RMSProp. * Mini batch size was 32. * They train for 10 million frames/examples. * They initialize epsilon (in their epsilon greedy strategy) to 1.0 and decrease it linearly to 0.1 at one million frames. * They let the agent decide upon an action at every 4th in-game frame (3rd in space invaders). * (5.1) Training and stability * They plot the average reward und Q-value per N games to evaluate the agent's training progress, * The average reward increases in a noisy way. * The average Q value increases smoothly. * They did not experience any divergence issues during their training. * (5.2) Visualizating the Value Function * The agent learns to predict the value function accurately, even for rather long sequences (here: ~25 frames). * (5.3) Main Evaluation * They compare to three other methods that use hand-engineered features and/or use the pixel data combined with significant prior knownledge. * They mostly outperform the other methods. * They managed to beat a human player in three games. The ones where the human won seemed to require strategies that stretched over longer time frames. |
[link]
#### Problem addressed: A new type of activation function #### Summary: This paper propose a new activation function that computes a Lp norm from multiple projections on an input vector. The p value can be learned from training example, and can also be different for each hidden unit. The intuition is that 1) for different datasets there may exist different optimal p-values, so it make more sense to make p tunable; 2) allowing different unit take different p-values can potentially make the approximation of decision boundaries more efficient and more flexible. The empirical results support these two intuitions, and achieved comparable results on three datasets. #### Novelty: A generalization of pooling but applied through channels, when the data and weight vector dot product plus bias is constrained to non-negative case, the $L_\infty$ is equivalent to maxout unit. #### Drawbacks: Empirical performance is not very impressive, although evidence of supporting the intuition occurs. #### Datasets: MNIST, TFD, Pentomino #### Resources: http://arxiv.org/abs/1311.1780 #### Presenter: Yingbo Zhou |
[link]
#### Problem addressed: Evaluation and comparison of generative models #### Summary: This paper improves upon an existing non parametric estimator by sampling from hidden variables instead of features. They present an unbiased estimator and prove it asymptotically converges to true distribution with number of samples. They also prove that the expected value of unbiased estimator is a lower bound on the true distribution. They also present a biased estimator with a different sampling scheme. They empirically validate their estimators using MNIST dataset on different generative models #### Novelty: Sampling from hidden space for non-parametric estimation #### Drawbacks: This method works only for models which have hidden variables. Application for deep networks is not clear. Procedure for sampling from hidden variables is not explicitly mentioned. Assumes that P(x|h) is easily calculated from the model #### Datasets: MNIST #### Resources: paper: http://arxiv.org/pdf/1311.6184v4.pdf #### Presenter: Bhargava U. Kota |
[link]
#### Problem addressed: It has been empirically observed that deep representations lead to better mode mixing when sampled using MCMC. The authors present a set of hypotheses as to why this happens and confirm them empirically. #### Summary: The paper claims that deep representations (specially from parametric models) disentangle the factors of variations in the raw feature space. This disentangling leads to better ""mode mixing"" during MCMC sampling. For eg., in faces, the factors of variation could be identity-pose-illumination. If the higher layer learns these features then changing the representation in this space starting from a ""valid"" point would lead to changes in each of these factors directly and hence will produce ""valid"" images, which in the original feature space would be far apart; thus better mode mixing. This hypothesis is explained using 2 additional ones: (a) the manifold structure of the ""valid"" data is flattened in the higher layer space, and (b) the fraction of total volume occupied by high probability (valid) points is larger in the higher layer space. While (a) should lead to better interpolation in higher layer space, (b) should lead to more valid points in a parzen window around any known sample. These are confirmed experimentally. #### Novelty: novel intuitions why deep representations are good for generative modeling. #### Drawbacks: no theoretical justification #### Datasets: MNIST, Toronto Face dataset (TFD) #### Additional remarks: used DBN and Deep CAE for experiments on the datasets #### Presenter: Devansh Arpit |
[link]
#### Problem addressed: Gradient estimation for stochastic neurons #### Summary: This paper proposed an unbiased estimator of stochastic units so that one can use gradient based learning. In addition, it also proposed a simple, biased estimator called straight through. #### Novelty: A new approach for estimating gradient for stochastic units. #### Drawbacks: The proposed unbised estimator seems to have large variance, and the biased one seems not performing very well in practice #### Presenter: Yingbo Zhou |
[link]
#### Problem addressed: Variational learning of Bayesian networks #### Summary: This paper present a generic method for learning belief networks, which uses variational lower bound for the likelihood term. #### Novelty: Uses a re-parameterization trick to change random variables to deterministic function plus a noise term, so one can apply normal gradient based learning #### Drawbacks: The resulting model marginal likelihood is still intractible, may not be very good for applications that require the use of actual values of the marginal probablities #### Datasets: MNIST, Frey face #### Additional remarks: Experimentally compared with wake sleep algorithm on logliklihood lower bound as well as estimated marginal likelihood #### Resources: Implementation: https://github.com/y0ast/Variational-Autoencoder #### Presenter: Yingbo Zhou |
[link]
The paper introduces two key properties of deep neural networks: - Semantic meaning of individual units. - Earlier works analyzed learnt semantics by finding images that maximally activate individual units. - Authors observe that there is no difference between individual units and random linear combinations of units. - It is the entire space of activations that contains the bulk of semantic information. - Stability of neural networks to small perturbations in input space. - Networks that generalize well are expected to be robust to small perturbations in the input, i.e. imperceptible noise in the input shouldn't change the predicted class. - Authors find that networks can be made to misclassify an image by applying a certain imperceptible perturbation, which is found by maximizing the network's prediction error. - These 'adversarial examples' generalize well to different architectures trained on different data subsets. ## Strengths - The authors propose a way to make networks more robust to small perturbations by training them with adversarial examples in an adaptive manner, i.e. keep changing the pool of adversarial examples during training. In this regard, they draw a connection with hard-negative mining, and a network trained with adversarial examples performs better than others. - Formal description of how to generate adversarial examples and mathematical analysis of a network's stability to perturbations are useful studies. ## Weaknesses / Notes - Two images that are visually indistinguishable to humans but classified differently by the network is indeed an intriguing observation. - The paper feels a little half-baked in parts, and some ideas could've been presented more clearly. |
[link]
## Terms * Classification: Assign one label to one image * Localization: Many object, there might be multiple instances of the same class. * Detection: One big dominant object which has to be found and to be classified. Pretty much first classification and then finding a bounding box for the class within the image ## Evaluation * ILSVRC2013, localization task * ILSVRC2013, detection task * ILSVRC2013, classification task ## Implementations * https://github.com/sermanet/OverFeat |
[link]
The paper 'Big Neural Networks Waste Capacity' recognizes that adding more layer / parameters does not improve accuracy. When reading this paper, one should bear in mind that it was written well before [Deep Residual Learning for Image Recognition](http://www.shortscience.org/paper?bibtexKey=journals/corr/HeZRS15) or DenseNets. In the experiments, they applied MLPs to SIFT features of ImageNet LSVRC-2010. **Do not read this paper**. Instead, you might want to read the "Deep Residual Learning for Image Recognition". It makes the same point, but clearer and offers a solution to the underfitting problem. ## Criticism I don't understand why they write about k-means. > Assuming minimal error in the human labelling of the dataset, it should be possible to reach errors close to 0%. For ImageNet, the human labeling error is estimated at about 5% (I can't find the source for that, though) > Improvements on ImageNet are thought to be a good proxy for progress in object recognition (Deng et al., 2009). ImageNet images are very different from "typical web images" like the [100 million images Flickr dataset](http://yahoolabs.tumblr.com/post/89783581601/one-hundred-million-creative-commons-flickr-images-for). |
[link]
A paper in the intersection for Computer Vision and Machine Learning. They propose a method (network in network) to reduce parameters. Essentially, it boils down to a pattern of (conv with size > 1) -> (1x1 conv) -> (1x1 conv) -> repeat ## Datasets state-of-the-art classification performances with NIN on CIFAR-10 and CIFAR-100, and reasonable performances on SVHN and MNIST ## Implementations * [Lasagne](https://github.com/Lasagne/Recipes/blob/master/modelzoo/cifar10_nin.py) |
[link]
This paper analyses fully connected networks with dropout and ReLU activation functions. |
[link]
The main contribution of this paper is a new way to analyze CNNs by (a) visualizing intermediate learned features and (b) occlusion sensitivity analysis. ## Analyzation techniques ### Visualization A multi-layer deconvolutional network is used to project the feature activations back into pixel space, showing what input pattern originally caused a given activation in the feature maps. The idea is to train a network which is given the result of a layer $L_i$ and has to reconstruct the input feature map of $L_i$. This is repeated until the input image is reached. The deconv-net has a special **unpooling layer**: The max-pooling layers have to save where an activation came from and store those to a switch variable, which is used in unpooling. ### Occlusion sensitivity analysis * Occlude(I, x, y): Put a gray square centered at $(x, y)$ over a part of the image $I$. Run the classifier. * Create an image like this: * Run Occlude(I, x, y) for all $(x, y)$ (possible with stride) * At $(x, y)$, either ... * (d) ... place a pixel which color-encodes the probability of the correct class * (e) ... place a pixel which color-encodes the most probable class The following image from the Zeiler & Fergus paper visualizes this pretty well: If the dogs face is occluded, the probability of the correct class drops a lot: ![Imgur](http://i.imgur.com/Q1Ama2z.png) If the dogs face is occluded, the most likely class suddenly is "tennisball" and no longer "Pomeranian". ![Imgur](http://i.imgur.com/5QYKh7b.png) See [LIME](http://www.shortscience.org/paper?bibtexKey=journals/corr/1602.04938#martinthoma). ## How visualization helped to construct ZF-Net * "The first layer filters are a mix of extremely high and low frequency information, with little coverage of the mid frequencies" -> Lower filter size from $11 \times 11$ to $7 \times 7$ * "the 2nd layer visualization shows aliasing artifacts caused by the large stride 4 used in the 1st layer convolutions" -> Lower stride from 4 to 2 * The occlusion analysis helps to boost confidence that the kind of features being learned are actually correct. ## ZF-Net Zeiler and Fergus also created a new network for ImageNet. The network consists of multiple interleaved layers of convolutions, non-linear activations, local response normalizations and max pooling layers. Training setup: * **Preprocessing**: Resize smallest dimension to 256, per-pixel mean subtraction per channel, crop $224\text{px} \times 224\text{px}$ region * **Optimization**: Mini-Batch SGD, learning rate $= 10^{-2}$, momentum = $0.9$, 70 epochs * **Resources**: took around 12 days on a single GTX580 GPU The network was evaluated on * ImageNet 2012: 14.8% error * Caltech-101: $86.5 \pm 0.5$ (pretrained on ImageNet) * Caltech-256: $74.2\% \pm 0.3$ (pretrained on ImageNet) ## Minor errors * typo: "goes give" (also: something went wrong with the link there - the whole block is a link) |
[link]
## Introduction * Introduces techniques to learn word vectors from large text datasets. * Can be used to find similar words (semantically, syntactically, etc). * [Link to the paper](http://arxiv.org/pdf/1301.3781.pdf) * [Link to open source implementation](https://code.google.com/archive/p/word2vec/) ## Model Architecture * Computational complexity defined in terms of a number of parameters accessed during model training. * Proportional to $E*T*Q$ * *E* - Number of training epochs * *T* - Number of words in training set * *Q* - depends on the model ### Feedforward Neural Net Language Model (NNLM) * Probabilistic model with input, projection, hidden and output layer. * Input layer encodes N previous word using 1-of-V encoding (V is vocabulary size). * Input layer projected to projection layer P with dimensionality *N\*D* * Hidden layer (of size *H*) computes the probability distribution over all words. * Complexity per training example $Q =N*D + N*D*H + H*V$ * Can reduce *Q* by using hierarchical softmax and Huffman binary tree (for storing vocabulary). ### Recurrent Neural Net Language Model (RNNLM) * Similar to NNLM minus the projection layer. * Complexity per training example $Q =H*H + H*V$ * Hierarchical softmax and Huffman tree can be used here as well. ## Log-Linear Models * Nonlinear hidden layer causes most of the complexity. * NNLMs can be successfully trained in two steps: * Learn continuous word vectors using simple models. * N-gram NNLM trained over the word vectors. ### Continuous Bag-of-Words Model * Similar to feedforward NNLM. * No nonlinear hidden layer. * Projection layer shared for all words and order of words does not influence projection. * Log-linear classifier uses a window of words to predict the middle word. * $Q = N*D + D*\log_2V$ ### Continuous Skip-gram Model * Similar to Continuous Bag-of-Words but uses the middle world of the window to predict the remaining words in the window. * Distant words are given less weight by sampling fewer distant words. * $Q = C*(D + D*log_2 V$) where *C* is the max distance of the word from the middle word. * Given a *C* and a training data, a random *R* is chosen in range *1 to C*. * For each training word, *R* words from history (previous words) and *R* words from future (next words) are marked as target output and model is trained. ## Results * Skip-gram beats all other models for semantic accuracy tasks (eg - relating Athens with Greece). * Continuous Bag-of-Words Model outperforms other models for semantic accuracy tasks (eg great with greater) - with skip-gram just behind in performance. * Skip-gram architecture combined with RNNLMs outperforms RNNLMs (and other models) for Microsoft Research Sentence Completion Challenge. * Model can learn relationships like "Queen is to King as Woman is to Man". This allows algebraic operations like Vector("King") - Vector("Man") + Vector("Woman"). |
[link]
This paper is a direct extension of the work of \cite{journals/corr/1305.2532}. Prior to this work, \cite{journals/corr/1305.2532} proposed an algorithm for learning to predict fixed size lists using either a single greedy policy (SCP) or a list of sequentially applied policies (CONSEQOPT). Given a sub-modular reward function, \cite{journals/corr/1305.2532} provided theoretical guarantees on the regret of the learned policies via a reduction to cost sensitive classification. In practice one can then approximate the cost sensitive classification loss with convex surrogate losses that can be efficiently learned. In this work, the authors extend the framework of \cite{journals/corr/1305.2532} to knapsack problems, wherein each element in the list has an associated length and the sum total lengths may not exceed a fixed budget. (Note that I intentionally use the word "length" to differentiate from the "cost" of cost-sensitive classification in the reduction, which gets a bit confusing in the paper as the authors overload the word "cost" frequently.) They extend the algorithm of \cite{journals/corr/1305.2532} to be sensitive to the length of items added to the list, and extend the analysis of \cite{journals/corr/1305.2532} to provide regret guarantees in this setting. The authors then apply their algorithm to the problem of multi-document summarization, where items correspond to sentences and the budget constraints the total length of the summary. They show an improvement over prior work with their new approach. The paper is reasonably well written and straightforward to follow, with the exception of containing lots of notation and terminology that is sometimes difficult to keep straight (e.g. the definitions of the words "cost" and "weight", and the use of \ell as both cost sensitive losses and the length (cost? weight?) of each item for the budget.) The work itself is a useful extension to prior work that appears to be experimentally sound. Pros: + Novel extension of prior work to incorporate budgetary constraints + Works well experimentally Cons: - Overloading of symbols and words leads to confusion References: \cite{journals/corr/1305.2532} Ross, Stephane, Zhou, Jiaji, Yue, Yisong, Dey, Debadeepta, and Bagnell, .A. Learning policies for contextual submodular prediction. In ICML, 2013. |
[link]
Motivated by recent attempts to learn very large networks this work proposes an approach for reducing the number of free parameters in neural-network type architectures. The method is based on the intuition that there is typically strong redundancy in the learned parameters (for instance, the first layer filters of of NNs applied to images are smooth): The authors suggest to learn only a subset of the parameter values and to then predicted the remaining ones through some form of interpolation. The proposed approach is evaluated for several architectures (MLP, convolutional NN, reconstruction-ICA) and different vision datasets (MNIST, CIFAR, STL-10). The results suggest that in general it is sufficient to learn fewer than 50% of the parameters without any loss in performance (significantly fewer parameters seem sufficient for MNIST). The method is relatively simple: The authors assume a low-rank decomposition of the weight matrix and then further fix one of the two matrices using prior knowledge about the data (e.g., in the vision case, exploiting the fact that nearby pixels - and weights - tend to be correlated). This can be interpreted as predicting the "unobserved" parameters from the subset of learned filter weights via kernel ridge regression, where the kernel captures prior knowledge about the topology / "smoothness" of the weights. For the situation when such prior knowledge is not available the authors describe a way to learn a suitable kernel from data. The idea of reducing the number of parameters in NN-like architectures through connectivity constraints in itself is of course not novel, and the authors provide a pretty good discussion of related work in section 5. Their method is very closely related to the idea of factorizing weight matrices as is, for instance, commonly done for 3-way RBMs (e.g. ref [22] in the paper), but also occasionally for standard RBMs (e.g. [R1], missing in the paper). The present papers differs from these in that the authors propose to exploit prior knowledge to constrain one of the matrices. As also discussed by the authors, the approach can further be interpreted as a particular type of pooling -- a strategy commonly employed in convolutional neural networks. Another view of the proposed approach is that the filters are represented as a linear combination of basis functions (in the paper, the particular form of the basis functions is determined by the choice of kernel). Such representations have been explored in various forms and to various ends in the computer vision and signal processing literature (see e.g. [R2,R3,R4,R5]). [R4,R5], for instance, represent filters in terms of a linear combination of basis functions that reduce the computational complexity of the filtering process). |
[link]
The paper discusses a number of extensions to the Skip-gram model previously proposed by Mikolov et al (citation [7] in the paper): which learns linear word embeddings that are particularly useful for analogical reasoning type tasks. The extensions proposed (namely, negative sampling and sub-sampling of high frequency words) enable extremely fast training of the model on large scale datasets. This also results in significantly improved performance as compared to previously proposed techniques based on neural networks. The authors also provide a method for training phrase level embeddings by slightly tweaking the original training algorithm. This paper proposes 3 improvements for the skip-gram model which allows for learning embeddings for words. The first improvement is subsampling frequent word, the second is the use of a simplified version of noise constrastive estimation (NCE) and finally they propose a method to learn idiomatic phrase embeddings. In all three cases the improvements are somewhat ad-hoc. In practice, both the subsampling and negative samples help to improve generalization substantially on an analogical reasoning task. The paper reviews related work and furthers the interesting topic of additive compositionality in embeddings. The article does not propose any explanation as to why the negative sampling produces better results than NCE which it is suppose to loosely approximate. In fact it doesn't explain why besides the obvious generalization gain the negative sampling scheme should be preferred to NCE since they achieve similar speeds. |
[link]
This paper investigates a model which aims at predicting the order of events; each event is an english sentence. While previous methods relied on a graph representation to infer the right order, the proposed model is made of two stages. The first stage use a continuous representation of a verb frame, where the predicate and its arguments are represented by their word embeddings. A neural network is used to derive this continuous representation in order to capture the compositionality within the verb frame. The second stage uses a large margin extension of PRank. The learning scheme is very interesting: the error made by the ranker is used to update the ranker parameters, but is also back-propagated to update the NN parameters. |
[link]
This paper extends the mixture-of-experts (MoE) model by stacking several blocks of the MoEs to form a deep MoE. In this model, each mixture weight is implemented with a gating network. The mixtures at each block is different. The whole deep MoE is trained jointly using the stochastic gradient descent algorithm. The motivation of the work is to reduce the decoding time by exploiting the structure imposed in the MoE model. The model was evaluated on the MNIST and speech monophone classification tasks. |
[link]
This paper presents methods for visualizing the behaviour of an object recognition convolutional neural network. The first method generates a "canonical image" for a given class that the network can recognize. The second generates a saliency map for a given input image and specified class, that illustrates the part of the image (pixels) that influence the most the given class's output probability. This can be used to seed a graphcut segmentation and localize objects of that class in the input image. Finally, a connection between the saliency map method and the work of Zeiler and Fergus on using deconvolutions to visualize deep networks is established. |
[link]
The authors present a model that learns representations of sequential inputs on random trajectories through the state space, then feed those into a reinforcement learner, to deal with partially observable environments. They apply this to a POMDP mountain car problem, where the velocity of the car is not visible but has to be inferred from successive observations. |
[link]
this paper uses the common 2-step procedure to first eliminate most of unlikely detection windows (high recall), then use a network with higher capacity for better discrimination (high precision). Deep learning (in the unsupervised sense) helps having features optimized for each of these 2 different tasks, adapt them for different situations (different robotics grippers) and beat hand-designed features for detection of graspable areas, using a mixture of inputs (depth + rgb + xyz). |
[link]
This paper presents a theoretical analysis and empirical validation of a novel view of feature extraction systems based on the idea of Nystrom sampling for kernel methods. The main idea is to analyze the kernel matrix for a feature space defined by an off-the-shelf feature extraction system. In such a system, a bound is identified for the error in representing the "full" dictionary composed of all data points by a Nystrom approximated version (i.e., represented by subsampling the data points randomly). The bound is then extended to show that the approximate kernel matrix obtained using the Nystrom-sampled dictionary is close to the true kernel matrix, and it is argued that the quality of the approximation is a reasonable proxy for the classification error we can expect after training. It is shown that this approximation model qualitatively predicts the monotonic rise in accuracy of feature extraction with larger dictionaries and saturation of performance in experiments. |
[link]
The authors aim to introduce a new method for training deep Boltzmann machines. Inspired by inference procedure they turn the model into two hidden layers autoencoder with recurrent connections. Instead of reconstructing all pixels from all (perhaps corrupted) pixels they reconstruct one subset of pixels from the other (the complement). DBM are usually "pre-trained" in a layer-wise manner using RBMs, a conceivably suboptimal procedure. Here the authors propose to use a deterministic criterion that basically turns the DBM into a RNN. This RNN is trained with a loss that resembles that one of denoising auto-encoders (some inputs at random are missing and the task is to predict their values from the observed ones). |
[link]
The paper presents a framework to learn to classify images that can come either from known or unknown classes. This is done by first mapping both images and classes into a joint embedding space. Furthermore, the probability of an image being of an unknown class is estimated using a mixture of Gaussians. Experiments on CIFAR-10 show how performance vary depending on the threshold use to determine if an image is of a known class or not. The model first tries to detect whether an image contains an object from a so-far unseen category. If not, the model relies on a regular, state-of-the art supervised classifier to assign the image to known classes. Otherwise, it attempts to identify what this object is, based on a comparison between the image and each unseen class, in a learned joint image/class representation space. The method relies on pre-trained word representations, extracted from unlabelled text, to represent the classes. Experiments evaluate the compromise between classification accuracy on the seen classes and the unseen classes, as a threshold for identifying an unseen class is varied. |
[link]
The paper presents an approach for learning the filters of a convolutional NN, for an image classification task, without making use of target labels. The algorithm proceeds in two steps: learning a transformation of the original image and then learning a classifier using this new representation. For the first step, patches are sampled from an image collection, each patch will then correspond to a surrogate class and a classifier will be trained to associate transformed versions of the patches to the corresponding class labels using a convolutional net. In a second step, this net is replicated on whole images leading to a transformed representation of the original image. A linear classifier is then trained using this representation as input and the target labels relative to the image collection. Experiments are performed on different image collections and a comparison with several baselines is provided. |
[link]
This paper addresses one simple but potentially very important point: That Dirichlet process mixture models can be inconsistent in the number of mixture components that they infer. This is important because DPs are nowadays widely used in various types of statistical modeling, for example when building clustering type algorithms. This can have real-world implications, for example when clustering breast cancer data with the aim of identifying distinct disease subtypes. Such subtypes are used in clinical practice to inform treatment, so identifying the correct number of clusters (and hence subtypes) has a very important real-world impact. The paper focuses on proofs concerning two specific cases where the DP turns out to be inconsistent. Both consider the case of the "standard normal DPM", where the likelihood is a univariate normal distribution with unit variance, the mean of which is subject to a normal prior with unit variance. The first proof shows that, if the data are drawn i.i.d. from a zero-mean, unit-variance normal (hence matching the assumed DPM model), $P(T=1 | \text{data})$ does not converge to 1. The second proof takes this further, demonstrating that in fact$ P(T=1 | \text{data}) -> 0 $ |
[link]
This paper examines the problem of approximate graph matching (isomorphism). Given graphs G, H with p nodes, represented by respective adjacency matrices A, B, Find a permutation matrix P that best "matches" AP and PB. This paper poses the multimodal graph matching problem as a convex optimization problem, and solves it using augmented Langrangian techniques (viz., ADMM). This is an important problem with application in several fields. Experimental results on synthetic and multiple real world datasets demonstrate effectiveness of the proposed approach. |
[link]
The authors propose a non-linear measure of dependence between two random variables. This turns out to be the canonical correlation between random, nonlinear projections of the variables after a copula transformation which renders the marginals of the r.vs invariant to linear transformations. The paper introduces a new method called RDC to measure the statistical dependence between random variables. It combines a copula transform to a variant of kernel CCA using random projections, resulting in a $O(n log n)$ complexity. Results on synthetic and real benchmark data show promising results for feature selection. The RDC is a non-linear dependency estimator that satisfies Renyi's criteria and exploits the very recent FastFood speedup trick (ICML13) \cite{journals/corr/LeSS14}. This is a straightforward recipe: 1) copularize the data, effectively preserving the dependency structure while ignoring the marginals, 2) sample k non-linear features of each datum (inspired from Bochner's theorem) and 3) solve the regular CCA eigenvalue problem on the resulting paired datasets. Ultimately, RDC feels like a copularised variation of kCCA (misleading as this may sound). Its efficiency is illustrated successfully on a set of classical non-linear bivariate dependency scenarios and 12 real datasets via a forward feature selection procedure. |
[link]
This paper continues a recent line of theoretical work that seeks to explain what autoencoders learn about the data-generating distribution. Of practical importance from this work have been ways to sample from autoencoders. Specifically, this paper picks up where \cite{journals/jmlr/AlainB14} left off. That paper was able to show that autoencoders (under a number of conditions) estimate the score (derivative of the log-density) of the data-generating distribution in a way that was proportional to the difference between reconstruction and input. However, it was these conditions that limited this work: it only considered Gaussian corruption, it only applied to continuous inputs, it was proven for only squared error, and was valid only in the limit of small corruption. The current paper connects the autoencoder training procedure to the implicit estimation of the data-generating distribution for arbitrary corruption, arbitrary reconstruction loss, and can handle both discrete and continuous variables for non-infinitesimal corruption noise. Moreover, the paper presents a new training algorithm called "walkback" which estimates the same distribution as the "vanilla" denoising algorithm, but, as experimental evidence suggests, may do so in a more efficient way. |
[link]
This paper proposes a method to send messages between cell phones over Bluetooth by using the device name field. This allows devices to communicate directly with each other without pairing. |