ShortScience.org - Making Science Accessible!

Welcome to ShortScience.org!

Infinite Dimensional Word Embeddings
Nalisnick, Eric T. and Ravi, Sachin
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Hugo Larochelle 9 years ago

This paper introduces a version of the skipgram word embeddings learning algorithm that can also learn the size (nb. of dimensions) of these embeddings. The method, coined infinite skipgram (iSG), is inspired from my work with Marc-Alexandre Côté on the infinite RBM, in which we describe a mathematical trick for learning the size of a latent representation. This is done by introducing an additional latent variable $z$ representing the number of dimensions effectively involved in the energy function. Moreover, a term penalizing increasing values for $z$ is also incorporated, such that the infinite sum over $z$ is converging.

In this paper, the authors extend the probabilistic model behind skipgram with such a variable $z$, now corresponding to the number of dimensions involved in the dot product between word embeddings. They also propose a few approximations required to allow for an efficient training algorithm. Mainly they optimize an upper bound on the regular skipgram objective (see Section 3.2) and they approximate the computation of the conditional over $z$ for a given word $w$, which requires summing over all possible context words $c$, by summing only over the words observed in the immediate current context of $w$ (thus this sum will very across training example of the same word $w$).

Experiments show that the iSG better learns to exploit different dimensions to model different senses of words, better than the original skipgram model. Quantitatively, the iSG seems to provide better probabilities to context words.

doi.org
sci-hub
scholar.google.com

MagNet: A Two-Pronged Defense against Adversarial Examples
Meng, Dongyu and Chen, Hao
ACM ACM Conference on Computer and Communications Security - 2017 via Local Bibsonomy
Keywords: dblp

[link] Summary by David Stutz 5 years ago

Meng and Chen propose MagNet, a combination of adversarial example detection and removal. At test time, given a clean or adversarial test image, the proposed defense works as follows: First, the input is passed through one or multiple detectors. If one of these detectors fires, the input is rejected. To this end, the authors consider detection based on the reconstruction error of an auto-encoder or detection based on the divergence between probability predictions (on adversarial vs. clean example). Second, if not rejected, the input is passed through a reformed. The reformer reconstructs the input, e.g., through an auto-encoder, to remove potentially undetected adversarial noise.

Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/).

papers.nips.cc
scholar.google.com

Thwarting Adversarial Examples: An L_0-Robust Sparse Fourier Transform
Bafna, Mitali and Murtagh, Jack and Vyas, Nikhil
Neural Information Processing Systems Conference - 2018 via Local Bibsonomy
Keywords: dblp

[link] Summary by David Stutz 5 years ago

Bafna et al. show that iterative hard thresholding results in $L_0$ robust Fourier transforms. In particular, as shown in Algorithm 1, iterative hard thresholding assumes a signal $y = x + e$ where $x$ is assumed to be sparse, and $e$ is assumed to be sparse. This translates to noise $e$ that is bounded in its $L_0$ norm, corresponding to common adversarial attacks such as adversarial patches in computer vision. Using their algorithm, the authors can provably reconstruct the signal, specifically the top-$k$ coordinates for a $k$-sparse signal, which can subsequently be fed to a neural network classifier. In experiments, the classifier is always trained on sparse signals, and at test time, the sparse signal is reconstructed prior to the forward pass. This way, on MNIST and Fashion-MNIST, the algorithm is able to recover large parts of the original accuracy.

https://i.imgur.com/yClXLoo.jpg
Algorithm 1 (see paper for details): The iterative hard thresholding algorithm resulting in provable robustness against $L_0$ attack on images and other signals.

Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/).

papers.nips.cc
scholar.google.com

Variational Policy Search via Trajectory Optimization
Levine, Sergey and Koltun, Vladlen
Neural Information Processing Systems Conference - 2013 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 9 years ago

The paper introduces a new approach of how classical policy search can be combined and improved with trajectory optimization methods serving as exploration strategy. An optimization criteria with the goal of finding optimal policy parameters is decomposed with a variational approach. The variational distribution is approximated as Gaussian distribution which allows a solution with the iterative LQR algorithm. The overall algorithm uses expectation maximization to iterate between minimizing the KL divergence of the variational decomposition and maximizing the lower bound with respect to the policy parameters.

www.jmlr.org
scholar.google.com

Understanding the difficulty of training deep feedforward neural networks
Glorot, Xavier and Bengio, Yoshua
Journal of Machine Learning Research - 2010 via Local Bibsonomy
Keywords: dblp

[link] Summary by Martin Thoma 8 years ago

The main contribution of [Understanding the difficulty of training deep feedforward neural networks](http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf) by Glorot et al. is a **normalized weight initialization**

$$W \sim U \left [  - \frac{\sqrt{6}}{\sqrt{n_j + n_{j+1}}}, \frac{\sqrt{6}}{\sqrt{n_j + n_{j+1}}} \right ]$$

where $n_j \in \mathbb{N}^+$ is the number of neurons in the layer $j$.

Showing some ways **how to debug neural networks** might be another reason to read the paper.

The paper analyzed standard multilayer perceptrons (MLPs) on a artificial dataset of $32 \text{px} \times 32 \text{px}$ images with either one or two of the 3 shapes: triangle, parallelogram and ellipse. The MLPs varied in the activation function which was used (either sigmoid, tanh or softsign).

However, no regularization was used and many mini-batch epochs were learned. It might be that batch normalization / dropout might change the influence of initialization very much.

Questions that remain open for me:

* [How is weight initialization done today?](https://www.reddit.com/r/MLQuestions/comments/4jsge9)
* Figure 4: Why is this plot not simply completely dependent on the data?
* Is softsign still used? Why not?
* If the only advantage of softsign is that is has the plateau later, why doesn't anybody use $\frac{1}{1+e^{-0.1 \cdot x}}$ or something similar instead of the standard sigmoid activation function?

1 Comments