Welcome to ShortScience.org! |
[link]
This is a very techincal paper and I only covered items that interested me * Model * Encoder * 8 layers LSTM * bi-directional only first encoder layer * top 4 layers add input to output (residual network) * Decoder * same as encoder except all layers are just forward direction * encoder state is not passed as a start point to Decoder state * Attention * energy computed using NN with one hidden layer as appose to dot product or the usual practice of no hidden layer and $\tanh$ activation at the output layer * computed from output of 1st decoder layer * pre-feed to all layers * Training has two steps: ML and RL * ML (cross-entropy) training: * common wisdom, initialize all trainable parameters uniformly between [-0.04, 0.04] * clipping=5, batch=128 * Adam (lr=2e-4) 60K steps followed by SGD (lr=.5 which is probably a typo!) 1.2M steps + 4x(lr/=2 200K steps) * 12 async machines, each machine with 8 GPUs (K80) on which the model is spread X 6days * [dropout](http://www.shortscience.org/paper?bibtexKey=journals/corr/ZarembaSV14) 0.2-0.3 (higher for smaller datasets) * RL - [Reinforcement Learning](http://www.shortscience.org/paper?bibtexKey=journals/corr/RanzatoCAZ15) * sequence score, $\text{GLEU} = r = \min(\text{precision}, \text{recall})$ computed on n-grams of size 1-4 * mixed loss $\alpha \text{ML} + \text{RL}, \alpha =0.25$ * mean $r$ computed from $m=15$ samples * SGD, 400K steps, 3 days, no drouput * Prediction (i.e. Decoder) * beam search (3 beams) * A normalized score is computed to every beam that ended (died) * did not normalize beam score by $\text{beam_length}^\alpha , \alpha \in [0.6-0.7]$ * normalized with similar formula in which 5 is add to length and a coverage factor is added, which is the sum-log of attention weight of every input word (i.e. after summing over all output words) * Do a second pruning using normalized scores |
[link]
Munoz-Gonzalez et al. propose a multi-class data poisening attack against deep neural networks based on back-gradient optimization. They consider the common poisening formulation stated as follows: $ \max_{D_c} \min_w \mathcal{L}(D_c \cup D_{tr}, w)$ where $D_c$ denotes a set of poisened training samples and $D_{tr}$ the corresponding clea dataset. Here, the loss $\mathcal{L}$ used for training is minimized as the inner optimization problem. As result, as long as learning itself does not have closed-form solutions, e.g., for deep neural networks, the problem is computationally infeasible. To resolve this problem, the authors propose using back-gradient optimization. Then, the gradient with respect to the outer optimization problem can be computed while only computing a limited number of iterations to solve the inner problem, see the paper for detail. In experiments, on spam/malware detection and digit classification, the approach is shown to increase test error of the trained model with only few training examples poisened. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |
[link]
Proposes a two-stage approach for continual learning. An active learning phase and a consolidation phase. The active learning stage optimizes for a specific task that is then consolidated into the knowledge base network via Elastic Weight Consolidation (Kirkpatrick et al., 2016). The active learning phases uses a separate network than the knowledge base, but is not always trained from scratch - authors suggest a heuristic based on task-similarity. Improves EWC by deriving a new online method so parameters don’t increase linearly with the number of tasks. Desiderata for a continual learning solution: - A continual learning method should not suffer from catastrophic forgetting. That is, it should be able to perform reasonably well on previously learned tasks. - It should be able to learn new tasks while taking advantage of knowledge extracted from previous tasks, thus exhibiting positive forward transfer to achieve faster learning and/or better final performance. - It should be scalable, that is, the method should be trainable on a large number of tasks. - It should enable positive backward transfer as well, which means gaining improved performance on previous tasks after learning a new task which is similar or relevant. - Finally, it should be able to learn without requiring task labels, and ideally, it should even be applicable in the absence of clear task boundaries. Experiments: - Sequential learning of handwritten characters of 50 alphabets taken from the Omniglot dataset. - Sequential learning of 6 games in the Atari suite (Bellemare et al., 2012) (“Space Invaders”, “Krull”, “Beamrider”, “Hero”, “Stargunner” and “Ms. Pac-man”). - 8 navigation tasks in 3D environments inspired by experiments with Distral (Teh et al., 2017). |
[link]
Zhao et al. propose a generative adversarial network (GAN) based approach to generate meaningful and natural adversarial examples for images and text. With natural adversarial examples, the authors refer to meaningful changes in the image content instead of adding seemingly random/adversarial noise – as illustrated in Figure 1. These natural adversarial examples can be crafted by first learning a generative model of the data, e.g., using a GAN together with an inverter (similar to an encoder), see Figure 2. Then, given an image $x$ and its latent code $z$, adversarial examples $\tilde{z} = z + \delta$ can be found within the latent code. The hope is that these adversarial examples will correspond to meaningful, naturally looking adversarial examples in the image space. https://i.imgur.com/XBhHJuY.png Figure 1: Illustration of natural adversarial examples in comparison ot regular, FGSM adversarial examples. https://i.imgur.com/HT2StGI.png Figure 2: Generative model (GAN) together with the required inverter. In practice, e.g., on MNIST, any black-box classifier can be attacked by randomly sampling possible perturbations $\delta$ in the random space (with increasing norm) until an adversarial perturbation is found. Here, the inverted from Figure 2 is trained on top of the critic of the GAN (although specific details are missing in the paper). Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |
[link]
This paper deals with the question what / how exactly CNNs learn, considering the fact that they usually have more trainable parameters than data points on which they are trained. When the authors write "deep neural networks", they are talking about Inception V3, AlexNet and MLPs. ## Key contributions * Deep neural networks easily fit random labels (achieving a training error of 0 and a test error which is just randomly guessing labels as expected). $\Rightarrow$Those architectures can simply brute-force memorize the training data. * Deep neural networks fit random images (e.g. Gaussian noise) with 0 training error. The authors conclude that VC-dimension / Rademacher complexity, and uniform stability are bad explanations for generalization capabilities of neural networks * The authors give a construction for a 2-layer network with $p = 2n+d$ parameters - where $n$ is the number of samples and $d$ is the dimension of each sample - which can easily fit any labeling. (Finite sample expressivity). See section 4. ## What I learned * Any measure $m$ of the generalization capability of classifiers $H$ should take the percentage of corrupted labels ($p_c \in [0, 1]$, where $p_c =0$ is a perfect labeling and $p_c=1$ is totally random) into account: If $p_c = 1$, then $m()$ should be 0, too, as it is impossible to learn something meaningful with totally random labels. * We seem to have built models which work well on image data in general, but not "natural" / meaningful images as we thought. ## Funny > deep neural nets remain mysterious for many reasons > Note that this is not exactly simple as the kernel matrix requires 30GB to store in memory. Nonetheless, this system can be solved in under 3 minutes in on a commodity workstation with 24 cores and 256 GB of RAM with a conventional LAPACK call. ## See also * [Deep Nets Don't Learn Via Memorization](https://openreview.net/pdf?id=rJv6ZgHYg) |