ShortScience.org - Making Science Accessible!

Welcome to ShortScience.org!

arxiv.org
arxiv-vanity.com
scholar.google.com

Progress & Compress: A scalable framework for continual learning
Jonathan Schwarz and Jelena Luketina and Wojciech M. Czarnecki and Agnieszka Grabska-Barwinska and Yee Whye Teh and Razvan Pascanu and Raia Hadsell
arXiv e-Print archive - 2018 via Local arXiv
Keywords: stat.ML, cs.LG
more

[link] Summary by devin132 4 years ago

Proposes a two-stage approach for continual learning. An active learning phase and a consolidation phase. The active learning stage optimizes for a specific task that is then consolidated into the knowledge base network via Elastic Weight Consolidation (Kirkpatrick et al., 2016). The active learning phases uses a separate network than the knowledge base, but is not always trained from scratch - authors suggest a heuristic based on task-similarity. Improves EWC by deriving a new online method so parameters don’t increase linearly with the number of tasks.

Desiderata for a continual learning solution:

- A continual learning method should not suffer from catastrophic forgetting. That is, it should be able to perform reasonably well on previously learned tasks.

- It should be able to learn new tasks while taking advantage of knowledge extracted from previous tasks, thus exhibiting positive forward transfer to achieve faster learning and/or better final performance.

- It should be scalable, that is, the method should be trainable on a large number of tasks.

- It should enable positive backward transfer as well, which means gaining improved performance on previous tasks after learning a new task which is similar or relevant.

- Finally, it should be able to learn without requiring task labels, and ideally, it should even be applicable in the absence of clear task boundaries.

Experiments:

- Sequential learning of handwritten characters of 50 alphabets taken from the Omniglot dataset.
- Sequential learning of 6 games in the Atari suite (Bellemare et al., 2012) (“Space Invaders”, “Krull”, “Beamrider”, “Hero”, “Stargunner” and “Ms. Pac-man”).
- 8 navigation tasks in 3D environments inspired by experiments with Distral (Teh et al., 2017).

openreview.net
scholar.google.com

Generating Natural Adversarial Examples
Zhao, Zhengli and Dua, Dheeru and Singh, Sameer
International Conference on Learning Representations - 2018 via Local Bibsonomy
Keywords: dblp

[link] Summary by David Stutz 4 years ago

Zhao et al. propose a generative adversarial network (GAN) based approach to generate meaningful and natural adversarial examples for images and text. With natural adversarial examples, the authors refer to meaningful changes in the image content instead of adding seemingly random/adversarial noise – as illustrated in Figure 1. These natural adversarial examples can be crafted by first learning a generative model of the data, e.g., using a GAN together with an inverter (similar to an encoder), see Figure 2. Then, given an image $x$ and its latent code $z$, adversarial examples $\tilde{z} = z + \delta$ can be found within the latent code. The hope is that these adversarial examples will correspond to meaningful, naturally looking adversarial examples in the image space.

https://i.imgur.com/XBhHJuY.png
Figure 1: Illustration of natural adversarial examples in comparison ot regular, FGSM adversarial examples.

https://i.imgur.com/HT2StGI.png
Figure 2: Generative model (GAN) together with the required inverter.

In practice, e.g., on MNIST, any black-box classifier can be attacked by randomly sampling possible perturbations $\delta$ in the random space (with increasing norm) until an adversarial perturbation is found. Here, the inverted from Figure 2 is trained on top of the critic of the GAN (although specific details are missing in the paper).

Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/).

doi.org
sci-hub
scholar.google.com

Towards Poisoning of Deep Learning Algorithms with Back-gradient Optimization
Luis Muñoz-González and Battista Biggio and Ambra Demontis and Andrea Paudice and Vasin Wongrassamee and Emil C. Lupu and Fabio Roli
Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security - AISec '17 - 2017 via Local CrossRef
Keywords:

[link] Summary by David Stutz 4 years ago

Munoz-Gonzalez et al. propose a multi-class data poisening attack against deep neural networks based on back-gradient optimization. They consider the common poisening formulation stated as follows:

$ \max_{D_c} \min_w \mathcal{L}(D_c \cup D_{tr}, w)$

where $D_c$ denotes a set of poisened training samples and $D_{tr}$ the corresponding clea dataset. Here, the loss $\mathcal{L}$ used for training is minimized as the inner optimization problem. As result, as long as learning itself does not have closed-form solutions, e.g., for deep neural networks, the problem is computationally infeasible. To resolve this problem, the authors propose using back-gradient optimization. Then, the gradient with respect to the outer optimization problem can be computed while only computing a limited number of iterations to solve the inner problem, see the paper for detail. In experiments, on spam/malware detection and digit classification, the approach is shown to increase test error of the trained model with only few training examples poisened.

Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/).

arxiv.org
arxiv-vanity.com
scholar.google.com

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
Yonghui Wu and Mike Schuster and Zhifeng Chen and Quoc V. Le and Mohammad Norouzi and Wolfgang Macherey and Maxim Krikun and Yuan Cao and Qin Gao and Klaus Macherey and Jeff Klingner and Apurva Shah and Melvin Johnson and Xiaobing Liu and Łukasz Kaiser and Stephan Gouws and Yoshikiyo Kato and Taku Kudo and Hideto Kazawa and Keith Stevens et al.
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.CL, cs.AI, cs.LG
more

[link] Summary by Udibr 7 years ago

This is a very techincal paper and I only covered items that interested me
* Model
  * Encoder
    * 8 layers LSTM 
    * bi-directional only first encoder layer
    * top 4 layers add input to output (residual network)
  * Decoder
    * same as encoder except all layers are just forward direction
  * encoder state is not passed as a start point to Decoder state
  * Attention
    * energy computed using NN with one hidden layer as appose to dot product or the usual practice of no hidden layer and $\tanh$ activation at the output layer
    * computed from output of 1st decoder layer
    * pre-feed to all layers
* Training has two steps: ML and RL
  * ML (cross-entropy) training:
    * common wisdom, initialize all trainable parameters uniformly between [-0.04, 0.04]
    * clipping=5, batch=128
    * Adam (lr=2e-4) 60K steps followed by SGD (lr=.5 which is probably a typo!) 1.2M steps + 4x(lr/=2 200K steps)
    * 12 async machines, each machine with 8 GPUs (K80) on which the model is spread X 6days
    * [dropout](http://www.shortscience.org/paper?bibtexKey=journals/corr/ZarembaSV14) 0.2-0.3 (higher for smaller datasets)
  * RL - [Reinforcement Learning](http://www.shortscience.org/paper?bibtexKey=journals/corr/RanzatoCAZ15) 
    * sequence score, $\text{GLEU} = r = \min(\text{precision}, \text{recall})$ computed on n-grams of size 1-4
    * mixed loss $\alpha \text{ML} + \text{RL}, \alpha =0.25$
    * mean $r$ computed from $m=15$ samples
    * SGD, 400K steps, 3 days, no drouput
* Prediction (i.e. Decoder)
  * beam search (3 beams)
  * A normalized score is computed to every beam that ended (died)
    * did not normalize beam score by $\text{beam_length}^\alpha , \alpha \in [0.6-0.7]$
    * normalized with similar formula in which 5 is add to length and a coverage factor is added, which is the sum-log of attention weight of every input word (i.e. after summing over all output words)
    * Do a second pruning using normalized scores

arxiv.org
arxiv-vanity.com
scholar.google.com

Understanding deep learning requires rethinking generalization
Chiyuan Zhang and Samy Bengio and Moritz Hardt and Benjamin Recht and Oriol Vinyals
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.LG
more

[link] Summary by Martin Thoma 7 years ago

This paper deals with the question what / how exactly CNNs learn, considering the fact that they usually have more trainable parameters than data points on which they are trained.

When the authors write "deep neural networks", they are talking about Inception V3, AlexNet and MLPs.

## Key contributions

* Deep neural networks easily fit random labels (achieving a training error of 0 and a test error which is just randomly guessing labels as expected). $\Rightarrow$Those architectures can simply brute-force memorize the training data.
* Deep neural networks fit random images (e.g. Gaussian noise) with 0 training error. The authors conclude that VC-dimension / Rademacher complexity, and uniform stability are bad explanations for generalization capabilities of neural networks
* The authors give a construction for a 2-layer network with $p = 2n+d$ parameters - where $n$ is the number of samples and $d$ is the dimension of each sample - which can easily fit any labeling. (Finite sample expressivity). See section 4.

## What I learned

* Any measure $m$ of the generalization capability of classifiers $H$ should take the percentage of corrupted labels ($p_c \in [0, 1]$, where $p_c =0$ is a perfect labeling and $p_c=1$ is totally random) into account: If $p_c = 1$, then $m()$ should be 0, too, as it is impossible to learn something meaningful with totally random labels.
* We seem to have built models which work well on image data in general, but not "natural" / meaningful images as we thought.

## Funny

> deep neural nets remain mysterious for many reasons

> Note that this is not exactly simple as the kernel matrix requires 30GB to store in memory. Nonetheless, this system can be solved in under 3 minutes in on a commodity workstation with 24 cores and 256 GB of RAM with a conventional LAPACK call.

## See also

* [Deep Nets Don't Learn Via Memorization](https://openreview.net/pdf?id=rJv6ZgHYg)