Welcome to ShortScience.org! |

- ShortScience.org is a platform for post-publication discussion aiming to improve accessibility and reproducibility of research ideas.
- The website has 1584 public summaries, mostly in machine learning, written by the community and organized by paper, conference, and year.
- Reading summaries of papers is useful to obtain the perspective and insight of another reader, why they liked or disliked it, and their attempt to demystify complicated sections.
- Also, writing summaries is a good exercise to understand the content of a paper because you are forced to challenge your assumptions when explaining it.
- Finally, you can keep up to date with the flood of research by reading the latest summaries on our Twitter and Facebook pages.

Self-Normalizing Neural Networks

Günter Klambauer and Thomas Unterthiner and Andreas Mayr and Sepp Hochreiter

arXiv e-Print archive - 2017 via Local arXiv

Keywords: cs.LG, stat.ML

**First published:** 2017/06/08 (7 years ago)

**Abstract:** Deep Learning has revolutionized vision via convolutional neural networks
(CNNs) and natural language processing via recurrent neural networks (RNNs).
However, success stories of Deep Learning with standard feed-forward neural
networks (FNNs) are rare. FNNs that perform well are typically shallow and,
therefore cannot exploit many levels of abstract representations. We introduce
self-normalizing neural networks (SNNs) to enable high-level abstract
representations. While batch normalization requires explicit normalization,
neuron activations of SNNs automatically converge towards zero mean and unit
variance. The activation function of SNNs are "scaled exponential linear units"
(SELUs), which induce self-normalizing properties. Using the Banach fixed-point
theorem, we prove that activations close to zero mean and unit variance that
are propagated through many network layers will converge towards zero mean and
unit variance -- even under the presence of noise and perturbations. This
convergence property of SNNs allows to (1) train deep networks with many
layers, (2) employ strong regularization, and (3) to make learning highly
robust. Furthermore, for activations not close to unit variance, we prove an
upper and lower bound on the variance, thus, vanishing and exploding gradients
are impossible. We compared SNNs on (a) 121 tasks from the UCI machine learning
repository, on (b) drug discovery benchmarks, and on (c) astronomy tasks with
standard FNNs and other machine learning methods such as random forests and
support vector machines. SNNs significantly outperformed all competing FNN
methods at 121 UCI tasks, outperformed all competing methods at the Tox21
dataset, and set a new record at an astronomy data set. The winning SNN
architectures are often very deep. Implementations are available at:
github.com/bioinf-jku/SNNs.
more
less

Günter Klambauer and Thomas Unterthiner and Andreas Mayr and Sepp Hochreiter

arXiv e-Print archive - 2017 via Local arXiv

Keywords: cs.LG, stat.ML

[link]
_Objective:_ Design Feed-Forward Neural Network (fully connected) that can be trained even with very deep architectures. * _Dataset:_ [MNIST](yann.lecun.com/exdb/mnist/), [CIFAR10](https://www.cs.toronto.edu/%7Ekriz/cifar.html), [Tox21](https://tripod.nih.gov/tox21/challenge/) and [UCI tasks](https://archive.ics.uci.edu/ml/datasets/optical+recognition+of+handwritten+digits). * _Code:_ [here](https://github.com/bioinf-jku/SNNs) ## Inner-workings: They introduce a new activation functio the Scaled Exponential Linear Unit (SELU) which has the nice property of making neuron activations converge to a fixed point with zero-mean and unit-variance. They also demonstrate that upper and lower bounds and the variance and mean for very mild conditions which basically means that there will be no exploding or vanishing gradients. The activation function is: [![screen shot 2017-06-14 at 11 38 27 am](https://user-images.githubusercontent.com/17261080/27125901-1a4f7276-50f6-11e7-857d-ebad1ac94789.png)](https://user-images.githubusercontent.com/17261080/27125901-1a4f7276-50f6-11e7-857d-ebad1ac94789.png) With specific parameters for alpha and lambda to ensure the previous properties. The tensorflow impementation is: def selu(x): alpha = 1.6732632423543772848170429916717 scale = 1.0507009873554804934193349852946 return scale*np.where(x>=0.0, x, alpha*np.exp(x)-alpha) They also introduce a new dropout (alpha-dropout) to compensate for the fact that [![screen shot 2017-06-14 at 11 44 42 am](https://user-images.githubusercontent.com/17261080/27126174-e67d212c-50f6-11e7-8952-acad98b850be.png)](https://user-images.githubusercontent.com/17261080/27126174-e67d212c-50f6-11e7-8952-acad98b850be.png) ## Results: Batch norm becomes obsolete and they are also able to train deeper architectures. This becomes a good choice to replace shallow architectures where random forest or SVM used to be the best results. They outperform most other techniques on small datasets. [![screen shot 2017-06-14 at 11 36 30 am](https://user-images.githubusercontent.com/17261080/27125798-bd04c256-50f5-11e7-8a74-b3b6a3fe82ee.png)](https://user-images.githubusercontent.com/17261080/27125798-bd04c256-50f5-11e7-8a74-b3b6a3fe82ee.png) Might become a new standard for fully-connected activations in the future. |

Biologically inspired protection of deep networks from adversarial attacks

Aran Nayebi and Surya Ganguli

arXiv e-Print archive - 2017 via Local arXiv

Keywords: stat.ML, cs.LG, q-bio.NC

**First published:** 2017/03/27 (7 years ago)

**Abstract:** Inspired by biophysical principles underlying nonlinear dendritic computation
in neural circuits, we develop a scheme to train deep neural networks to make
them robust to adversarial attacks. Our scheme generates highly nonlinear,
saturated neural networks that achieve state of the art performance on gradient
based adversarial examples on MNIST, despite never being exposed to
adversarially chosen examples during training. Moreover, these networks exhibit
unprecedented robustness to targeted, iterative schemes for generating
adversarial examples, including second-order methods. We further identify
principles governing how these networks achieve their robustness, drawing on
methods from information geometry. We find these networks progressively create
highly flat and compressed internal representations that are sensitive to very
few input dimensions, while still solving the task. Moreover, they employ
highly kurtotic weight distributions, also found in the brain, and we
demonstrate how such kurtosis can protect even linear classifiers from
adversarial attack.
more
less

Aran Nayebi and Surya Ganguli

arXiv e-Print archive - 2017 via Local arXiv

Keywords: stat.ML, cs.LG, q-bio.NC

[link]
Nayebi and Ganguli propose saturating neural networks as defense against adversarial examples. The main observation driving this paper can be stated as follows: Neural networks are essentially based on linear sums of neurons (e.g. fully connected layers, convolutiona layers) which are then activated; by injecting a small amount of noise per neuron it is possible to shift the final sum by large values, thereby propagating the noisy through the network and fooling the network into misclassifying an example. To prevent the impact of these adversarial examples, the network should be trained in a manner to drive many neurons into a saturated regime – noisy will, so the argument, have less impact then. The authors also give a biological motivation, which I won't go into detail here. Letting $\psi$ be the used activation function, e.g. sigmoid or ReLU, a regularizer is added to drive neurons into saturation. In particular, a penalty $\lambda \sum_l \sum_i \psi_c(h_i^l)$ is added to the loss. Here, $l$ indexes the layer and $i$ the unit in the layer; $h_i^l$ then describes the input to the non-linearity computed for unit $i$ in layer $l$. $\psi_c$ is the complementary function defined as $\psi_c(z) = \inf_{z': \psi'(z') = 0} |z – z'|$ It defines the distance of the point $z$ to the nearest saturated point $z'$ where $\psi'(z') = 0$. For ReLU activations, the complementary function is the ReLU function itself; for sigmoid activations, the complementary function is $\sigma_c(z) = |\sigma(z)(1 - \sigma(z))|$. In experiments, Nayebi and Ganguli show that training with the additional penalty yields networks with higher robustness against adversarial examples compared to adversarial training (i.e. training on adversarial examples). They also provide some insight, showing e.g. the activation and weight distribution of layers illustrating that neurons are indeed saturated in large parts. For details, see the paper. I also want to point to a comment on the paper written by Brendel and Bethge [1] questioning the effectiveness of the proposed defense strategy. They discuss a variant of the fast sign gradient method (FSGM) with stabilized gradients which is able to fool saturated networks. [1] W. Brendel, M. Behtge. Comment on “Biologically inspired protection of deep networks from adversarial attacks”, https://arxiv.org/abs/1704.01547. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

Algorithms for Non-negative Matrix Factorization

Lee, Daniel D. and Seung, H. Sebastian

Neural Information Processing Systems Conference - 2000 via Local Bibsonomy

Keywords: dblp

Lee, Daniel D. and Seung, H. Sebastian

Neural Information Processing Systems Conference - 2000 via Local Bibsonomy

Keywords: dblp

[link]
We want to find two matrices $W$ and $H$ such that $V = WH$. Often a goal is to determine underlying patterns in the relationships between the concepts represented by each row and column. $W$ is some $m$ by $n$ matrix and we want the inner dimension of the factorization to be $r$. So $$\underbrace{V}_{m \times n} = \underbrace{W}_{m \times r} \underbrace{H}_{r \times n}$$ Let's consider an example matrix where of three customers (as rows) are associated with three movies (the columns) by a rating value. $$ V = \left[\begin{array}{c c c} 5 & 4 & 1 \\\\ 4 & 5 & 1 \\\\ 2 & 1 & 5 \end{array}\right] $$ We can decompose this into two matrices with $r = 1$. First lets do this without any non-negative constraint using an SVD reshaping matrices based on removing eigenvalues: $$ W = \left[\begin{array}{c c c} -0.656 \\\ -0.652 \\\ -0.379 \end{array}\right], H = \left[\begin{array}{c c c} -6.48 & -6.26 & -3.20\\\\ \end{array}\right] $$ We can also decompose this into two matrices with $r = 1$ subject to the constraint that $w_{ij} \ge 0$ and $h_{ij} \ge 0$. (Note: this is only possible when $v_{ij} \ge 0$): $$ W = \left[\begin{array}{c c c} 0.388 \\\\ 0.386 \\\\ 0.224 \end{array}\right], H = \left[\begin{array}{c c c} 11.22 & 10.57 & 5.41 \\\\ \end{array}\right] $$ Both of these $r=1$ factorizations reconstruct matrix $V$ with the same error. $$ V \approx WH = \left[\begin{array}{c c c} 4.36 & 4.11 & 2.10 \\\ 4.33 & 4.08 & 2.09 \\\ 2.52 & 2.37 & 1.21 \\\ \end{array}\right] $$ If they both yield the same reconstruction error then why is a non-negativity constraint useful? We can see above that it is easy to observe patterns in both factorizations such as similar customers and similar movies. `TODO: motivate why NMF is better` #### Paper Contribution This paper discusses two approaches for iteratively creating a non-negative $W$ and $H$ based on random initial matrices. The paper discusses a multiplicative update rule where the elements of $W$ and $H$ are iteratively transformed by scaling each value such that error is not increased. The multiplicative approach is discussed in contrast to an additive gradient decent based approach where small corrections are iteratively applied. The multiplicative approach can be reduced to this by setting the learning rate ($\eta$) to a ratio that represents the magnitude of the element in $H$ to the scaling factor of $W$ on $H$. ### Still a draft |

Intriguing properties of neural networks

Christian Szegedy and Wojciech Zaremba and Ilya Sutskever and Joan Bruna and Dumitru Erhan and Ian Goodfellow and Rob Fergus

arXiv e-Print archive - 2013 via Local arXiv

Keywords: cs.CV, cs.LG, cs.NE

**First published:** 2013/12/21 (10 years ago)

**Abstract:** Deep neural networks are highly expressive models that have recently achieved
state of the art performance on speech and visual recognition tasks. While
their expressiveness is the reason they succeed, it also causes them to learn
uninterpretable solutions that could have counter-intuitive properties. In this
paper we report two such properties.
First, we find that there is no distinction between individual high level
units and random linear combinations of high level units, according to various
methods of unit analysis. It suggests that it is the space, rather than the
individual units, that contains of the semantic information in the high layers
of neural networks.
Second, we find that deep neural networks learn input-output mappings that
are fairly discontinuous to a significant extend. We can cause the network to
misclassify an image by applying a certain imperceptible perturbation, which is
found by maximizing the network's prediction error. In addition, the specific
nature of these perturbations is not a random artifact of learning: the same
perturbation can cause a different network, that was trained on a different
subset of the dataset, to misclassify the same input.
more
less

Christian Szegedy and Wojciech Zaremba and Ilya Sutskever and Joan Bruna and Dumitru Erhan and Ian Goodfellow and Rob Fergus

arXiv e-Print archive - 2013 via Local arXiv

Keywords: cs.CV, cs.LG, cs.NE

[link]
The paper introduces two key properties of deep neural networks: - Semantic meaning of individual units. - Earlier works analyzed learnt semantics by finding images that maximally activate individual units. - Authors observe that there is no difference between individual units and random linear combinations of units. - It is the entire space of activations that contains the bulk of semantic information. - Stability of neural networks to small perturbations in input space. - Networks that generalize well are expected to be robust to small perturbations in the input, i.e. imperceptible noise in the input shouldn't change the predicted class. - Authors find that networks can be made to misclassify an image by applying a certain imperceptible perturbation, which is found by maximizing the network's prediction error. - These 'adversarial examples' generalize well to different architectures trained on different data subsets. ## Strengths - The authors propose a way to make networks more robust to small perturbations by training them with adversarial examples in an adaptive manner, i.e. keep changing the pool of adversarial examples during training. In this regard, they draw a connection with hard-negative mining, and a network trained with adversarial examples performs better than others. - Formal description of how to generate adversarial examples and mathematical analysis of a network's stability to perturbations are useful studies. ## Weaknesses / Notes - Two images that are visually indistinguishable to humans but classified differently by the network is indeed an intriguing observation. - The paper feels a little half-baked in parts, and some ideas could've been presented more clearly. |

Automatic chemical design using a data-driven continuous representation of molecules

Gómez-Bombarelli, Rafael and Duvenaud, David and Hernández-Lobato, José Miguel and Aguilera-Iparraguirre, Jorge and Hirzel, Timothy D. and Adams, Ryan P. and Aspuru-Guzik, Alán

arXiv e-Print archive - 2016 via Local Bibsonomy

Keywords: dblp

Gómez-Bombarelli, Rafael and Duvenaud, David and Hernández-Lobato, José Miguel and Aguilera-Iparraguirre, Jorge and Hirzel, Timothy D. and Adams, Ryan P. and Aspuru-Guzik, Alán

arXiv e-Print archive - 2016 via Local Bibsonomy

Keywords: dblp

[link]
I'll admit that I found this paper a bit of a letdown to read, relative to expectations rooted in its high citation count, and my general excitement and interest to see how deep learning could be brought to bear on molecular design. But before a critique, let's first walk through the mechanics of how the authors' approach works. The method proposed is basically a very straightforward Variational Auto Encoder, or VAE. It takes in a textual SMILES string representation of a molecular structure, uses an encoder to map that into a continuous vector representation, a decoder to map the vector representation back into a a SMILES string, and an auxiliary predictor to predict properties of a molecule given the continuous representation. So, the training loss is a combination of the reconstruction loss (log probability of the true molecule under the distribution produced by the decoder) and the semi-supervised predictive loss. The hope with this model is that it would allow you to sample from a space of potential molecules by starting from an existing molecule, and then optimizing the the vector representation of that molecule to make it score higher on whatever property you want to optimize for. https://i.imgur.com/WzZsCOB.png The authors acknowledge that, in this setup, you're just producing a probability distribution over characters, and that the continuous vectors sampled from the latent space might not actually map to valid SMILES strings, and beyond that may well not correspond to chemically valid molecules. Empirically, they said that the proportion of valid generated molecules ranged between 1 and 70%. But they argue that it'd be too difficult to enforce those constraints, and instead just sample from the model and run the results through a hand-designed filter for molecular validity. In my view, this is the central weakness of the method proposed in this paper: that they seem to have not tackled the question of either chemical viability or even syntactic correctness of the produced molecules. I found it difficult to nail down from the paper what the ultimate percentage of valid molecules was from points in latent space that were off of the training . A table reports "percentage of 5000 randomly-selected latent points that decode to valid molecules after 1000 attempts," but I'm confused by what the 1000 attempts means here - does that mean we draw 1000 samples from the distribution given by the decoder, and see if *any* of those samples are valid? That would be a strange metric, if so, and perhaps it means something different, but it's hard to tell. https://i.imgur.com/9sy0MXB.png This paper made me really curious to see whether a GAN could do better in this space, since it would presumably be better at the task of incentivizing syntactic correctness of produced strings (given that any deviation from correctness could be signal for the discriminator), but it might also lead to issues around mode collapse, and when I last checked the literature, GANs on text data in particular were still not great. |

About