ShortScience.org - Making Science Accessible!

Welcome to ShortScience.org!

arxiv.org
arxiv-vanity.com
scholar.google.com

Understanding deep learning requires rethinking generalization
Chiyuan Zhang and Samy Bengio and Moritz Hardt and Benjamin Recht and Oriol Vinyals
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.LG
more

[link] Summary by Martin Thoma 7 years ago

This paper deals with the question what / how exactly CNNs learn, considering the fact that they usually have more trainable parameters than data points on which they are trained.

When the authors write "deep neural networks", they are talking about Inception V3, AlexNet and MLPs.

## Key contributions

* Deep neural networks easily fit random labels (achieving a training error of 0 and a test error which is just randomly guessing labels as expected). $\Rightarrow$Those architectures can simply brute-force memorize the training data.
* Deep neural networks fit random images (e.g. Gaussian noise) with 0 training error. The authors conclude that VC-dimension / Rademacher complexity, and uniform stability are bad explanations for generalization capabilities of neural networks
* The authors give a construction for a 2-layer network with $p = 2n+d$ parameters - where $n$ is the number of samples and $d$ is the dimension of each sample - which can easily fit any labeling. (Finite sample expressivity). See section 4.

## What I learned

* Any measure $m$ of the generalization capability of classifiers $H$ should take the percentage of corrupted labels ($p_c \in [0, 1]$, where $p_c =0$ is a perfect labeling and $p_c=1$ is totally random) into account: If $p_c = 1$, then $m()$ should be 0, too, as it is impossible to learn something meaningful with totally random labels.
* We seem to have built models which work well on image data in general, but not "natural" / meaningful images as we thought.

## Funny

> deep neural nets remain mysterious for many reasons

> Note that this is not exactly simple as the kernel matrix requires 30GB to store in memory. Nonetheless, this system can be solved in under 3 minutes in on a commodity workstation with 24 cores and 256 GB of RAM with a conventional LAPACK call.

## See also

* [Deep Nets Don't Learn Via Memorization](https://openreview.net/pdf?id=rJv6ZgHYg)

dx.doi.org
sci-hub
scholar.google.com

Near-optimal probabilistic RNA-seq quantification
Nicolas L Bray and Harold Pimentel and Páll Melsted and Lior Pachter
Nature Biotechnology - 2016 via Local CrossRef
Keywords:

[link] Summary by Geneviève Boucher 6 years ago

This paper from 2016 introduced a new k-mer based method to estimate isoform abundance from RNA-Seq data called kallisto.  The method provided a significant improvement in speed and memory usage compared to the previously used methods while yielding similar accuracy.   In fact, kallisto is able to quantify expression in a matter of minutes instead of hours.

The standard (previous) methods for quantifying expression rely on mapping, i.e. on the alignment of a transcriptome sequenced reads to a genome of reference.  Reads are assigned to a position in the genome and the gene or isoform expression values are derived by counting the number of reads overlapping the features of interest. 

The idea behind kallisto is to rely on a pseudoalignment which does not attempt to identify the positions of the reads in the transcripts, only the potential transcripts of origin. Thus,  it avoids doing an alignment of each read to a reference genome. In fact, kallisto only uses the transcriptome sequences (not the whole genome) in its first step which is the generation of  the kallisto index.  Kallisto builds a colored de Bruijn graph (T-DBG) from all the k-mers found in the transcriptome.  Each node of the graph corresponds to a k-mer (a short sequence of k nucleotides) and retains the information about the transcripts in which they can be found in the form of a color.  Linear stretches having the same coloring in the graph correspond to transcripts. Once the T-DBG is built, kallisto stores a hash table mapping each k-mer to its transcript(s) of origin along with the position within the transcript(s).  This step is done only once and is dependent on a provided annotation file (containing the sequences of all the transcripts in the transcriptome).  
  
Then for a given sequenced sample, kallisto decomposes each read into its k-mers and uses those k-mers to find a path covering in the T-DBG.  This path covering of the transcriptome graph, where a path corresponds to a transcript, generates k-compatibility classes for each k-mer, i.e. sets of potential transcripts of origin on the nodes.   The potential transcripts of origin for a read can be obtained using the intersection of its k-mers k-compatibility classes. To make the pseudoalignment faster, kallisto removes redundant k-mers since neighboring k-mers often belong to the same transcripts. Figure1, from the paper, summarizes these different steps.

https://i.imgur.com/eNH2kuO.png

**Figure1**. Overview of kallisto. The input consists of a reference transcriptome and reads from an RNA-seq experiment. (a) An example of a read (in black) and three overlapping transcripts with exonic regions as shown. (b) An index is constructed by creating the transcriptome de Bruijn Graph (T-DBG) where nodes (v1, v2, v3, ... ) are k-mers, each transcript corresponds to a colored path as shown and the path cover of the transcriptome induces a k-compatibility class for each k-mer. (c) Conceptually, the k-mers of a read are hashed (black nodes) to find the k-compatibility class of a read. (d) Skipping (black dashed lines) uses the information stored in the T-DBG to skip k-mers that are redundant because they have the same k-compatibility class. (e) The k-compatibility class of the read is determined by taking the intersection of the k-compatibility classes of its constituent k-mers.[From Bray et al. Near-optimal probabilistic RNA-seq quantification, Nature Biotechnology, 2016.]

Then, kallisto optimizes the following RNA-Seq likelihood function using the expectation-maximization (EM) algorithm.  

$$L(\alpha) \propto \prod_{f \in F} \sum_{t \in T} y_{f,t} \frac{\alpha_t}{l_t} = \prod_{e \in E}\left(  \sum_{t \in e} \frac{\alpha_t}{l_t} \right )^{c_e}$$

In this function,  $F$ is the set of fragments (or reads), $T$ is the set of transcripts, $l_t$ is the (effective) length of transcript $t$ and **y**$_{f,t}$ is a compatibility matrix defined as 1 if  fragment $f$ is compatible with $t$ and 0 otherwise.  The parameters $α_t$ are the probabilities of selecting reads from a transcript $t$.  These $α_t$ are the parameters of interest since they represent the isoforms abundances or relative expressions.

To make things faster, the compatibility matrix is collapsed (factorized) into equivalence classes. An equivalent class consists of all the reads compatible with the same subsets of transcripts. The EM algorithm is applied to equivalence classes (not to reads).  Each $α_t$ will be optimized to maximise the likelihood of transcript abundances given observations of the equivalence classes. The speed of the method makes it possible to evaluate the uncertainty of the  abundance estimates for each RNA-Seq sample using a bootstrap technique.  For a given sample containing $N$ reads, a bootstrap sample is generated from the sampling of $N$ counts from a multinomial distribution over the equivalence classes derived from the original sample.  The EM algorithm is applied on those sampled equivalence class counts to estimate transcript abundances. The bootstrap information is then used in downstream analyses such as determining which genes are differentially expressed.

Practically, we can illustrate the different steps involved in kallisto using a small example.  Starting from a tiny genome with 3 transcripts, assume that the RNA-Seq experiment produced 4 reads as depicted in the image below.

https://i.imgur.com/5JDpQO8.png

The first step is to build the T-DBG graph and the kallisto index.  All transcript sequences are decomposed into k-mers (here k=5) to construct the colored de Bruijn graph. Not all nodes are represented in the following drawing.  The idea is that each different transcript will lead to a different path in the graph.  The strand is not taken into account, kallisto is strand-agnostic.

https://i.imgur.com/4oW72z0.png

Once the index is built, the four reads of the sequenced sample can be analysed.  They are decomposed into k-mers (k=5 here too) and the pre-built index is used to determine the k-compatibility class of each k-mer. Then, the k-compatibility class of each read is computed. For example, for read 1, the intersection of all the k-compatibility classes of its k-mers suggests that it might come from transcript 1 or transcript 2.

https://i.imgur.com/woektCH.png

This is done for the four reads enabling the construction of the compatibility matrix  **y**$_{f,t}$ which is part of the RNA-Seq likelihood function.  In this equation, the $α_t$ are the parameters that we want to estimate.

https://i.imgur.com/Hp5QJvH.png

The EM algorithm being too slow to be applied on millions of reads, the compatibility matrix **y**$_{f,t}$ is factorized into equivalence classes and a count is computed for each class (how many reads are represented by this equivalence class). The EM algorithm uses this collapsed information to maximize the new equivalent RNA-Seq likelihood function and optimize the $α_t$.

https://i.imgur.com/qzsEq8A.png

The EM algorithm stops when for every transcript $t$, $α_tN$ > 0.01 changes less than 1%, where $N$ is the total number of reads.

papers.nips.cc
scholar.google.com

Algorithms for Non-negative Matrix Factorization
Lee, Daniel D. and Seung, H. Sebastian
Neural Information Processing Systems Conference - 2000 via Local Bibsonomy
Keywords: dblp

[link] Summary by Joseph Paul Cohen 7 years ago

We want to find two matrices $W$ and $H$ such that $V = WH$. Often a goal is to determine underlying patterns in the relationships between the concepts represented by each row and column. $W$ is some $m$ by $n$ matrix and we want the inner dimension of the factorization to be $r$. So 

$$\underbrace{V}_{m \times n} = \underbrace{W}_{m \times r} \underbrace{H}_{r \times n}$$

Let's consider an example matrix where of three customers (as rows) are associated with three movies (the columns) by a rating value.

$$
V = \left[\begin{array}{c c c}
5 & 4 & 1  \\\\
4 & 5 & 1 \\\\
2 & 1 & 5
\end{array}\right]
$$


We can decompose this into two matrices with $r = 1$. First lets do this without any non-negative constraint using an SVD reshaping matrices based on removing eigenvalues:


$$
W = \left[\begin{array}{c c c}
-0.656 \\\
 -0.652 \\\
 -0.379
\end{array}\right],
H = \left[\begin{array}{c c c}
-6.48 & -6.26 & -3.20\\\\
\end{array}\right]
$$

We can also decompose this into two matrices with $r = 1$ subject to the constraint that $w_{ij} \ge 0$ and  $h_{ij} \ge 0$. (Note: this is only possible when $v_{ij} \ge 0$):

$$
W = \left[\begin{array}{c c c}
0.388 \\\\
0.386 \\\\
0.224
\end{array}\right],
H = \left[\begin{array}{c c c}
11.22 & 10.57 & 5.41  \\\\
\end{array}\right]
$$

Both of these $r=1$ factorizations reconstruct matrix $V$ with the same error. 

$$
V \approx WH = \left[\begin{array}{c c c}
4.36 & 4.11 & 2.10 \\\
4.33 & 4.08 & 2.09 \\\
2.52 & 2.37 & 1.21 \\\
\end{array}\right]
$$


If they both yield the same reconstruction error then why is a non-negativity constraint useful? We can see above that it is easy to observe patterns in both factorizations such as similar customers and similar movies. `TODO: motivate why NMF is better`



#### Paper Contribution 

This paper discusses two approaches for iteratively creating a non-negative $W$ and $H$ based on random initial matrices. The paper discusses a multiplicative update rule where the elements of $W$ and $H$ are iteratively transformed by scaling each value such that error is not increased. 

The multiplicative approach is discussed in contrast to an additive gradient decent based approach where small corrections are iteratively applied. The multiplicative approach can be reduced to this by setting the learning rate ($\eta$) to a ratio that represents the magnitude of the element in $H$ to the scaling factor of $W$ on $H$.



### Still a draft

proceedings.mlr.press
scholar.google.com

Sharp Minima Can Generalize For Deep Nets
Dinh, Laurent and Pascanu, Razvan and Bengio, Samy and Bengio, Yoshua
International Conference on Machine Learning - 2017 via Local Bibsonomy
Keywords: dblp

[link] Summary by David Stutz 4 years ago

Dinh et al. show that it is unclear whether flat minima necessarily generalize better than sharp ones. In particular, they study several notions of flatness, both based on the local curvature and based on the notion of “low change in error”. The authors show that the parameterization of the network has a significant impact on the flatness; this means that functions leading to the same prediction function (i.e., being indistinguishable based on their test performance) might have largely varying flatness around the obtained minima, as illustrated in Figure 1. In conclusion, while networks that generalize well usually correspond to flat minima, it is not necessarily true that flat minima generalize better than sharp ones.

https://i.imgur.com/gHfolEV.jpg
Figure 1: Illustration of the influence of parameterization on the flatness of the obtained minima.

Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/).

arxiv.org
scholar.google.com

Net2Net: Accelerating Learning via Knowledge Transfer
Chen, Tianqi and Goodfellow, Ian J. and Shlens, Jonathon
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Hugo Larochelle 8 years ago

This paper presents an approach to initialize a neural network from the parameters of a smaller and previously trained neural network. This is effectively done by increasing the size (in width and/or depth) of the previously trained neural network, in such of a way that the function represented by the network doesn't change (i.e. the output of the larger neural network is still the same). The motivation here is that initializing larger neural networks in this way allows to accelerate their training, since at initialization the neural network will already be quite good.

In a nutshell, neural networks are made wider by adding several copies (selected randomly) of the same hidden units to the hidden layer, for each hidden layer. To ensure that the neural network output remains the same, each incoming connection weight must also be divided by the number of replicas that unit is connected to in the previous layer. If not training using dropout, it is also recommended to add some noise to this initialization, in order to break its initial symmetry (though this will actually break the property that the network's output is the same). As for making a deeper network, layers are added by initializing them to be the identity function. For ReLU units, this is achieved using an identity matrix as the connection weight matrix. For units based on sigmoid or tanh activations, unfortunately it isn't possible to add such identity layers.

In their experiments on ImageNet, the authors show that this initialization allows them to train larger networks faster than if trained from random initialization. More importantly, they were able to outperform their previous validation set ImageNet accuracy by initializing a very large network from their best Inception network.