ShortScience.org - Making Science Accessible!

Welcome to ShortScience.org!

arxiv.org
arxiv-vanity.com
scholar.google.com

Understanding deep learning requires rethinking generalization
Chiyuan Zhang and Samy Bengio and Moritz Hardt and Benjamin Recht and Oriol Vinyals
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.LG
more

[link] Summary by Martin Thoma 7 years ago

This paper deals with the question what / how exactly CNNs learn, considering the fact that they usually have more trainable parameters than data points on which they are trained.

When the authors write "deep neural networks", they are talking about Inception V3, AlexNet and MLPs.

## Key contributions

* Deep neural networks easily fit random labels (achieving a training error of 0 and a test error which is just randomly guessing labels as expected). $\Rightarrow$Those architectures can simply brute-force memorize the training data.
* Deep neural networks fit random images (e.g. Gaussian noise) with 0 training error. The authors conclude that VC-dimension / Rademacher complexity, and uniform stability are bad explanations for generalization capabilities of neural networks
* The authors give a construction for a 2-layer network with $p = 2n+d$ parameters - where $n$ is the number of samples and $d$ is the dimension of each sample - which can easily fit any labeling. (Finite sample expressivity). See section 4.

## What I learned

* Any measure $m$ of the generalization capability of classifiers $H$ should take the percentage of corrupted labels ($p_c \in [0, 1]$, where $p_c =0$ is a perfect labeling and $p_c=1$ is totally random) into account: If $p_c = 1$, then $m()$ should be 0, too, as it is impossible to learn something meaningful with totally random labels.
* We seem to have built models which work well on image data in general, but not "natural" / meaningful images as we thought.

## Funny

> deep neural nets remain mysterious for many reasons

> Note that this is not exactly simple as the kernel matrix requires 30GB to store in memory. Nonetheless, this system can be solved in under 3 minutes in on a commodity workstation with 24 cores and 256 GB of RAM with a conventional LAPACK call.

## See also

* [Deep Nets Don't Learn Via Memorization](https://openreview.net/pdf?id=rJv6ZgHYg)

papers.nips.cc
scholar.google.com

Algorithms for Non-negative Matrix Factorization
Lee, Daniel D. and Seung, H. Sebastian
Neural Information Processing Systems Conference - 2000 via Local Bibsonomy
Keywords: dblp

[link] Summary by Joseph Paul Cohen 7 years ago

We want to find two matrices $W$ and $H$ such that $V = WH$. Often a goal is to determine underlying patterns in the relationships between the concepts represented by each row and column. $W$ is some $m$ by $n$ matrix and we want the inner dimension of the factorization to be $r$. So 

$$\underbrace{V}_{m \times n} = \underbrace{W}_{m \times r} \underbrace{H}_{r \times n}$$

Let's consider an example matrix where of three customers (as rows) are associated with three movies (the columns) by a rating value.

$$
V = \left[\begin{array}{c c c}
5 & 4 & 1  \\\\
4 & 5 & 1 \\\\
2 & 1 & 5
\end{array}\right]
$$


We can decompose this into two matrices with $r = 1$. First lets do this without any non-negative constraint using an SVD reshaping matrices based on removing eigenvalues:


$$
W = \left[\begin{array}{c c c}
-0.656 \\\
 -0.652 \\\
 -0.379
\end{array}\right],
H = \left[\begin{array}{c c c}
-6.48 & -6.26 & -3.20\\\\
\end{array}\right]
$$

We can also decompose this into two matrices with $r = 1$ subject to the constraint that $w_{ij} \ge 0$ and  $h_{ij} \ge 0$. (Note: this is only possible when $v_{ij} \ge 0$):

$$
W = \left[\begin{array}{c c c}
0.388 \\\\
0.386 \\\\
0.224
\end{array}\right],
H = \left[\begin{array}{c c c}
11.22 & 10.57 & 5.41  \\\\
\end{array}\right]
$$

Both of these $r=1$ factorizations reconstruct matrix $V$ with the same error. 

$$
V \approx WH = \left[\begin{array}{c c c}
4.36 & 4.11 & 2.10 \\\
4.33 & 4.08 & 2.09 \\\
2.52 & 2.37 & 1.21 \\\
\end{array}\right]
$$


If they both yield the same reconstruction error then why is a non-negativity constraint useful? We can see above that it is easy to observe patterns in both factorizations such as similar customers and similar movies. `TODO: motivate why NMF is better`



#### Paper Contribution 

This paper discusses two approaches for iteratively creating a non-negative $W$ and $H$ based on random initial matrices. The paper discusses a multiplicative update rule where the elements of $W$ and $H$ are iteratively transformed by scaling each value such that error is not increased. 

The multiplicative approach is discussed in contrast to an additive gradient decent based approach where small corrections are iteratively applied. The multiplicative approach can be reduced to this by setting the learning rate ($\eta$) to a ratio that represents the magnitude of the element in $H$ to the scaling factor of $W$ on $H$.



### Still a draft

dx.doi.org
sci-hub
scholar.google.com

Keep It SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image
Bogo, Federica and Kanazawa, Angjoo and Lassner, Christoph and Gehler, Peter V. and Romero, Javier and Black, Michael J.
European Conference on Computer Vision - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Kirill Pevzner 6 years ago

Problem
----------
Given an unconstrained image, estimate:
1. 3d pose of human skeleton 
2. 3d body mesh


Contributions
-----------
1. full body mesh extraction from image
2. improvement of state of the art


Datasets
-------------
1. Leeds Sports
2. HumanEva
3. Human3.6M


Approach
----------------
Consider the problem both bottom-up and top-down.
1. Bottom-up: DeepCut cnn model to fit joints 2d positions onto the image.
2. top-down: A skinned multi-person linear model (SMPL) is fitted and projected onto 2d joint positions and image.

arxiv.org
scholar.google.com

Deep Residual Learning for Image Recognition
He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Martin Thoma 8 years ago

Deeper networks should never have a higher **training** error than smaller ones. In the worst case, the layers should "simply" learn identities. It seems as this is not so easy with conventional networks, as they get much worse with more layers. So the idea is to add identity functions which skip some layers. The network only has to learn the **residuals**. 

Advantages:

* Learning the identity becomes learning 0 which is simpler
* Loss in information flow in the forward pass is not a problem anymore
    * No vanishing / exploding gradient
* Identities don't have parameters to be learned

## Evaluation

The learning rate starts at 0.1 and is divided by 10 when the error plateaus. Weight decay of 0.0001 ($10^{-4}$), momentum of 0.9. They use mini-batches of size 128.

* ImageNet ILSVRC 2015: 3.57% (ensemble)
* CIFAR-10: 6.43%
* MS COCO: 59.0% mAp@0.5 (ensemble)
* PASCAL VOC 2007: 85.6% mAp@0.5
* PASCAL VOC 2012: 83.8% mAp@0.5

## See also

* [DenseNets](http://www.shortscience.org/paper?bibtexKey=journals/corr/1608.06993)

arxiv.org
arxiv-vanity.com
scholar.google.com

Progressive Neural Networks
Andrei A. Rusu and Neil C. Rabinowitz and Guillaume Desjardins and Hubert Soyer and James Kirkpatrick and Koray Kavukcuoglu and Razvan Pascanu and Raia Hadsell
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.LG
more

[link] Summary by Denny Britz 7 years ago

TLDR; The authors propose Progressive Neural Networks (ProgNN), a new way to do transfer learning without forgetting prior knowledge (as is done in finetuning). ProgNNs train a neural neural on task 1, freeze the parameters, and then train a new network on task 2 while introducing lateral connections and adapter functions from network 1 to network 2. This process can be repeated with further columns (networks). The authors evaluate ProgNNs on 3 RL tasks and find that they outperform finetuning-based approaches.

#### Key Points

- Finetuning is a destructive process that forgets previous knowledge. We don't want that.
- Layer h_k in network 3 gets additional lateral connections from layers h_(k-1) in network 2 and network 1. Parameters of those connections are learned, but network 2 and network 1 are frozen during training of network 3.
- Downside: # of Parameters grows quadratically with the number of tasks. Paper discussed some approaches to address the problem, but not sure how well these work in practice.
- Metric: AUC (Average score per episode during training) as opposed to final score. Transfer score = Relative performance compared with single net baseline.
- Authors use Average Perturbation Sensitivity (APS) and Average Fisher Sensitivity (AFS) to analyze which features/layers from previous networks are actually used in the newly trained network.
- Experiment 1: Variations of Pong game. Baseline that finetunes only final layer fails to learn. ProgNN beats other baselines and APS shows re-use of knowledge.
- Experiment 2: Different Atari games. ProgNets result in positive Transfer 8/12 times, negative transfer 2/12 times. Negative transfer may be a result of optimization problems. Finetuning final layers fails again. ProgNN beats other approaches.
- Experiment 3: Labyrinth, 3D Maze. Pretty much same result as other experiments.

#### Notes

- It seems like the assumption is that layer k always wants to transfer knowledge from layer (k-1). But why is that true? Network are trained on different tasks, so the layer representations, or even numbers of layers, may be completely different. And Once you introduce lateral connections from all layers to all other layers the approach no longer scales.
- Old tasks cannot learn from new tasks. Unlike humans.
- Gating or residuals for lateral connection could make sense to allow to network to "easily" re-use previously learned knowledge.
- Why use AUC metric? I also would've liked to see the final score. Maybe there's a good reason for this, but the paper doesn't explain.
- Scary that finetuning the final layer only fails in most experiments. That's a very commonly used approach in non-RL domains.
- Someone should try this on non-RL tasks.
- What happens to training time and optimization difficult as you add more columns? Seems prohibitively expensive.