ShortScience.org - Making Science Accessible!

Welcome to ShortScience.org!

arxiv.org
arxiv-vanity.com
scholar.google.com

Improving neural networks by preventing co-adaptation of feature detectors
Geoffrey E. Hinton and Nitish Srivastava and Alex Krizhevsky and Ilya Sutskever and Ruslan R. Salakhutdinov
arXiv e-Print archive - 2012 via Local arXiv
Keywords: cs.NE, cs.CV, cs.LG
more

[link] Summary by Martin Thoma 7 years ago

This paper introduced Dropout, a new layer type. It has a parameter $\alpha \in (0, 1)$. The output dimensionality of a dropout layer is equal to its input dimensionality. With a probability of $\alpha$ any neurons output is set to 0. At testing time, the output of all neurons is multiplied with $\alpha$ to compensate for the fact that no output is set to 0.

A much better paper, by the same authors but 2 years later, is [Dropout: a simple way to prevent neural networks from overfitting](http://www.shortscience.org/paper?bibtexKey=journals/jmlr/SrivastavaHKSS14).

Dropout can be interpreted as training an ensemble of many networks, which share weights.

It was notably used by [ImageNet Classification with Deep Convolutional Neural Networks](http://www.shortscience.org/paper?bibtexKey=krizhevsky2012imagenet).

arxiv.org
arxiv-vanity.com
scholar.google.com

Nerfies: Deformable Neural Radiance Fields
Keunhong Park and Utkarsh Sinha and Jonathan T. Barron and Sofien Bouaziz and Dan B Goldman and Steven M. Seitz and Ricardo Martin-Brualla
arXiv e-Print archive - 2020 via Local arXiv
Keywords: cs.CV, cs.GR
more

[link] Summary by CodyWild 2 years ago

This summary builds substantially on my summary of NERFs, so if you haven't yet read that, I recommend doing so first! 

The idea of a NERF is learn a neural network that represents a 3D scene, and from which you can, once the model is trained, sample an image of that scene from any desired angle. This involves structuring your neural network as a function that predicts the RGB color and density/opacity for a given point in 3D space (x, y, z), from a given viewing angle (theta, phi). With such a function, you can generate predictions of what images taken from certain angles would look like by sampling along a viewing ray, and integrating the combined hue and opacity into an aggregated view. This prediction can then be compared to a true image taken from that direction, and gradients passed backwards into the prediction model. 

An important assumption of this model is that the scene being photographed is static; specifically, that every point in space is always inhabited by the same part of the 3D object, regardless of what angle it's viewed from. This is a reasonable assumption for photos of inanimate objects, or of humans in highly controlled lab settings, but it is often not true for humans when you, say, ask them to take a selfie video of themselves. Even if they're trying to keep roughly still, there will be slight shifts in the location and position of their head between frames, and the authors of this paper show that this can lead to strange artifacts if you naively try to train a NERF from the images (including a particularly odd one where it hallucinates tiny copies of the image in the air surrounding the face). 

https://i.imgur.com/IUVh6uM.png

The fix proposed by this paper is to apply a learnable deformation field to each image, where the notion is to deform each view into being in one canonical position (fixed per network, since, again, one network corresponds to a single scene). This means that, along with learning the parameters of the NERF itself, you're also learning what deformation to apply to each training image to get it into this canonical position. This is done by parametrizing the deformation in a particular way, and then having that deformation be conditioned by a latent vector that's trained similar to how you'd train an embedding (one learned vector per image example). The parametrization of the deformation is honestly a little bit over my head, given my lack of grounding in 3D modeling, but my general sense is that it applies some constraints and regularization to ensure that the learned deformations are realistic, insofar as humans are mostly rigid (one patch of skin on my forehead generally doesn't move except in concordance with the rest of my forehead), but with some possibility for elasticity (skin can stretch if I, say, smile). The authors also include an annealing scheme whereby, early in training, the model focuses on learning course (large-scale) deformations, and later in training, it's allowed to also learn weights for more precise deformations. This is to hopefully match macro-scale shifts before adding the noise of precise changes. 

This addition of a learned deformation is most of the contribution of this method: with it applied, they show that they're able to learn realistic NERFs from selfies, which they term "NERFIES". They mention a few pieces of concurrent work that try to solve the same problem of non-static human subjects in different ways, but I haven't had a chance to read those, so I can't really comment on how NERFIES stacks up to alternate approaches, but it appears to be as least one empirically convincing solution to the problem it's aiming at.

papers.nips.cc
scholar.google.com

Algorithms for Non-negative Matrix Factorization
Lee, Daniel D. and Seung, H. Sebastian
Neural Information Processing Systems Conference - 2000 via Local Bibsonomy
Keywords: dblp

[link] Summary by Joseph Paul Cohen 8 years ago

We want to find two matrices $W$ and $H$ such that $V = WH$. Often a goal is to determine underlying patterns in the relationships between the concepts represented by each row and column. $W$ is some $m$ by $n$ matrix and we want the inner dimension of the factorization to be $r$. So 

$$\underbrace{V}_{m \times n} = \underbrace{W}_{m \times r} \underbrace{H}_{r \times n}$$

Let's consider an example matrix where of three customers (as rows) are associated with three movies (the columns) by a rating value.

$$
V = \left[\begin{array}{c c c}
5 & 4 & 1  \\\\
4 & 5 & 1 \\\\
2 & 1 & 5
\end{array}\right]
$$


We can decompose this into two matrices with $r = 1$. First lets do this without any non-negative constraint using an SVD reshaping matrices based on removing eigenvalues:


$$
W = \left[\begin{array}{c c c}
-0.656 \\\
 -0.652 \\\
 -0.379
\end{array}\right],
H = \left[\begin{array}{c c c}
-6.48 & -6.26 & -3.20\\\\
\end{array}\right]
$$

We can also decompose this into two matrices with $r = 1$ subject to the constraint that $w_{ij} \ge 0$ and  $h_{ij} \ge 0$. (Note: this is only possible when $v_{ij} \ge 0$):

$$
W = \left[\begin{array}{c c c}
0.388 \\\\
0.386 \\\\
0.224
\end{array}\right],
H = \left[\begin{array}{c c c}
11.22 & 10.57 & 5.41  \\\\
\end{array}\right]
$$

Both of these $r=1$ factorizations reconstruct matrix $V$ with the same error. 

$$
V \approx WH = \left[\begin{array}{c c c}
4.36 & 4.11 & 2.10 \\\
4.33 & 4.08 & 2.09 \\\
2.52 & 2.37 & 1.21 \\\
\end{array}\right]
$$


If they both yield the same reconstruction error then why is a non-negativity constraint useful? We can see above that it is easy to observe patterns in both factorizations such as similar customers and similar movies. `TODO: motivate why NMF is better`



#### Paper Contribution 

This paper discusses two approaches for iteratively creating a non-negative $W$ and $H$ based on random initial matrices. The paper discusses a multiplicative update rule where the elements of $W$ and $H$ are iteratively transformed by scaling each value such that error is not increased. 

The multiplicative approach is discussed in contrast to an additive gradient decent based approach where small corrections are iteratively applied. The multiplicative approach can be reduced to this by setting the learning rate ($\eta$) to a ratio that represents the magnitude of the element in $H$ to the scaling factor of $W$ on $H$.



### Still a draft

arxiv.org
scholar.google.com

Neural Turing Machines
Graves, Alex and Wayne, Greg and Danihelka, Ivo
arXiv e-Print archive - 2014 via Local Bibsonomy
Keywords: dblp

[link] Summary by Denny Britz 8 years ago

TLDR; The authors propose Neural Turing Machines (NTMs). A NTM consists of a memory bank and a controller network. The controller network (LSTM or MLP in this paper) controls read/write heads by focusing their attention softly, using a distribution over all memory addresses. It can learn the parameters for two addressing mechanisms: Content-based addressing ("find similar items") and location-based addressing. NTMs can be trained end-to-end using gradient descent. The authors evaluate NTMs on program generations tasks and compare their performance against that of LSTMs. Tasks include copying, recall, prediction, and sorting binary vectors. While both LSTMs and NTMs seems to perform well on training data, only NTMs are able to generalize to longer sequences.


#### Key Observations

- Controller network tried with LSTM or MLP. Which one works better is task-dependent, but LSTM "cache" can be a bottleneck.
- Controller size, number  of read/write heads, and memory size are hyperparameters. 
- Monitoring the memory addressing shows that the NTM actually learns meaningful programs.
- Number LSTM parameters grow quadratically with hidden unit size due to recurrent connection, not so for NTMs, leading to models with fewer parameters.
- Example problems are very small, typically using sequences 8 bit vectors.


#### Notes/Questions

- At what length to NTMs stop to work? Would've liked to see where results get significantly worse.
- Can we automatically transform fuzzy NTM programs into deterministic ones?

arxiv.org
scholar.google.com

Rethinking Pre-training and Self-training
Zoph, Barret and Ghiasi, Golnaz and Lin, Tsung-Yi and Cui, Yin and Liu, Hanxiao and Cubuk, Ekin D. and Le, Quoc V.
arXiv e-Print archive - 2020 via Local Bibsonomy
Keywords: dblp

[link] Summary by CodyWild 3 years ago

Occasionally, I come across results in machine learning that I'm glad exist, even if I don't fully understand them, precisely because they remind me how little we know about the complicated information architectures we're building, and what kinds of signal they can productively use. This is one such result.

The paper tests a method called self-training, and compares it against the more common standard of pre-training. Pre-training works by first training your model on a different dataset, in a supervised way, with the labels attached to that dataset, and then transferring the learned weights on that model model (except for the final prediction head) and using that as initialization for training on your downstream task. Self-training also uses an external dataset, but doesn't use that external data's labels. It works by

1) Training a model on the labeled data from your downstream task, the one you ultimately care about final performance on

2) Using that model to make label predictions (for the label set of your downstream task), for the external dataset

3) Retraining a model from scratch with the combined set of human labels and predicted labels from step (2)

https://i.imgur.com/HaJTuyo.png
This intuitively feels like cheating; something that shouldn't quite work, and yet the authors find that it equals or outperforms pretraining and self-supervised learning in the setting they examined (transferring from ImageNet as an external dataset to CoCo as a downstream task, and using data augmentations on CoCo). They particularly find this to be the case when they're using stronger data augmentations, and when they have more labeled CoCo data to train with from the pretrained starting point. They also find that self-training outperforms self-supervised (e.g. contrastive) learning in similar settings. They further demonstrate that self-training and pre-training can stack; you can get marginal value from one, even if you're already using the other. They do acknowledge that - because it requires training a model on your dataset twice, rather than reusing an existing model directly - their approach is more computationally costly than the pretrained-Imagenet alternative.

This work is, I believe, rooted in the literature on model distillation and student/teacher learning regimes, which I believe has found that you can sometimes outperform a model by training on its outputs, though I can't fully remember the setups used in those works.

The authors don't try too hard to give a rigorous theoretical account of why this approach works, which I actually appreciate. I think we need to have space in ML for people to publish what (at least to some) might be unintuitive empirical results, without necessarily feeling pressure to articulate a theory that may just be a half-baked after-the-fact justification.

One criticism or caveat I have about this paper is that I wish they'd evaluated what happened if they didn't use any augmentation. Does pre-training do better in that case? Does the training process they're using just break down? Only testing on settings with augmentations made me a little less confident in the generality of their result. Their best guess is that it demonstrates the value of task-specificity in your training. I think there's a bit of that, but also feel like this ties in with other papers I've read recently on the surprising efficacy of training with purely random labels. I think there's, in general, a lot we don't know about what ostensibly supervised networks learn in the face of noisy or even completely permuted labels.