Welcome to ShortScience.org! |

- ShortScience.org is a platform for post-publication discussion aiming to improve accessibility and reproducibility of research ideas.
- The website has 1567 public summaries, mostly in machine learning, written by the community and organized by paper, conference, and year.
- Reading summaries of papers is useful to obtain the perspective and insight of another reader, why they liked or disliked it, and their attempt to demystify complicated sections.
- Also, writing summaries is a good exercise to understand the content of a paper because you are forced to challenge your assumptions when explaining it.
- Finally, you can keep up to date with the flood of research by reading the latest summaries on our Twitter and Facebook pages.

Understanding deep learning requires rethinking generalization

Chiyuan Zhang and Samy Bengio and Moritz Hardt and Benjamin Recht and Oriol Vinyals

arXiv e-Print archive - 2016 via Local arXiv

Keywords: cs.LG

**First published:** 2016/11/10 (6 years ago)

**Abstract:** Despite their massive size, successful deep artificial neural networks can
exhibit a remarkably small difference between training and test performance.
Conventional wisdom attributes small generalization error either to properties
of the model family, or to the regularization techniques used during training.
Through extensive systematic experiments, we show how these traditional
approaches fail to explain why large neural networks generalize well in
practice. Specifically, our experiments establish that state-of-the-art
convolutional networks for image classification trained with stochastic
gradient methods easily fit a random labeling of the training data. This
phenomenon is qualitatively unaffected by explicit regularization, and occurs
even if we replace the true images by completely unstructured random noise. We
corroborate these experimental findings with a theoretical construction showing
that simple depth two neural networks already have perfect finite sample
expressivity as soon as the number of parameters exceeds the number of data
points as it usually does in practice.
We interpret our experimental findings by comparison with traditional models.
more
less

Chiyuan Zhang and Samy Bengio and Moritz Hardt and Benjamin Recht and Oriol Vinyals

arXiv e-Print archive - 2016 via Local arXiv

Keywords: cs.LG

[link]
This paper deals with the question what / how exactly CNNs learn, considering the fact that they usually have more trainable parameters than data points on which they are trained. When the authors write "deep neural networks", they are talking about Inception V3, AlexNet and MLPs. ## Key contributions * Deep neural networks easily fit random labels (achieving a training error of 0 and a test error which is just randomly guessing labels as expected). $\Rightarrow$Those architectures can simply brute-force memorize the training data. * Deep neural networks fit random images (e.g. Gaussian noise) with 0 training error. The authors conclude that VC-dimension / Rademacher complexity, and uniform stability are bad explanations for generalization capabilities of neural networks * The authors give a construction for a 2-layer network with $p = 2n+d$ parameters - where $n$ is the number of samples and $d$ is the dimension of each sample - which can easily fit any labeling. (Finite sample expressivity). See section 4. ## What I learned * Any measure $m$ of the generalization capability of classifiers $H$ should take the percentage of corrupted labels ($p_c \in [0, 1]$, where $p_c =0$ is a perfect labeling and $p_c=1$ is totally random) into account: If $p_c = 1$, then $m()$ should be 0, too, as it is impossible to learn something meaningful with totally random labels. * We seem to have built models which work well on image data in general, but not "natural" / meaningful images as we thought. ## Funny > deep neural nets remain mysterious for many reasons > Note that this is not exactly simple as the kernel matrix requires 30GB to store in memory. Nonetheless, this system can be solved in under 3 minutes in on a commodity workstation with 24 cores and 256 GB of RAM with a conventional LAPACK call. ## See also * [Deep Nets Don't Learn Via Memorization](https://openreview.net/pdf?id=rJv6ZgHYg) |

When Does Contrastive Visual Representation Learning Work?

Elijah Cole and Xuan Yang and Kimberly Wilber and Oisin Mac Aodha and Serge Belongie

arXiv e-Print archive - 2021 via Local arXiv

Keywords: cs.CV, cs.LG

**First published:** 2023/02/01 (just now)

**Abstract:** Recent self-supervised representation learning techniques have largely closed
the gap between supervised and unsupervised learning on ImageNet
classification. While the particulars of pretraining on ImageNet are now
relatively well understood, the field still lacks widely accepted best
practices for replicating this success on other datasets. As a first step in
this direction, we study contrastive self-supervised learning on four diverse
large-scale datasets. By looking through the lenses of data quantity, data
domain, data quality, and task granularity, we provide new insights into the
necessary conditions for successful self-supervised learning. Our key findings
include observations such as: (i) the benefit of additional pretraining data
beyond 500k images is modest, (ii) adding pretraining images from another
domain does not lead to more general representations, (iii) corrupted
pretraining images have a disparate impact on supervised and self-supervised
pretraining, and (iv) contrastive learning lags far behind supervised learning
on fine-grained visual classification tasks.
more
less

Elijah Cole and Xuan Yang and Kimberly Wilber and Oisin Mac Aodha and Serge Belongie

arXiv e-Print archive - 2021 via Local arXiv

Keywords: cs.CV, cs.LG

[link]
This is a mildly silly paper to summarize, since there isn't really a new mechanism to understand, but rather a number of straightforward (and interesting!) empirical results that are also quite well-explained in the paper itself. That said, for the sake of a tiny bit more brevity than the paper itself provides, I'll try to pull out some of the conclusions I found the most interesting here. The general goal of this paper is to better understand the contours of when self-supervised representation learning is valuable for vision (and specifically when it can compete with supervised learning), and when it doesn't. In general, the results are all using ResNet backbones, with SimCLR SSL, on image classification datasets. Some bullet-point takeaways: - The SSL models being tested here seem to roughly saturate at unsupervised dataset sizes of around 500K; the comparative jump from dataset sizes of 500K to 1M is fairly small. - Once you have a supervised dataset of around 50K or more, the benefit of SSL pretraining starts to diminish, and it converges to being more similar to just supervised learning on that numbrer of labeled images. On the flip side, it's only possible to get close to "good" fully supervised performance by using 100K images or more on top of a SSL baseline. - Even within image classification datasets, it's much better to do SSL representation on the same dataset as the one you'll use for downstream training; trying to transfer representations to different datasets leads to meaningfully worse results. Interestingly, this is even true when you add out-of-domain (i.e. other-dataset) data to an existing in-domain dataset: a dataset of 250K in-dataset images does better than a 500K dataset of images from mixed datasets, and does notably better than a 1M dataset of mixed images. In this case, adding more out-of-domain images seems to have just degraded performance - SSL seems to perform more closely to SL on a course label set; when the label set gets more granular, the task gets harder overall, but, more specifically, the gap between SSL and SL grows - When the authors tried different forms of dataset corruption, SSL was much more robust to adding salt-and-pepper noise than it was to removing high-frequency information in the form of reducing the images to a lower resolution. |

Contrastive Learning with Adversarial Examples

Ho, Chih-Hui and Vasconcelos, Nuno

arXiv e-Print archive - 2020 via Local Bibsonomy

Keywords: dblp

Ho, Chih-Hui and Vasconcelos, Nuno

arXiv e-Print archive - 2020 via Local Bibsonomy

Keywords: dblp

[link]
Contrastive learning works by performing augmentations on a batch of images, and training a network to match the representations of the two augmented parts of a pair together, and push the representations of images not in a pair farther apart. Historically, these algorithms have benefitted from using stronger augmentations, which has the effect of making the two positive elements in a pair more visually distinct from one another. This paper tries to build on that success, and, beyond just using a strong augmentation, tries to learn a way to perturb images that adversarially increases contrastive loss. As with adversarial training in normal supervised setting, the thinking here is that examples which push loss up the highest are the hardest and thus most informative for the network to learn from While the concept of this paper made some sense, I found the notation and the explanation of mechanics a bit confusing, particularly when it came to choice to frame a contrastive loss as a cross-entropy loss, with the "weights" of the dot product in the the cross-entropy loss being, in fact, the projection by the learned encoder of various of the examples in the batch. https://i.imgur.com/iQXPeXk.png This notion of the learned representations being "weights" is just odd and counter-intuitive, and the process of trying to wrap my mind around it isn't one I totally succeeded at. I think the point of using this frame is because it provides an easy analogue to the Fast Gradient Sign Method of normal supervised learning adversarial examples, even though it has the weird effect that, as the authors say "your weights vary by batch...rather than being consistent across training," Notational weirdness aside, my understanding is that the method of this paper: - Runs a forward pass of normal contrastive loss (framed as cross-entropy loss) which takes augmentations p and q and runs both forward through an encoder. - Calculates a delta to apply to each input image in the q that will increase the loss most, taken over all the images in the p set - I think the delta is per-image in q, and is just aggregated over all images in p, but I'm not fully confident of this, as a result of notational confusion. It could also be one delta applied for all all images in q. - Calculate the loss that results when you run forward the adversarially generated q against the normal p - Train a combined loss that is a weighted combination of the normal p/q contrastive part and the adversarial p/q contrastive part https://i.imgur.com/UWtJpVx.png The authors show a small but relatively consistent improvement to performance using their method. Notably, this improvement is much stronger when using larger encoders (presumably because they have more capacity to learn from harder examples). One frustration I have with the empirics of the paper is that, at least in the main paper, they don't discuss the increase in training time required to calculate these perturbations, which, a priori, I would imagine to be nontrivial. |

Gaussian Processes in Machine Learning

Rasmussen, Carl Edward

Springer Advanced Lectures on Machine Learning - 2003 via Local Bibsonomy

Keywords: dblp

Rasmussen, Carl Edward

Springer Advanced Lectures on Machine Learning - 2003 via Local Bibsonomy

Keywords: dblp

[link]
In this tutorial paper, Carl E. Rasmussen gives an introduction to Gaussian Process Regression focusing on the definition, the hyperparameter learning and future research directions. A Gaussian Process is completely defined by its mean function $m(\pmb{x})$ and its covariance function (kernel) $k(\pmb{x},\pmb{x}')$. The mean function $m(\pmb{x})$ corresponds to the mean vector $\pmb{\mu}$ of a Gaussian distribution whereas the covariance function $k(\pmb{x}, \pmb{x}')$ corresponds to the covariance matrix $\pmb{\Sigma}$. Thus, a Gaussian Process $f \sim \mathcal{GP}\left(m(\pmb{x}), k(\pmb{x}, \pmb{x}')\right)$ is a generalization of a Gaussian distribution over vectors to a distribution over functions. A random function vector $\pmb{\mathrm{f}}$ can be generated by a Gaussian Process through the following procedure: 1. Compute the components $\mu_i$ of the mean vector $\pmb{\mu}$ for each input $\pmb{x}_i$ using the mean function $m(\pmb{x})$ 2. Compute the components $\Sigma_{ij}$ of the covariance matrix $\pmb{\Sigma}$ using the covariance function $k(\pmb{x}, \pmb{x}')$ 3. A function vector $\pmb{\mathrm{f}} = [f(\pmb{x}_1), \dots, f(\pmb{x}_n)]^T$ can be drawn from the Gaussian distribution $\pmb{\mathrm{f}} \sim \mathcal{N}\left(\pmb{\mu}, \pmb{\Sigma} \right)$ Applying this procedure to regression, means that the resulting function vector $\pmb{\mathrm{f}}$ shall be drawn in a way that a function vector $\pmb{\mathrm{f}}$ is rejected if it does not comply with the training data $\mathcal{D}$. This is achieved by conditioning the distribution on the training data $\mathcal{D}$ yielding the posterior Gaussian Process $f \rvert \mathcal{D} \sim \mathcal{GP}(m_D(\pmb{x}), k_D(\pmb{x},\pmb{x}'))$ for noise-free observations with the posterior mean function $m_D(\pmb{x}) = m(\pmb{x}) + \pmb{\Sigma}(\pmb{X},\pmb{x})^T \pmb{\Sigma}^{-1}(\pmb{\mathrm{f}} - \pmb{\mathrm{m}})$ and the posterior covariance function $k_D(\pmb{x},\pmb{x}')=k(\pmb{x},\pmb{x}') - \pmb{\Sigma}(\pmb{X}, \pmb{x}')$ with $\pmb{\Sigma}(\pmb{X},\pmb{x})$ being a vector of covariances between every training case of $\pmb{X}$ and $\pmb{x}$. Noisy observations $y(\pmb{x}) = f(\pmb{x}) + \epsilon$ with $\epsilon \sim \mathcal{N}(0,\sigma_n^2)$ can be taken into account with a second Gaussian Process with mean $m$ and covariance function $k$ resulting in $f \sim \mathcal{GP}(m,k)$ and $y \sim \mathcal{GP}(m, k + \sigma_n^2\delta_{ii'})$. The figure illustrates the cases of noisy observations (variance at training points) and of noise-free observationshttps://i.imgur.com/BWvsB7T.png (no variance at training points). In the Machine Learning perspective, the mean and the covariance function are parametrised by hyperparameters and provide thus a way to include prior knowledge e.g. knowing that the mean function is a second order polynomial. To find the optimal hyperparameters $\pmb{\theta}$, 1. determine the log marginal likelihood $L= \mathrm{log}(p(\pmb{y} \rvert \pmb{x}, \pmb{\theta}))$, 2. take the first partial derivatives of $L$ w.r.t. the hyperparameters, and 3. apply an optimization algorithm. It should be noted that a regularization term is not necessary for the log marginal likelihood $L$ because it already contains a complexity penalty term. Also, the tradeoff between data-fit and penalty is performed automatically. Gaussian Processes provide a very flexible way for finding a suitable regression model. However, they require the high computational complexity $\mathcal{O}(n^3)$ due to the inversion of the covariance matrix. In addition, the generalization of Gaussian Processes to non-Gaussian likelihoods remains complicated. |

Thwarting Adversarial Examples: An L_0-Robust Sparse Fourier Transform

Bafna, Mitali and Murtagh, Jack and Vyas, Nikhil

Neural Information Processing Systems Conference - 2018 via Local Bibsonomy

Keywords: dblp

Bafna, Mitali and Murtagh, Jack and Vyas, Nikhil

Neural Information Processing Systems Conference - 2018 via Local Bibsonomy

Keywords: dblp

[link]
Bafna et al. show that iterative hard thresholding results in $L_0$ robust Fourier transforms. In particular, as shown in Algorithm 1, iterative hard thresholding assumes a signal $y = x + e$ where $x$ is assumed to be sparse, and $e$ is assumed to be sparse. This translates to noise $e$ that is bounded in its $L_0$ norm, corresponding to common adversarial attacks such as adversarial patches in computer vision. Using their algorithm, the authors can provably reconstruct the signal, specifically the top-$k$ coordinates for a $k$-sparse signal, which can subsequently be fed to a neural network classifier. In experiments, the classifier is always trained on sparse signals, and at test time, the sparse signal is reconstructed prior to the forward pass. This way, on MNIST and Fashion-MNIST, the algorithm is able to recover large parts of the original accuracy. https://i.imgur.com/yClXLoo.jpg Algorithm 1 (see paper for details): The iterative hard thresholding algorithm resulting in provable robustness against $L_0$ attack on images and other signals. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

About