Welcome to ShortScience.org! |

- ShortScience.org is a platform for post-publication discussion aiming to improve accessibility and reproducibility of research ideas.
- The website has 1584 public summaries, mostly in machine learning, written by the community and organized by paper, conference, and year.
- Reading summaries of papers is useful to obtain the perspective and insight of another reader, why they liked or disliked it, and their attempt to demystify complicated sections.
- Also, writing summaries is a good exercise to understand the content of a paper because you are forced to challenge your assumptions when explaining it.
- Finally, you can keep up to date with the flood of research by reading the latest summaries on our Twitter and Facebook pages.

Supervised Contrastive Learning

Prannay Khosla and Piotr Teterwak and Chen Wang and Aaron Sarna and Yonglong Tian and Phillip Isola and Aaron Maschinot and Ce Liu and Dilip Krishnan

arXiv e-Print archive - 2020 via Local arXiv

Keywords: cs.LG, cs.CV, stat.ML

**First published:** 2024/02/24 (just now)

**Abstract:** Contrastive learning applied to self-supervised representation learning has
seen a resurgence in recent years, leading to state of the art performance in
the unsupervised training of deep image models. Modern batch contrastive
approaches subsume or significantly outperform traditional contrastive losses
such as triplet, max-margin and the N-pairs loss. In this work, we extend the
self-supervised batch contrastive approach to the fully-supervised setting,
allowing us to effectively leverage label information. Clusters of points
belonging to the same class are pulled together in embedding space, while
simultaneously pushing apart clusters of samples from different classes. We
analyze two possible versions of the supervised contrastive (SupCon) loss,
identifying the best-performing formulation of the loss. On ResNet-200, we
achieve top-1 accuracy of 81.4% on the ImageNet dataset, which is 0.8% above
the best number reported for this architecture. We show consistent
outperformance over cross-entropy on other datasets and two ResNet variants.
The loss shows benefits for robustness to natural corruptions and is more
stable to hyperparameter settings such as optimizers and data augmentations. In
reduced data settings, it outperforms cross-entropy significantly. Our loss
function is simple to implement, and reference TensorFlow code is released at
https://t.ly/supcon.
more
less

Prannay Khosla and Piotr Teterwak and Chen Wang and Aaron Sarna and Yonglong Tian and Phillip Isola and Aaron Maschinot and Ce Liu and Dilip Krishnan

arXiv e-Print archive - 2020 via Local arXiv

Keywords: cs.LG, cs.CV, stat.ML

[link]
This was a really cool-to-me paper that asked whether contrastive losses, of the kind that have found widespread success in semi-supervised domains, can add value in a supervised setting as well. In a semi-supervised context, contrastive loss works by pushing together the representations of an "anchor" data example with an augmented version of itself (which is taken as a positive or target, because the image is understood to not be substantively changed by being augmented), and pushing the representation of that example away from other examples in the batch, which are negatives in the sense that they are assumed to not be related to the anchor image. This paper investigates whether this same structure - of training representations of positives to be close relative to negatives - could be expanded to the supervised setting, where "positives" wouldn't just mean augmented versions of a single image, but augmented versions of other images belonging to the same class. This would ideally combine the advantages of self-supervised contrastive loss - that explicitly incentivizes invariance to augmentation-based changes - with the advantages of a supervised signal, which allows the representation to learn that it should also see instances of the same class as close to one another. https://i.imgur.com/pzKXEkQ.png To evaluate the performance of this as a loss function, the authors first train the representation - either with their novel supervised contrastive loss SupCon, or with a control cross-entropy loss - and then train a linear regression with cross-entropy on top of that learned representation. (Just because, structurally, a contrastive loss doesn't lead to assigning probabilities to particular classes, even if it is supervised in the sense of capturing information relevant to classification in the representation) The authors investigate two versions of this contrastive loss, which differ, as shown below, in terms of the relative position of the sum and the log operation, and show that the L_out version dramatically outperforms (and I mean dramatically, with a top-one accuracy of 78.7 vs 67.4%). https://i.imgur.com/X5F1DDV.png The authors suggest that the L_out version is superior in terms of training dynamics, and while I didn't fully follow their explanation, I believe it had to do with L_out version doing its normalization outside of the log, which meant it actually functioned as a multiplicative normalizer, as opposed to happening inside the log, where it would have just become an additive (or, really, subtractive) constant in the gradient term. Due to this stronger normalization, the authors positive the L_out loss was less noisy and more stable. Overall, the authors show that SupCon consistently (if not dramatically) outperforms cross-entropy when it comes to final accuracy. They also show that it is comparable in transfer performance to a self-supervised contrastive loss. One interesting extension to this work, which I'd enjoy seeing more explored in the future, is how the performance of this sort of loss scales with the number of different augmentations that performed of each element in the batch (this work uses two different augmentations, but there's no reason this number couldn't be higher, which would presumably give additional useful signal and robustness?) |

Understanding deep learning requires rethinking generalization

Chiyuan Zhang and Samy Bengio and Moritz Hardt and Benjamin Recht and Oriol Vinyals

arXiv e-Print archive - 2016 via Local arXiv

Keywords: cs.LG

**First published:** 2016/11/10 (7 years ago)

**Abstract:** Despite their massive size, successful deep artificial neural networks can
exhibit a remarkably small difference between training and test performance.
Conventional wisdom attributes small generalization error either to properties
of the model family, or to the regularization techniques used during training.
Through extensive systematic experiments, we show how these traditional
approaches fail to explain why large neural networks generalize well in
practice. Specifically, our experiments establish that state-of-the-art
convolutional networks for image classification trained with stochastic
gradient methods easily fit a random labeling of the training data. This
phenomenon is qualitatively unaffected by explicit regularization, and occurs
even if we replace the true images by completely unstructured random noise. We
corroborate these experimental findings with a theoretical construction showing
that simple depth two neural networks already have perfect finite sample
expressivity as soon as the number of parameters exceeds the number of data
points as it usually does in practice.
We interpret our experimental findings by comparison with traditional models.
more
less

Chiyuan Zhang and Samy Bengio and Moritz Hardt and Benjamin Recht and Oriol Vinyals

arXiv e-Print archive - 2016 via Local arXiv

Keywords: cs.LG

[link]
This paper deals with the question what / how exactly CNNs learn, considering the fact that they usually have more trainable parameters than data points on which they are trained. When the authors write "deep neural networks", they are talking about Inception V3, AlexNet and MLPs. ## Key contributions * Deep neural networks easily fit random labels (achieving a training error of 0 and a test error which is just randomly guessing labels as expected). $\Rightarrow$Those architectures can simply brute-force memorize the training data. * Deep neural networks fit random images (e.g. Gaussian noise) with 0 training error. The authors conclude that VC-dimension / Rademacher complexity, and uniform stability are bad explanations for generalization capabilities of neural networks * The authors give a construction for a 2-layer network with $p = 2n+d$ parameters - where $n$ is the number of samples and $d$ is the dimension of each sample - which can easily fit any labeling. (Finite sample expressivity). See section 4. ## What I learned * Any measure $m$ of the generalization capability of classifiers $H$ should take the percentage of corrupted labels ($p_c \in [0, 1]$, where $p_c =0$ is a perfect labeling and $p_c=1$ is totally random) into account: If $p_c = 1$, then $m()$ should be 0, too, as it is impossible to learn something meaningful with totally random labels. * We seem to have built models which work well on image data in general, but not "natural" / meaningful images as we thought. ## Funny > deep neural nets remain mysterious for many reasons > Note that this is not exactly simple as the kernel matrix requires 30GB to store in memory. Nonetheless, this system can be solved in under 3 minutes in on a commodity workstation with 24 cores and 256 GB of RAM with a conventional LAPACK call. ## See also * [Deep Nets Don't Learn Via Memorization](https://openreview.net/pdf?id=rJv6ZgHYg) |

Not All Unlabeled Data are Equal: Learning to Weight Data in Semi-supervised Learning

Ren, Zhongzheng and Yeh, Raymond A. and Schwing, Alexander G.

- 2020 via Local Bibsonomy

Keywords: dataset, semi-supervised, machine-learning, data, 2020

Ren, Zhongzheng and Yeh, Raymond A. and Schwing, Alexander G.

- 2020 via Local Bibsonomy

Keywords: dataset, semi-supervised, machine-learning, data, 2020

[link]
This paper argues that, in semi-supervised learning, it's suboptimal to use the same weight for all examples (as happens implicitly, when the unsupervised component of the loss for each example is just added together directly. Instead, it tries to learn weights for each specific data example, through a meta-learning-esque process. The form of semi-supervised learning being discussed here is label-based consistency loss, where a labeled image is augmented and run through the current version of the model, and the model is optimized to try to induce the same loss for the augmented image as the unaugmented one. The premise of the authors argument for learning per-example weights is that, ideally, you would enforce consistency loss less on examples where a model was unconfident in its label prediction for an unlabeled example. As a way to solve this, the authors suggest learning a vector of parameters - one for each example in the dataset - where element i in the vector is a weight for element i of the dataset, in the summed-up unsupervised loss. They do this via a two-step process, where first they optimize the parameters of the network given the example weights, and then the optimize the example weights themselves. To optimize example weights, they calculate a gradient of those weights on the post-training validation loss, which requires backpropogating through the optimization process (to determine how different weights might have produced a different gradient, which might in turn have produced better validation loss). This requires calculating the inverse Hessian (second derivative matrix of the loss), which is, generally speaking, a quite costly operation for huge-parameter nets. To lessen this cost, they pretend that only the final layer of weights in the network are being optimized, and so only calculate the Hessian with respect to those weights. They also try to minimize cost by only updating the example weights for the examples that were used during the previous update step, since, presumably those were the only ones we have enough information to upweight or downweight. With this model, the authors achieve modest improvements - performance comparable to or within-error-bounds better than the current state of the art, FixMatch. Overall, I find this paper a little baffling. It's just a crazy amount of effort to throw into something that is a minor improvement. A few issues I have with the approach: - They don't seem to have benchmarked against the simpler baseline of some inverse of using Dropout-estimated uncertainty as the weight on examples, which would, presumably, more directly capture the property of "is my model unsure of its prediction on this unlabeled example" - If the presumed need for this is the lack of certainty of the model, that's a non-stationary problem that's going to change throughout the course of training, and so I'd worry that you're basically taking steps in the direction of a moving target - Despite using techniques rooted in meta-learning, it doesn't seem like this models learns anything generalizable - it's learning index-based weights on specific examples, which doesn't give it anything useful it can do with some new data point it finds that it wasn't specifically trained on Given that, I think I'd need to see a much stronger case for dramatic performance benefits for something like this to seem like it was worth the increase in complexity (not to mention computation, even with the optimized Hessian scheme) |

Does Unsupervised Architecture Representation Learning Help Neural Architecture Search?

Yan, Shen and Zheng, Yu and Ao, Wei and Zeng, Xiao and Zhang, Mi

arXiv e-Print archive - 2020 via Local Bibsonomy

Keywords: dblp

Yan, Shen and Zheng, Yu and Ao, Wei and Zeng, Xiao and Zhang, Mi

arXiv e-Print archive - 2020 via Local Bibsonomy

Keywords: dblp

[link]
This paper is ultimately relatively straightforward, for all that it's embedded in the somewhat new-to-me literature around graph-based Neural Architecture Search - the problem of iterating through options to find a graph representing an optimized architecture. The authors want to understand whether in this problem, as in many others in deep learning, we can benefit from building our supervised models off of representations learned during an unsupervised pretraining step. In this case, the unsupervised pretraining is conceptually simple - a variational autoencoder - even though the components of the VAE are more complex by dint of being made up of graph networks. This autoencoder, termed arch2vec, is trained on a simple reconstruction loss, and uses the Graph Isomorphism Network (or, GIN) architecture in its encoder, rather than a more typical Graph Convolutional Network. I don't feel like I fully follow the intuitive difference between these two structures, but have the general sense that GIN architectures are simpler; calculating a weighted sum of current central node features with the features of neighboring nodes, rather than learning a function of the full concatenated (current_node, sum_of_neighbors) vector. First, the authors investigate the quality of their embedding space, compared to the representation implicitly learned by doing end-to-end supervised (i.e. with accuracies as labels) NAS. They show that (1) distances in their continuous embedding space correlate more strongly with the edit distance between graphs, compared to the embedding learned by the supervised model, and that (2) their embedding fills more of the space (as would be expected from the KL regularization term) and leads to high-performing networks being naturally concentrated within the space. https://i.imgur.com/SavZnce.png Looking into their representation as an initialization point, they demonstrate that their initializations do lead to lower long-term regret over the course of the architecture search process, particularly differentiating themselves from random initializations at the end of training. https://i.imgur.com/4DG7lZd.png The authors argue that this is because the representations learned by the supervised methods are "biased towards weight-free operations, which are often preferred in the early stage of the search process, resulting in lower final accuracies." I admit I don't fully understand why this would be true, though they do cite a few papers they say demonstrate it. My initial thought was that weight-free architectures would overperform early in the training of each individual network, but my understanding was that the dataset used here is a labeled static dataset of architectures and accuracies, so the within-training-run dynamics wouldn't obviously play a role. Nevertheless, there does seem to be empirical benefit that comes from using these pretrained representations, even if I don't follow the intuition behind it fully. |

Gaussian Processes in Machine Learning

Rasmussen, Carl Edward

Springer Advanced Lectures on Machine Learning - 2003 via Local Bibsonomy

Keywords: dblp

Rasmussen, Carl Edward

Springer Advanced Lectures on Machine Learning - 2003 via Local Bibsonomy

Keywords: dblp

[link]
In this tutorial paper, Carl E. Rasmussen gives an introduction to Gaussian Process Regression focusing on the definition, the hyperparameter learning and future research directions. A Gaussian Process is completely defined by its mean function $m(\pmb{x})$ and its covariance function (kernel) $k(\pmb{x},\pmb{x}')$. The mean function $m(\pmb{x})$ corresponds to the mean vector $\pmb{\mu}$ of a Gaussian distribution whereas the covariance function $k(\pmb{x}, \pmb{x}')$ corresponds to the covariance matrix $\pmb{\Sigma}$. Thus, a Gaussian Process $f \sim \mathcal{GP}\left(m(\pmb{x}), k(\pmb{x}, \pmb{x}')\right)$ is a generalization of a Gaussian distribution over vectors to a distribution over functions. A random function vector $\pmb{\mathrm{f}}$ can be generated by a Gaussian Process through the following procedure: 1. Compute the components $\mu_i$ of the mean vector $\pmb{\mu}$ for each input $\pmb{x}_i$ using the mean function $m(\pmb{x})$ 2. Compute the components $\Sigma_{ij}$ of the covariance matrix $\pmb{\Sigma}$ using the covariance function $k(\pmb{x}, \pmb{x}')$ 3. A function vector $\pmb{\mathrm{f}} = [f(\pmb{x}_1), \dots, f(\pmb{x}_n)]^T$ can be drawn from the Gaussian distribution $\pmb{\mathrm{f}} \sim \mathcal{N}\left(\pmb{\mu}, \pmb{\Sigma} \right)$ Applying this procedure to regression, means that the resulting function vector $\pmb{\mathrm{f}}$ shall be drawn in a way that a function vector $\pmb{\mathrm{f}}$ is rejected if it does not comply with the training data $\mathcal{D}$. This is achieved by conditioning the distribution on the training data $\mathcal{D}$ yielding the posterior Gaussian Process $f \rvert \mathcal{D} \sim \mathcal{GP}(m_D(\pmb{x}), k_D(\pmb{x},\pmb{x}'))$ for noise-free observations with the posterior mean function $m_D(\pmb{x}) = m(\pmb{x}) + \pmb{\Sigma}(\pmb{X},\pmb{x})^T \pmb{\Sigma}^{-1}(\pmb{\mathrm{f}} - \pmb{\mathrm{m}})$ and the posterior covariance function $k_D(\pmb{x},\pmb{x}')=k(\pmb{x},\pmb{x}') - \pmb{\Sigma}(\pmb{X}, \pmb{x}')$ with $\pmb{\Sigma}(\pmb{X},\pmb{x})$ being a vector of covariances between every training case of $\pmb{X}$ and $\pmb{x}$. Noisy observations $y(\pmb{x}) = f(\pmb{x}) + \epsilon$ with $\epsilon \sim \mathcal{N}(0,\sigma_n^2)$ can be taken into account with a second Gaussian Process with mean $m$ and covariance function $k$ resulting in $f \sim \mathcal{GP}(m,k)$ and $y \sim \mathcal{GP}(m, k + \sigma_n^2\delta_{ii'})$. The figure illustrates the cases of noisy observations (variance at training points) and of noise-free observationshttps://i.imgur.com/BWvsB7T.png (no variance at training points). In the Machine Learning perspective, the mean and the covariance function are parametrised by hyperparameters and provide thus a way to include prior knowledge e.g. knowing that the mean function is a second order polynomial. To find the optimal hyperparameters $\pmb{\theta}$, 1. determine the log marginal likelihood $L= \mathrm{log}(p(\pmb{y} \rvert \pmb{x}, \pmb{\theta}))$, 2. take the first partial derivatives of $L$ w.r.t. the hyperparameters, and 3. apply an optimization algorithm. It should be noted that a regularization term is not necessary for the log marginal likelihood $L$ because it already contains a complexity penalty term. Also, the tradeoff between data-fit and penalty is performed automatically. Gaussian Processes provide a very flexible way for finding a suitable regression model. However, they require the high computational complexity $\mathcal{O}(n^3)$ due to the inversion of the covariance matrix. In addition, the generalization of Gaussian Processes to non-Gaussian likelihoods remains complicated. |

About