Welcome to ShortScience.org! |

- ShortScience.org is a platform for post-publication discussion aiming to improve accessibility and reproducibility of research ideas.
- The website has 1584 public summaries, mostly in machine learning, written by the community and organized by paper, conference, and year.
- Reading summaries of papers is useful to obtain the perspective and insight of another reader, why they liked or disliked it, and their attempt to demystify complicated sections.
- Also, writing summaries is a good exercise to understand the content of a paper because you are forced to challenge your assumptions when explaining it.
- Finally, you can keep up to date with the flood of research by reading the latest summaries on our Twitter and Facebook pages.

The Pitfalls of Simplicity Bias in Neural Networks

Shah, Harshay and Tamuly, Kaustav and Raghunathan, Aditi and Jain, Prateek and Netrapalli, Praneeth

arXiv e-Print archive - 2020 via Local Bibsonomy

Keywords: dblp

Shah, Harshay and Tamuly, Kaustav and Raghunathan, Aditi and Jain, Prateek and Netrapalli, Praneeth

arXiv e-Print archive - 2020 via Local Bibsonomy

Keywords: dblp

[link]
This is an interesting paper that makes a fairly radical claim, and I haven't fully decided whether what they find is an interesting-but-rare corner case, or a more fundamental weakness in the design of neural nets. The claim is: neural nets prefer learning simple features, even if there exist complex features that are equally or more predictive, and even if that means learning a classifier with a smaller margin - where margin means "the distance between the decision boundary and the nearest-by data". A large-margin classifier is preferable in machine learning because the larger the margin, the larger the perturbation that would have to be made - by an adversary, or just by the random nature of the test set - to trigger misclassification. https://i.imgur.com/PJ6QB6h.png This paper defines simplicity and complexity in a few ways. In their simulated datasets, a feature is simpler when the decision boundary along that axis requires fewer piecewise linear segments to separate datapoints. (In the example above, note that having multiple alternating blocks still allows for linear separation, but with a higher piecewise linear requirement). In their datasets that concatenate MNIST and CIFAR images, the MNIST component represents the simple feature. The authors then test which models use which features by training a model with access to all of the features - simple and complex - and then testing examples where one set of features is sampled in alignment with the label, and one set of features is sampled randomly. If the features being sampled randomly are being used by the model, perturbing them like this should decrease the test performance of the model. For the simulated datasets, a fully connected network was used; for the MNIST/CIFAR concatenation, a variety of different image classification convolutional architectures were tried. The paper finds that neural networks will prefer to use the simpler feature to the complete exclusion of more complex features, even if the complex feature is slightly more predictive (can achieve 100 vs 95% separation). The authors go on to argue that what they call this Extreme Simplicity Bias, or Extreme SB, might actually explain some of the observed pathologies in neural nets, like relying on spurious features or being subject to adversarial perturbations. They claim that spurious features - like background color or texture - will tend to be simpler, and that their theory explains networks' reliance on them. Additionally, relying completely or predominantly on single features means that a perturbation along just that feature can substantially hurt performance, as opposed to a network using multiple features, all of which must be perturbed to hurt performance an equivalent amount. As I mentioned earlier, I feel like I'd need more evidence before I was strongly convinced by the claims made in this paper, but they are interestingly provocative. On a broader level, I think a lot of the difficulties in articulating why we expect simpler features to perform well come from an imprecision in thinking in language around the idea - we think of complex features as inherently brittle and high-dimensional, but this paper makes me wonder how well our existing definitions of simplicity actually match those intuitions. |

Understanding deep learning requires rethinking generalization

Chiyuan Zhang and Samy Bengio and Moritz Hardt and Benjamin Recht and Oriol Vinyals

arXiv e-Print archive - 2016 via Local arXiv

Keywords: cs.LG

**First published:** 2016/11/10 (7 years ago)

**Abstract:** Despite their massive size, successful deep artificial neural networks can
exhibit a remarkably small difference between training and test performance.
Conventional wisdom attributes small generalization error either to properties
of the model family, or to the regularization techniques used during training.
Through extensive systematic experiments, we show how these traditional
approaches fail to explain why large neural networks generalize well in
practice. Specifically, our experiments establish that state-of-the-art
convolutional networks for image classification trained with stochastic
gradient methods easily fit a random labeling of the training data. This
phenomenon is qualitatively unaffected by explicit regularization, and occurs
even if we replace the true images by completely unstructured random noise. We
corroborate these experimental findings with a theoretical construction showing
that simple depth two neural networks already have perfect finite sample
expressivity as soon as the number of parameters exceeds the number of data
points as it usually does in practice.
We interpret our experimental findings by comparison with traditional models.
more
less

Chiyuan Zhang and Samy Bengio and Moritz Hardt and Benjamin Recht and Oriol Vinyals

arXiv e-Print archive - 2016 via Local arXiv

Keywords: cs.LG

[link]
This paper deals with the question what / how exactly CNNs learn, considering the fact that they usually have more trainable parameters than data points on which they are trained. When the authors write "deep neural networks", they are talking about Inception V3, AlexNet and MLPs. ## Key contributions * Deep neural networks easily fit random labels (achieving a training error of 0 and a test error which is just randomly guessing labels as expected). $\Rightarrow$Those architectures can simply brute-force memorize the training data. * Deep neural networks fit random images (e.g. Gaussian noise) with 0 training error. The authors conclude that VC-dimension / Rademacher complexity, and uniform stability are bad explanations for generalization capabilities of neural networks * The authors give a construction for a 2-layer network with $p = 2n+d$ parameters - where $n$ is the number of samples and $d$ is the dimension of each sample - which can easily fit any labeling. (Finite sample expressivity). See section 4. ## What I learned * Any measure $m$ of the generalization capability of classifiers $H$ should take the percentage of corrupted labels ($p_c \in [0, 1]$, where $p_c =0$ is a perfect labeling and $p_c=1$ is totally random) into account: If $p_c = 1$, then $m()$ should be 0, too, as it is impossible to learn something meaningful with totally random labels. * We seem to have built models which work well on image data in general, but not "natural" / meaningful images as we thought. ## Funny > deep neural nets remain mysterious for many reasons > Note that this is not exactly simple as the kernel matrix requires 30GB to store in memory. Nonetheless, this system can be solved in under 3 minutes in on a commodity workstation with 24 cores and 256 GB of RAM with a conventional LAPACK call. ## See also * [Deep Nets Don't Learn Via Memorization](https://openreview.net/pdf?id=rJv6ZgHYg) |

Supervised Contrastive Learning

Prannay Khosla and Piotr Teterwak and Chen Wang and Aaron Sarna and Yonglong Tian and Phillip Isola and Aaron Maschinot and Ce Liu and Dilip Krishnan

arXiv e-Print archive - 2020 via Local arXiv

Keywords: cs.LG, cs.CV, stat.ML

**First published:** 2024/02/27 (just now)

**Abstract:** Contrastive learning applied to self-supervised representation learning has
seen a resurgence in recent years, leading to state of the art performance in
the unsupervised training of deep image models. Modern batch contrastive
approaches subsume or significantly outperform traditional contrastive losses
such as triplet, max-margin and the N-pairs loss. In this work, we extend the
self-supervised batch contrastive approach to the fully-supervised setting,
allowing us to effectively leverage label information. Clusters of points
belonging to the same class are pulled together in embedding space, while
simultaneously pushing apart clusters of samples from different classes. We
analyze two possible versions of the supervised contrastive (SupCon) loss,
identifying the best-performing formulation of the loss. On ResNet-200, we
achieve top-1 accuracy of 81.4% on the ImageNet dataset, which is 0.8% above
the best number reported for this architecture. We show consistent
outperformance over cross-entropy on other datasets and two ResNet variants.
The loss shows benefits for robustness to natural corruptions and is more
stable to hyperparameter settings such as optimizers and data augmentations. In
reduced data settings, it outperforms cross-entropy significantly. Our loss
function is simple to implement, and reference TensorFlow code is released at
https://t.ly/supcon.
more
less

Prannay Khosla and Piotr Teterwak and Chen Wang and Aaron Sarna and Yonglong Tian and Phillip Isola and Aaron Maschinot and Ce Liu and Dilip Krishnan

arXiv e-Print archive - 2020 via Local arXiv

Keywords: cs.LG, cs.CV, stat.ML

[link]
This was a really cool-to-me paper that asked whether contrastive losses, of the kind that have found widespread success in semi-supervised domains, can add value in a supervised setting as well. In a semi-supervised context, contrastive loss works by pushing together the representations of an "anchor" data example with an augmented version of itself (which is taken as a positive or target, because the image is understood to not be substantively changed by being augmented), and pushing the representation of that example away from other examples in the batch, which are negatives in the sense that they are assumed to not be related to the anchor image. This paper investigates whether this same structure - of training representations of positives to be close relative to negatives - could be expanded to the supervised setting, where "positives" wouldn't just mean augmented versions of a single image, but augmented versions of other images belonging to the same class. This would ideally combine the advantages of self-supervised contrastive loss - that explicitly incentivizes invariance to augmentation-based changes - with the advantages of a supervised signal, which allows the representation to learn that it should also see instances of the same class as close to one another. https://i.imgur.com/pzKXEkQ.png To evaluate the performance of this as a loss function, the authors first train the representation - either with their novel supervised contrastive loss SupCon, or with a control cross-entropy loss - and then train a linear regression with cross-entropy on top of that learned representation. (Just because, structurally, a contrastive loss doesn't lead to assigning probabilities to particular classes, even if it is supervised in the sense of capturing information relevant to classification in the representation) The authors investigate two versions of this contrastive loss, which differ, as shown below, in terms of the relative position of the sum and the log operation, and show that the L_out version dramatically outperforms (and I mean dramatically, with a top-one accuracy of 78.7 vs 67.4%). https://i.imgur.com/X5F1DDV.png The authors suggest that the L_out version is superior in terms of training dynamics, and while I didn't fully follow their explanation, I believe it had to do with L_out version doing its normalization outside of the log, which meant it actually functioned as a multiplicative normalizer, as opposed to happening inside the log, where it would have just become an additive (or, really, subtractive) constant in the gradient term. Due to this stronger normalization, the authors positive the L_out loss was less noisy and more stable. Overall, the authors show that SupCon consistently (if not dramatically) outperforms cross-entropy when it comes to final accuracy. They also show that it is comparable in transfer performance to a self-supervised contrastive loss. One interesting extension to this work, which I'd enjoy seeing more explored in the future, is how the performance of this sort of loss scales with the number of different augmentations that performed of each element in the batch (this work uses two different augmentations, but there's no reason this number couldn't be higher, which would presumably give additional useful signal and robustness?) |

LSTM: A Search Space Odyssey

Greff, Klaus and Srivastava, Rupesh Kumar and Koutník, Jan and Steunebrink, Bas R. and Schmidhuber, Jürgen

arXiv e-Print archive - 2015 via Local Bibsonomy

Keywords: dblp

Greff, Klaus and Srivastava, Rupesh Kumar and Koutník, Jan and Steunebrink, Bas R. and Schmidhuber, Jürgen

arXiv e-Print archive - 2015 via Local Bibsonomy

Keywords: dblp

[link]
This paper presents an extensive evaluation of variants of LSTM networks. Specifically, they start from what they consider to be the vanilla architecture and, from it, also consider 8 variants which correspond to small modifications on the vanilla case. The vanilla architecture is the one described in Graves & Schmidhuber (2005) \cite{journals/nn/GravesS05}, and the variants consider removing single parts of it (input,forget,output gates or activation functions), coupling the input and forget gate (which is inspired from GRU) or having full recurrence between all gates (which comes from the original LSTM formulation). In their experimental setup, they consider 3 datasets: TIMIT (speech recognition), IAM Online Handwriting Database (character recognition) and JSB Chorales (polyphonic music modeling). For each, they tune the hyper-parameters of each of the 9 architectures, using random search based on 200 samples. Then, they keep the 20 best hyper-parameters and use the statistics of those as a basis for comparing the architectures. #### My two cents This was a very useful ready. I'd make it a required read for anyone that wants to start using LSTMs. First, I found the initial historical description of the developments surrounding LSTMs very interesting and clarifying. But more importantly, it presents a really useful picture of LSTMs that can both serve as a good basis for starting to use LSTMs and also an insightful (backed with data) exposition of the importance of each part in the LSTM. The analysis based on an fANOVA (which I didn't know about until now) is quite neat. Perhaps the most surprising observation is that momentum actually doesn't seem to help that much. Investigating second order interaction between hyper-parameters was a smart thing to do (showing that tuning the learning rate and hidden layer jointly might not be that important, which is a useful insight).The illustrations in Figure 4, layout out the estimated relationship (with uncertainty) between learning rate / hidden layer size / input noise variance and performance / training time is also full of useful information. I wont repeat here the main observations of the paper, which are laid out clearly in the conclusion (section 6). Additionally, my personal take-away point is that, in an LSTM implementation, it might still be useful to support the removal peepholes or having coupled input and forget gates, since they both yielded the ultimate best test set performance on at least one of the datasets (I'm assuming it was also best on the validation set, though this might not be the case...) The fANOVE analysis makes it clear that the learning rate is the most critical hyper-parameter to tune (can be "make or break"). That said, this is already well known. And the fact that it explains so much of the variance might reflect a bias of the analysis towards a situation where the learning rate isn't tuned as well as it could be in practice (this is afterall THE hyper-parameter that neural net researcher spend the most time tuning in practice). So, as future work, this suggests perhaps doing another round of the same analysis (which is otherwise really neatly setup), where more effort is always put on tuning the learning rate, individually for each of the other hyper-parameters. In other words, we'd try to ignore the regions of hyper-parameter space that correspond to bad learning rates, in order to "marginalize out" its effect. This would thus explore the perhaps more realistic setup that assumes one always tunes the learning rate as best as possible. Also, considering a less aggressive gradient clipping into the hyper-parameter search would be interesting since, as the authors admit, clipping within [-1,1] might have been too much and could explain why it didn't help Otherwise, a really great and useful read! |

ImageNet-trained {CNN}s are biased towards texture; increasing shape bias improves accuracy and robustness

Geirhos, Robert and Rubisch, Patricia and Michaelis, Claudio and Bethge, Matthias and Wichmann, Felix A. and Brendel, Wieland

International Conference on Learning Representations - 2019 via Local Bibsonomy

Keywords: deep-learning, machine-learning, stable, foundations, robustness, theory

Geirhos, Robert and Rubisch, Patricia and Michaelis, Claudio and Bethge, Matthias and Wichmann, Felix A. and Brendel, Wieland

International Conference on Learning Representations - 2019 via Local Bibsonomy

Keywords: deep-learning, machine-learning, stable, foundations, robustness, theory

[link]
Geirhos et al. show that state-of-the-art convolutional neural networks put too much importance on texture information. This claim is confirmed in a controlled study comparing convolutional neural network and human performance on variants of ImageNet image with removed texture (silhouettes) or on edges. Additionally, networks only considering local information can perform nearly as well as other networks. To avoid this bias, they propose a stylized ImageNet variant where textured are replaced randomly, forcing the network to put more weight on global shape information. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

About