Welcome to ShortScience.org! |

- ShortScience.org is a platform for post-publication discussion aiming to improve accessibility and reproducibility of research ideas.
- The website has 1584 public summaries, mostly in machine learning, written by the community and organized by paper, conference, and year.
- Reading summaries of papers is useful to obtain the perspective and insight of another reader, why they liked or disliked it, and their attempt to demystify complicated sections.
- Also, writing summaries is a good exercise to understand the content of a paper because you are forced to challenge your assumptions when explaining it.
- Finally, you can keep up to date with the flood of research by reading the latest summaries on our Twitter and Facebook pages.

Understanding deep learning requires rethinking generalization

Chiyuan Zhang and Samy Bengio and Moritz Hardt and Benjamin Recht and Oriol Vinyals

arXiv e-Print archive - 2016 via Local arXiv

Keywords: cs.LG

**First published:** 2016/11/10 (7 years ago)

**Abstract:** Despite their massive size, successful deep artificial neural networks can
exhibit a remarkably small difference between training and test performance.
Conventional wisdom attributes small generalization error either to properties
of the model family, or to the regularization techniques used during training.
Through extensive systematic experiments, we show how these traditional
approaches fail to explain why large neural networks generalize well in
practice. Specifically, our experiments establish that state-of-the-art
convolutional networks for image classification trained with stochastic
gradient methods easily fit a random labeling of the training data. This
phenomenon is qualitatively unaffected by explicit regularization, and occurs
even if we replace the true images by completely unstructured random noise. We
corroborate these experimental findings with a theoretical construction showing
that simple depth two neural networks already have perfect finite sample
expressivity as soon as the number of parameters exceeds the number of data
points as it usually does in practice.
We interpret our experimental findings by comparison with traditional models.
more
less

Chiyuan Zhang and Samy Bengio and Moritz Hardt and Benjamin Recht and Oriol Vinyals

arXiv e-Print archive - 2016 via Local arXiv

Keywords: cs.LG

[link]
This paper deals with the question what / how exactly CNNs learn, considering the fact that they usually have more trainable parameters than data points on which they are trained. When the authors write "deep neural networks", they are talking about Inception V3, AlexNet and MLPs. ## Key contributions * Deep neural networks easily fit random labels (achieving a training error of 0 and a test error which is just randomly guessing labels as expected). $\Rightarrow$Those architectures can simply brute-force memorize the training data. * Deep neural networks fit random images (e.g. Gaussian noise) with 0 training error. The authors conclude that VC-dimension / Rademacher complexity, and uniform stability are bad explanations for generalization capabilities of neural networks * The authors give a construction for a 2-layer network with $p = 2n+d$ parameters - where $n$ is the number of samples and $d$ is the dimension of each sample - which can easily fit any labeling. (Finite sample expressivity). See section 4. ## What I learned * Any measure $m$ of the generalization capability of classifiers $H$ should take the percentage of corrupted labels ($p_c \in [0, 1]$, where $p_c =0$ is a perfect labeling and $p_c=1$ is totally random) into account: If $p_c = 1$, then $m()$ should be 0, too, as it is impossible to learn something meaningful with totally random labels. * We seem to have built models which work well on image data in general, but not "natural" / meaningful images as we thought. ## Funny > deep neural nets remain mysterious for many reasons > Note that this is not exactly simple as the kernel matrix requires 30GB to store in memory. Nonetheless, this system can be solved in under 3 minutes in on a commodity workstation with 24 cores and 256 GB of RAM with a conventional LAPACK call. ## See also * [Deep Nets Don't Learn Via Memorization](https://openreview.net/pdf?id=rJv6ZgHYg) |

Deep Residual Learning for Image Recognition

He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian

arXiv e-Print archive - 2015 via Local Bibsonomy

Keywords: dblp

He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian

arXiv e-Print archive - 2015 via Local Bibsonomy

Keywords: dblp

[link]
Deeper networks should never have a higher **training** error than smaller ones. In the worst case, the layers should "simply" learn identities. It seems as this is not so easy with conventional networks, as they get much worse with more layers. So the idea is to add identity functions which skip some layers. The network only has to learn the **residuals**. Advantages: * Learning the identity becomes learning 0 which is simpler * Loss in information flow in the forward pass is not a problem anymore * No vanishing / exploding gradient * Identities don't have parameters to be learned ## Evaluation The learning rate starts at 0.1 and is divided by 10 when the error plateaus. Weight decay of 0.0001 ($10^{-4}$), momentum of 0.9. They use mini-batches of size 128. * ImageNet ILSVRC 2015: 3.57% (ensemble) * CIFAR-10: 6.43% * MS COCO: 59.0% mAp@0.5 (ensemble) * PASCAL VOC 2007: 85.6% mAp@0.5 * PASCAL VOC 2012: 83.8% mAp@0.5 ## See also * [DenseNets](http://www.shortscience.org/paper?bibtexKey=journals/corr/1608.06993) |

Meta-learners' learning dynamics are unlike learners'

Neil C. Rabinowitz

arXiv e-Print archive - 2019 via Local arXiv

Keywords: cs.LG, cs.AI, stat.ML

**First published:** 2019/05/03 (4 years ago)

**Abstract:** Meta-learning is a tool that allows us to build sample-efficient learning
systems. Here we show that, once meta-trained, LSTM Meta-Learners aren't just
faster learners than their sample-inefficient deep learning (DL) and
reinforcement learning (RL) brethren, but that they actually pursue
fundamentally different learning trajectories. We study their learning dynamics
on three sets of structured tasks for which the corresponding learning dynamics
of DL and RL systems have been previously described: linear regression (Saxe et
al., 2013), nonlinear regression (Rahaman et al., 2018; Xu et al., 2018), and
contextual bandits (Schaul et al., 2019). In each case, while
sample-inefficient DL and RL Learners uncover the task structure in a staggered
manner, meta-trained LSTM Meta-Learners uncover almost all task structure
concurrently, congruent with the patterns expected from Bayes-optimal inference
algorithms. This has implications for research areas wherever the learning
behaviour itself is of interest, such as safety, curriculum design, and
human-in-the-loop machine learning.
more
less

Neil C. Rabinowitz

arXiv e-Print archive - 2019 via Local arXiv

Keywords: cs.LG, cs.AI, stat.ML

[link]
Meta learning, or, the idea of training models on some distribution of tasks, with the hope that they can then learn more quickly on new tasks because they have “learned how to learn” similar tasks, has become a more central and popular research field in recent years. Although there is a veritable zoo of different techniques (to an amusingly literal degree; there’s an emergent fad of naming new methods after animals), the general idea is: have your inner loop consist of training a model on some task drawn from a distribution over tasks (be that maze learning with different wall configurations, letter identification from different languages, etc), and have the outer loop that updates some structural part of your model be based on improving generalization error on each task within the distribution. It’s been demonstrated that meta-learned systems can in fact learn more quickly (at least when their tasks are “in distribution” relative to the distribution they were trained on, which is an important point to be cognizant of), but this paper is less interested with how much better or faster they’re learning, and more interested in whether there are qualitative differences in the way normal learning systems and meta-trained learning systems go about learning a new task. The author (oddly for DeepMind, which typically goes in for super long author lists, there’s only the one on this paper) goes about this by studying simple learning tasks where it’s easier for us to introspect into what each model is learning over time. https://i.imgur.com/ceycq46.png In the first test, he looks at linear regression in a simple setting: where for each individual “task” data is generated according a known true weight matrix (sampled from a prior over weight matrices), with some noise added in. Given this weight matrix, he takes the singular value decomposition (think: PCA), and so ends up with a factorized representation of the weights, where higher eigenvalues on the factors, or “modes”, represent that those factors represent larger-scale patterns that explain more variance, and lower eigenvalues are smaller scale refinements on top of that. He can apply this same procedure to the weights the network has learned at any given point in training, and compare, to see how close the network is to having correctly captured each of these different modes. When normal learners (starting from a raw initialization) approach the task, they start by matching the large scale (higher eigenvalue) factors of variation, and then over the course of training improve performance on the higher-precision factors. By contrast, meta learners, in addition to learning faster, also learn large scale and small scale modes at the same rate. Similar analysis was performed and similar results found for nonlinear regression, where instead of PCA-style components, the function generating data were decomposed into different Fourier frequencies, and the normal learner learned the broad, low-frequency patterns first, where the meta learner learned them all at the same rate. The paper finds intuition for this by showing that the behavior of the meta learners matches quite well against how a Bayes-optimal learner would update on new data points, in the world where that learner had a prior over the data-generating weights that matched the true generating process. So, under this framing, the process of meta learning is roughly equivalent to your model learning a prior correspondant with the task distribution it was trained on. This is, at a high level, what I think we all sort of thought was happening with meta learning, but it’s pretty neat to see it laid out in a small enough problem where we can actually validate against an analytic model. A bit of a meta (heh) point: I wish this paper had more explanation of why the author chose to use the specific eigenvalue-focused metrics of progression on task learning that he did. They seem reasonable, but I’d have been curious to see an explication of what is captured by these, and what might be captured by alternative metrics of task progress. (A side note: the paper also contained a reinforcement learning experiment, but I both understood that one less well and also feel like it wasn’t really that analogous to the other tests) |

Mask R-CNN

He, Kaiming and Gkioxari, Georgia and Dollár, Piotr and Girshick, Ross B.

arXiv e-Print archive - 2017 via Local Bibsonomy

Keywords: dblp

He, Kaiming and Gkioxari, Georgia and Dollár, Piotr and Girshick, Ross B.

arXiv e-Print archive - 2017 via Local Bibsonomy

Keywords: dblp

[link]
Mask RCNN takes off from where Faster RCNN left, with some augmentations aimed at bettering instance segmentation (which was out of scope for FRCNN). Instance segmentation was achieved remarkably well in *DeepMask* , *SharpMask* and later *Feature Pyramid Networks* (FPN). Faster RCNN was not designed for pixel-to-pixel alignment between network inputs and outputs. This is most evident in how RoIPool , the de facto core operation for attending to instances, performs coarse spatial quantization for feature extraction. Mask RCNN fixes that by introducing RoIAlign in place of RoIPool. #### Methodology Mask RCNN retains most of the architecture of Faster RCNN. It adds the a third branch for segmentation. The third branch takes the output from RoIAlign layer and predicts binary class masks for each class. ##### Major Changes and intutions **Mask prediction** Mask prediction segmentation predicts a binary mask for each RoI using fully convolution - and the stark difference being usage of *sigmoid* activation for predicting final mask instead of *softmax*, implies masks don't compete with each other. This *decouples* segmentation from classification. The class prediction branch is used for class prediction and for calculating loss, the mask of predicted loss is used calculating Lmask. Also, they show that a single class agnostic mask prediction works almost as effective as separate mask for each class, thereby supporting their method of decoupling classification from segmentation **RoIAlign** RoIPool first quantizes a floating-number RoI to the discrete granularity of the feature map, this quantized RoI is then subdivided into spatial bins which are themselves quantized, and finally feature values covered by each bin are aggregated (usually by max pooling). Instead of quantization of the RoI boundaries or bin bilinear interpolation is used to compute the exact values of the input features at four regularly sampled locations in each RoI bin, and aggregate the result (using max or average). **Backbone architecture** Faster RCNN uses a VGG like structure for extracting features from image, weights of which were shared among RPN and region detection layers. Herein, authors experiment with 2 backbone architectures - ResNet based VGG like in FRCNN and ResNet based [FPN](http://www.shortscience.org/paper?bibtexKey=journals/corr/LinDGHHB16) based. FPN uses convolution feature maps from previous layers and recombining them to produce pyramid of feature maps to be used for prediction instead of single-scale feature layer (final output of conv layer before connecting to fc layers was used in Faster RCNN) **Training Objective** The training objective looks like this ![](https://i.imgur.com/snUq73Q.png) Lmask is the addition from Faster RCNN. The method to calculate was mentioned above #### Observation Mask RCNN performs significantly better than COCO instance segmentation winners *without any bells and whiskers*. Detailed results are available in the paper |

Supervised Contrastive Learning

Prannay Khosla and Piotr Teterwak and Chen Wang and Aaron Sarna and Yonglong Tian and Phillip Isola and Aaron Maschinot and Ce Liu and Dilip Krishnan

arXiv e-Print archive - 2020 via Local arXiv

Keywords: cs.LG, cs.CV, stat.ML

**First published:** 2024/02/28 (just now)

**Abstract:** Contrastive learning applied to self-supervised representation learning has
seen a resurgence in recent years, leading to state of the art performance in
the unsupervised training of deep image models. Modern batch contrastive
approaches subsume or significantly outperform traditional contrastive losses
such as triplet, max-margin and the N-pairs loss. In this work, we extend the
self-supervised batch contrastive approach to the fully-supervised setting,
allowing us to effectively leverage label information. Clusters of points
belonging to the same class are pulled together in embedding space, while
simultaneously pushing apart clusters of samples from different classes. We
analyze two possible versions of the supervised contrastive (SupCon) loss,
identifying the best-performing formulation of the loss. On ResNet-200, we
achieve top-1 accuracy of 81.4% on the ImageNet dataset, which is 0.8% above
the best number reported for this architecture. We show consistent
outperformance over cross-entropy on other datasets and two ResNet variants.
The loss shows benefits for robustness to natural corruptions and is more
stable to hyperparameter settings such as optimizers and data augmentations. In
reduced data settings, it outperforms cross-entropy significantly. Our loss
function is simple to implement, and reference TensorFlow code is released at
https://t.ly/supcon.
more
less

Prannay Khosla and Piotr Teterwak and Chen Wang and Aaron Sarna and Yonglong Tian and Phillip Isola and Aaron Maschinot and Ce Liu and Dilip Krishnan

arXiv e-Print archive - 2020 via Local arXiv

Keywords: cs.LG, cs.CV, stat.ML

[link]
This was a really cool-to-me paper that asked whether contrastive losses, of the kind that have found widespread success in semi-supervised domains, can add value in a supervised setting as well. In a semi-supervised context, contrastive loss works by pushing together the representations of an "anchor" data example with an augmented version of itself (which is taken as a positive or target, because the image is understood to not be substantively changed by being augmented), and pushing the representation of that example away from other examples in the batch, which are negatives in the sense that they are assumed to not be related to the anchor image. This paper investigates whether this same structure - of training representations of positives to be close relative to negatives - could be expanded to the supervised setting, where "positives" wouldn't just mean augmented versions of a single image, but augmented versions of other images belonging to the same class. This would ideally combine the advantages of self-supervised contrastive loss - that explicitly incentivizes invariance to augmentation-based changes - with the advantages of a supervised signal, which allows the representation to learn that it should also see instances of the same class as close to one another. https://i.imgur.com/pzKXEkQ.png To evaluate the performance of this as a loss function, the authors first train the representation - either with their novel supervised contrastive loss SupCon, or with a control cross-entropy loss - and then train a linear regression with cross-entropy on top of that learned representation. (Just because, structurally, a contrastive loss doesn't lead to assigning probabilities to particular classes, even if it is supervised in the sense of capturing information relevant to classification in the representation) The authors investigate two versions of this contrastive loss, which differ, as shown below, in terms of the relative position of the sum and the log operation, and show that the L_out version dramatically outperforms (and I mean dramatically, with a top-one accuracy of 78.7 vs 67.4%). https://i.imgur.com/X5F1DDV.png The authors suggest that the L_out version is superior in terms of training dynamics, and while I didn't fully follow their explanation, I believe it had to do with L_out version doing its normalization outside of the log, which meant it actually functioned as a multiplicative normalizer, as opposed to happening inside the log, where it would have just become an additive (or, really, subtractive) constant in the gradient term. Due to this stronger normalization, the authors positive the L_out loss was less noisy and more stable. Overall, the authors show that SupCon consistently (if not dramatically) outperforms cross-entropy when it comes to final accuracy. They also show that it is comparable in transfer performance to a self-supervised contrastive loss. One interesting extension to this work, which I'd enjoy seeing more explored in the future, is how the performance of this sort of loss scales with the number of different augmentations that performed of each element in the batch (this work uses two different augmentations, but there's no reason this number couldn't be higher, which would presumably give additional useful signal and robustness?) |

About