When Does Contrastive Visual Representation Learning Work? on ShortScience.org

arxiv.org
arxiv-vanity.com
scholar.google.com

When Does Contrastive Visual Representation Learning Work?
Elijah Cole and Xuan Yang and Kimberly Wilber and Oisin Mac Aodha and Serge Belongie
arXiv e-Print archive - 2021 via Local arXiv
Keywords: cs.CV, cs.LG
more

Summaries/Notes 1

[link] Summary by CodyWild 2 years ago

This is a mildly silly paper to summarize, since there isn't really a new mechanism to understand, but rather a number of straightforward (and interesting!) empirical results that are also quite well-explained in the paper itself. That said, for the sake of a tiny bit more brevity than the paper itself provides, I'll try to pull out some of the conclusions I found the most interesting here. 

The general goal of this paper is to better understand the contours of when self-supervised representation learning is valuable for vision (and specifically when it can compete with supervised learning), and when it doesn't. In general, the results are all using ResNet backbones, with SimCLR SSL, on image classification datasets. Some bullet-point takeaways: 

- The SSL models being tested here seem to roughly saturate at unsupervised dataset sizes of around 500K; the comparative jump from dataset sizes of 500K to 1M is fairly small.
- Once you have a supervised dataset of around 50K or more, the benefit of SSL pretraining starts to diminish, and it converges to being more similar to just supervised learning on that numbrer of labeled images. On the flip side, it's only possible to get close to "good" fully supervised performance by using 100K images or more on top of a SSL baseline.
- Even within image classification datasets, it's much better to do SSL representation on the same dataset as the one you'll use for downstream training; trying to transfer representations to different datasets leads to meaningfully worse results. Interestingly, this is even true when you add out-of-domain (i.e. other-dataset) data to an existing in-domain dataset: a dataset of 250K in-dataset images does better than a 500K dataset of images from mixed datasets, and does notably better than a 1M dataset of mixed images. In this case, adding more out-of-domain images seems to have just degraded performance
- SSL seems to perform more closely to SL on a course label set; when the label set gets more granular, the task gets harder overall, but, more specifically, the gap between SSL and SL grows
- When the authors tried different forms of dataset corruption, SSL was much more robust to adding salt-and-pepper noise than it was to removing high-frequency information in the form of reducing the images to a lower resolution.

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private