[link]
	  			  		
	  		
	  					
			Summary by CodyWild 4 years ago
						
			
		  	
		  	The premise of contrastive loss is that we want to push together the representations of objects that are similar, and push dissimilar representations farther apart. However, in an unlabeled setting, we don't generally have class labels to tell which images (or objects in general) are supposed to be similar or dissimilar along the axes that matter to us, so we use the shortcut of defining some transformation on a given anchor frame that gets us a frame we're confident is related enough to that anchor that it can be considered a "positive" or target similarity-wise. Some of these transformations are data augmentations performed on a frame, or choosing temporally adjacent frames in a video sequence (which, since the real world evolves smoothly, are assumed to be similar). 
Anyhow, all of this is well and good, except for the fact that, especially in an image classification setting like CIFAR or ImageNet, sampling randomly from the other images in a given batch doesn't give you a set of things that are entirely "negatives" in terms of being dissimilar to the anchor image. It is true that most of the objects you get by sampling randomly are negatives (especially in a many-class setting), but some of them will be other samples from the same class. By treating all of those as negatives, we penalize the model for having representations of them that are chose to our anchor representation, even though, for many downstream tasks, we'd probably prefer elements of the same class to have more similar representations. However, the whole premise of the unsupervised setting is that we don't have class labels, so we don't know, for a given sample from the batch (of things that aren't specifically transformations of the anchor) whether it's an actual negative or secretly a positive (i.e. of the same class). And, that's true, but this paper argues that, even if you can't identify which specific elements in a batch are secret positives, you can try to account for them in aggregate, if you have some reasonably good estimate of the overall class probabilities, which will tell you how many positives you expect to find in a given batch in expectation. 
Given that, they reformulate the loss to be "debiased". They do this by taking the expectation over negatives in the denominator, which is actually a sample over the full p(x), not just the distribution over negatives, and trying to make it a better estimate of the actual distribution over negatives. 
https://i.imgur.com/URN4RBF.png
This they accomplish by writing out the full p(x) as a weighted combination of the distributions over positive and negative (which here is "every class that doesn't match the anchor"), as shown above, and noticing that you can represent the negative part of the distribution by taking the full distribution, and subtracting out the positive distribution (which we have an estimator for by construction, with our transformations), weighted by the prior over how frequent the positives are in our overall distribution. 
https://i.imgur.com/5IgGIhu.png
This leads to a change of estimating the similarity between the anchor and positives (which we already have in the numerator, but which we can also calculate with more augmentations/positive samples to get a better estimate) and doing a (weighted) subtraction of that from the similarity over negative examples. Intuitively, we keep in the part where we penalize similarity with negatives (by adding magnitude to the denominator), but reduce that penalty in accordance with how much we think that "similarity with negatives" is actually similarity with other positives in the batch, which we actually would like to keep around. 
https://i.imgur.com/kUGoemA.png
https://i.imgur.com/5Gitdi7.png
In terms of experimental results, my read is that this is most useful on problems - like CIFAR10 and STL10 - that don't have many classes (they each, per their names, have 10). The results there are meaningfully stronger than for the 200-class ImageNet. And, that makes pretty good intuitive sense, since you would expect the scale of the "secret positives in our random sample of images" bias problem to be a lot more acute in a setting where we've got a 1 in 10 chance of sampling a same-class image, compared to a 1-in-200 chance.
			