[link]
Summary by CodyWild 4 years ago
This paper argues that, in semi-supervised learning, it's suboptimal to use the same weight for all examples (as happens implicitly, when the unsupervised component of the loss for each example is just added together directly. Instead, it tries to learn weights for each specific data example, through a meta-learning-esque process.
The form of semi-supervised learning being discussed here is label-based consistency loss, where a labeled image is augmented and run through the current version of the model, and the model is optimized to try to induce the same loss for the augmented image as the unaugmented one. The premise of the authors argument for learning per-example weights is that, ideally, you would enforce consistency loss less on examples where a model was unconfident in its label prediction for an unlabeled example.
As a way to solve this, the authors suggest learning a vector of parameters - one for each example in the dataset - where element i in the vector is a weight for element i of the dataset, in the summed-up unsupervised loss. They do this via a two-step process, where first they optimize the parameters of the network given the example weights, and then the optimize the example weights themselves. To optimize example weights, they calculate a gradient of those weights on the post-training validation loss, which requires backpropogating through the optimization process (to determine how different weights might have produced a different gradient, which might in turn have produced better validation loss). This requires calculating the inverse Hessian (second derivative matrix of the loss), which is, generally speaking, a quite costly operation for huge-parameter nets. To lessen this cost, they pretend that only the final layer of weights in the network are being optimized, and so only calculate the Hessian with respect to those weights. They also try to minimize cost by only updating the example weights for the examples that were used during the previous update step, since, presumably those were the only ones we have enough information to upweight or downweight. With this model, the authors achieve modest improvements - performance comparable to or within-error-bounds better than the current state of the art, FixMatch.
Overall, I find this paper a little baffling. It's just a crazy amount of effort to throw into something that is a minor improvement. A few issues I have with the approach:
- They don't seem to have benchmarked against the simpler baseline of some inverse of using Dropout-estimated uncertainty as the weight on examples, which would, presumably, more directly capture the property of "is my model unsure of its prediction on this unlabeled example"
- If the presumed need for this is the lack of certainty of the model, that's a non-stationary problem that's going to change throughout the course of training, and so I'd worry that you're basically taking steps in the direction of a moving target
- Despite using techniques rooted in meta-learning, it doesn't seem like this models learns anything generalizable - it's learning index-based weights on specific examples, which doesn't give it anything useful it can do with some new data point it finds that it wasn't specifically trained on
Given that, I think I'd need to see a much stronger case for dramatic performance benefits for something like this to seem like it was worth the increase in complexity (not to mention computation, even with the optimized Hessian scheme)
more
less