[link]
This paper tries to do an exhaustive empirical evaluation of the question of how effectively you can reduce the number of training steps needed to train your model, by increasing the batch size. This is an interesting question because it’s becoming increasingly the case that computational costs scale very slowly with additional datapoints added to a batch, and so your per-example cost will be lower, the larger a batch you do your gradient calculation in. In the most ideal world, we might imagine that there’s a perfect trade-off relationship between batch size and training steps. As a simplistic example, if it were the case that your model only needed to see each observation in the dataset once in order to obtain some threshold of accuracy, and there was an unbounded ability to trade off batch size against training steps, then one might imagine that you could just take one large step based on the whole dataset (in which case you’d then be doing not Stochastic Gradient Descent, but just Batch Gradient Descent). However, there’s reason to suspect that this won’t be possible; for one thing, it seems like having multiple noisier steps is better for optimization than taking one single step of training. https://i.imgur.com/uwCfBJR.png This paper set out to do a large-scale evaluation of what this behavior looks like over a range of datasets. They did so by setting a target test error rate, and then measuring how many training steps were necessary to reach that error rate, for a given batch size. For fairness, they trained hyperparameters separately for each batch size. They found that, matching some theoretical predictions, at small to medium batch sizes, your increase in batch size pays off 1:1 in fewer needed training steps. As batch size increases more, the tradeoff curve diverges from 1:1, and eventually goes flat, meaning that, even if you increase your batch size more, you can no longer go any lower in terms of training steps. This seems to me connected to the idea that having a noisy, multi-step search process is useful for the non-convex environments that neural net optimizers are working in. https://i.imgur.com/ycigYVX.png A few other notes from the paper: - Different model architectures can extend 1:1 scaling to higher batch sizes, and thus plateau at a lower number of training steps - Momentum also has the effect of plateauing at a lower number of needed training steps - It’s been previously suggested that you need to scale optimal learning rate linearly or according to the square root of the batch size, in order to maintain best performance. The authors find that there are different learning rates across batch size, but that they aren’t well-approximated by a linear or square-root relationship
Your comment:
|