[link]
Often the best learning rate for a DNN is sensitive to batch size and hence need significant tuning while scaling batch sizes to large scale training. Theory suggests that when you scale the batch size by a factor of $k$ (in the case of multi-GPU training), the learning rate should be scaled by $\sqrt{k}$ to keep the variance of the gradient estimator constant (remember the variance of an estimator is inversely proportional to the sample size?). But in practice, often linear learning rate scaling works better (i.e. scale learning rate by $k$), with a gradual warmup scheme. This paper proposes a slight modification to the existing learning rate scheduling scheme called LEGW (Linear Epoch Gradual Warmup) which helps us in bridging the gap between theory and practice of large batch training. The authors notice that in order to make square root scaling work well in practice, one should also scale the warmup period (in terms of epochs) by a factor of $k$. In other words, if you consider learning rate as a function of time period in terms of epochs, scale the periodicity of the function by $k$, while scaling the amplitude of the function by $\sqrt{k}$, when the batch size is scaled by $k$. The authors consider various learning rate scheduling schemes like exponential decay, polynomial decay and multi-step LR decay and find that square root scaling with LEGW scheme often leads to little to no loss in performance while scaling the batch sizes. In fact, one can use SGD with LEGW with no tuning and make it work as good as Adam. Thus with this approach, one can tune the learning rate for a small batch size and extrapolate it to larger batch sizes while making use of parallel hardwares. |
[link]
**TL;DR:** This paper summarizes some of the practical tips for training a transformer model for MT task, though I believe some of the tips are task-agnostic. The parameters considered include number of GPUs, batch size, learning rate schedule, warmup steps, checkpoint averaging and maximum sequence lengths. **Framework used for the experiments:** [Tensor2Tensor](https://github.com/tensorflow/tensor2tensor) The effect of varying the most important hyper-parameters on the performances are as follows: **Early Stopping:** Usually papers don't report the stopping criterion except in some vague terms (like number of days to train). The authors suggest that with a large dataset, even a very large model almost never converges and keeps improving by small amounts. So, keep training your model for long periods if your GPU budget supports such an option. **Data Preprocessing:** Mostly neural architectures these days use sub word units instead of words. It's better to create the sub word vocabulary using a sufficiently large dataset. Also, its advised to filter datasets based on *max_sequence_length* and store them (for eg. as TFRecords) before training your model and not do the filtering every epoch to save precious CPU time. **Batch Size:** Computational throughput (number of tokens executed per unit time) increases sub-linearly w.r.t. the batch size, which means after a particular number, increasing the batch size may not be that useful. From performance POV however, increasing the batch size usually leads to faster and better convergence. So, try using the maximum batch size, be it a single or multi-GPU training. Keep in mind, however, that due to random batching you may run out of memory suddenly even after days of training. So, leave some backup memory for such cases while increasing the batch size. **Dataset size:** Experiments from the paper reinforce the fact that with BIG models, more data is better. While comparing datasets of different sizes, its advised to train the models for long enough periods because the effect of dataset sizes kicks in usually after long periods. **Model size:** A bigger model even with a smaller batch size performs better than a smaller model with larger batch sizes after a few days of training. For debugging, use the smallest models btw! **Maximum sequence length:** Decreasing the *max_sequence_length* leads to more examples from the dataset excluded while allowing bigger batch sizes. So, its a trade-off. Often, the presence of more examples off-sets the gains from increasing batch sizes while training for enough time. But even such a gain plateaus after a sufficient sequence length, since very long sentences are often outliers and won't contribute much to performance gains. **Learning rate and Warm-up steps:** The usual advice of using a not-so-high and not-so-low learning rates apply here. Using large warm-up steps often off-set the damage caused by large learning rates. So does gradient clipping. **Number of GPUs:** For the fastest convergence, use as many GPUs as available. There would be no noticeable variation in the performances. There is a huge debate on scaling of learning rates while going from single to multiple GPUs, though the authors report that there is no significant variation while using the same learning rates, independent of the batch size (which increases with more GPUs) **Checkpoint averaging:** Averaging last n (=10) model checkpoints saved at 1hr/30mins intervals almost always leads to better performances. This is similar to Averaged SGD from AWD-LSTM (*Merity et al.*) |