Rethinking the Smaller-Norm-Less-Informative Assumption in Channel Pruning of Convolution Layers
Jianbo Ye
and
Xin Lu
and
Zhe Lin
and
James Z. Wang
arXiv e-Print archive - 2018 via Local arXiv
Keywords:
cs.LG
First published: 2018/02/01 (6 years ago) Abstract: Model pruning has become a useful technique that improves the computational
efficiency of deep learning, making it possible to deploy solutions on
resource-limited scenarios. A widely-used practice in relevant work assumes
that a smaller-norm parameter or feature plays a less informative role at the
inference time. In this paper, we propose a channel pruning technique for
accelerating the computations of deep convolutional neural networks (CNNs),
which does not critically rely on this assumption. Instead, it focuses on
direct simplification of the channel-to-channel computation graph of a CNN
without the need of performing a computational difficult and not always useful
task of making high-dimensional tensors of CNN structured sparse. Our approach
takes two stages: the first being to adopt an end-to-end stochastic training
method that eventually forces the outputs of some channels being constant, and
the second being to prune those constant channels from the original neural
network by adjusting the biases of their impacting layers such that the
resulting compact model can be quickly fine-tuned. Our approach is
mathematically appealing from an optimization perspective and easy to
reproduce. We experimented our approach through several image learning
benchmarks and demonstrate its interesting aspects and the competitive
performance.
The central argument of the paper is that pruning deep neural networks by removing the smallest weights is not always wise. They provide two examples to show that regularisation in this form is unsatisfactory.
## **Pruning via batchnorm**
As an alternative to the traditional approach of removing small weights, the authors propose pruning filters using regularisation on the gamma term used to scale the result of batch normalization.
Consider a convolutional layer with batchnorm applied:
```
out = max{ gamma * BN( convolve(W,x) + beta, 0 }
```
By imposing regularisation on the gamma term the resulting image becomes constant almost everywhere (except for padding) because of the additive beta. The authors train the network using regularisation on the gamma term and after convergence remove any constant filters before fine-tuning the model with further training.
The general algorithm is as follows:
- **Compute the sparse penalty for each layer.** This essentially corresponds to determining the memory footprint of each channel of the layer. We refer to the penalty as lambda.
- **Rescale the gammas.** Choose some alpha in {0.001, 0.01, 0.1, 1} and use them to scale the gamma term of each layer - apply `1/alpha` to the successive convolutional layers.
- **Train the network using ISTA regularisation on gamma.** Train the network using SGD but applying the ISTA penalty to each layer using `rho * lambda` , where rho is another hyperparameter and lambda is the sparse penalty calculated in step 1.
- **Remove constant filters.**
- **Scale back.** Multiply gamma by `1 / gamma` and gamma respectively to scale the parameters back up.
- **Finetune.** Retrain the new network format for a small number of epochs.