Rethinking the Smaller-Norm-Less-Informative Assumption in Channel Pruning of Convolution Layers on ShortScience.org

arxiv.org
arxiv-vanity.com
scholar.google.com

Rethinking the Smaller-Norm-Less-Informative Assumption in Channel Pruning of Convolution Layers
Jianbo Ye and Xin Lu and Zhe Lin and James Z. Wang
arXiv e-Print archive - 2018 via Local arXiv
Keywords: cs.LG
more

Summaries/Notes 1

[link] Summary by jwturner 6 years ago

The central argument of the paper is that pruning deep neural networks by removing the smallest weights is not always wise. They provide two examples to show that regularisation in this form is unsatisfactory.

## **Pruning via batchnorm**
As an alternative to the traditional approach of removing small weights, the authors propose pruning filters using regularisation on the gamma term used to scale the result of batch normalization.

Consider a convolutional layer with batchnorm applied:
```
out = max{ gamma * BN( convolve(W,x) + beta, 0 }
```

By imposing regularisation on the gamma term the resulting image becomes constant almost everywhere (except for padding) because of the additive beta. The authors train the network using regularisation on the gamma term and after convergence remove any constant filters before fine-tuning the model with further training.

The general algorithm is as follows:
- **Compute the sparse penalty for each layer.** This essentially corresponds to determining the memory footprint of each channel of the layer. We refer to the penalty as lambda.
- **Rescale the gammas.** Choose some alpha in {0.001, 0.01, 0.1, 1} and use them to scale the gamma term of each layer - apply `1/alpha` to the successive convolutional layers.
- **Train the network using ISTA regularisation on gamma.** Train the network using SGD but applying the ISTA penalty to each layer using `rho * lambda` , where rho is another hyperparameter and lambda is the sparse penalty calculated in step 1.
- **Remove constant filters.**
- **Scale back.** Multiply gamma by `1 / gamma` and gamma respectively to scale the parameters back up.
- **Finetune.** Retrain the new network format for a small number of epochs.

Please check the paper for details on how to prune constant channels.

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private