Once for All: Train One Network and Specialize it for Efficient Deployment on ShortScience.org

arxiv.org
scholar.google.com

Once for All: Train One Network and Specialize it for Efficient Deployment
Cai, Han and Gan, Chuang and Han, Song
arXiv e-Print archive - 2019 via Local Bibsonomy
Keywords: dblp

Summaries/Notes 1

[link] Summary by ameroyer 4 years ago

**Summary**: The goal of this work is to propose a "Once-for-all” (OFA) network: a large network which is trained such that its subnetworks (subsets of the network with smaller width, convolutional kernel sizes, shallower units) are also trained towards the target task. This allows to adapt the architecture to a given budget at inference time while preserving performance.

**Elastic Parameters.**
The goal is to train a large architecture that contains several well-trained subnetworks with different architecture configurations (in terms of depth, width, kernel size, and resolution). One of the key difficulties is to ensure that each subnetwork reaches high-accuracy even though it is not trained independently but as part of a larger architecture.
This work considers standard CNN architectures (decreasing spatial resolution and increasing number of feature maps), which can be decomposed into units (A unit is a block of layers such that the first layer has stride 2, and the remaining ones have stride 1). The parameters of these units (depth, kernel size, input resolution, width) are denoted as *elastic parameters* in the sense that they can take different values, which defines different subnetworks, which still share the convolutional parameters.

**Progressive Shrinking.**
Additionally, the authors consider a curriculum-style training process which they call *progressive shrinking*. First, they train the model with the maximum depth, $D$, kernel size, $K$, and width, $W$, which yields convolutional parameters . Then they progressively fine-tune this weight, with an additional distillation term from the largest network, while considering different values for the elastic parameters, in the following order:
* Elastic kernel size: Training for a kernel size $k < K$ is done by taking a linear transformation the center $k \times k$ patch in the full $K \times K$ kernels that are in . The linear transformation is useful to model the fact that different scales might be useful for different tasks.
* Elastic depth: To train for depth $d < D$, simply skip the last $D-d$ layers of the unit (rather than looking at every subset of dlayers)
* Elastic width: For a width $w < W$. First, the channels are reorganized by importance (decreasing order of the $L1$-norm of their weights), then use only the top wchannels
* Elastic resolution: Simply train with different image resolutions / resizing: This is actually used for all training processes.

**Experiments.**
Having trained the Once-for-all (OFA) network, the goal is now to find the adequate architecture configuration, given a specific task/budget constraints. To do this automatically, they propose to train a small performance predictive model. They randomly sample 16K subnetworks from OFA, evaluate their accuracy on a validation set, and learn to predict accuracy based on architecture and input image resolution. (Note: It seems that this predictor is then used to perform a cheap evolutionary search, given latency constraints, to find the best architecture config but the whole process is not entirely clear to me. Compared to a proper neural architecture search, however it should be inexpensive).

The main experiments are on ImageNet, using MobileNetv3 as the base full architecture, with the goal of applying the model across different platforms with different inference budget constraints. Overall, the proposed model achieves comparable or higher accuracies for reduced search time, compared to neural architecture search baselines. More precisely their model has a fixed training cost (the OFA network) and a small search cost (find best config based on target latency), which is still lower than doing exhaustive neural architecture search. Furthermore, progressive shrinking does have a significant positive impact on the subnetworks accuracy (+4%).

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private