A common setting in deep networks is to design the network first, "freeze" the network architecture, and then train the parameters. The paper pointed out a potential dilemma of that, in the sense that complex networks may have better representation power but may be hard to train. To address this issue the paper proposed to train the network in a hybrid fashion where simpler components and more complex components are combined via a weight average, and the weight is updated over the training procedure to introduce the more complex components, while utilizing the fast training capability of simpler ones.
The authors propose to blend any two architectural components as the time of optimisation progresses. As the time progresses, the initial approach, e.g. employed rectifier, is gradually switched off in place of another rectifier. The authors claim that this strategy is good for a fast convergence and they present some experimental results.