Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Noam Shazeer
and
Azalia Mirhoseini
and
Krzysztof Maziarz
and
Andy Davis
and
Quoc Le
and
Geoffrey Hinton
and
Jeff Dean
arXiv e-Print archive - 2017 via Local arXiv
Keywords:
cs.LG, cs.CL, cs.NE, stat.ML
First published: 2017/01/23 (7 years ago) Abstract: The capacity of a neural network to absorb information is limited by its
number of parameters. Conditional computation, where parts of the network are
active on a per-example basis, has been proposed in theory as a way of
dramatically increasing model capacity without a proportional increase in
computation. In practice, however, there are significant algorithmic and
performance challenges. In this work, we address these challenges and finally
realize the promise of conditional computation, achieving greater than 1000x
improvements in model capacity with only minor losses in computational
efficiency on modern GPU clusters. We introduce a Sparsely-Gated
Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward
sub-networks. A trainable gating network determines a sparse combination of
these experts to use for each example. We apply the MoE to the tasks of
language modeling and machine translation, where model capacity is critical for
absorbing the vast quantities of knowledge available in the training corpora.
We present model architectures in which a MoE with up to 137 billion parameters
is applied convolutionally between stacked LSTM layers. On large language
modeling and machine translation benchmarks, these models achieve significantly
better results than state-of-the-art at lower computational cost.