First published: 2013/11/07 (8 years ago) Abstract: In this paper we propose and investigate a novel nonlinear unit, called $L_p$
unit, for deep neural networks. The proposed $L_p$ unit receives signals from
several projections of a subset of units in the layer below and computes a
normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$
unit. First, the proposed unit can be understood as a generalization of a
number of conventional pooling operators such as average, root-mean-square and
max pooling widely used in, for instance, convolutional neural networks (CNN),
HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain
degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013)
which achieved the state-of-the-art object recognition results on a number of
benchmark datasets. Secondly, we provide a geometrical interpretation of the
activation function based on which we argue that the $L_p$ unit is more
efficient at representing complex, nonlinear separating boundaries. Each $L_p$
unit defines a superelliptic boundary, with its exact shape defined by the
order $p$. We claim that this makes it possible to model arbitrarily shaped,
curved boundaries more efficiently by combining a few $L_p$ units of different
orders. This insight justifies the need for learning different orders for each
unit in the model. We empirically evaluate the proposed $L_p$ units on a number
of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$
units achieve the state-of-the-art results on a number of benchmark datasets.
Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep
recurrent neural networks (RNN).
#### Problem addressed:
A new type of activation function
This paper propose a new activation function that computes a Lp norm from multiple projections on an input vector. The p value can be learned from training example, and can also be different for each hidden unit. The intuition is that 1) for different datasets there may exist different optimal p-values, so it make more sense to make p tunable; 2) allowing different unit take different p-values can potentially make the approximation of decision boundaries more efficient and more flexible. The empirical results support these two intuitions, and achieved comparable results on three datasets.
A generalization of pooling but applied through channels, when the data and weight vector dot product plus bias is constrained to non-negative case, the $L_\infty$ is equivalent to maxout unit.
Empirical performance is not very impressive, although evidence of supporting the intuition occurs.
MNIST, TFD, Pentomino