First published: 2016/05/24 (6 years ago) Abstract: How can we efficiently propagate uncertainty in a latent state representation
with recurrent neural networks? This paper introduces stochastic recurrent
neural networks which glue a deterministic recurrent neural network and a state
space model together to form a stochastic and sequential neural generative
model. The clear separation of deterministic and stochastic layers allows a
structured variational inference network to track the factorization of the
model's posterior distribution. By retaining both the nonlinear recursive
structure of a recurrent neural network and averaging over the uncertainty in a
latent path, like a state space model, we improve the state of the art results
on the Blizzard and TIMIT speech modeling data sets by a large margin, while
achieving comparable performances to competing methods on polyphonic music
modeling.

This paper is based on an intriguing idea of combining state-space models (SSMs) and recurrent neural networks (RNNs). Ideally, it is very much needed: For the sequences which have distinct structure and high variability, probabilistic modelling is a big problem. The handcrafted and parameterised feature representations are widely used to ease the problem and it is customary to develop the probabilistic model on top of these extracted features from the signal (such as using a short-time Fourier transform and _then_ probabilistic modelling over these features).
But machines should be able to handle learning the representation part as well. So the story of the paper. Here, it is termed neural network with stochastic layers but one can safely say that the model is a state-space model with a deterministic neural network layer which is tied to both hidden variables and observations. Still, due to the complicated nonlinearities, it is not easy to understand what's going on.
One thing I found missing from the paper is the exact form of nonlinearity used for the NN layer as I am looking at it from the perspective of probabilistic modelling. But this is probably because of space reasons.