[link]
Yoshua Bengio’s visionary work on probabilistic language modelling had a huge impact on the field of Natural Language Processing. Despite that it was published nearly 20 years ago, it is still relevant to modern NLP solutions. In fact, the subsequent works perfected the state-of-the-art in NLP, although it took more than 10 years for the paper to get a significant attention in the field. The authors approach the fundamental problem of exponentially growing number of trainable parameters with the size of corpora. The problem is called “Curse of Dimensionality” and limits the generalization performance of conventional approaches such as N-gram models. For instance, if we want to model a joint distribution of 10 consecutive words with a vocabulary of size 100,000, there are potentially 10^50-1 trainable variables. The proposed probabilistic modeling approach scales linearly with the size of the vocabulary. It comprises of two steps: 1) converting words into real-valued feature vector of size m and 2) extract joint probability of the word sequence represented by the feature vectors (total size of (n-1)-by-m). While the first step is implemented as a trainable matrix of size |V|*m, the joint probability function is implemented as two densely connected layers (tanh and softmax activation) with a skip connection between the input of the first layer and the input of the second layers. The first layer has weight matrix of size m*h and the second is (nm+h)*|V|, which results in total of |V|(1+nm+h)+h(1+(n-1)m) trainable variables. The output of the model is then mixed with the output of a tri-gram model. As the optimization objective, the authors employ log-likelihood of the next word regularized on the weights of dense layers (biases are excluded, as well as, matrix of real-valued word features embedding matrix of size |V|-by-m). The goal of the optimization is to find the parameters that minimize the perplexity of the training dataset. Eventually we learn the distributed representations of the words and the probability function of a sequence as a function of the distributed word representations. The proposed model improved the out-of-sample perplexity score by 24% for Brown and 8% on AP news datasets compared to the state-of-the-art smoothed trigram models. The best performing architecture comprised h=100, m=30, without a skip connection, but with trigram mixing. As a primary drawback of the approach, authors mention significant speed limitations and propose use of shallow networks, time delay and recurrent neural networks to mitigate the problem and improve the performance. They also consider several promising direction for future research such as decomposing the network in sub-networks, representing the conditional probability as a tree, introducing a-priori knowledge, interpretable word embedding, and adjusting the gradient propagation path. As a critique, the training did not explicitly target word semantics; although the authors claim that their embedding takes advantage of the inherently learned similarity (close words are expected to have similar feature vectors). For example, a better approach could be a Siamese network with an objective to minimize a distance (e.g. cosine or Euclidian) between “similar” words, this would also decrease the number of the trainable variables and well as exclude the risky non-regularized word embedding matrix. Secondly, as the authors mentioned, the model could significantly benefit from use of GRU, RNN, LSTM and other network architectures, as well as including prior knowledge. Finally, the model scales linearly with n and |V|, which significantly limits its application.
Your comment:
|