[link]
Summary by Tim Miller 6 years ago
This work extends sequence-to-sequence models for machine translation by using syntactic information on the source language side. This paper looks at the translation task where English is the source language, and Japanese is the target language. The dataset is the ASPEC corpus of scientific paper abstracts that seem to be in both English and Japanese? (See note below). The trees for the source (English) are generated by running the ENJU parser on the English data, resulting in binary trees, and only the bracketing information is used (no phrase category information).
Given that setup, the method is an extension of seq2seq translation models where they augment it with a Tree-LSTM to do the encoding of the source language. They deviate from a standard Tree-LSTM by running an LSTM across tokens first, and using the LSTM hidden states as the leaves of the tree instead of the token embeddings themselves. Once they have the encoding from the tree, it is concatenated with the standard encoding from an LSTM. At decoding time, the attention for output token $y_j$ is computed across all source tree nodes $i$, which includes $n$ input token nodes and $n-1$ phrasal nodes, as the similarity between the hidden state $s_j$ and the encoding at node $i$, then passed through softmax. Another deviation from standard practice (I believe) is that the hidden state calculations $s_j$ in the decoder are a function of the previous output token $y_{t-1}$, the previous time steps hidden state $s_{j-1}$ and the previous time step's attention-modulated hidden state $\tilde{s}_{j-1}$.
The authors introduce an additional trick for improving decoding performance when translating long sentences, since they say standard length normalization did not work. Their method is to compute a probability distribution over output length given input length, and use this to create an additional penalty term in their scoring function, as the log of the probability of the current output length given input length.
They evaluate using RIBES (not familiar) and BLEU scores, and show better performance than other NMT and SMT methods, and similar to the best performing (non-neural) tree to sequence model.
Implementation: They seem to have a custom implementation in C++ rather than using a DNN library. Their implementation takes one day to run one epoch of training on the full training set. They do not say how many epochs they train for.
Note on data: We have looked at this data a bit for a project I'm working on, and the English sentences look like translations from Japanese. A large proportion of the sentences are written in passive form with the structure "X was Yed" e.g.. "the data was processed, the cells were cultured." This looks to me like they translated subject-dropped Japanese sentences which would have the same word order, but are not actually passive! So that raises for me the question of how representative the source side inputs are of natural English.
more
less