[link]
TLDR; The authors train a word-level NMT where UNK tokens in both source and target sentence are replaced by character-level RNNs that produce word representations. The authors can thus train a fast word-based system that still generalized that doesn't produce unknown words. The best system achieves a new state of the art BLEU score of 19.9 in WMT'15 English to Czech translation. #### Key Points - Source Sentence: Final hidden state of character-RNN is used as word representation. - Source Sentence: Character RNNs always initialized with 0 state to allow efficient pre-training - Target: Produce word-level sentence including UNK first and then run the char-RNNs - Target: Two ways to initialize char-RNN: With same hidden state as word-RNN (same-path), or with its own representation (separate-path) - Authors find that attention mechanism is critical for pure character-based NMT models #### Notes - Given that the authors demonstrate the potential of character-based models, is the hybrid approach the right direction? If we had more compute power, would pure character-based models win?
Your comment:
|
[link]
# Achieving Open Vocabulary Neural Machine Translation with Hybrid Word-Character Models ## Introduction * The paper presents a novel open vocabulary NMT(Neural Machine Translation) system that translates mostly at word level and falls back to character level models for rare words. * Advantages: * Faster and easier to train as compared to character models. * Does not produce unknown words in the translations which need to be removed using *unk replacement* techniques. * [Link to the paper](https://arxiv.org/abs/1604.00788) ## Unk Replacement Technique * Most NMT operate on constrained vocabulary and represent unknown words with *unk* token. * A post-processing step replaces *unk* tokens with actual words using alignment information. * Disadvantages: * These systems treat words as independent entities while they are morphologically related. * Difficult to capture things like name translation. ## Proposed Architecture ### Word-level NMT * Deep LSTM encoder-decoder. * Global attention mechanism and bilinear attention scoring function. * Similar to regular NMT system except in the way unknown words are handled. ### Character-level NMT * Deep LSTM model used to generate on-the-fly representation of rare words (using final hidden state from the top layer). * Advantages: * Simplified architecture. * Efficiency through precomputation - representations for rare sources words can be computed at once before each mini-batch. * The model can be trained easily in an end-to-end fashion. #### Hidden-state Initialization * For source representation, layers of the LSTM are initialized with zero hidden states and cell values. * For target representation, the same strategy is followed except for the hidden state of the first layer where one of the following approaches are used: * **same-path** target generation approach * Use the context vector just before softmax (of word-level NMT). * **seperate-path** target generation approach * Learn a new weight matrix **W** that will be used to generate the context vector. ### Training Objective * *J = J<sub>w</sub> + αJ<sub>c</sub>* * *J* - total loss * *J<sub>w</sub>* - loss in a regular word-level NMT * *αJ<sub>c</sub>* - loss in the character-level NMT ### Word Character Generation Strategy * The final hidden state from character-level decoder could be interpreted as the representation of *unk* token but this approach would not be efficient. * Instead, *unk* is fed to the word-level decoder as it is so as to decouple the execution for the character-level model as soon the word-level model finishes. * During testing, a beam search decoder is run at the word level to find the best translation using the word NMT alone. * Next, a character-level encoder is used to generate the words in place of *unk* to minimise the combined loss. ## Experiments ### Data * WMT’15 translation task from English into Czech with newstest2013 (3000 sentences) as dev set and newstest2015 (2656 sentences) as a test set. ### Metrics * Case-sensitive NIST BLEU. * chrF3 ### Models * Purely word based * Purely character based * Hybrid (proposed model) ### Observations * Hybrid model surpasses all the other systems (neural/non-neural) and establishes a new state-of-the-art result for English-Czech translation in WMT’15 with 19.9 BLEU. * Character-level models, when used as a replacement for the standard unk replacement technique in NMT, yields an improvement of up to +7.9 BLEU points. * Attention is very important for character-based models as the non-attentional character models perform poorly. * Character models with shorter time-step backpropagation perform inferior as compared to ones with longer backpropagation. * Separate-path strategy outperforms same-path strategy. ### Rare word embeddings * Obtain representations for rare words. * Compare the Spearman correlation between similarity scores assigned by humans and by the model. * Outperforms the recursive neural network model (which also uses a morphological analyser) on this task. |