Towards cross-lingual distributed representations without parallel text trained with adversarial autoencoders
Antonio Valerio Miceli Barone
arXiv e-Print archive - 2016 via Local arXiv
Keywords:
cs.CL, cs.LG, cs.NE
First published: 2016/08/09 (8 years ago) Abstract: Current approaches to learning vector representations of text that are
compatible between different languages usually require some amount of parallel
text, aligned at word, sentence or at least document level. We hypothesize
however, that different natural languages share enough semantic structure that
it should be possible, in principle, to learn compatible vector representations
just by analyzing the monolingual distribution of words.
In order to evaluate this hypothesis, we propose a scheme to map word vectors
trained on a source language to vectors semantically compatible with word
vectors trained on a target language using an adversarial autoencoder.
We present preliminary qualitative results and discuss possible future
developments of this technique, such as applications to cross-lingual sentence
representations.
This is a simple unsupervised method for learning word-level translation
between embeddings of two different languages.
That's right -- unsupervised.
The basic motivating hypothesis is that there should be an isomorphism between
the "semantic spaces" of different languages:
> we hypothesize that, if languages are used to convey thematically similar information in similar contexts, these random processes should be approximately isomorphic between languages, and that this isomorphism can be learned from the statistics of the realizations of these processes, the monolingual corpora, in principle without any form of explicit alignment.
If you squint a bit, you can make the more aggressive claim from this premise
that there should be a nonlinear / MLP mapping between *word embedding spaces*
that gets us the same result.
The author uses the adversarial autoencoder (AAE, from Makhzani last year)
framework in order to enforce a cross-lingual semantic mapping in word
embedding spaces. The basic setup for adversarial training between a source and
a target language:
1. Sample a batch of words from the source language according to the language's
word frequency distribution.
2. Sample a batch of words from the target language according to its word
frequency distribution. (No sort of relationship is enforced between the two
samples here.)
3. Feed the word embeddings corresponding to the source words through an
*encoder* MLP. This corresponds to the standard "generator" in a GAN setup.
4. Pass the generator output to a *discriminator* MLP along with the
target-language word embeddings.
5. Also pass the generator output to a *decoder* which maps back to the source
embedding distribution.
6. Update weights based on a combination of GAN loss + reconstruction loss.
### Does it work?
We don't really know. The paper is unfortunately short on evaluation --- we
just see a few examples of success and failure on a trained model. One easy
evaluation would be to compare accuracy in lexical mapping vs. corpus frequency
of the source word. I would bet that this would reveal the model hasn't done
much more than learn to align a small set of high-frequency words.