This paper looks at four different ways of training cross-lingual embeddings and evaluates them on downstream tasks to investigate which types of learning are suitable for which tasks. Since they require different amounts and quality of supervision this is important to understand. If it only works when there are word-level alignments then cross-lingual embeddings won't help with low-resource languages. On the other hand, if methods trained by comparable corpora only are effective for some downstream tasks, then we have some hope for low resource languages.
The four methods are: 1) a skip-gram like method (Biskip) that uses word-aligned corpora, replacing words for aligned words in the target language, and predicting the source words in the context. This requires sentence aligned corpora, and I believe the word alignments are automatic given that. 2) A model that computes sentence embeddings from component word embeddings and optimizes the loss function to minimize the difference between aligned sentences. (BiCVM) This of course reqiures aligned sentences as well. 3) A projection-based method (BiCCA) that takes two monolingual (independent) word vectors and learns projections into a shared space. They use a translation dictinoary to find the subset of words that align, and use CCA to learn a mapping that respects those alignment. This projection can then be applied to all words in the dictionary. This method does not require similar corpora. 4) A method that uses comparable corpora (similar documents in each language, ala wikipedia) to create pseudo-documents that randomly samples words from each languages document. Once these documents are created they train with word2vec skipgram.
These methods are evaluated on a few NLP tasks, including monolingual word similarity, cross-lingual dictinoary induction, cross-lingual document classification, and cross-lingual dependency parsing.
Of the models that require sentence alignments, BiSkip usually beats BiCVM. BiSkip is best for most of the semantic tasks. BiCCA is as good or better for dependency parsing, suggesting that cheaper methods might be ok for syntactic tasks. One major caveat is that the Chinese dependency parsing results are terrible, meaning that this style of training for dependency parsers probably only works when the languages have similar structure. So the benefits to parsing low resource languages may be minimal even though the supervision required to create the embeddings is low.