Summary by Tim Miller 5 years ago
Read this because it was cited by Zhang et al. 2017 and the title looked interesting. The setting is machine translation where you have pairs in one domain (europarl) but need to do translation in another (biomed). They quantify how much of the performance loss is due to vocabulary differences in the two domains. First, an oracle is created which using both domains to train. Second, an OOV oracle is created that removes words that their mining approach could not possibly find, to see what the essential limit of their approach is.
Their approach, then, uses non-parallel domain texts to create word similarity matrices for new terms. They compute a context based similarity matrix first. This involves creating feature vectors for each word based on the contexts in which it appears, and then computing similarity between all word pairs. Then they create an orthography-based similarity matrix using character n-grams within each word as a feature vector and computing similarity between all word pairs. They sum these matrices to get similarity matrix. They build a bipartite graph between existing word pairs where word in source language is connected to its translation words with edge weighted by unigram translation probability. This graph is reduced to single pairs (one edge between word pairs) with the Hungarian algorithm, and they use CCA to estimate the projection using these training pairs (one projection for each language). These projections are then applied to _all_ words to get new word representations. They then explore a few ways to integrate these new scores with existing MT systems, and find that they don't get improvement just by calling them scores, they also need to add features indicating when they are real scores and when they are mined scores.