First published: 2014/09/26 (9 years ago) Abstract: Top-performing deep architectures are trained on massive amounts of labeled
data. In the absence of labeled data for a certain task, domain adaptation
often provides an attractive option given that labeled data of similar nature
but from a different domain (e.g. synthetic images) are available. Here, we
propose a new approach to domain adaptation in deep architectures that can be
trained on large amount of labeled data from the source domain and large amount
of unlabeled data from the target domain (no labeled target-domain data is
necessary).
As the training progresses, the approach promotes the emergence of "deep"
features that are (i) discriminative for the main learning task on the source
domain and (ii) invariant with respect to the shift between the domains. We
show that this adaptation behaviour can be achieved in almost any feed-forward
model by augmenting it with few standard layers and a simple new gradient
reversal layer. The resulting augmented architecture can be trained using
standard backpropagation.
Overall, the approach can be implemented with little effort using any of the
deep-learning packages. The method performs very well in a series of image
classification experiments, achieving adaptation effect in the presence of big
domain shifts and outperforming previous state-of-the-art on Office datasets.
The goal of this method is to create a feature representation $f$ of an input $x$ that is domain invariant over some domain $d$. The feature vector $f$ is obtained from $x$ using an encoder network (e.g. $f = G_f(x)$).
The reason this is an issue is that the input $x$ is correlated with $d$ and this can confuse the model to extract features that capture differences in domains instead of differences in classes. Here I will recast the problem differently from in the paper:
**Problem:** Given a conditional probability $p(x|d=0)$ that may be different from $p(x|d=1)$:
$$p(x|d=0) \stackrel{?}{\ne} p(x|d=1)$$
we would like it to be the case that these distributions are equal.
$$p(G_f(x) |d=0) = p(G_f(x)|d=1)$$
aka:
$$p(f|d=0) = p(f|d=1)$$
Of course this is an issue if some class label $y$ is correlated with $d$ meaning that we may hurt the performance of a classifier that now may not be able to predict $y$ as well as before.
https://i.imgur.com/WR2ujRl.png
The paper proposes adding a domain classifier network to the feature vector using a reverse gradient layer. This layer simply flips the sign on the gradient. Here is an example in [Theano](https://github.com/Theano/Theano):
```
class ReverseGradient(theano.gof.Op):
...
def grad(self, input, output_gradients):
return [-output_gradients[0]]
```
You then train this domain network as if you want it to correctly predict the domain (appending it's error to your loss function). As the domain network learns new ways to correctly predict an output these gradients will be flipped and the information in feature vector $f$ will be removed.
There are two major hyper parameters of the method. The number of dimensions at the bottleneck is one but it is linked to your network. The second is a scalar on the gradient so you can increase or decrease the effect of the gradient on the embedding.
I like the interpretation of minimising the classification loss with the constraint that the class conditional marginal (for all $x$ conditioned on the domain source) distribution of the internal representation (learned features) should match each other. This, though, could be better formulated as a soft constraint (as an optimisation problem devised in the paper):
$$ \min_{\theta_f,\theta_y}-\mathbf{E}_x [p_{\theta_f,\theta_y}(y|x)] + \mathcal{D}(p_{\theta_f}(f|d=0)||p_{\theta_f}(f|d=1)) $$
where the first term is the standard probabilistic loss, regularised by the distance between the internal distributions. Since the domain label and image datapoint come in pairs, we can always marginalise out the data point and have $p(f|d)=\mathbf{E}_{p(x|d)}[p(f|x)]$. In our case here, p(f|x) is deterministic. The original author uses an "adversarial"-like methodology that introduces a discriminator for domain classification, where a possible choice of the distance metric ($\mathcal{D}$) could be the Jensen Shannon divergence. The adversarial training makes it possible to train the feature extractor like a generator to match the conditionals $p(f|d=0)$ and $p(f|d=1)$ through sampling.
_Objective:_ Build a network easily trainable by back-propagation to perform unsupervised domain adaptation while at the same time learning a good embedding for both source and target domains.
_Dataset:_ [SVHN](ufldl.stanford.edu/housenumbers/), [MNIST](yann.lecun.com/exdb/mnist/), [USPS](https://www.otexts.org/1577), [CIFAR](https://www.cs.toronto.edu/%7Ekriz/cifar.html) and [STL](https://cs.stanford.edu/%7Eacoates/stl10/).
#### Architecture:
Very similar to RevGrad but with some differences.
Basically a shared encoder and then a classifier and a reconstructor.
[![screen shot 2017-05-22 at 6 11 22 pm](https://cloud.githubusercontent.com/assets/17261080/26318076/21361592-3f1a-11e7-9213-9cc07cfe2f2a.png)](https://cloud.githubusercontent.com/assets/17261080/26318076/21361592-3f1a-11e7-9213-9cc07cfe2f2a.png)
The two losses are:
* the usual cross-entropy with softmax for the classifier
* the pixel-wise squared loss for reconstruction
Which are then combined using a trade-off hyper-parameter between classification and reconstruction.
They also use data augmentation to generate additional training data during the supervised training using only geometrical deformation: translation, rotation, skewing, and scaling
Plus denoising to reconstruct clean inputs given their noisy counterparts (zero-masked noise and Gaussian noise).
#### Results:
Outperforms state of the art on most tasks at the time, now outperformed itself by Generate To Adapt on most tasks.