Summaries from International Conference on Machine Learning on ShortScience.org

jmlr.org
scholar.google.com

A Deep and Tractable Density Estimator
Uria, Benigno and Murray, Iain and Larochelle, Hugo
International Conference on Machine Learning - 2014 via Local Bibsonomy
Keywords: dblp

[link] Summary by Cubs Reading Group 6 years ago

#### Problem addressed: 
Fully visible Bayesian network learning

#### Summary: 
This work is an extension of the original NADE paper. As oppose to using a prefixed random fully visible connected Bayesian network (FVBN), they try to train a factorial number of all possible FVBN by optimizing a stochastic version of the objective, which is an unbiased estimator. The resultant model is very easy to do any type of inference, in addition, since it is trained on all orderings, the ensemble generation of NADE models are also very easy with no additional cost. The training is to mask out the variables that one wants to predict, and maximize the likelihood over training data for the prediction of those missing variables. The model is very similar to denoising autoencoder with Bernoulli type of noise on the input. One drawback of this masking is that the model has no distinction between a masked out variable and a variable that has value 0. To overcome this, they supply the mask as additional input to the network and showed that this is an important ingredient for the model to work.

#### Novelty:
Proposed order agnoistic NADE, which overcome several drawbacks of original NADE.

#### Drawbacks:
The inference at test time is a bit expensive.

#### Datasets:
UCI, binary MNIST

#### Additional remarks:


#### Resources:
The first author provided the implementation on his website

#### Presenter:
Yingbo Zhou

arxiv.org
arxiv-vanity.com
scholar.google.com

Unsupervised Domain Adaptation by Backpropagation
Yaroslav Ganin and Victor Lempitsky
arXiv e-Print archive - 2014 via Local arXiv
Keywords: stat.ML, cs.LG, cs.NE
more

[link] Summary by Joseph Paul Cohen 7 years ago

The goal of this method is to create a feature representation $f$ of an input $x$ that is domain invariant over some domain $d$. The feature vector $f$ is obtained from $x$ using an encoder network (e.g. $f = G_f(x)$). 

The reason this is an issue is that the input $x$ is correlated with $d$ and this can confuse the model to extract features that capture differences in domains instead of differences in classes. Here I will recast the problem differently from in the paper:

**Problem:** Given a conditional probability $p(x|d=0)$ that may be different from $p(x|d=1)$:

$$p(x|d=0) \stackrel{?}{\ne} p(x|d=1)$$

we would like it to be the case that these distributions are equal.

$$p(G_f(x) |d=0) = p(G_f(x)|d=1)$$

aka:

$$p(f|d=0) = p(f|d=1)$$

Of course this is an issue if some class label $y$ is correlated with $d$ meaning that we may hurt the performance of a classifier that now may not be able to predict $y$ as well as before.

https://i.imgur.com/WR2ujRl.png

The paper proposes adding a domain classifier network to the feature vector using a reverse gradient layer. This layer simply flips the sign on the gradient. Here is an example in [Theano](https://github.com/Theano/Theano):

```
class ReverseGradient(theano.gof.Op):
    ...
    def grad(self, input, output_gradients):
        return [-output_gradients[0]]
```

You then train this domain network as if you want it to correctly predict the domain (appending it's error to your loss function). As the domain network learns new ways to correctly predict an output these gradients will be flipped and the information in feature vector $f$ will be removed.

There are two major hyper parameters of the method. The number of dimensions at the bottleneck is one but it is linked to your network. The second is a scalar on the gradient so you can increase or decrease the effect of the gradient on the embedding.

1 Comments

jmlr.org
scholar.google.com

Distributed Representations of Sentences and Documents
Le, Quoc V. and Mikolov, Tomas
International Conference on Machine Learning - 2014 via Local Bibsonomy
Keywords: dblp

[link] Summary by Denny Britz 8 years ago

TLDR; The authors present Paragraph Vector, which learns fixed-length, semantically meaningful vector representations for text of any length (sentences, paragraphs, documents, etc). The algorithm works by training a word vector model with an additional paragraph embedding vector as an input. This paragraph embedding is fixed for each paragraph, but varies across paragraphs. Similar to word2vec, PV comes in 2 flavors:

- A Distributed Memory Model (PV-DM) that predicts the next word based on the paragraph and preceding words
- A BoW model (PW-BoW) that predicts context words for a given paragraph

A notable property of PV is that during inference (when you see a new paragraph) it requires training of a new vector, which can be slow. The learned embeddings can used as the input to other models. In their experiments the authors train both variants and concatenate the results. The authors evaluate PV on Classification and Information Retrieval Tasks and achieve new state-of-the-art.

#### Data Sets / Results

Stanford Sentiment Treebank Polar error: 12.2%
Stanford Sentiment Treebank Fine-Grained error: 51.3%
IMDB Polar error: 7.42%
Query-based search result retrieval (internal) error: 3.82%

#### Key Points

- Authors use 400-dimensional PV and word embeddings. The window size is a hyperparameter chosen on the validation set, values from 5-12 seem to work well. In IMDB, window size resulted in error fluctuation of ~0.7%.
- PV-DM performs well on its own, but concatenating PV-DM and PV-BoW consistently leads to (small) improvements.
- When training the PV-DM model, use concatenation instead of averaging to combine words and paragraph vectors (this preserves ordering information)
- Hierarchical Softmax is used to deal with large vocabularies.
- For final classification, authors use LR or MLP, depending on the task (see below)
- IMDB Training (25k documents, 230 average length) takes 30min on 16 core machine, CPU I assume.

#### Notes / Question

- How did the authors choose the final classification model? Did they cross-validate this? The authors mention that NN performs better than LR for the IMDB data, but they don't show how large the gap is. Does PV maybe perform significantly worse with a simpler model?
- I wonder if we can train hierarchical representations of words, sentences, paragraphs, documents, keep the vectors of each one fixed at each layer, and predicting sequences using RNNs.
- I wonder how PV compares to an attention-based RNN autoencoder approach. When training PV you are in a way attending to specific parts of the paragraph to predict the missing parts.

jmlr.org
scholar.google.com

Affinity Weighted Embedding
Weston, Jason and Weiss, Ron J. and Yee, Hector
International Conference on Machine Learning - 2014 via Local Bibsonomy
Keywords: dblp

[link] Summary by Open Review 8 years ago

This paper extends supervised embedding models by combining them multiplicatively, 

i.e. $f'(x,y) = G(x,y) f(x,y). $

It considers two types of model, dot product in the *embedding* space and kernel density in the *embedding* space, where the kernel in the embedding space is restricted to 

$k((x,y),(x','y)) = k(x-x')k(y-y'). $

It proposes an iterative algorithm which alternates $f$ and $G$ parameter updates.