ShortScience.org - Making Science Accessible!

Welcome to ShortScience.org!

scholar.google.com

Distributed representations of words and phrases and their compositionality
Mikolov, Tomas and Sutskever, Ilya and Chen, Kai and Corrado, Greg S and Dean, Jeff
Advances in neural information processing systems - 2013 via Local Bibsonomy
Keywords: thema:deepwalk, language, modelling, representation

[link] Summary by NIPS Conference Reviews 10 years ago

The paper discusses a number of extensions to the Skip-gram model previously proposed by Mikolov et al (citation [7] in the paper): which learns linear word embeddings that are particularly useful for analogical reasoning type tasks. The extensions proposed (namely, negative sampling and sub-sampling of high frequency words) enable extremely fast training of the model on large scale datasets. This also results in significantly improved performance as compared to previously proposed techniques based on neural networks. The authors also provide a method for training phrase level embeddings by slightly tweaking the original training algorithm.

This paper proposes 3 improvements for the skip-gram model which allows for learning embeddings for words. The first improvement is subsampling frequent word, the second is the use of a simplified version of noise constrastive estimation (NCE) and finally they propose a method to learn idiomatic phrase embeddings. In all three cases the improvements are somewhat ad-hoc. In practice, both the subsampling and negative samples help to improve generalization substantially on an analogical reasoning task. The paper reviews related work and furthers the interesting topic of additive compositionality in embeddings.

The article does not propose any explanation as to why the negative sampling produces better results than NCE which it is suppose to loosely approximate. In fact it doesn't explain why besides the obvious generalization gain the negative sampling scheme should be preferred to NCE since they achieve similar speeds.

aclweb.org
scholar.google.com

Deep Reinforcement Learning for Dialogue Generation
Li, Jiwei and Monroe, Will and Ritter, Alan and Jurafsky, Dan and Galley, Michel and Gao, Jianfeng
Empirical Methods on Natural Language Processing (EMNLP) - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Abhishek Das 8 years ago

This paper builds on top of a bunch of existing ideas for building neural conversational agents so as to control against generic and repetitive responses.

Their model is the sequence-to-sequence model with attention (Bahdanau et al.), first trained with the usual MLE loss and fine-tuned with policy gradients to optimize for specific conversational properties. Specifically, they define 3 rewards:

1. Ease of answering — Measured as the likelihood of responding to a query with a list of hand-picked dull responses (more negative log likelihood is higher reward).
2. Information flow — Consecutive responses from the same agent (person) should have different information, measured as negative of log cosine distance (more negative is better).
3. Semantic coherence — Mutual information between source and target (the response should make sense wrt query). $P(a|q) + P(q|a)$ where a is answer, q is question.

The model is pre-trained with the usual supervised objective function, taking source as concatenation of two previous utterances. Then they have two stages of policy gradient training, first with just a mutual information reward and then with a combination of all three. The policy network (sequence-to-sequence model) produces a probability distribution over actions (responses) given state (previous utterances). To estimate the gradient in an iteration, the network is frozen and responses are sampled from the model, the rewards for which are then averaged and gradients are computed for first L tokens of response using MLE and remaining T-L tokens with policy gradients, with L being gradually annealed to zero (moving towards just the long-term reward).

Evaluation is done based on length of dialogue, diversity (distinct unigram, bigrams) and human studies on

1. Which of two outputs has better quality (single turn)
2. Which of two outputs is easier to respond to, and
3. Which of two conversations have better quality (multi turn).

## Strengths

- Interesting results
- Avoids generic responses
- 'Ease of responding' reward encourages responses to be question-like
- Adding in hand-engineereed approximate reward functions based on conversational properties and using those to fine-tune a pre-trained network using policy gradients is neat.
- Policy gradient training also encourages two dialogue agents to interact with each other and explore the complete action space (space of responses), which seems desirable to identify modes of the distribution and not converge on a single, high-scoring, generic response.

## Weaknesses / Notes

- Evaluating conversational agents is hard. BLEU / perplexity are intentionally avoided as they don't necessarily reward desirable conversational properties.

arxiv.org
scholar.google.com

Are Disentangled Representations Helpful for Abstract Visual Reasoning?
van Steenkiste, Sjoerd and Locatello, Francesco and Schmidhuber, Jürgen and Bachem, Olivier
arXiv e-Print archive - 2019 via Local Bibsonomy
Keywords: dblp

[link] Summary by CodyWild 6 years ago

Arguably, the central achievement of the deep learning era is multi-layer neural networks' ability to learn useful intermediate feature representations using a supervised learning signal. In a supervised task, it's easy to define what makes a feature representation useful: the fact that's easier for a subsequent layer to use to make the final class prediction. When we want to learn features in an unsupervised way, things get a bit trickier. There's the obvious problem of what kinds of problem structures and architectures work to extract representations at all. But there's also a deeper problem: when we ask for a good feature representation, outside of the context of any given task, what are we asking for? Are there some inherent aspects of a representation that can be analyzed without ground truth labels to tell you whether the representations you've learned are good are not? 

The notion of "disentangled" features is one answer to that question: it suggests that a representation is good when the underlying "factors of variation" (things that are independently variable in the underlying generative process of the data) are captured in independent dimensions of the feature representation. That is, if your representation is a ten-dimensional vector, and it just so happens that there are ten independent factors along which datapoints differ (color, shape, rotation, etc), you'd ideally want each dimension to correspond to each factor. 

This criteria has an elegance to it, and it's previously been shown useful in predicting when the representations learned by a model will be useful in predicting the values of the factors of variation. This paper goes one step further, and tests the value representations for solving a visual reasoning task that involves the factors of variation, but doesn't just involve predicting them. In particular, the authors use learned representations to solve a task patterned on a human IQ test, where some factors stay fixed across a row in a grid, and some vary, and the model needs to generate the image that "fits the pattern". 

https://i.imgur.com/O1aZzcN.png

To test the value of disentanglement, they looked at a few canonical metrics of disentanglement, including scores that represent "how many factors are captured in each dimension" and "how many dimensions is a factor spread across". They measured the correlation of these metrics with task performance, and compared that with the correlation between simple autoencoder reconstruction error and performance.

They found that at early stages of training on top of the representations, the disentanglement metrics were more predictive of performance than reconstruction accuracy. This distinction went away as the model learning on top of the representations had more time to train. It makes reasonable sense that you'd mostly see value for disentangled features in a low-data regime, since after long enough the fine-tuning network can learn its own features regardless. But, this paper does appear to contribute to evidence that disentangled features are predictive of task performance, at least when that task directly involves manipulation of specific, known, underlying factors of variation.

arxiv.org
scholar.google.com

Learning to Diagnose with LSTM Recurrent Neural Networks
Lipton, Zachary Chase and Kale, David C. and Elkan, Charles and Wetzel, Randall C.
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Tiago Vinhoza 8 years ago

#### Goal
+ Predict 128 diagnoses for intensive pediatric care patients.

#### Dataset:

+ Children's Hospital LA.
+ Episode is a multivariate time series that describes the stay of one patient in the intensive care unit.

Dataset properties | Value
---------|----------
Number of episodes | 10,401
Duration of episodes | From 12h to several months
Time series variables | Systolic blood pressure, Diastolic blood pressure, Peripheral capillary refill rate, End tidal CO2, Fraction of inspired O2, Glasgow coma scale, Blood glucose, Heart rate, pH, Respiratory rate, Blood O2 Saturation, Body temperature, Urine output.

+ Resampling and missing values:
+ Irregularly sampled time-series that is resampled to an hourly rate.
+ Mean measurement within each hour window is taken.
+ Forward- and back-filling are used to fill gaps created by the resampling.
+ When variable time series is missing entirely: imputation with a clinically *normal* value defined by domain experts.
+ This paper is followed by [Modeling Missing Data in Clinical Time Series with RNNs](http://www.shortscience.org/paper?bibtexKey=journals/corr/LiptonKW16) from the same research group.

+ Labels:
+ Each episode is associated with 0 or more diagnoses. (in-house taxonomy, ICD-9 based).
+ Dataset contains 429 diagnoses. The paper focuses on the 128 most frequent diagnoses that appear 50 or more times in the dataset.

#### Architecture:

+ LSTM with Target Replication:

![Architecture](https://raw.githubusercontent.com/tiagotvv/ml-papers/master/clinical-data/images/Lipton2016a_target.png?raw=true "Target Replication")

+ Loss function:
+ For the model with target replication, output y is generated at every sequence step. The loss function is then a convex combination of the final loss (log-loss in the case of this paper) and the average of the losses over all steps where T is the number of sequence steps and alpha is a hyperparameter.

![Loss function](https://raw.githubusercontent.com/tiagotvv/ml-papers/master/clinical-data/images/Lipton2016a_loss.png?raw=true "Loss function")

#### Experiments and Results:

**Methodology**:
+ Split dataset: 80% training, 10% validation, 10% test
+ LTSM trained for 100 epochs via gradient stochastic gradient (with momentum).
+ Regularization L2: 1e-6, obtained via validation dataset.

+ LSTM: 2 hidden layers with 64 cells or 128 cells (and 50% dropout)
+ Multiple combinations: target replication / auxiliary target variables (trained using the other 301 diagnoses and other clinical information as a target. Inferences are made only for the 128 major diagnoses.

+ Baselines for comparison:
+ Logistic Regression - L2 regularized
+ MLP with 3 hidden layers - ReLU - dropout 50%.
+ Baselines tested in the raw time-series and in a feature engineering version made by domain experts.

*Metrics*:
+ Micro AUC, Micro F1: calculated by adding the TPs, FPs, TNs and FNs for the entire dataset and for all classes.
+ Macro AUC, Macro F1: Arithmetic mean of AUCs and F1 scores for each of the classes.
+ Precision at 10: Fraction of correct diagnoses among the top 10 predictions of the model.
+ The upper bound for precision at 10 is 0.2281 since in the test set there are on average 2.281 diagnoses per patient.

*Results*:

![All Results](https://raw.githubusercontent.com/tiagotvv/ml-papers/master/clinical-data/images/Lipton2016a_allresults.png?raw=true "Performance metrics across all labels")

*Results for selected diagnoses*:

![Results for Selected Diseases](https://raw.githubusercontent.com/tiagotvv/ml-papers/master/clinical-data/images/Lipton2016a_selected.png?raw=true "Performance for selected diagnoses")

#### Discussion:

+ Auxiliary outputs improve performance at the expense of increased training time. Very unbalanced dataset for some of the remaining 301 labels makes it spend an entire epoch only to learn that one of the target variables can take values other than 0.

+ Real-Time Predictions: In the future, the authors expect that the proposed solution could be used to make continuously updated real-time alerts and diagnoses.

arxiv.org
arxiv-vanity.com
scholar.google.com

Variational Dropout Sparsifies Deep Neural Networks
Dmitry Molchanov and Arsenii Ashukha and Dmitry Vetrov
arXiv e-Print archive - 2017 via Local arXiv
Keywords: stat.ML, cs.LG
more

[link] Summary by Gavin Gray 9 years ago

The authors introduce their contribution as an alternative way to approximate the KL divergence between prior and variational posterior used in [Variational Dropout and the Local Reparameterization Trick][kingma] which allows unbounded variance on the multiplicative noise. When the noise variance parameter associated with a weight tends to infinity you can say that the weight is effectively being removed, and in their implementation this is what they do.

There are some important details differing from the [original algorithm][kingma] on per-weight variational dropout. For both methods we have the following initialization for each dense layer:

```
theta = initialize weight matrix with shape (number of input units, number of hidden units)
log_alpha = initialize zero matrix with shape (number of input units, number of hidden units)
b = biases initialized to zero with length the number of hidden units
```

Where `log_alpha` is going to parameterise the variational posterior variance.

In the original paper the algorithm was the following:

```
mean = dot(input, theta) + b # standard dense layer
# marginal variance over activations (eq. 10 in [original paper][kingma])
variance = dot(input^2, theta^2 * exp(log_alpha)) 
# sample from marginal distribution by scaling Normal 
activations = mean + sqrt(variance)*unit_normal(number of output units) 
```

The final step is a standard [reparameterization trick][shakir], but since it is a marginal distribution this is referred to as a local reparameterization trick (directly inspired by the [fast dropout paper][fast]).

The sparsifying algorithm starts with an alternative parameterisation for `log_alpha`

```
log_sigma2 = matrix filled with negative constant (default -8) with size (number of input units, number of hidden units)
log_alpha = log_sigma2 - log(theta^2)
log_alpha = log_alpha clipped between 8 and -8
```

The authors discuss this in section 4.1, the $\sigma_{ij}^2$ term corresponds to an additive noise variance on each weight with $\sigma_{ij}^2 = \alpha_{ij}\theta_{ij}^2$. Since this can then be reversed to define `log_alpha` the forward pass remains unchanged, but the variance of the gradient is reduced. It is quite a counter-intuitive trick, so much so I can't quite believe it works.

They then define a mask removing contributions to units where the noise variance has gone too high:

```
clip_mask = matrix shape of log_alpha, equals 1 if log_alpha is greater than thresh (default 3)
```

The clip mask is used to set elements of `theta` to zero, and then the forward pass is exactly the same as in the original paper.

The difference in the approximation to the KL divergence is illustrated in figure 1 of the paper; the sparsifying version tends to zero as the variance increases, which matches the true KL divergence. In the [original paper][kingma] the KL divergence would explode, forcing them to clip the variances at a certain point.

[kingma]: https://arxiv.org/abs/1506.02557
[shakir]: http://blog.shakirm.com/2015/10/machine-learning-trick-of-the-day-4-reparameterisation-tricks/
[fast]: http://proceedings.mlr.press/v28/wang13a.html