ShortScience.org - Making Science Accessible!

Welcome to ShortScience.org!

arxiv.org
arxiv-vanity.com
scholar.google.com

WikiReading: A Novel Large-scale Language Understanding Task over Wikipedia
Daniel Hewlett and Alexandre Lacoste and Llion Jones and Illia Polosukhin and Andrew Fandrianto and Jay Han and Matthew Kelcey and David Berthelot
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.CL
more

[link] Summary by Shagun Sodhani 8 years ago

#### Introduction

* Large scale natural language understanding task - predict text values given a knowledge base.
* Accompanied by a large dataset generated using Wikipedia
* [Link to the paper](http://www.aclweb.org/anthology/P/P16/P16-1145.pdf)

#### Dataset

* WikiReading dataset built using Wikidata and Wikipedia.
* Wikidata consists of statements of the form (property, value) about different items
* 80M statements, 16M items and 884 properties.
* These statements are grouped by items to get (item, property, answer) tuples where the answer is a set of values.
* Items are further replaced by their Wikipedia documents to generate 18.58M statements of the form (document, property, answer).
* Task is to predict answer given document and property.
* Properties are divided into 2 classes:
* **Categorical properties** - properties with a small number of possible answers. Eg gender.
* **Relational properties** - properties with unique answers. Eg date of birth.
* This classification is done on the basis of the entropy of answer distribution.
* Properties with entropy less than 0.7 are classified as categorical properties.
* Answer distribution has a small number of very high-frequency answers (head) and a large number of answers with very small frequency (tail).
* 30% of the answers do not appear in the training set and must be inferred from the document.

#### Models

##### Answer Classification

* Consider WikiReading as classification task and treat each answer as a class label.

###### Baseline

* Linear model over Bag of Words (BoW) features.
* Two BoW vectors computed - one for the document and other for the property. These are concatenated into a single feature vector.

###### Neural Networks Method

* Encode property and document into a joint representation which is fed into a softmax layer.

* **Average Embeddings BoW**

* Average the BoW embeddings for documents and property and concatenate to get joint representation.

* **Paragraph Vectors**

* As a variant of the previous method, encode document as a paragraph vector.

* **LSTM Reader**

* LSTM reads the property and document sequence, word-by-word, and uses the final state as joint representation.

* **Attentive Reader**

* Use attention mechanism to focus on relevant parts of the document for a given property.

* **Memory Networks**

* Maps a property p and list of sentences x1, x2, ...xn in a joint representation by attention over the sentences in the document.

##### Answer Extraction

* For relational properties, it makes more sense to model the problem as information extraction than classification.

* **RNNLabeler**

* Use an RNN to read the sequence of words and estimate if a given word is part of the answer.

* **Basic SeqToSeq (Sequence to Sequence)**

* Similar to LSTM Reader but augmented with a second RNN to decode answer as a sequence of words.

* **Placeholder SeqToSeq**

* Extends Basic SeqToSeq to handle OOV (Out of Vocabulary) words by adding placeholders to the vocabulary.
* OOV words in the document and answer are replaced by placeholders so that input and output sentences are a mixture of words and placeholders only.

* **Basic Character SeqToSeq**

* Property encoder RNN reads the property, character-by-character and transforms it into a fixed length vector.
* This becomes the initial hidden state for the second layer of a 2-layer document encoder RNN.
* Final state of this RNN is used by answer decoder RNN to generate answer as a character sequence.

* **Character SeqToSeq with pretraining**

* Train a character-level language model on input character sequence from the training set and use the weights to initiate the first layer of encoder and decoder.

#### Experiments

* Evaluation metric is F1 score (harmonic mean of precision and accuracy).
* All models perform well on categorical properties with neural models outperforming others.
* In the case of relational properties, SeqToSeq models have a clear edge.
* SeqToSeq models also show a great deal of balance between relational and categorical properties.
* Language model pretraining enhances the performance of character SeqToSeq approach.
* Results demonstrate that end-to-end SeqToSeq models are most promising for WikiReading like tasks.

dx.doi.org
sci-hub
scholar.google.com

An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database
Jimmy K. Eng and Ashley L. McCormack and John R. Yates
Journal of the American Society for Mass Spectrometry - 1994 via Local CrossRef
Keywords:

[link] Summary by siegfried gessulat 8 years ago

In proteomics, a popular method to identify peptides is mass spectrometry. An experimental tandem mass spectrum consists of mass peaks that stem from fragmenting a peptide to fragment ions. 
To identify the peptide, that produced the experimental spectrum, all peptides that could produce the seen fragment ions must be analysed. Unfortunately, the space of possible peptides to search against is large.  The paper simplifies this problem by searching against a peptide database.

The approach describes a 4-step method:

- Data reduction: only the 200 most abundant peaks of the experimental spectrum are kept
- Search: peptides with a mass similar to the experimental spectrum's precursor are selected
- Scoring: the score for each selected peptide is based on the number of predicted fragment ions seen in the experimental spectrum.
- Cross-correlation: For the top 500 scoring peptides, theoretical spectra is constructed. These spectra are are evaluated by cross-correlation with the experimental spectrum. The top scoring peptide is considered the identified peptide.

Since a database of known peptides is needed for this approach, it can not be used for de-novo peptide identification, meaning identifying peptides for species where no sequenced genome or proteome is available.

arxiv.org
arxiv-vanity.com
scholar.google.com

On orthogonality and learning recurrent networks with long term dependencies
Eugene Vorontsov and Chiheb Trabelsi and Samuel Kadoury and Chris Pal
arXiv e-Print archive - 2017 via Local arXiv
Keywords: cs.LG, cs.NE
more

[link] Summary by Raza Habib 8 years ago

## Headline Summary
The problem of vanishing or exploding gradients when training recurrent neural networks can be partially overcome by enforcing an orthogonality constraint on the weight matrices between consecutive hidden states. This paper explores the trade-off between enforced orthogonality and the rate of learning by slowly increasing the extent to which the weight matrices are allowed to deviate from orthogonality.  They find that a hard orthogonality constraint can be too restrictive and perseveres gradients at the cost of slowing learning.

### Why orthogonal matrices are likely to help with exploding gradients

For a  neural network  with N hidden layers we can write the gradient of the loss wrt to the weight matrix between hidden states $W_i$ as: 

$ \frac{\partial L}{\partial W_i} =  \frac{\partial a_i}{\partial w_i} \prod_j^N (\frac{\partial h_i}{\partial a_i} \frac{\partial a_i}{\partial h_i}) \frac{\partial L}{\partial a_{N+1}} = \frac{\partial a_i}{\partial w_i} \prod_j^N (D_i W_{j+1}) \frac{\partial L}{\partial a_{N+1}}$

where L is the loss function and D is the jacobian of the activation function. It's in this long chain of products that the gradient tends to either vanish or explode. This problem is particularly bad in recurrent nets where the chain is the length of the unrolled computation graph.

The norm of the derivative $\|\frac{\partial h_i}{\partial a_i} \frac{\partial a_i}{\partial h_i} \|= \|D_i W_{j+1}\| \leq  \|D_i \| \| W_{j+1}\| \leq \lambda_D \lambda_W$ 

where $\lambda_{D/W}$ is the largest singular value of the given matrix. Its clear that if $\lambda$ is less than or greater than 1, that the norm of the gradient will grow or decay exponentially with length of the chain of products given above.

Constraining the weight matrices to be orthogonal ensures that the singular values are 1 and so perseveres the gradient norm.

### How to enforce the orthogonality constraints

The set of all orthogonal matrices from a continuos manifold known as the stiefel manifold. One way to ensure that the weight matrices remain orthogonal is to initialise them to be orthogonal and then to perform gradient optimisation on this manifold. The orthogonality of the matrix can be preserved by a Caley transformation outlined in the paper. 

In order to allow the weight  matrix to deviate slightly from the stiefel manifold, the authors took the SVD of $W = USV^T $ and parameterised the singular values. They used geodesic gradient descent to maintain the orthogonality of $U$ and $V$. The singular values were restricted to between $ 1 ^+_- m$ where  $m$ is a margin by parameterisation with a sigmoid.

### Experiment Summary

They experiment on a number of synthetic and real data-sets with elman and factorial RNNs. The synthetic tasks are the hochrieter memory and addition tasks, classifying sequential mnist and the real task is a language modelling task using the Penn Tree Bank dataset.

In the synthetic memory tasks they found that pure othogonalisation hindered performance but that a small margin away from orthogonality was better than no orthogonal constraint.

On the mnist tasks, the LSTM outperformed all other versions of simple RNNs but the simple RNNs with orthogonal constraints did quite well. Orthonal initialisation outperformed identity and glorot initialisation.

On the language modelling task with short sentences, orthogonal constraints were detrimental but on the longer sentences beneficial. They did not compare to LSTM. They also found that orthogonal initialisation was almost as good as orthogonal constraints during learning suggesting that preserving gradient flow is most important early in learning.

### My Two Cents
This paper explores an interesting approach to learning long term dependencies even with very simple RNNs. However, as theoretically appealing as the results may be, they still fail to match the memory capacities of an LSTM so for the time being it doesn't seem that there are any major practical take aways.

arxiv.org
arxiv-vanity.com
scholar.google.com

Lexical Features in Coreference Resolution: To be Used With Caution
Nafise Sadat Moosavi and Michael Strube
arXiv e-Print archive - 2017 via Local arXiv
Keywords: cs.CL
more

[link] Summary by Tim Miller 7 years ago

Kind of a response/deeper dive into the durret/klein "easy victories" paper. Suggests that a) lexical features they used ("easy victories") are very prone to overfitting. They first show that several state of the art systems that use lexical features, trained on CoNLL data, perform poorly on wikiref, which was annotated using the same guidelines. Meanwhile the stanford sieve system performs about the same on both.
Then they show that a high percentage of gold standard linked headwords in the test set have been seen in the training set, and that a much lower percentage of errors are in the training set, implying that lexical features just allow you to memorize what kinds of things can be linked.
They suggest development of robust features, including using embeddings as lexical features, using lexical representations only for context, and on the evaluation side, using test sets that are different domains than the training set.

dx.doi.org
sci-hub
scholar.google.com

Inferring Road Maps from Global Positioning System Traces
James Biagioni and Jakob Eriksson
Transportation Research Record: Journal of the Transportation Research Board - 2012 via Local CrossRef
Keywords:

[link] Summary by Martin Thoma 7 years ago

[https://www.cs.uic.edu/~jakob/papers/biagioni-trr12.pdf](https://www.cs.uic.edu/~jakob/papers/biagioni-trr12.pdf) is a super nice survey where I almost feel bad to summarize it. Almost.

It is about road map inference in the presence of traces of geo coordinates. So you have many vehicles which log their positions while driving. From this data, you want to infer the latest map.

Problems:
* The map changes: What was a valid map a week ago, might not be anymore (due to construction work, breaking roads, new roads)
* Sensor errors: GPS is not very accurate

## Contributions

* overview over the literature up to 2012 on map generation
* method for the automatic evaluation of generated maps
* an evaluation of three reference algorithms including code
* a [118-h trace data set](https://www.cs.uic.edu/bin/view/Bits/Software) and ground truth map

## How Map Inference Works

* **Preprocess Geo-Traces**: check for unreasonable speed, too extreme acceleration, too abrupt changes
* Inference:
* k-Means: Reduce candidates to centroid. Works on 3D - 2 coordinates and direction
* trace merging: Merge edges directly, without reduction
* kernel density estimation

## Quantitative Evaluation

1. Start at one location in both, the ground truth map and the generated map
2. Follow streets from both and sample points in a fixed range (e.g. each 50cm). Only go in directions that go away from the start
3. Try to match both, ground truth and generated map points. Only match if points are close enough (e.g. a distance of 10cm if this is an acceptable uncertainty). Then use a classification score (e.g. F1)