ShortScience.org - Making Science Accessible!

Welcome to ShortScience.org!

Algorithms for Non-negative Matrix Factorization
Lee, Daniel D. and Seung, H. Sebastian
Neural Information Processing Systems Conference - 2000 via Local Bibsonomy
Keywords: dblp

[link] Summary by Joseph Paul Cohen 8 years ago

We want to find two matrices $W$ and $H$ such that $V = WH$. Often a goal is to determine underlying patterns in the relationships between the concepts represented by each row and column. $W$ is some $m$ by $n$ matrix and we want the inner dimension of the factorization to be $r$. So 

$$\underbrace{V}_{m \times n} = \underbrace{W}_{m \times r} \underbrace{H}_{r \times n}$$

Let's consider an example matrix where of three customers (as rows) are associated with three movies (the columns) by a rating value.

$$
V = \left[\begin{array}{c c c}
5 & 4 & 1  \\\\
4 & 5 & 1 \\\\
2 & 1 & 5
\end{array}\right]
$$


We can decompose this into two matrices with $r = 1$. First lets do this without any non-negative constraint using an SVD reshaping matrices based on removing eigenvalues:


$$
W = \left[\begin{array}{c c c}
-0.656 \\\
 -0.652 \\\
 -0.379
\end{array}\right],
H = \left[\begin{array}{c c c}
-6.48 & -6.26 & -3.20\\\\
\end{array}\right]
$$

We can also decompose this into two matrices with $r = 1$ subject to the constraint that $w_{ij} \ge 0$ and  $h_{ij} \ge 0$. (Note: this is only possible when $v_{ij} \ge 0$):

$$
W = \left[\begin{array}{c c c}
0.388 \\\\
0.386 \\\\
0.224
\end{array}\right],
H = \left[\begin{array}{c c c}
11.22 & 10.57 & 5.41  \\\\
\end{array}\right]
$$

Both of these $r=1$ factorizations reconstruct matrix $V$ with the same error. 

$$
V \approx WH = \left[\begin{array}{c c c}
4.36 & 4.11 & 2.10 \\\
4.33 & 4.08 & 2.09 \\\
2.52 & 2.37 & 1.21 \\\
\end{array}\right]
$$


If they both yield the same reconstruction error then why is a non-negativity constraint useful? We can see above that it is easy to observe patterns in both factorizations such as similar customers and similar movies. `TODO: motivate why NMF is better`



#### Paper Contribution 

This paper discusses two approaches for iteratively creating a non-negative $W$ and $H$ based on random initial matrices. The paper discusses a multiplicative update rule where the elements of $W$ and $H$ are iteratively transformed by scaling each value such that error is not increased. 

The multiplicative approach is discussed in contrast to an additive gradient decent based approach where small corrections are iteratively applied. The multiplicative approach can be reduced to this by setting the learning rate ($\eta$) to a ratio that represents the magnitude of the element in $H$ to the scaling factor of $W$ on $H$.



### Still a draft

aclweb.org
scholar.google.com

WikiQA: A Challenge Dataset for Open-Domain Question Answering
Yang, Yi and tau Yih, Wen and Meek, Christopher
Empirical Methods on Natural Language Processing (EMNLP) - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Shagun Sodhani 7 years ago

#### Introduction

* Presents WikiQA - a publicly available set of question and sentence pairs for open-domain question answering.
* [Link to the paper](https://www.microsoft.com/en-us/research/publication/wikiqa-a-challenge-dataset-for-open-domain-question-answering/)

#### Dataset

* 3047 questions sampled from Bing query logs.
* Each question associated with a Wikipedia page.
* All sentences in the summary paragraph of the page become the candidate answers.
* Only 1/3rd questions have a correct answer in the candidate answer set.
* Solutions crowdsourced through MTurk like platform.
* Answer sentences are associated with *answer phrases* (shortest substring of a sentence that answers the question) though this annotation is not used in the experiments reported by the paper.

#### Other Datasets

* [QASent datset](http://homes.cs.washington.edu/~nasmith/papers/wang+smith+mitamura.emnlp07.pdf)
    * Uses questions from TREC-QA dataset (questions from both query logs and human editors) and selects sentences which share at least one non-stopword from the question. 
    * Lexical overlap makes QA task easier.
    * Does not support evaluating for *answer triggering* (detecting if the correct answer even exists in the candidate sentences).

#### Experiments

##### Baseline Systems

* **Word Count** - Counts the number of non-stopwords common to question and answer sentences.
* **Weighted Word Count** - Re-weight word counts by the IDF values of the question words.
* **[LCLR](https://www.microsoft.com/en-us/research/publication/question-answering-using-enhanced-lexical-semantic-models/)** - Uses rich lexical semantic features like WordNet and vector-space lexical semantic models.
* **Paragraph Vectors** - Considers cosine similarity between question vector and sentence vector.
* **Convolutional Neural Network (CNN)** - Bigram CNN model with average pooling.
* **PV-Cnt** and **CNN-Cnt** - Logistic regression classifier combining PV (and CNN) models and Word Count models.

##### Metrics

* MAP and MRR for answer selection problem.
* Precision, recall and F1 scores for answer triggering problem.

#### Observations

* CNN-cnt outperforms all other models on both the tasks.
* Three additional features, namely the length of the question (QLen), the length of sentence (SLen), and the class of the question (QClass) are added to track question hardness and sentence comprehensiveness.
* Adding QLen improves performance significantly while adding SLen (QClass) improves (degrades) performance marginally.
* For the same model, the performance on the WikiQA dataset is inferior to that on the QASent dataset.
* Note: The dataset is very small to train end-to-end networks.

dx.doi.org
sci-hub
scholar.google.com

Fast R-CNN
Girshick, Ross B.
International Conference on Computer Vision - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Joseph Paul Cohen 8 years ago

This method is based on improving the speed of R-CNN \cite{conf/cvpr/GirshickDDM14}

1. Where R-CNN would have two different objective functions, Fast R-CNN combines localization and classification losses into a "multi-task loss" in order to speed up training.
2. It also uses a pooling method based on \cite{journals/pami/HeZR015} called the RoI pooling layer that scales the input so the images don't have to be scaled before being set an an input image to the CNN. "RoI max pooling works by dividing the $h \times w$ RoI window into an $H \times W$ grid of sub-windows of approximate size $h/H \times w/W$ and then max-pooling the values in each sub-window into the corresponding output grid cell."
3. Backprop is performed for the RoI pooling layer by taking the argmax of the incoming gradients that overlap the incoming values.

This method is further improved by the paper "Faster R-CNN" \cite{conf/nips/RenHGS15}

arxiv.org
scholar.google.com

Convolutional Neural Networks for Sentence Classification
Kim, Yoon
arXiv e-Print archive - 2014 via Local Bibsonomy
Keywords: dblp

[link] Summary by Shagun Sodhani 7 years ago

#### Introduction

* The paper demonstrates how simple CNNs, built on top of word embeddings, can be used for sentence classification tasks.
* [Link to the paper](https://arxiv.org/abs/1408.5882)
* [Implementation](https://github.com/shagunsodhani/CNN-Sentence-Classifier)

#### Architecture

* Pad input sentences so that they are of the same length.
* Map words in the padded sentence using word embeddings (which may be either initialized as zero vectors or initialized as word2vec embeddings) to obtain a matrix corresponding to the sentence.
* Apply convolution layer with multiple filter widths and feature maps.
* Apply max-over-time pooling operation over the feature map.
* Concatenate the pooling results from different layers and feed to a fully-connected layer with softmax activation.
* Softmax outputs probabilistic distribution over the labels.
* Use dropout for regularisation.

#### Hyperparameters

* RELU activation for convolution layers
* Filter window of 3, 4, 5 with 100 feature maps each.
* Dropout - 0.5
* Gradient clipping at 3
* Batch size - 50
* Adadelta update rule.

#### Variants

* CNN-rand
    * Randomly initialized word vectors.
* CNN-static
    * Uses pre-trained vectors from word2vec and does not update the word vectors.
* CNN-non-static
    * Same as CNN-static but updates word vectors during training.
* CNN-multichannel
    * Uses two set of word vectors (channels).
    * One set is updated and other is not updated.

#### Datasets

* Sentiment analysis datasets for Movie Reviews, Customer Reviews etc.
* Classification data for questions.
* Maximum number of classes for any dataset - 6

#### Strengths

* Good results on benchmarks despite being a simple architecture.
* Word vectors obtained by non-static channel have more meaningful representation. 

#### Weakness

* Small data with few labels.
* Results are not very detailed or exhaustive.

jmlr.org
scholar.google.com

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
Xu, Kelvin and Ba, Jimmy and Kiros, Ryan and Cho, Kyunghyun and Courville, Aaron C. and Salakhutdinov, Ruslan and Zemel, Richard S. and Bengio, Yoshua
International Conference on Machine Learning - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Denny Britz 8 years ago

TLDR; The authors use an attention mechanism in image caption generation, allowing the decoder RNN focus on specific parts of the image. In order find the correspondence between words and image patches, the RNN uses a lower convolutional layer as its input (before pooling). The authors propose both a "hard" attention (trained using sampling methods) and "soft" attention (trained end-to-end) mechanism, and show qualitatively that the decoder focuses on sensible regions while generating text, adding an additional layer of interpretability to the model. The attention-based models achieve state-of-the art on Flickr8k, Flickr30 and MS Coco.

#### Key Points

- To find image correspondence use lower convolutional layers to attend to.
- Two attention mechanisms: Soft and hard. Depending on evaluation metric (BLEU vs. METERO) one or the other performs better.
- Largest data set (MS COCO) takes 3 days to train on Titan Black GPU. Oxford VGG.
- Soft attention is same as for seq2seq models.
- Attention weights are visualized by upsampling and applying a Gaussian

#### Notes/Questions

- Would've liked to see an explanation of when/how soft vs. hard attention does better.
- What is the computational overhead of using the attention mechanism? Is it significant?