Welcome to ShortScience.org! |

- ShortScience.org is a platform for post-publication discussion aiming to improve accessibility and reproducibility of research ideas.
- The website has 1584 public summaries, mostly in machine learning, written by the community and organized by paper, conference, and year.
- Reading summaries of papers is useful to obtain the perspective and insight of another reader, why they liked or disliked it, and their attempt to demystify complicated sections.
- Also, writing summaries is a good exercise to understand the content of a paper because you are forced to challenge your assumptions when explaining it.
- Finally, you can keep up to date with the flood of research by reading the latest summaries on our Twitter and Facebook pages.

Exploring the Limits of Language Modeling

Józefowicz, Rafal and Vinyals, Oriol and Schuster, Mike and Shazeer, Noam and Wu, Yonghui

arXiv e-Print archive - 2016 via Local Bibsonomy

Keywords: dblp

Józefowicz, Rafal and Vinyals, Oriol and Schuster, Mike and Shazeer, Noam and Wu, Yonghui

arXiv e-Print archive - 2016 via Local Bibsonomy

Keywords: dblp

[link]
#### This nice paper looks amazing at the first sight since it brings a mixture of: - Fancy models - State-of-art training procedure(considering the 32-GPU distributed training effort which takes 21 days to get the best result) - Significant theory metric improvement(single model: 51.3 -> 30 perplexity reduction, ensemble model:41.0 -> 23.7) - Benchmark on a somewhat industry scale(vocabulary of 793471 words, 0,8B words training data) data-set rather than a pure research one. #### However, I also want to add some criticism: - As [1] mentioned perplexity is somewhat confusing metric, big perplexity may not reflect the real improvement, it would rather bring some kind of "exaggerating" effect. - This paper only provide the language model improvement, however, LMs are usually embedded into a complex usage scenario, such as speech recognition or machine translation. It would be more insightful if the LMs provided in this paper could share its result with integrating into some end-to-end products. Since the authors are working for Google Brain team, this is not too much a stringent requirement. - So far as I know, the data set used by this paper is from news stories[2], this kind of data set is more formal than oral one. And for real application, what we face are usually less formal data(such as search engine and speech recognition). It is still a question what the best model mentioned in this paper will perform in a more realistic scenario. Again, for Google Brain team, this should not be a big obstacles for integrating it with existing system just by replacing or complementing the existing LMs. Although I posted some personal criticism, I do still appreciate this nice paper and recommend this as a "must-read" for NLP and related guys since I do think this paper provide a unifying and comprehensive survey-style perspective for us to help grasp the latest state-of-art language model technology in an efficient way. References: - [1].http://www.fit.vutbr.cz/~imikolov/rnnlm/thesis.pdf - [2].http://static.googleusercontent.com/media/research.google.com/zh-CN//pubs/archive/41880.pdf |

Algorithms for Non-negative Matrix Factorization

Lee, Daniel D. and Seung, H. Sebastian

Neural Information Processing Systems Conference - 2000 via Local Bibsonomy

Keywords: dblp

Lee, Daniel D. and Seung, H. Sebastian

Neural Information Processing Systems Conference - 2000 via Local Bibsonomy

Keywords: dblp

[link]
We want to find two matrices $W$ and $H$ such that $V = WH$. Often a goal is to determine underlying patterns in the relationships between the concepts represented by each row and column. $W$ is some $m$ by $n$ matrix and we want the inner dimension of the factorization to be $r$. So $$\underbrace{V}_{m \times n} = \underbrace{W}_{m \times r} \underbrace{H}_{r \times n}$$ Let's consider an example matrix where of three customers (as rows) are associated with three movies (the columns) by a rating value. $$ V = \left[\begin{array}{c c c} 5 & 4 & 1 \\\\ 4 & 5 & 1 \\\\ 2 & 1 & 5 \end{array}\right] $$ We can decompose this into two matrices with $r = 1$. First lets do this without any non-negative constraint using an SVD reshaping matrices based on removing eigenvalues: $$ W = \left[\begin{array}{c c c} -0.656 \\\ -0.652 \\\ -0.379 \end{array}\right], H = \left[\begin{array}{c c c} -6.48 & -6.26 & -3.20\\\\ \end{array}\right] $$ We can also decompose this into two matrices with $r = 1$ subject to the constraint that $w_{ij} \ge 0$ and $h_{ij} \ge 0$. (Note: this is only possible when $v_{ij} \ge 0$): $$ W = \left[\begin{array}{c c c} 0.388 \\\\ 0.386 \\\\ 0.224 \end{array}\right], H = \left[\begin{array}{c c c} 11.22 & 10.57 & 5.41 \\\\ \end{array}\right] $$ Both of these $r=1$ factorizations reconstruct matrix $V$ with the same error. $$ V \approx WH = \left[\begin{array}{c c c} 4.36 & 4.11 & 2.10 \\\ 4.33 & 4.08 & 2.09 \\\ 2.52 & 2.37 & 1.21 \\\ \end{array}\right] $$ If they both yield the same reconstruction error then why is a non-negativity constraint useful? We can see above that it is easy to observe patterns in both factorizations such as similar customers and similar movies. `TODO: motivate why NMF is better` #### Paper Contribution This paper discusses two approaches for iteratively creating a non-negative $W$ and $H$ based on random initial matrices. The paper discusses a multiplicative update rule where the elements of $W$ and $H$ are iteratively transformed by scaling each value such that error is not increased. The multiplicative approach is discussed in contrast to an additive gradient decent based approach where small corrections are iteratively applied. The multiplicative approach can be reduced to this by setting the learning rate ($\eta$) to a ratio that represents the magnitude of the element in $H$ to the scaling factor of $W$ on $H$. ### Still a draft |

Formal Guarantees on the Robustness of a Classifier against Adversarial Manipulation.

Matthias Hein and Maksym Andriushchenko

Neural Information Processing Systems Conference - 2017 via Local dblp

Keywords:

Matthias Hein and Maksym Andriushchenko

Neural Information Processing Systems Conference - 2017 via Local dblp

Keywords:

[link]
Hein and Andriushchenko give a intuitive bound on the robustness of neural networks based on the local Lipschitz constant. With robustness, the authors refer a small $\epsilon$-ball around each sample; this ball is supposed to describe the region where the neural network predicts a constant class. This means that adversarial examples have to compute changes large enough to leave these robust areas. Larger $\epsilon$-balls imply higher robustness to adversarial examples. When considering a single example $x$, and a classifier $f = (f_1, \ldots, f_K)^T$ (i.e. in a multi-class setting), the bound can be stated as follows. For $q$ and $p$ such that $\frac{1}{q} + \frac{1}{p} = 1$ and $c$ being the class predicted for $x$, the it holds $x = \arg\max_j f_j(x + \delta)$ for all $\delta$ with $\|\delta\|_p \leq \max_{R > 0}\min \left\{\min_{j \neq c} \frac{f_c(x) – f_j(x)}{\max_{y \in B_p(x, R)} \|\nabla f_c(y) - \nabla f_j(y)\|_q}, R\right\}$. Here, $B_p(x, R)$ describes the $R$-ball around $x$ measured using the $p$-norm. Based on the local Lipschitz constant (in the denominator), the bound essentially measures how far we can deviate from the sample $x$ (measured in the $p$-norm) until $f_j(x) > f_c(x)$ for some $j \neq c$. The higher the local Lipschitz constant, the smaller deviations are allowed, i.e. adversarial examples are easier to find. Note that the bound also depends on the confidence, i.e. the edge $f_c(x)$ has in comparison to all other $f_j(x)$. In the remaining paper, the authors also provide bounds for simple classifiers including linear classifiers, kernel methods and two-layer perceptrons (i.e. one hidden layer). For the latter, they also propose a new type of regularization called cross-Lipschitz regularization: $P(f) = \frac{1}{nK^2} \sum_{i = 1}^n \sum_{l,m = 1}^K \|\nabla f_l(x_i) - \nabla f_m(x_i)\|_2^2$. This regularization term is intended to reduce the Lipschitz constant locally around training examples. They show experimental results using this regularization on MNIST and CIFAR, see the paper for details. Also view this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

Online Meta-Learning

Finn, Chelsea and Rajeswaran, Aravind and Kakade, Sham M. and Levine, Sergey

International Conference on Machine Learning - 2019 via Local Bibsonomy

Keywords: dblp

Finn, Chelsea and Rajeswaran, Aravind and Kakade, Sham M. and Levine, Sergey

International Conference on Machine Learning - 2019 via Local Bibsonomy

Keywords: dblp

[link]
## Introduction Two distinct research paradigms have studied how prior tasks or experiences can be used by an agent to inform future learning. * Meta Learning: past experience is used to acquire a prior over model parameters or a learning procedure, and typically studies a setting where a set of meta-training tasks are made available together upfront * Online learning : a sequential setting where tasks are revealed one after another, but aims to attain zero-shot generalization without any task-specific adaptation. We argue that neither setting is ideal for studying continual lifelong learning. Meta-learning deals with learning to learn, but neglects the sequential and non-stationary aspects of the problem. Online learning offers an appealing theoretical framework, but does not generally consider how past experience can accelerate adaptation to a new task. ## Online Learning Online learning focuses on regret minimization. Most standard notion of regret is to compare to the cumulative loss of the best fixed model in hindsight: https://i.imgur.com/pbZG4kK.png One way minimize regret is with Follow the Leader (FTL): https://i.imgur.com/NCs73vG.png ## Online Meta-learning Setting: let $U_t$ be the update procedure for task $t$ e.g. in MAML: https://i.imgur.com/Q4I4HkD.png The overall protocol for the setting is as follows: 1. At round t, the agent chooses a model defined by $w_t$ 2. The world simultaneously chooses task defined by $f_t$ 3. The agent obtains access to the update procedure $U_t$, and uses it to update parameters as $\tilde w_t = U_t(w_t)$ 4. The agent incurs loss $f_t(\tilde w_t )$. Advance to round t + 1. the goal for the agent is to minimize regrets over rounds. Achieving sublinear regrets means you're improving and converging to upper bound (joint training on all tasks) ## Algorithm and Analysis: Follow the meta-leader (FTML): https://i.imgur.com/qWb9g8Q.png FTML’s regret is sublinear (under some assumption) |

Bayesian dark knowledge

Balan, Anoop Korattikara and Rathod, Vivek and Murphy, Kevin P. and Welling, Max

Neural Information Processing Systems Conference - 2015 via Local Bibsonomy

Keywords: dblp

Balan, Anoop Korattikara and Rathod, Vivek and Murphy, Kevin P. and Welling, Max

Neural Information Processing Systems Conference - 2015 via Local Bibsonomy

Keywords: dblp

[link]
This paper combines two ideas. The first is stochastic gradient Langevin dynamics (SGLD), which is an efficient Bayesian learning method for larger datasets, allowing to efficiently sample from the posterior over the parameters of a model (e.g. a deep neural network). In short, SGLD is stochastic (minibatch) gradient descent, but where Gaussian noise is added to the gradients before each update. Each update thus results in a sample from the SGLD sampler. To make a prediction for a new data point, a number of previous parameter values are combined into an ensemble, which effectively corresponds to Monte Carlo estimate of the posterior predictive distribution of the model. The second idea is distillation or dark knowledge, which in short is the idea of training a smaller model (student) in replicating the behavior and performance of a much larger model (teacher), by essentially training the student to match the outputs of the teacher. The observation made in this paper is that the step of creating an ensemble of several models (e.g. deep networks) can be expensive, especially if many samples are used and/or if each model is large. Thus, they propose to approximate the output of that ensemble by training a single network to predict to output of ensemble. Ultimately, this is done by having the student predict the output of a teacher corresponding to the model with the last parameter value sampled by SGLD. Interestingly, this process can be operated in an online fashion, where one alternates between sampling from SGLD (i.e. performing a noisy SGD step on the teacher model) and performing a distillation update (i.e. updating the student model, given the current teacher model). The end result is a student model, whose outputs should be calibrated to the bayesian predictive distribution. |

About