Welcome to ShortScience.org! |
[link]
The main contribution of this paper is introducing a new transformation that the authors call Batch Normalization (BN). The need for BN comes from the fact that during the training of deep neural networks (DNNs) the distribution of each layer’s input change. This phenomenon is called internal covariate shift (ICS). #### What is BN? Normalize each (scalar) feature independently with respect to the mean and variance of the mini batch. Scale and shift the normalized values with two new parameters (per activation) that will be learned. The BN consists of making normalization part of the model architecture. #### What do we gain? According to the author, the use of BN provides a great speed up in the training of DNNs. In particular, the gains are greater when it is combined with higher learning rates. In addition, BN works as a regularizer for the model which allows to use less dropout or less L2 normalization. Furthermore, since the distribution of the inputs is normalized, it also allows to use sigmoids as activation functions without the saturation problem. #### What follows? This seems to be specially promising for training recurrent neural networks (RNNs). The vanishing and exploding gradient problems \cite{journals/tnn/BengioSF94} have their origin in the iteration of transformation that scale up or down the activations in certain directions (eigenvectors). It seems that this regularization would be specially useful in this context since this would allow the gradient to flow more easily. When we unroll the RNNs, we usually have ultra deep networks. #### Like * Simple idea that seems to improve training. * Makes training faster. * Simple to implement. Probably. * You can be less careful with initialization. #### Dislike * Does not work with stochastic gradient descent (minibatch size = 1). * This could reduce the parallelism of the algorithm since now all the examples in a mini batch are tied. * Results on ensemble of networks for ImageNet makes it harder to evaluate the relevance of BN by itself. (Although they do mention the performance of a single model). |
[link]
This paper derives an algorithm for passing gradients through a sample from a mixture of Gaussians. While the reparameterization trick allows to get the gradients with respect to the Gaussian means and covariances, the same trick cannot be invoked for the mixing proportions parameters (essentially because they are the parameters of a multinomial discrete distribution over the Gaussian components, and the reparameterization trick doesn't extend to discrete distributions). One can think of the derivation as proceeding in 3 steps: 1. Deriving an estimator for gradients a sample from a 1-dimensional density $f(x)$ that is such that $f(x)$ is differentiable and its cumulative distribution function (CDF) $F(x)$ is tractable: $\frac{\partial \hat{x}}{\partial \theta} = - \frac{1}{f(\hat{x})}\int_{t=-\infty}^{\hat{x}} \frac{\partial f(t)}{\partial \theta} dt$ where $\hat{x}$ is a sample from density $f(x)$ and $\theta$ is any parameter of $f(x)$ (the above is a simplified version of Equation 6). This is probably the most important result of the paper, and is based on a really clever use of the general form of the Leibniz integral rule. 2. Noticing that one can sample from a $D$-dimensional Gaussian mixture by decomposing it with the product rule $f({\bf x}) = \prod_{d=1}^D f(x_d|{\bf x}_{<d})$ and using ancestral sampling, where each $f(x_d|{\bf x}_{<d})$ are themselves 1-dimensional mixtures (i.e. with differentiable densities and tractable CDFs) 3. Using the 1-dimensional gradient estimator (of Equation 6) and the chain rule to backpropagate through the ancestral sampling procedure. This requires computing the integral in the expression for $\frac{\partial \hat{x}}{\partial \theta}$ above, where $f(x)$ is one of the 1D conditional Gaussian mixtures and $\theta$ is a mixing proportion parameter $\pi_j$. As it turns out, this integral has an analytical form (see Equation 22). **My two cents** This is a really surprising and neat result. The author mentions it could be applicable to variational autoencoders (to support posteriors that are mixtures of Gaussians), and I'm really looking forward to read about whether that can be successfully done in practice. The paper provides the derivation only for mixtures of Gaussians with diagonal covariance matrices. It is mentioned that extending to non-diagonal covariances is doable. That said, ancestral sampling with non-diagonal covariances would become more computationally expensive, since the conditionals under each Gaussian involves a matrix inverse. Beyond the case of Gaussian mixtures, Equation 6 is super interesting in itself as its application could go beyond that case. This is probably why the paper also derived a sampling-based estimator for Equation 6, in Equation 9. However, that estimator might be inefficient, since it involves sampling from Equation 10 with rejection, and it might take a lot of time to get an accepted sample if $\hat{x}$ is very small. Also, a good estimate of Equation 6 might require *multiple* samples from Equation 10. Finally, while I couldn't find any obvious problem with the mathematical derivation, I'd be curious to see whether using the same approach to derive a gradient on one of the Gaussian mean or standard deviation parameters gave a gradient that is consistent with what the reparameterization trick provides.
3 Comments
|
[link]
DRL has lot of disadvantages like large data requirement, slow learning, difficult interpretation, difficult transfer, no causality, analogical reasoning done at a statistical level not at a abstract level etc. This can be overcome by adding a symbolic front end on top of DL layer before feeding it to RL agent. Symbolic front end gives advantage of smaller state space generalization, flexible predicate length and easier combination of predicate expressions. DL avoids manual creation of features unlike symbolic reasoning. Hence DL along with symbolic reasoning might be the way to progress for AGI. State space reduction in symbolic reasoning is carried out by using object interactions(object positions and object types) for state representation. Although certain assumptions are made in the process such as objects of same type behave similarly etc, one can better understand causal relations in terms of actions, object interactions and reward by using symbolic reasoning. Broadly, pipeline consists of (1)CNN layer - Raw pixels to representation (2)Salient pixel identification - Pixels that have activations in CNN above a certain threshold (3)Identify objects of similar kind by using activation spectra of salient pixels (4)Identify similar objects in consecutive time steps to track object motion using spatial closeness(as objects can move only by a small distance in consecutive frames) and similar neighbors(different type of objects can be placed close to each other and spatial closeness alone cannot identify similar objects) (4)Building symbolic interactions by using relative object positions for all pairs of objects located within a certain maximal distance. Relative object position is necessary to capture object dynamics. Maximal distance threshold is required to make the learning quicker eventhough it may reach a locally optimal policy (4)RL agent uses object interactions as states in Q-Learning update. Instead of using all object interactions in a frame as one state, number of states are further reduced by considering interactions between two types to be independent of other types and doing a Q-Learning update separately for each type pair. Intuitive explanation for doing so is to look at a frame as a set of independent object type interactions. Action choice at a state is then the one that maximizes sum of Q values across all type pairs. Results claim that using DRL with symbolic reasoning, transfer in policies can be observed by first training on evenly spaced grid world and using it for randomly spaced grid world with a performance close to 70% contrary to DQN that achieves 50% even after training for 1000 epochs with epoch length of 100. |
[link]
Mu and Gilmer introduce MNIST-C, an MNIST-based corruption benchmark for out-of-distribution evaluation. The benchmark includes various corruption types including random noise (shot and impulse noise), blur (glass and motion blur), (affine) transformations, “striping” or occluding parts of the image, using Canny images or simulating fog. These corruptions are also shown in Figure 1. The transformations have been chosen to be semantically invariant, meaning that the true class of the image does not change. This is important for evaluation as model’s can easily be tested whether they still predict the correct labels on the corrupted images. https://i.imgur.com/Y6LgAM4.jpg Figure 1: Examples of the used corruption types included in MNIST-C. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |
[link]
This paper is about a recommendation system approach using collaborative filtering (CF) on implicit feedback datasets. The core of it is the minimization problem $$\min_{x_*, y_*} \sum_{u,i} c_{ui} (p_{ui} - x_u^T y_i)^2 + \underbrace{\lambda \left ( \sum_u || x_u ||^2 + \sum_i || y_i ||^2\right )}_{\text{Regularization}}$$ with * $\lambda \in [0, \infty[$ is a hyper parameter which defines how strong the model is regularized * $u$ denoting a user, $u_*$ are all user factors $x_u$ combined * $i$ denoting an item, $y_*$ are all item factors $y_i$ combined * $x_u \in \mathbb{R}^n$ is the latent user factor (embedding); $n$ is another hyper parameter. $n=50$ seems to be a reasonable choice. * $y_i \in \mathbb{R}^n$ is the latent item factor (embedding) * $r_{ui}$ defines the "intensity"; higher values mean user $u$ interacted more with item $i$ * $p_{ui} = \begin{cases}1 & \text{if } r_{ui} >0\\0 &\text{otherwise}\end{cases}$ * $c_{ui} := 1 + \alpha r_{ui}$ where $\alpha \in [0, \infty[$ is a hyper parameter; $\alpha =40$ seems to be reasonable In contrast, the standard matrix factoriation optimization function looks like this ([example](https://www.cs.cmu.edu/~mgormley/courses/10601-s17/slides/lecture25-mf.pdf)): $$\min_{x_*, y_*} \sum_{(u, i, r_{ui}) \in \mathcal{R}} {(r_{ui} - x_u^T y_i)}^2 + \underbrace{\lambda \left ( \sum_u || x_u ||^2 + \sum_i || y_i ||^2\right )}_{\text{Regularization}}$$ where * $\mathcal{R}$ is the set of all ratings $(u, i, r_{ui})$ - user $u$ has rated item $i$ with value $r_{ui} \in \mathbb{R}$ They use alternating least squares (ALS) to train this model. The prediction then is the dot product between the user factor and all item factors ([source](https://github.com/benfred/implicit/blob/master/implicit/recommender_base.pyx#L157-L176)) |