Welcome to ShortScience.org! |
[link]
Everyone has been thinking about how to apply GANs to discrete sequence data for the past year or so. This paper presents the model that I would guess most people thought of as the first-thing-to-try: 1. Build a recurrent generator model which samples from its softmax outputs at each timestep. 2. Pass sampled sequences to a recurrent discriminator model which distinguishes between sampled sequences and real-data sequences. 3. Train the discriminator under the standard GAN loss. 4. Train the generator with a REINFORCE (policy gradient) objective, where each trajectory is assigned a single episodic reward: the score assigned to the generated sequence by the discriminator. Sounds hacky, right? We're learning a generator with a high-variance model-free reinforcement learning algorithm, in a very seriously non-stationary environment. (Here the "environment" is a discriminator being jointly learned with the generator.) There's just one trick in this paper on top of that setup: for non-terminal states, the reward is defined as the *expectation* of the discriminator score after stochastically generating from that state forward. To restate using standard (somewhat sloppy) RL syntax, in different terms than the paper: (under stochastic sequential policy $\pi$, with current state $s_t$, trajectory $\tau_{1:T}$ and discriminator $D(\tau)$) $$r_t = \mathbb E_{\tau_{t+1:T} \sim \pi(s_t)} \left[ D(\tau_{1:T}) \right]$$ The rewards are estimated via Monte Carlo — i.e., just take the mean of $N$ rollouts from each intermediate state. They claim this helps to reduce variance. That makes intuitive sense, but I don't see any results in the paper demonstrating the effect of varying $N$. --- Yep, so it turns out that this sort of works.. with a big caveat: ## The big caveat Graph from appendix: ![](https://www.dropbox.com/s/5fqh6my63sgv5y4/Bildschirmfoto%202016-09-27%20um%2021.34.44.png?raw=1) SeqGANs don't work without supervised pretraining. Makes sense — with a cold start, the generator just samples a bunch of nonsense and the discriminator overfits. Both the generator and discriminator are pretrained on supervised data in this paper (see Algorithm 1). I think it must be possible to overcome this with the proper training tricks and enough sweat. But it's probably more worth our time to address the fundamental problem here of developing better RL for structured prediction tasks.
4 Comments
|
[link]
$\bf Summary:$ The paper is about squeezing the number of parameters in a convolutional neural network. The number of parameters in a convolutional layer is given by (number of input channels)$\times$(number of filters)$\times$(size of filter$\times$size of filter). The paper proposes 2 strategies: (i) replace 3x3 filters with 1x1 filters and (ii) decrease the number of input channels. They assume the budget of the filter is given, i,e., they do not tinker with the number of filters. Decrease in number of parameters will lead to less accuracy. To compensate, the authors propose to downsample late in the network. The results are quite impressive. Compared to AlexNet, they achieve a 50x reduction is model size while preserving the accuracy. Their model can be further compressed with existing methods like Deep Compression which are orthogonal to this paper's approach and this can give in total of around 510x reduction while still preserving accuracy of AlexNet. $\bf Question$: The impact on running times (specially on feed forward phase which may be more typical on embedded devices) is not clear to me. Is it certain to be reduced as well or at least be *no worse* than the baseline models? |
[link]
A study of how scan orders influence Mixing time in Gibbs sampling. This paper is interested in comparing the mixing rates of Gibbs sampling using either systematic scan or random updates. The basic contributions are two: First, in Section 2, a set of cases where 1) systematic scan is polynomially faster than random updates. Together with a previously known case where it can be slower this contradicts a conjecture that the speeds of systematic and random updates are similar. Secondly, (In Theorem 1) a set of mild conditions under which the mixing times of systematic scan and random updates are not "too" different (roughly within squares of each other). First, following from a recent paper by Roberts and Rosenthal, the authors construct several examples which do not satisfy the commonly held belief that systematic scan is never more than a constant factor slower and a log factor faster than random scan. The authors then provide a result Theorem 1 which provides weaker bounds, which however they verify at least under some conditions. In fact the Theorem compares random scan to a lazy version of the systematic scan and shows that and obtains bounds in terms of various other quantities, like the minimum probability, or the minimum holding probability. MCMC is at the heart of many applications of modern machine learning and statistics. It is thus important to understand the computational and theoretical performance under various conditions. The present paper focused on examining systematic Gibbs sampling in comparison to random scan Gibbs. They do so first though the construction of several examples which challenge the dominant intuitions about mixing times, and develop theoretical bounds which are much wider than previously conjectured. |
[link]
Residual Networks (ResNets) have greatly advanced the state-of-the-art in Deep Learning by making it possible to train much deeper networks via the addition of skip connections. However, in order to compute gradients during the backpropagation pass, all the units' activations have to be stored during the feed-forward pass, leading to high memory requirements for these very deep networks. Instead, the authors propose a **reversible architecture** based on ResNets, in which activations at one layer can be computed from the ones of the next. Leveraging this invertibility property, they design a more efficient implementation of backpropagation, effectively trading compute power for memory storage. * **Pros (+): ** The change does not negatively impact model accuracy (for equivalent number of model parameters) and it only requires a small change in the backpropagation algorithm. * **Cons (-): ** Increased number of parameters, thus need to change the unit depth to match the "equivalent" ResNet --- # Proposed Architecture ## RevNet This paper proposes to incorporate idea from previous reversible architectures, such as NICE [1], into a standard ResNet. The resulting model is called **RevNet** and is composed of reversible blocks, inspired from *additive coupling* [1, 2]: $ \begin{array}{r|r} \texttt{RevNet Block} & \texttt{Inverse Transformation}\\ \hline \mathbf{input }\ x & \mathbf{input }\ y \\ x_1, x_2 = \mbox{split}(x) & y1, y2 = \mbox{split}(y)\\ y_1 = x_1 + \mathcal{F}(x_2) & x_2 = y_2 - \mathcal{G}(y_1) \\ y_2 = x_2 + \mathcal{G}(y_1) & x_1 = y_1 - \mathcal{F}(x_2)\\ \mathbf{output}\ y = (y_1, y_2) & \mathbf{output}\ x = (x_1, x_2) \end{array} $ where $\mathcal F$ and $\mathcal G$ are residual functions, composed of sequences of convolutions, ReLU and Batch Normalization layers, analoguous to the ones in a standard ResNet block, although operations in the reversible blocks need to have a stride of 1 to avoid information loss and preserve invertibility. Finally, for the `split` operation, the authors consider spliting the input Tensor across the channel dimension as in [1, 2]. Similarly to ResNet, the final RevNet architecture is composed of these invertible residual blocks, as well as non-reversible subsampling operations (e.g., pooling) for which activations have to be stored. However the number of such operations is much smaller than the number of residual blocks in a typical ResNet architecture. ## Backpropagation ### Standard The backpropagaton algorithm is derived from the chain rule and is used to compute the total gradients of the loss with respect to the parameters in a neural network: given a loss function $L$, we want to compute the gradients of $L$ with respect to the parameters of each layer, indexed by $n \in [1, N]$, i.e., the quantities $ \overline{\theta_{n}} = \partial L /\ \partial \theta_n$. (where $\forall x, \bar{x} = \partial L / \partial x$). We roughly summarize the algorithm in the left column of **Table 1**: In order to compute the gradients for the $n$-th block, backpropagation requires the input and output activation of this block, $y_{n - 1}$ and $y_{n}$, which have been stored, and the derivative of the loss respectively to the output, $\overline{y_{n}}$, which has been computed in the backpropagation iteration of the upper layer; Hence the name backpropagation ### RevNet Since activations are not stored in RevNet, the algorithm needs to be slightly modified, which we describe in the right column of **Table 1**. In summary, we first need to recover the input activations of the RevNet block using its invertibility. These activations will be propagated to the earlier layers for further backpropagation. Secondly, we need to compute the gradients of the loss with respect to the inputs, i.e. $\overline{y_{n - 1}} = (\overline{y_{n -1, 1}}, \overline{y_{n - 1, 2}})$, using the fact that: $ \begin{align} \overline{y_{n - 1, i}} = \overline{y_{n, 1}}\ \frac{\partial y_{n, 1}}{y_{n - 1, i}} + \overline{y_{n, 2}}\ \frac{\partial y_{n, 2}}{y_{n - 1, i}} \end{align} $ Once again, this result will be propagated further down the network. Finally, once we have computed both these quantities we can obtain the gradients with respect to the parameters of this block, $\theta_n$. $ \begin{array}{|c|l|l|} \hline & \mathbf{ResNet} & \mathbf{RevNet} \\ \hline \mathbf{Block} & y_{n} = y_{n - 1} + \mathcal F(y_{n - 1}) & y_{n - 1, 1}, y_{n - 1, 2} = \mbox{split}(y_{n - 1})\\ && y_{n, 1} = y_{n - 1, 1} + \mathcal{F}(y_{n - 1, 2})\\ && y_{n, 2} = y_{n - 1, 2} + \mathcal{G}(y_{n, 1})\\ && y_{n} = (y_{n, 1}, y_{n, 2})\\ \hline \mathbf{Params} & \theta = \theta_{\mathcal F} & \theta = (\theta_{\mathcal F}, \theta_{\mathcal G})\\ \hline \mathbf{Backprop} & \mathbf{in:}\ y_{n - 1}, y_{n}, \overline{ y_{n}} & \mathbf{in:}\ y_{n}, \overline{y_{n }}\\ & \overline{\theta_n} =\overline{y_n} \frac{\partial y_n}{\partial \theta_n} &\texttt{# recover activations} \\ &\overline{y_{n - 1}} = \overline{y_{n}}\ \frac{\partial y_{n}}{\partial y_{n-1}} &y_{n, 1}, y_{n, 2} = \mbox{split}(y_{n}) \\ &\mathbf{out:}\ \overline{\theta_n}, \overline{y_{n -1}} & y_{n - 1, 2} = y_{n, 2} - \mathcal{G}(y_{n, 1})\\ &&y_{n - 1, 1} = y_{n, 1} - \mathcal{F}(y_{n - 1, 2})\\ &&\texttt{# gradients wrt. inputs} \\ &&\overline{y_{n -1, 1}} = \overline{y_{n, 1}} + \overline{y_{n,2}} \frac{\partial \mathcal G}{\partial y_{n,1}} \\ &&\overline{y_{n -1, 2}} = \overline{y_{n, 1}} \frac{\partial \mathcal F}{\partial y_{n,2}} + \overline{y_{n,2}} \left(1 + \frac{\partial \mathcal F}{\partial y_{n,2}} \frac{\partial \mathcal G}{\partial y_{n,1}} \right) \\ &&\texttt{ gradients wrt. parameters} \\ &&\overline{\theta_{n, \mathcal G}} = \overline{y_{n, 2}} \frac{\partial \mathcal G}{\partial \theta_{n, \mathcal G}}\\ &&\overline{\theta_{n, \mathcal F}} = \overline{y_{n,1}} \frac{\partial F}{\partial \theta_{n, \mathcal F}} + \overline{y_{n, 2}} \frac{\partial F}{\partial \theta_{n, \mathcal F}} \frac{\partial \mathcal G}{\partial y_{n,1}}\\ &&\mathbf{out:}\ \overline{\theta_{n}}, \overline{y_{n -1}}, y_{n - 1}\\ \hline \end{array} $ **Table 1:** Backpropagation in the standard case and for Reversible blocks --- ## Experiments ** Computational Efficiency.** RevNets trade off memory requirements, by avoiding storing activations, against computations. Compared to other methods that focus on improving memory requirements in deep networks, RevNet provides the best trade-off: no activations have to be stored, the spatial complexity is $O(1)$. For the computation complexity, it is linear in the number of layers, i.e. $O(L)$. One small disadvantage is that RevNets introduces additional parameters, as each block is composed of two residuals, $\mathcal F$ and $\mathcal G$, and their number of channels is also halved as the input is first split into two. **Results.** In the experiments section, the author compare ResNet architectures to their RevNets "counterparts": they build a RevNet with roughly the same number of parameters by halving the number of residual units and doubling the number of channels. Interestingly, RevNets achieve **similar performances** to their ResNet counterparts, both in terms of final accuracy, and in terms of training dynamics. The authors also analyze the impact of floating errors that might occur when reconstructing activations rather than storing them, however it appears these errors are of small magnitude and do not seem to negatively impact the model. To summarize, reversible networks seems like a very promising direction to efficiently train very deep networks with memory budget constraints. --- ## References * [1] NICE: Non-linear Independent Components Estimation, Dinh et al., ICLR 2015 * [2] Density estimation using Real NVP, Dinh et al., ICLR 2017 |
[link]
Ilyas et al. present a follow-up work to their paper on the trade-off between accuracy and robustness. Specifically, given a feature $f(x)$ computed from input $x$, the feature is considered predictive if $\mathbb{E}_{(x,y) \sim \mathcal{D}}[y f(x)] \geq \rho$; similarly, a predictive feature is robust if $\mathbb{E}_{(x,y) \sim \mathcal{D}}\left[\inf_{\delta \in \Delta(x)} yf(x + \delta)\right] \geq \gamma$. This means, a feature is considered robust if the worst-case correlation with the label exceeds some threshold $\gamma$; here the worst-case is considered within a pre-defined set of allowed perturbations $\Delta(x)$ relative to the input $x$. Obviously, there also exist predictive features, which are however not robust according to the above definition. In the paper, Ilyas et al. present two simple algorithms for obtaining adapted datasets which contain only robust or only non-robust features. The main idea of these algorithms is that an adversarially trained model only utilizes robust features, while a standard model utilizes both robust and non-robust features. Based on these datasets, they show that non-robust, predictive features are sufficient to obtain high accuracy; similarly training a normal model on a robust dataset also leads to reasonable accuracy but also increases robustness. Experiments were done on Cifar10. These observations are supported by a theoretical toy dataset consisting of two overlapping Gaussians; I refer to the paper for details. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |