![]() |
Welcome to ShortScience.org! |
![]() ![]() ![]() |
[link]
#### Introduction * The paper presents gradient computation based techniques to visualise image classification models. * [Link to the paper](https://arxiv.org/abs/1312.6034) #### Experimental Setup * Single deep convNet trained on ILSVRC-2013 dataset (1.2M training images and 1000 classes). * Weight layer configuration is: conv64-conv256-conv256-conv256-conv256-full4096-full4096-full1000. #### Class Model Visualisation * Given a learnt ConvNet and a class (of interest), start with the zero image and perform optimisation by back propagating with respect to the input image (keeping the ConvNet weights constant). * Add the mean image (for training set) to the resulting image. * The paper used unnormalised class scores so that optimisation focuses on increasing the score of target class and not decreasing the score of other classes. #### Image-Specific Class Saliency Visualisation * Given an image, class of interest, and trained ConvNet, rank the pixels of the input image based on their influence on class scores. * Derivative of the class score with respect to image gives an estimate of the importance of different pixels for the class. * The magnitude of derivative also indicated how much each pixel needs to be changed to improve the class score. ##### Class Saliency Extraction * Find the derivative of the class score with respect with respect to the input image. * This would result in one single saliency map per colour channel. * To obtain a single saliency map, take the maximum magnitude of derivative across all colour channels. ##### Weakly Supervised Object Localisation * The saliency map for an image provides a rough encoding of the location of the object of the class of interest. * Given an image and its saliency map, an object segmentation map can be computed using GraphCut colour segmentation. * Color continuity cues are needed as saliency maps might capture only the most dominant part of the object in the image. * This weakly supervised approach achieves 46.4% top-5 error on the test set of ILSVRC-2013. #### Relation to Deconvolutional Networks * DeconvNet-based reconstruction of the $n^{th}$ layer input is similar to computing the gradient of the visualised neuron activity $f$ with respect to the input layer. * One difference is in the way RELU neurons are treated: * In DeconvNet, the sign indicator (for the derivative of RELU) is computed on output reconstruction while in this paper, the sign indicator is computed on the layer input. ![]() |
[link]
We want to find two matrices $W$ and $H$ such that $V = WH$. Often a goal is to determine underlying patterns in the relationships between the concepts represented by each row and column. $W$ is some $m$ by $n$ matrix and we want the inner dimension of the factorization to be $r$. So $$\underbrace{V}_{m \times n} = \underbrace{W}_{m \times r} \underbrace{H}_{r \times n}$$ Let's consider an example matrix where of three customers (as rows) are associated with three movies (the columns) by a rating value. $$ V = \left[\begin{array}{c c c} 5 & 4 & 1 \\\\ 4 & 5 & 1 \\\\ 2 & 1 & 5 \end{array}\right] $$ We can decompose this into two matrices with $r = 1$. First lets do this without any non-negative constraint using an SVD reshaping matrices based on removing eigenvalues: $$ W = \left[\begin{array}{c c c} -0.656 \\\ -0.652 \\\ -0.379 \end{array}\right], H = \left[\begin{array}{c c c} -6.48 & -6.26 & -3.20\\\\ \end{array}\right] $$ We can also decompose this into two matrices with $r = 1$ subject to the constraint that $w_{ij} \ge 0$ and $h_{ij} \ge 0$. (Note: this is only possible when $v_{ij} \ge 0$): $$ W = \left[\begin{array}{c c c} 0.388 \\\\ 0.386 \\\\ 0.224 \end{array}\right], H = \left[\begin{array}{c c c} 11.22 & 10.57 & 5.41 \\\\ \end{array}\right] $$ Both of these $r=1$ factorizations reconstruct matrix $V$ with the same error. $$ V \approx WH = \left[\begin{array}{c c c} 4.36 & 4.11 & 2.10 \\\ 4.33 & 4.08 & 2.09 \\\ 2.52 & 2.37 & 1.21 \\\ \end{array}\right] $$ If they both yield the same reconstruction error then why is a non-negativity constraint useful? We can see above that it is easy to observe patterns in both factorizations such as similar customers and similar movies. `TODO: motivate why NMF is better` #### Paper Contribution This paper discusses two approaches for iteratively creating a non-negative $W$ and $H$ based on random initial matrices. The paper discusses a multiplicative update rule where the elements of $W$ and $H$ are iteratively transformed by scaling each value such that error is not increased. The multiplicative approach is discussed in contrast to an additive gradient decent based approach where small corrections are iteratively applied. The multiplicative approach can be reduced to this by setting the learning rate ($\eta$) to a ratio that represents the magnitude of the element in $H$ to the scaling factor of $W$ on $H$. ### Still a draft ![]() |
[link]
* They describe a model that upscales low resolution images to their high resolution equivalents ("Single Image Super Resolution"). * Their model uses a deeper architecture than previous models and has a residual component. ### How * Their model is a fully convolutional neural network. * Input of the model: The image to upscale, *already upscaled to the desired size* (but still blurry). * Output of the model: The upscaled image (without the blurriness). * They use 20 layers of padded 3x3 convolutions with size 64xHxW with ReLU activations. (No pooling.) * They have a residual component, i.e. the model only learns and outputs the *change* that has to be applied/added to the blurry input image (instead of outputting the full image). That change is applied to the blurry input image before using the loss function on it. (Note that this is a bit different from the currently used "residual learning".) * They use a MSE between the "correct" upscaling and the generated upscaled image (input image + residual). * They use SGD starting with a learning rate of 0.1 and decay it 3 times by a factor of 10. * They use weight decay of 0.0001. * During training they use a special gradient clipping adapted to the learning rate. Usually gradient clipping restricts the gradient values to `[-t, t]` (`t` is a hyperparameter). Their gradient clipping restricts the values to `[-t/lr, t/lr]` (where `lr` is the learning rate). * They argue that their special gradient clipping allows the use of significantly higher learning rates. * They train their model on multiple scales, e.g. 2x, 3x, 4x upscaling. (Not really clear how. They probably feed their upscaled image again into the network or something like that?) ### Results * Higher accuracy upscaling than all previous methods. * Can handle well upscaling factors above 2x. * Residual network learns significantly faster than non-residual network.  *Architecture of the model.*  *Super-resolution quality of their model (top, bottom is a competing model).* ![]() |
[link]
Bafna et al. show that iterative hard thresholding results in $L_0$ robust Fourier transforms. In particular, as shown in Algorithm 1, iterative hard thresholding assumes a signal $y = x + e$ where $x$ is assumed to be sparse, and $e$ is assumed to be sparse. This translates to noise $e$ that is bounded in its $L_0$ norm, corresponding to common adversarial attacks such as adversarial patches in computer vision. Using their algorithm, the authors can provably reconstruct the signal, specifically the top-$k$ coordinates for a $k$-sparse signal, which can subsequently be fed to a neural network classifier. In experiments, the classifier is always trained on sparse signals, and at test time, the sparse signal is reconstructed prior to the forward pass. This way, on MNIST and Fashion-MNIST, the algorithm is able to recover large parts of the original accuracy. https://i.imgur.com/yClXLoo.jpg Algorithm 1 (see paper for details): The iterative hard thresholding algorithm resulting in provable robustness against $L_0$ attack on images and other signals. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). ![]() |
[link]
Residual Networks (ResNets) have greatly advanced the state-of-the-art in Deep Learning by making it possible to train much deeper networks via the addition of skip connections. However, in order to compute gradients during the backpropagation pass, all the units' activations have to be stored during the feed-forward pass, leading to high memory requirements for these very deep networks. Instead, the authors propose a **reversible architecture** based on ResNets, in which activations at one layer can be computed from the ones of the next. Leveraging this invertibility property, they design a more efficient implementation of backpropagation, effectively trading compute power for memory storage. * **Pros (+): ** The change does not negatively impact model accuracy (for equivalent number of model parameters) and it only requires a small change in the backpropagation algorithm. * **Cons (-): ** Increased number of parameters, thus need to change the unit depth to match the "equivalent" ResNet --- # Proposed Architecture ## RevNet This paper proposes to incorporate idea from previous reversible architectures, such as NICE [1], into a standard ResNet. The resulting model is called **RevNet** and is composed of reversible blocks, inspired from *additive coupling* [1, 2]: $ \begin{array}{r|r} \texttt{RevNet Block} & \texttt{Inverse Transformation}\\ \hline \mathbf{input }\ x & \mathbf{input }\ y \\ x_1, x_2 = \mbox{split}(x) & y1, y2 = \mbox{split}(y)\\ y_1 = x_1 + \mathcal{F}(x_2) & x_2 = y_2 - \mathcal{G}(y_1) \\ y_2 = x_2 + \mathcal{G}(y_1) & x_1 = y_1 - \mathcal{F}(x_2)\\ \mathbf{output}\ y = (y_1, y_2) & \mathbf{output}\ x = (x_1, x_2) \end{array} $ where $\mathcal F$ and $\mathcal G$ are residual functions, composed of sequences of convolutions, ReLU and Batch Normalization layers, analoguous to the ones in a standard ResNet block, although operations in the reversible blocks need to have a stride of 1 to avoid information loss and preserve invertibility. Finally, for the `split` operation, the authors consider spliting the input Tensor across the channel dimension as in [1, 2]. Similarly to ResNet, the final RevNet architecture is composed of these invertible residual blocks, as well as non-reversible subsampling operations (e.g., pooling) for which activations have to be stored. However the number of such operations is much smaller than the number of residual blocks in a typical ResNet architecture. ## Backpropagation ### Standard The backpropagaton algorithm is derived from the chain rule and is used to compute the total gradients of the loss with respect to the parameters in a neural network: given a loss function $L$, we want to compute the gradients of $L$ with respect to the parameters of each layer, indexed by $n \in [1, N]$, i.e., the quantities $ \overline{\theta_{n}} = \partial L /\ \partial \theta_n$. (where $\forall x, \bar{x} = \partial L / \partial x$). We roughly summarize the algorithm in the left column of **Table 1**: In order to compute the gradients for the $n$-th block, backpropagation requires the input and output activation of this block, $y_{n - 1}$ and $y_{n}$, which have been stored, and the derivative of the loss respectively to the output, $\overline{y_{n}}$, which has been computed in the backpropagation iteration of the upper layer; Hence the name backpropagation ### RevNet Since activations are not stored in RevNet, the algorithm needs to be slightly modified, which we describe in the right column of **Table 1**. In summary, we first need to recover the input activations of the RevNet block using its invertibility. These activations will be propagated to the earlier layers for further backpropagation. Secondly, we need to compute the gradients of the loss with respect to the inputs, i.e. $\overline{y_{n - 1}} = (\overline{y_{n -1, 1}}, \overline{y_{n - 1, 2}})$, using the fact that: $ \begin{align} \overline{y_{n - 1, i}} = \overline{y_{n, 1}}\ \frac{\partial y_{n, 1}}{y_{n - 1, i}} + \overline{y_{n, 2}}\ \frac{\partial y_{n, 2}}{y_{n - 1, i}} \end{align} $ Once again, this result will be propagated further down the network. Finally, once we have computed both these quantities we can obtain the gradients with respect to the parameters of this block, $\theta_n$. $ \begin{array}{|c|l|l|} \hline & \mathbf{ResNet} & \mathbf{RevNet} \\ \hline \mathbf{Block} & y_{n} = y_{n - 1} + \mathcal F(y_{n - 1}) & y_{n - 1, 1}, y_{n - 1, 2} = \mbox{split}(y_{n - 1})\\ && y_{n, 1} = y_{n - 1, 1} + \mathcal{F}(y_{n - 1, 2})\\ && y_{n, 2} = y_{n - 1, 2} + \mathcal{G}(y_{n, 1})\\ && y_{n} = (y_{n, 1}, y_{n, 2})\\ \hline \mathbf{Params} & \theta = \theta_{\mathcal F} & \theta = (\theta_{\mathcal F}, \theta_{\mathcal G})\\ \hline \mathbf{Backprop} & \mathbf{in:}\ y_{n - 1}, y_{n}, \overline{ y_{n}} & \mathbf{in:}\ y_{n}, \overline{y_{n }}\\ & \overline{\theta_n} =\overline{y_n} \frac{\partial y_n}{\partial \theta_n} &\texttt{# recover activations} \\ &\overline{y_{n - 1}} = \overline{y_{n}}\ \frac{\partial y_{n}}{\partial y_{n-1}} &y_{n, 1}, y_{n, 2} = \mbox{split}(y_{n}) \\ &\mathbf{out:}\ \overline{\theta_n}, \overline{y_{n -1}} & y_{n - 1, 2} = y_{n, 2} - \mathcal{G}(y_{n, 1})\\ &&y_{n - 1, 1} = y_{n, 1} - \mathcal{F}(y_{n - 1, 2})\\ &&\texttt{# gradients wrt. inputs} \\ &&\overline{y_{n -1, 1}} = \overline{y_{n, 1}} + \overline{y_{n,2}} \frac{\partial \mathcal G}{\partial y_{n,1}} \\ &&\overline{y_{n -1, 2}} = \overline{y_{n, 1}} \frac{\partial \mathcal F}{\partial y_{n,2}} + \overline{y_{n,2}} \left(1 + \frac{\partial \mathcal F}{\partial y_{n,2}} \frac{\partial \mathcal G}{\partial y_{n,1}} \right) \\ &&\texttt{ gradients wrt. parameters} \\ &&\overline{\theta_{n, \mathcal G}} = \overline{y_{n, 2}} \frac{\partial \mathcal G}{\partial \theta_{n, \mathcal G}}\\ &&\overline{\theta_{n, \mathcal F}} = \overline{y_{n,1}} \frac{\partial F}{\partial \theta_{n, \mathcal F}} + \overline{y_{n, 2}} \frac{\partial F}{\partial \theta_{n, \mathcal F}} \frac{\partial \mathcal G}{\partial y_{n,1}}\\ &&\mathbf{out:}\ \overline{\theta_{n}}, \overline{y_{n -1}}, y_{n - 1}\\ \hline \end{array} $ **Table 1:** Backpropagation in the standard case and for Reversible blocks --- ## Experiments ** Computational Efficiency.** RevNets trade off memory requirements, by avoiding storing activations, against computations. Compared to other methods that focus on improving memory requirements in deep networks, RevNet provides the best trade-off: no activations have to be stored, the spatial complexity is $O(1)$. For the computation complexity, it is linear in the number of layers, i.e. $O(L)$. One small disadvantage is that RevNets introduces additional parameters, as each block is composed of two residuals, $\mathcal F$ and $\mathcal G$, and their number of channels is also halved as the input is first split into two. **Results.** In the experiments section, the author compare ResNet architectures to their RevNets "counterparts": they build a RevNet with roughly the same number of parameters by halving the number of residual units and doubling the number of channels. Interestingly, RevNets achieve **similar performances** to their ResNet counterparts, both in terms of final accuracy, and in terms of training dynamics. The authors also analyze the impact of floating errors that might occur when reconstructing activations rather than storing them, however it appears these errors are of small magnitude and do not seem to negatively impact the model. To summarize, reversible networks seems like a very promising direction to efficiently train very deep networks with memory budget constraints. --- ## References * [1] NICE: Non-linear Independent Components Estimation, Dinh et al., ICLR 2015 * [2] Density estimation using Real NVP, Dinh et al., ICLR 2017 ![]() |