Welcome to ShortScience.org! |
[link]
Chen et al. propose a gradient-based black-box attack to compute adversarial examples. Specifically, they follow the general idea of [1] where the following objective is optimized: $\min_x \|x – x_0\|_2 + c \max\{\max_{i\neq t}\{z_i\} – z_t, - \kappa\}$. Here, $x$ is the adversarial example based on training sample $x_0$. The second part expresses that $x$ is supposed to be misclassified, i.e. the logit $z_i$ for some $i \neq t$ distinct form the true label $t$ is supposed to be larger that the logit $z_t$ corresponding to the true label. This is optimized subject to the constraint that $x$ is a valid image. The attack proposed in [1] assumes a white-box setting were we have access to the logits and the gradients (basically requiring access to the full model). Chen et al., in contrast want to design a black-box attacks. Therefore, they make the following changes: - Instead of using logits $z_i$, the probability distribution $f_i$ (i.e. the actual output of the network) is used. - Gradients are approximated by finite differences. Personally, I find that the first point does violate a strict black-box setting. As company, for example, I would prefer not to give away the full probability distribution but just the final decision (or the decision plus a confidence score). Then, however, the proposed method is not applicable anymore. Anyway, the changed objective looks as follows: $\min_x \|x – x_0\|_2 + c \max\{\max_{i\neq t}\{\log f_i\} – \log f_t, - \kappa\}$ where, according to the authors, the logarithm is essential for optimization. One remaining problem is efficient optimization with finite differences. To this end, they propose a randomized/stochastic coordinate descent algorithm. In particular, in each step, a ranodm pixel is chosen and a local update is performed by calculating the gradient on this pixel using finite differences and performing an ADAM step. [1] N. Carlini, D. Wagner. Towards evaluating the robustness of neural networks. IEEE Symposium of Security and Privacy, 2017. Also view this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |
[link]
Ulyanov et al. utilize untrained neural networks as regularizer/prior for various image restoration tasks such as denoising, inpainting and super-resolution. In particualr, the standard formulation of such tasks, i.e. $x^\ast = \arg\min_x E(x, x_0) + R(x)$ where $x_0$ is the input image and $E$ a task-dependent data term, is rephrased as follows: $\theta^\ast = \arg\min_\theta E(f_\theta(z); x_0)$ and $x^\ast = f_{\theta^\ast}(z)$ for a fixed but random $z$. Here, the regularizer $R$ is essentially replaced by an untrained neural network $f_\theta$ – usually in the form of a convolutional encoder. The authors argue that the regualizer is effectively $R(x) = 0$ if the image can be generated by the encoder from the fixed code $z$ and $R(x) = \infty$ if not. However, this argument does not necessarily provide any insights on why this approach works (as demonstrated in the paper). A main question addressed in the paper is why the network $f_\theta$ can be used as a prior – regarding the assumption that high-capacity networks can essentially fit any image (including random noise). In my opinion, the authors do not give a convincing answer to this question. Essentially, they argue that random noise is just harder to fit (i.e. it takes longer). Therefore, limiting the number of iterations is enough as regularization. Personally I would argue that this observation is mainly due to prior knowledge put into the encoder architecture and the idea that natural images (or any images with some structure) are easily embedded into low-dimensional latent spaced compared to fully I.i.d. random noise. They provide experiments on a range of tasks including denoising, image inpainting, super-resolution and neural network “inversion”. Figure 1 shows some results for image inpainting that I found quite convincing. For the remaining experiments I refer to the paper. https://i.imgur.com/BVQsaup.png Figure 1: Qualitative results for image inpainting. Also see this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |
[link]
Hein and Andriushchenko give a intuitive bound on the robustness of neural networks based on the local Lipschitz constant. With robustness, the authors refer a small $\epsilon$-ball around each sample; this ball is supposed to describe the region where the neural network predicts a constant class. This means that adversarial examples have to compute changes large enough to leave these robust areas. Larger $\epsilon$-balls imply higher robustness to adversarial examples. When considering a single example $x$, and a classifier $f = (f_1, \ldots, f_K)^T$ (i.e. in a multi-class setting), the bound can be stated as follows. For $q$ and $p$ such that $\frac{1}{q} + \frac{1}{p} = 1$ and $c$ being the class predicted for $x$, the it holds $x = \arg\max_j f_j(x + \delta)$ for all $\delta$ with $\|\delta\|_p \leq \max_{R > 0}\min \left\{\min_{j \neq c} \frac{f_c(x) – f_j(x)}{\max_{y \in B_p(x, R)} \|\nabla f_c(y) - \nabla f_j(y)\|_q}, R\right\}$. Here, $B_p(x, R)$ describes the $R$-ball around $x$ measured using the $p$-norm. Based on the local Lipschitz constant (in the denominator), the bound essentially measures how far we can deviate from the sample $x$ (measured in the $p$-norm) until $f_j(x) > f_c(x)$ for some $j \neq c$. The higher the local Lipschitz constant, the smaller deviations are allowed, i.e. adversarial examples are easier to find. Note that the bound also depends on the confidence, i.e. the edge $f_c(x)$ has in comparison to all other $f_j(x)$. In the remaining paper, the authors also provide bounds for simple classifiers including linear classifiers, kernel methods and two-layer perceptrons (i.e. one hidden layer). For the latter, they also propose a new type of regularization called cross-Lipschitz regularization: $P(f) = \frac{1}{nK^2} \sum_{i = 1}^n \sum_{l,m = 1}^K \|\nabla f_l(x_i) - \nabla f_m(x_i)\|_2^2$. This regularization term is intended to reduce the Lipschitz constant locally around training examples. They show experimental results using this regularization on MNIST and CIFAR, see the paper for details. Also view this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |
[link]
Federated learning is the problem of training a model that incorporates updates from the data of many individuals, without having direct access to that data, or having to store it. This is potentially desirable both for reasons of privacy (not wanting to have access to private data in a centralized way), and for potential benefits to transport cost when data needed to train models exists on a user's device, and would require a lot of bandwidth to transfer to a centralized server. Historically, the default way to do Federated Learning was with an algorithm called FedSGD, which worked by: - Sending a copy of the current model to each device/client - Calculating a gradient update to be applied on top of that current model given a batch of data sampled from the client's device - Sending that gradient back to the central server - Averaging those gradients and applying them all at once to a central model The authors note that this approach is equivalent to one where a single device performs a step of gradient descent locally, sends the resulting *model* back to the the central server, and performs model averaging by averaging the parameter vectors there. Given that, and given their observation that, in federated learning, communication of gradients and models is generally much more costly than the computation itself (since the computation happens across so many machines), they ask whether the communication required to get to a certain accuracy could be better optimized by performing multiple steps of gradient calculation and update on a given device, before sending the resulting model back to a central server to be average with other clients models. Specifically, their algorithm, FedAvg, works by: - Dividing the data on a given device into batches of size B - Calculating an update on each batch and applying them sequentially to the starting model sent over the wire from the server - Repeating this for E epochs Conceptually, this should work perfectly well in the world where data from each batch is IID - independently drawn from the same distribution. But that is especially unlikely to be true in the case of federated learning, when a given user and device might have very specialized parts of the data space, and prior work has shown that there exist pathological cases where averaged models can perform worse than either model independently, even *when* the IID condition is met. The authors experiment empirically ask the question whether these sorts of pathological cases arise when simulating a federated learning procedure over MNIST and a language model trained on Shakespeare, trying over a range of hyperparameters (specifically B and E), and testing the case where data is heavily non-IID (in their case: where different "devices" had non-overlapping sets of digits). https://i.imgur.com/xq9vi8S.png They show that, in both the IID and non-IID settings, they are able to reach their target accuracy, and are able to do so with many fewer rounds of communciation than are required by FedSGD (where an update is sent over the wire, and a model sent back, for each round of calculation done on the device.) The authors argue that this shows the practical usefulness of a Federated Learning approach that does more computation on individual devices before updating, even in the face of theoretical pathological cases. |
[link]
This paper describes how rank pooling, a very recent approach for pooling representations organized in a sequence $\\{{\bf v}_t\\}_{t=1}^T$, can be used in an end-to-end trained neural network architecture. Rank pooling is an alternative to average and max pooling for sequences, but with the distinctive advantage of maintaining some order information from the sequence. Rank pooling first solves a regularized (linear) support vector regression (SVR) problem where the inputs are the vector representations ${\bf v}_t$ in the sequence and the target is the corresponding index $t$ of that representation in the sequence (see Equation 5). The output of rank pooling is then simply the linear regression parameters $\bf{u}$ learned for that sequence. Because of the way ${\bf u}$ is trained, we can see that ${\bf u}$ will capture order information, as successful training would imply that ${\bf u}^\top {\bf v}_t < {\bf u}^\top {\bf v}_{t'} $ if $t < t'$. See [this paper](https://www.robots.ox.ac.uk/~vgg/rg/papers/videoDarwin.pdf) for more on rank pooling. While previous work has focused on using rank pooling on hand-designed and fixed representations, this paper proposes to use ConvNet features (pre-trained on ImageNet) for the representation and backpropagate through rank pooling to fine-tune the ConvNet features. Since the output of rank pooling corresponds to an argmin operation, passing gradients through this operation is not as straightforward as for average or max pooling. However, it turns out that if the objective being minimized (in our case regularized SVR) is twice differentiable, gradients with respect to its argmin can be computed (see Lemmas 1 and 2). The authors derive the gradient for rank pooling (Equation 21). Finally, since its gradient requires inverting a matrix (corresponding to a hessian), the authors propose to either use an efficient procedure for computing it by exploiting properties of sums of rank-one matrices (see Lemma 3) or to simply use an approximation based on using a diagonal hessian. In experiments on two small scale video activity recognition datasets (UCF-Sports and Hollywood2), the authors show that fine-tuning the ConvNet features significantly improves the performance of rank pooling and makes it superior to max and average pooling. **My two cents** This paper was eye opening for me, first because I did not realize that one could backpropagate through an operation corresponding to an argmin that doesn't have a closed form solution (though apparently this paper isn't the first to make that observation). Moreover, I did not know about rank pooling, which itself is a really thought provoking approach to pooling representations in a way that preserves some organizational information about the original representations. I wonder how sensitive the results are to the value of the regularization constant of the SVR problem. The authors mention some theoretical guaranties on the stability of the solution found by SVR in general, but intuitively I would expect that the regularization constant would play a large role in the stability. I'll be looking forward to any future attempts to increase the speed of rank pooling (or any similar method). Indeed, as the authors mention, it is currently too slow to be used on the larger video datasets that are currently available. Code for computing rank pooling (though not for computing its gradients) seems to be available [here](https://bitbucket.org/bfernando/videodarwin).
2 Comments
|