ShortScience.org - Making Science Accessible!

Welcome to ShortScience.org!

arxiv.org
arxiv-vanity.com
scholar.google.com

Characterizing Adversarial Subspaces Using Local Intrinsic Dimensionality
Xingjun Ma and Bo Li and Yisen Wang and Sarah M. Erfani and Sudanthi Wijewickrema and Grant Schoenebeck and Dawn Song and Michael E. Houle and James Bailey
arXiv e-Print archive - 2018 via Local arXiv
Keywords: cs.LG, cs.CR, cs.CV
more

[link] Summary by David Stutz 7 years ago

Ma et al. detect adversarial examples based on their estimated intrinsic dimensionality. I want to note that this work is also similar to [1] – in both publications, local intrinsic dimensionality is used to analyze adversarial examples. Specifically, the intrinsic dimensionality of a sample is estimated based on the radii $r_i(x)$ of the $k$-nearest neighbors around a sample $x$:

$- \left(\frac{1}{k} \sum_{i = 1}^k \log \frac{r_i(x)}{r_k(x)}\right)^{-1}$.

For details regarding the original, theoretical formulation of local intrinsic dimensionality I refer to the paper. In experiments, the authors show that adversarial examples exhibit a significant higher intrinsic dimensionality than training samples or randomly perturbed examples. This observation allows detection of adversarial examples. A proper interpretation of this finding is, however, missing. It would be interesting to investigate what this finding implies about the properties of adversarial examples.

arxiv.org
arxiv-vanity.com
scholar.google.com

Deep k-Nearest Neighbors: Towards Confident, Interpretable and Robust Deep Learning
Nicolas Papernot and Patrick McDaniel
arXiv e-Print archive - 2018 via Local arXiv
Keywords: cs.LG, stat.ML
more

[link] Summary by David Stutz 6 years ago

Papernot and McDaniel introduce deep k-nearest neighbors where nearest neighbors are found at each intermediate layer in order to improve interpretbaility and robustness. Personally, I really appreciated reading this paper; thus, I will not only discuss the actually proposed method but also highlight some ideas from their thorough survey and experimental results.

First, Papernot and McDaniel provide a quite thorough survey of relevant work in three disciplines: confidence, interpretability and robustness. To the best of my knowledge, this is one of few papers that explicitly make the connection of these three disciplines. Especially the work on confidence is interesting in the light of robustness as Papernot and McDaniel also frequently distinguish between in-distribution and out-distribution samples. Here, it is commonly known that deep neural networks are over-confidence when moving away from the data distribution.

The deep k-nearest neighbor approach is described in Algorithm 1 and summarized in the following. For a trained model and a training set of labeled samples, they first find k nearest neighbors for each intermediate layer of the network. The layer nonconformity with a specific label $j$, referred to as $\alpha$ in Algorithm 1, is computed as the number of labels that in the set of nearest neighbors that do not share this label. By comparing these nonconformity values to a set of reference values (computing over a set of labeled calibration data), the prediction can be refined. In particular, the probability for label $j$ can be computed as the fraction of reference nonconformity values that are higher than the computed one. See Algorthm 1 or the paper for details.

https://i.imgur.com/RA6q1VI.png
https://i.imgur.com/CkRf8ex.png
Algorithm 1: The deep k-nearest neighbor algorithm and an illustration.

Finally, they provide experimental results – again considering the three disciplines of confidence/credibility, interpretability and robustness. The main take-aways are that the resulting confidences are more reliable on out-of-distribution samples, which also include adversarial examples. Additioanlly, the nearest neighbor allow very basic interpretation of the predictions.

Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/).

openreview.net
scholar.google.com

Towards Robust, Locally Linear Deep Networks
Lee, Guang-He and Alvarez-Melis, David and Jaakkola, Tommi S.
International Conference on Learning Representations - 2019 via Local Bibsonomy
Keywords: dblp

[link] Summary by David Stutz 5 years ago

Lee et al. propose a regularizer to increase the size of linear regions of rectified deep networks around training and test points. Specifically, they assume piece-wise linear networks, in its most simplistic form consisting of linear layers (fully connected layers, convolutional layers) and ReLU activation functions. In these networks, linear regions are determined by activation patterns, i.e., a pattern indicating which neurons have value greater than zero. Then, the goal is to compute, and later to increase, the size $\epsilon$ such that the $L_p$-ball of radius $\epsilon$ around a sample $x$, denoted $B_{\epsilon,p}(x)$ is contained within one linear region (corresponding to one activation pattern). Formally, letting $S(x)$ denote the set of feasible inputs $x$ for a given activation pattern, the task is to determine

$\hat{\epsilon}_{x,p} = \max_{\epsilon \geq 0, B_{\epsilon,p}(x) \subset S(x)} \epsilon$.

For $p = 1, 2, \infty$, the authors show how $\hat{\epsilon}_{x,p}$ can be computed efficiently. For $p = 2$, for example, it results in

$\hat{\epsilon}_{x,p} = \min_{(i,j) \in I} \frac{|z_j^i|}{\|\nabla_x z_j^i\|_2}$.

Here, $z_j^i$ corresponds to the $j$th neuron in the $i$th layer of a multi-layer perceptron with ReLU activations; and $I$ contains all the indices of hidden neurons. This analytical form can then used to add a regularizer to encourage the network to learn larger linear regions:

$\min_\theta \sum_{(x,y) \in D} \left[\mathcal{L}(f_\theta(x), y) - \lambda \min_{(i,j) \in I} \frac{|z_j^i|}{\|\nabla_x z_j^i\|_2}\right]$

where $f_\theta$ is the neural network with paramters $\theta$. In the remainder of the paper, the authors propose a relaxed version of this training procedure that resembles a max-margin formulation and discuss efficient computation of the involved derivatives $\nabla_x z_j^i$ without too many additional forward/backward passes.

https://i.imgur.com/jSc9zbw.jpg
Figure 1: Visualization of locally linear regions for three different models on toy 2D data.

On toy data and datasets such as MNIST and CalTech-256, it is shown that the training procedure is effective in the sense that larger linear regions around training and test points are learned. For example, on a 2D toy dataset, Figure 1 visualizes the linear regions for the optimal regularizer as well as the proposed relaxed version.

Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/).

openaccess.thecvf.com
sci-hub
scholar.google.com

Deep High-Resolution Representation Learning for Human Pose Estimation
Sun, Ke and Xiao, Bin and Liu, Dong and Wang, Jingdong
Conference and Computer Vision and Pattern Recognition - 2019 via Local Bibsonomy
Keywords: dblp

[link] Summary by Oleksandr Bailo 6 years ago

This paper is a top-down (i.e. requires person detection separately) pose estimation method with a focus on improving high-resolution representations (features) to make keypoint detection easier.

During the training stage, this method utilizes annotated bounding boxes of person class to extract ground truth images and keypoints. The data augmentations include random rotation, random scale, flipping, and [half body augmentations](http://presentations.cocodataset.org/ECCV18/COCO18-Keypoints-Megvii.pdf) (feeding upper or lower part of the body separately). Heatmap learning is performed in a typical for this task approach of applying L2 loss between predicted keypoint locations and ground truth locations (generated by applying 2D Gaussian with std = 1).

During the inference stage, pre-trained object detector is used to provide bounding boxes. The final heatmap is obtained by averaging heatmaps obtained from the original and flipped images. The pixel location of the keypoint is determined by $argmax$ heatmap value with a quarter offset in the direction to the second-highest heatmap value.

While the pipeline described in this paper is a common practice for pose estimation methods, this method can achieve better results by proposing a network design to extract better representations. This is done through having several parallel sub-networks of different resolutions (next one is half the size of the previous one) while repeatedly fusing branches between each other:
https://raw.githubusercontent.com/leoxiaobin/deep-high-resolution-net.pytorch/master/figures/hrnet.png

The fusion process varies depending on the scale of the sub-network and its location in relation to others:
https://i.imgur.com/mGDn7pT.png

arxiv.org
scholar.google.com

Communication-Efficient Learning of Deep Networks from Decentralized Data
McMahan, H. Brendan and Moore, Eider and Ramage, Daniel and Hampson, Seth and Arcas, Blaise Agüera y
- 2016 via Local Bibsonomy
Keywords: distributed, deep_learning, hpc

[link] Summary by CodyWild 4 years ago

Federated learning is the problem of training a model that incorporates updates from the data of many individuals, without having direct access to that data, or having to store it. This is potentially desirable both for reasons of privacy (not wanting to have access to private data in a centralized way), and for potential benefits to transport cost when data needed to train models exists on a user's device, and would require a lot of bandwidth to transfer to a centralized server.  

Historically, the default way to do Federated Learning was with an algorithm called FedSGD, which worked by: 

- Sending a copy of the current model to each device/client
- Calculating a gradient update to be applied on top of that current model given a batch of data sampled from the client's device
- Sending that gradient back to the central server
- Averaging those gradients and applying them all at once to a central model

The authors note that this approach is equivalent to one where a single device performs a step of gradient descent locally, sends the resulting *model* back to the the central server, and performs model averaging by averaging the parameter vectors there. Given that, and given their observation that, in federated learning, communication of gradients and models is generally much more costly than the computation itself (since the computation happens across so many machines), they ask whether the communication required to get to a certain accuracy could be better optimized by performing multiple steps of gradient calculation and update on a given device, before sending the resulting model back to a central server to be average with other clients models. 

Specifically, their algorithm, FedAvg, works by: 

- Dividing the data on a given device into batches of size B
- Calculating an update on each batch and applying them sequentially to the starting model sent over the wire from the server
- Repeating this for E epochs

Conceptually, this should work perfectly well in the world where data from each batch is IID - independently drawn from the same distribution. But that is especially unlikely to be true in the case of federated learning, when a given user and device might have very specialized parts of the data space, and prior work has shown that there exist pathological cases where averaged models can perform worse than either model independently, even *when* the IID condition is met. 

The authors experiment empirically ask the question whether these sorts of pathological cases arise when simulating a federated learning procedure over MNIST and a language model trained on Shakespeare, trying over a range of hyperparameters (specifically B and E), and testing the case where data is heavily non-IID (in their case: where different "devices" had non-overlapping sets of digits). 

https://i.imgur.com/xq9vi8S.png

They show that, in both the IID and non-IID settings, they are able to reach their target accuracy, and are able to do so with many fewer rounds of communciation than are required by FedSGD (where an update is sent over the wire, and a model sent back, for each round of calculation done on the device.) The authors argue that this shows the practical usefulness of a Federated Learning approach that does more computation on individual devices before updating, even in the face of theoretical pathological cases.