ShortScience.org - Making Science Accessible!

Welcome to ShortScience.org!

Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples
Athalye, Anish and Carlini, Nicholas and Wagner, David A.
arXiv e-Print archive - 2018 via Local Bibsonomy
Keywords: dblp

[link] Summary by David Stutz 7 years ago

Athalye et al. propose methods to circumvent different types of defenses against adversarial example based on obfuscated gradients. In particular, they identify three types of obfuscated gradients: shattered gradients (e.g., caused by undifferentiable parts of a network or through numerical instability), stochastic gradients, and exploding and vanishing gradients. These phenomena all influence the effectiveness of gradient-based attacks. Athalye et al. Give several indicators of how to find out when obfuscated gradients occur. Personally, I find most of these points straight forward, but it is still beneficial to write these “debug strategies” down. The main contribution, however, is a comprehensive evaluation of all eight ICLR’18 defenses against state-of-the-art attacks. As all (except adversarial training) cause obfuscated gradients, Athalye et al. Discuss several strategies to “un-obfuscate” the gradients to successfully compute adversarial examples. Overall, they show that seven out of eight defenses are not reliable, only adversarial training with projected gradient descent can withstand attacks limited to $\epsilon\approx 0.3$.

Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/).

arxiv.org
scholar.google.com

Gated Graph Sequence Neural Networks
Li, Yujia and Tarlow, Daniel and Brockschmidt, Marc and Zemel, Richard S.
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Hugo Larochelle 10 years ago

This paper presents a feed-forward neural network architecture for processing graphs as inputs, inspired from previous work on Graph Neural Networks.

In brief, the architecture of the GG-NN corresponds to $T$ steps of GRU-like (gated recurrent units) updates, where T is a hyper-parameter. At each step, a vector representation is computed for all nodes in the graph, where a node's representation at step t is computed from the representation of nodes at step $t-1$. Specifically, the representation of a node will be updated based on the representation of its neighbors in the graph. Incoming and outgoing edges in the graph are treated differently by the neural network, by using different parameter matrices for each. Moreover, if edges have labels, separate parameters can be learned for the different types of edges (meaning that edge labels determine the configuration of parameter sharing in the model). Finally, GG-NNs can incorporate node-level attributes, by using them in the initialization (time step 0) of the nodes' representations.

GG-NNs can be used to perform a variety of tasks on graphs. The per-node representations can be used to make per-node predictions by feeding them to a neural network (shared across nodes). A graph-level predictor can also be obtained using a soft attention architecture, where per-node outputs are used as scores into a softmax in order to pool the representations across the graph, and feed this graph-level representation to a neural network. The attention mechanism can be conditioned on a "question" (e.g. on a task to predict the shortest path in a graph, the question would be the identity of the beginning and end nodes of the path to find), which is fed to the node scorer of the soft attention mechanism. Moreover, the authors describe how to chain GG-NNs to go beyond predicting individual labels and predict sequences.

Experiments on several datasets are presented. These include tasks where a single output is required (on a few bAbI tasks) as well as tasks where a sequential output is required, such as outputting the shortest path or the Eulerian circuit of a graph. Moreover, experiments on a much more complex and interesting program verification task are presented.

arxiv.org
scholar.google.com

Training recurrent networks online without backtracking
Ollivier, Yann and Charpiat, Guillaume
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Hugo Larochelle 10 years ago

This paper suggests a method (NoBackTrack) for training recurrent neural networks in an online way, i.e. without having to do backprop through time. One way of understanding the method is that it applies the [forward method for automatic differentiation](//en.wikipedia.org/wiki/Automatic_differentiation#Forward_accumulation), but since it requires maintaining a large Jacobian matrix (nb. of hidden units times nb. of parameters), they propose a way of obtaining a stochastic (but unbiased!) estimate of that matrix. Moreover, the method is improved by using Kalman filtering on that estimate, effectively smoothing the estimate over time.

#### My two cents

Online training of RNNs is a big, unsolved problem. The current approach people use is to truncate backprop to only a few steps in the past, which is more of a heuristic.

This paper makes progress towards a more principled approach. I really like the "rank-one trick" of Equation 7, really cute! And it is quite central to this method too, so good job on connecting those dots!

The authors present this work as being preliminary, and indeed they do not compare with truncated backprop. I really hope they do in a future version of this work.

Also, I don't think I buy their argument that the "theory of stochastic gradient descent applies". Here's the reason. So the method tracks the Jacobian of the hidden state wrt the parameter, which they note $G(t)$. It is update into $G(t+1)$, using a recursion which is based on the chain rule. However, between computing $G(t)$ and $G(t+1)$, a gradient step is performed during training. This means that $G(t)$ is now slightly stale, and corresponds to the gradient with respect to old value of the parameters, not the current value. As far as I understand, this implies that $G(t+1)$ (more specifically, its stochastic estimate as proposed in this paper) isn't unbiased anymore. So, unless I'm missing something (which I might!), I don't think we can invoke the theory of SGD as they suggest.

But frankly, that last issue seems pretty unavoidable in the online setting. I suspect this will never be solved, and future research will have to somehow have to design learning algorithms that are robust to this issue (or develop new theory that shows it isn't one).

So overall, kudos to the authors, and I'm really looking forward to read more about where this research goes!

arxiv.org
scholar.google.com

Plan Arithmetic: Compositional Plan Vectors for Multi-Task Control
Devin, Coline and Geng, Daniel and Abbeel, Pieter and Darrell, Trevor and Levine, Sergey
arXiv e-Print archive - 2019 via Local Bibsonomy
Keywords: dblp

[link] Summary by CodyWild 6 years ago

If you've been at all aware of machine learning in the past five years, you've almost certainly seen the canonical word2vec example demonstrating additive properties of word embeddings: "king - man + woman = queen". This paper has a goal of designing embeddings for agent plans or trajectories that follow similar principles, such that a task composed of multiple subtasks can be represented by adding the vectors corresponding to the subtasks. For example, if a task involved getting an ax and then going to a tree, you'd want to be able to generate an embedding that corresponded to a policy to execute that task by summing the embeddings for "go to ax" and "go to tree". 

https://i.imgur.com/AHlCt76.png

The authors don't assume that they know the discrete boundaries between subtasks in multiple-task trajectories, and instead use a relatively simple and clever training structure in order to induce the behavior described above. They construct some network g(x) that takes in information describing a trajectory (in this case, start and end state, but presumably could be more specific transitions), and produces an embedding. Then, they train a model on an imitation learning problem, where, given one demonstration of performing a particular task (typically generated by the authors to be composed of multiple subtasks), the agent needs to predict what action will be taken next in a second trajectory of the same composite task. At each point in the sequence of predicting the next action, the agent calculates the embedding of the full reference trajectory, and the embedding of the actions they have so far performed in the current stage in the predicted trajectory, and calculates the difference between these two values. This embedding difference is used to condition the policy function that predicts next action. At each point, you enforce this constraint, that the embedding of what is remaining to be done in the trajectory be close to the embedding of (full trajectory) - (what has so far been completed), by making the policy that corresponds with that embedding map to the remaining part of the trajectory. In addition to this core loss, they also have a few regularization losses, including: 

1. A loss that goes through different temporal subdivisions of reference, and pushes the summed embedding of the two parts to be close to the embedding of the whole 
2. A loss that simply pushes the embeddings of the two paired trajectories performing the same task closer together 

The authors test mostly on relatively simple tasks - picking up and moving sequences of objects with a robotic arm, moving around and picking up objects in a simplified Minecraft world - but do find that their central partial-conditioning-based loss gives them better performance on demonstration tasks that are made up of many subtasks. 

Overall, this is an interesting and clever paper: it definitely targeted additive composition much more directly, rather than situations like the original word2vec where additivity came as a side effect of other properties, but it's still an idea that I could imagine having interesting properties, and one I'd be interested to see applied to a wider variety of tasks.

papers.nips.cc
scholar.google.com

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
Ren, Shaoqing and He, Kaiming and Girshick, Ross B. and Sun, Jian
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Martin Thoma 10 years ago

**Object detection** is the task of drawing one bounding box around each instance of the type of object one wants to detect. Typically, image classification is done before object detection. With neural networks, the usual procedure for object detection is to train a classification network, replace the last layer with a regression layer which essentially predicts pixel-wise if the object is there or not. An bounding box inference algorithm is added at last to make a consistent prediction (see [Deep Neural Networks for Object Detection](http://papers.nips.cc/paper/5207-deep-neural-networks-for-object-detection.pdf)).

The paper introduces RPNs (Region Proposal Networks). They are end-to-end trained to generate region proposals.They simoultaneously regress region bounds and bjectness scores at each location on a regular grid.

RPNs are one type of fully convolutional networks. They take an image of any size as input and output a set of rectangular object proposals, each with an objectness score.

## See also
* [R-CNN](http://www.shortscience.org/paper?bibtexKey=conf/iccv/Girshick15#joecohen)
* [Fast R-CNN](http://www.shortscience.org/paper?bibtexKey=conf/iccv/Girshick15#joecohen)
* [Faster R-CNN](http://www.shortscience.org/paper?bibtexKey=conf/nips/RenHGS15#martinthoma)
* [Mask R-CNN](http://www.shortscience.org/paper?bibtexKey=journals/corr/HeGDG17)