Welcome to ShortScience.org! |

- ShortScience.org is a platform for post-publication discussion aiming to improve accessibility and reproducibility of research ideas.
- The website has 1567 public summaries, mostly in machine learning, written by the community and organized by paper, conference, and year.
- Reading summaries of papers is useful to obtain the perspective and insight of another reader, why they liked or disliked it, and their attempt to demystify complicated sections.
- Also, writing summaries is a good exercise to understand the content of a paper because you are forced to challenge your assumptions when explaining it.
- Finally, you can keep up to date with the flood of research by reading the latest summaries on our Twitter and Facebook pages.

Thwarting Adversarial Examples: An L_0-Robust Sparse Fourier Transform

Bafna, Mitali and Murtagh, Jack and Vyas, Nikhil

Neural Information Processing Systems Conference - 2018 via Local Bibsonomy

Keywords: dblp

Bafna, Mitali and Murtagh, Jack and Vyas, Nikhil

Neural Information Processing Systems Conference - 2018 via Local Bibsonomy

Keywords: dblp

[link]
Bafna et al. show that iterative hard thresholding results in $L_0$ robust Fourier transforms. In particular, as shown in Algorithm 1, iterative hard thresholding assumes a signal $y = x + e$ where $x$ is assumed to be sparse, and $e$ is assumed to be sparse. This translates to noise $e$ that is bounded in its $L_0$ norm, corresponding to common adversarial attacks such as adversarial patches in computer vision. Using their algorithm, the authors can provably reconstruct the signal, specifically the top-$k$ coordinates for a $k$-sparse signal, which can subsequently be fed to a neural network classifier. In experiments, the classifier is always trained on sparse signals, and at test time, the sparse signal is reconstructed prior to the forward pass. This way, on MNIST and Fashion-MNIST, the algorithm is able to recover large parts of the original accuracy. https://i.imgur.com/yClXLoo.jpg Algorithm 1 (see paper for details): The iterative hard thresholding algorithm resulting in provable robustness against $L_0$ attack on images and other signals. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

Implicit Neural Representations with Periodic Activation Functions

Sitzmann, Vincent and Martel, Julien N. P. and Bergman, Alexander W. and Lindell, David B. and Wetzstein, Gordon

- 2020 via Local Bibsonomy

Keywords: neural-network, machine-learinng

Sitzmann, Vincent and Martel, Julien N. P. and Bergman, Alexander W. and Lindell, David B. and Wetzstein, Gordon

- 2020 via Local Bibsonomy

Keywords: neural-network, machine-learinng

[link]
[First off, full credit that this summary is essentially a distilled-for-my-own-understanding compression of Yannic Kilcher's excellent video on the topic] I'm interested in learning more about Neural Radiance Fields (or NERFs), a recent technique for learning a representation of a scene that lets you generate multiple views from it, and a paper referenced as a useful prerequisite for that technique was SIRENs, or Sinuisodial Representation Networks. In my view, the most complex part of understanding this technique isn't the technique itself, but the particularities of the problem being solved, and the ways it differs from a more traditional ML setup. Typically, the goal of machine learning is to learn a model that extracts and represents properties of a data distribution, and that can generalize to new examples drawn from that distribution. Instead, in this framing, a single network is being used to capture information about a single image, essentially creating a compressed representation of that image that brings with it some nice additional properties. Concretely, the neural network is representing a function that maps inputs of the form (x, y), representing coordinates within the image, to (r, g, b) values, representing the pixel values of the image at that coordinate. If you're able to train an optimal version of such a network, it would mean you have a continuous representation of the image. A good way to think about "continuous," here, is that, you could theoretically ask the model for the color value at pixel (3.5, 2.5), and, given that it's simply a numerical mapping, it could give you a prediction, even though in your discrete "sampling" of pixels, that pixel never appears. Given this problem setting, the central technique proposed by SIRENs is to use sinusoidal non-linearities between the layers. On the face of it, this may seem like a pretty weird choice: non-linearities are generally monotonic, and a sine wave is absolutely not that. The appealing property of sinusoidal activations in this context is: if you take a derivative of a sine curve, what you get is a cosine curve (which is essentially a shifted sine curve), and the same is true in reverse. This means that you can take multiple derivatives of the learned function (where, again, "learned function" is your neural network optimized for this particular image), and have them still be networks of the same underlying format, with shifting constants. This allows SIRENs to use an enhanced version of what would be a typical training procedure for this setting. Simplistically, the way you'd go about training this kind of representation would be to simply give the inputs, and optimize against a loss function that reduced your prediction error in predicting the output values, or, in other words, the error on the f(x, y) function itself. When you have a model structure that makes it easy to take first and second derivatives of the function calculated by the model, you can, as this paper does, decide to train against a loss function of matching, not just the true f(x, y) function (again, the pixel values at coordinates), but also the first and second-derivatives (gradients and Laplacian) of the image at those coordinates. This supervision lets you learn a better underlying representation, since it enforces not just what comes "above the surface" at your sampled pixels, but the dynamics of the true function between those points. One interesting benefit of this procedure of using loss in a first or second derivative space (as pointed out in the paper), is that if you want to merge the interesting parts of multiple images, you can approximate that by training a SIREN on the sum of their gradients, since places where gradients are zero likely don't contain much contrast or interesting content (as an example: a constant color background). The Experiments section goes into a lot of specific applications in boundary-finding problems, which I understand at less depth, and thus won't try to explain. It also briefly mentions trying to learn a prior over the space of image functions (that is, a prior over the set of network weights that define the underlying function of an image); having such a prior is interesting in that it would theoretically let you sample both the implicit image function itself (from the prior), and then also points within that function. |

Communication-Efficient Learning of Deep Networks from Decentralized Data

McMahan, H. Brendan and Moore, Eider and Ramage, Daniel and Hampson, Seth and Arcas, Blaise Agüera y

- 2016 via Local Bibsonomy

Keywords: distributed, deep_learning, hpc

McMahan, H. Brendan and Moore, Eider and Ramage, Daniel and Hampson, Seth and Arcas, Blaise Agüera y

- 2016 via Local Bibsonomy

Keywords: distributed, deep_learning, hpc

[link]
Federated learning is the problem of training a model that incorporates updates from the data of many individuals, without having direct access to that data, or having to store it. This is potentially desirable both for reasons of privacy (not wanting to have access to private data in a centralized way), and for potential benefits to transport cost when data needed to train models exists on a user's device, and would require a lot of bandwidth to transfer to a centralized server. Historically, the default way to do Federated Learning was with an algorithm called FedSGD, which worked by: - Sending a copy of the current model to each device/client - Calculating a gradient update to be applied on top of that current model given a batch of data sampled from the client's device - Sending that gradient back to the central server - Averaging those gradients and applying them all at once to a central model The authors note that this approach is equivalent to one where a single device performs a step of gradient descent locally, sends the resulting *model* back to the the central server, and performs model averaging by averaging the parameter vectors there. Given that, and given their observation that, in federated learning, communication of gradients and models is generally much more costly than the computation itself (since the computation happens across so many machines), they ask whether the communication required to get to a certain accuracy could be better optimized by performing multiple steps of gradient calculation and update on a given device, before sending the resulting model back to a central server to be average with other clients models. Specifically, their algorithm, FedAvg, works by: - Dividing the data on a given device into batches of size B - Calculating an update on each batch and applying them sequentially to the starting model sent over the wire from the server - Repeating this for E epochs Conceptually, this should work perfectly well in the world where data from each batch is IID - independently drawn from the same distribution. But that is especially unlikely to be true in the case of federated learning, when a given user and device might have very specialized parts of the data space, and prior work has shown that there exist pathological cases where averaged models can perform worse than either model independently, even *when* the IID condition is met. The authors experiment empirically ask the question whether these sorts of pathological cases arise when simulating a federated learning procedure over MNIST and a language model trained on Shakespeare, trying over a range of hyperparameters (specifically B and E), and testing the case where data is heavily non-IID (in their case: where different "devices" had non-overlapping sets of digits). https://i.imgur.com/xq9vi8S.png They show that, in both the IID and non-IID settings, they are able to reach their target accuracy, and are able to do so with many fewer rounds of communciation than are required by FedSGD (where an update is sent over the wire, and a model sent back, for each round of calculation done on the device.) The authors argue that this shows the practical usefulness of a Federated Learning approach that does more computation on individual devices before updating, even in the face of theoretical pathological cases. |

Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction

Kumar, Aviral and Fu, Justin and Soh, Matthew and Tucker, George and Levine, Sergey

Neural Information Processing Systems Conference - 2019 via Local Bibsonomy

Keywords: dblp

Kumar, Aviral and Fu, Justin and Soh, Matthew and Tucker, George and Levine, Sergey

Neural Information Processing Systems Conference - 2019 via Local Bibsonomy

Keywords: dblp

[link]
Kumar et al. propose an algorithm to learn in batch reinforcement learning (RL), a setting where an agent learns purely form a fixed batch of data, $B$, without any interactions with the environments. The data in the batch is collected according to a batch policy $\pi_b$. Whereas most previous methods (like BCQ) constrain the learned policy to stay close to the behavior policy, Kumar et al. propose bootstrapping error accumulation reduction (BEAR), which constrains the newly learned policy to place some probability mass on every non negligible action. The difference is illustrated in the picture from the BEAR blog post: https://i.imgur.com/zUw7XNt.png The behavior policy is in both images the dotted red line, the left image shows the policy matching where the algorithm is constrained to the purple choices, while the right image shows the support matching. **Theoretical Contribution:** The paper analysis formally how the use of out-of-distribution actions to compute the target in the Bellman equation influences the back-propagated error. Firstly a distribution constrained backup operator is defined as $T^{\Pi}Q(s,a) = \mathbb{E}[R(s,a) + \gamma \max_{\pi \in \Pi} \mathbb{E}_{P(s' \vert s,a)} V(s')]$ and $V(s) = \max_{\pi \in \Pi} \mathbb{E}_{\pi}[Q(s,a)]$ which considers only policies $\pi \in \Pi$. It is possible that the optimal policy $\pi^*$ is not contained in the policy set $\Pi$, thus there is a suboptimallity constant $\alpha (\Pi) = \max_{s,a} \vert \mathcal{T}^{\Pi}Q^{*}(s,a) - \mathcal{T}Q^{*}(s,a) ]\vert $ which captures how far $\pi^{*}$ is from $\Pi$. Letting $P^{\pi_i}$ be the transition-matrix when following policy $\pi_i$, $\rho_0$ the state marginal distribution of the training data in the batch and $\pi_1, \dots, \pi_k \in \Pi $. The error analysis relies upon a concentrability assumption $\rho_0 P^{\pi_1} \dots P^{\pi_k} \leq c(k)\mu(s)$, with $\mu(s)$ the state marginal. Note that $c(k)$ might be infinite if the support of $\Pi$ is not contained in the state marginal of the batch. Using the coefficients $c(k)$ a concentrability coefficient is defined as: $C(\Pi) = (1-\gamma)^2\sum_{k=1}^{\infty}k \gamma^{k-1}c(k).$ The concentrability takes values between 1 und $\infty$, where 1 corresponds to the case that the batch data were collected by $\pi$ and $\Pi = \{\pi\}$ and $\infty$ to cases where $\Pi$ has support outside of $\pi$. Combining this Kumar et a. get a bound of the Bellman error for distribution constrained value iteration with the constrained Bellman operator $T^{\Pi}$: $\lim_{k \rightarrow \infty} \mathbb{E}_{\rho_0}[\vert V^{\pi_k}(s)- V^{*}(s)] \leq \frac{\gamma}{(1-\gamma^2)} [C(\Pi) \mathbb{E}_{\mu}[\max_{\pi \in \Pi}\mathbb{E}_{\pi}[\delta(s,a)] + \frac{1-\gamma}{\gamma}\alpha(\Pi) ] ]$, where $\delta(s,a)$ is the Bellman error. This presents the inherent batch RL trade-off between keeping policies close to the behavior policy of the batch (captured by $C(\Pi)$ and keeping $\Pi$ sufficiently large (captured by $\alpha(\Pi)$). It is finally proposed to use support sets to construct $\Pi$, that is $\Pi_{\epsilon} = \{\pi \vert \pi(a \vert s)=0 \text{ whenever } \beta(a \vert s) < \epsilon \}$. This amounts to the set of all policies that place probability on all non-negligible actions of the behavior policy. For this particular choice of $\Pi = \Pi_{\epsilon}$ the concentrability coefficient can be bounded. **Algorithm**: The algorithm has an actor critic style, where the Q-value to update the policy is taken to be the minimum over the ensemble. The support constraint to place at least some probability mass on every non negligible action from the batch is enforced via sampled MMD. The proposed algorithm is a member of the policy regularized algorithms as the policy is updated to optimize: $\pi_{\Phi} = \max_{\pi} \mathbb{E}_{s \sim B} \mathbb{E}_{a \sim \pi(\cdot \vert s)} [min_{j = 1 \dots, k} Q_j(s,a)] s.t. \mathbb{E}_{s \sim B}[MMD(D(s), \pi(\cdot \vert s))] \leq \epsilon$ The Bellman target to update the Q-functions is computed as the convex combination of minimum and maximum of the ensemble. **Experiments** The experiments use the Mujoco environments Halfcheetah, Walker, Hopper and Ant. Three scenarios of batch collection, always consisting of 1Mio. samples, are considered: - completely random behavior policy - partially trained behavior policy - optimal policy as behavior policy The experiments confirm that BEAR outperforms other off-policy methods like BCQ or KL-control. The ablations show further that the choice of MMD is crucial as it is sometimes on par and sometimes substantially better than choosing KL-divergence. |

Off-Policy Deep Reinforcement Learning without Exploration

Fujimoto, Scott and Meger, David and Precup, Doina

International Conference on Machine Learning - 2019 via Local Bibsonomy

Keywords: dblp

Fujimoto, Scott and Meger, David and Precup, Doina

International Conference on Machine Learning - 2019 via Local Bibsonomy

Keywords: dblp

[link]
Interacting with the environment comes sometimes at a high cost, for example in high stake scenarios like health care or teaching. Thus instead of learning online, we might want to learn from a fixed buffer $B$ of transitions, which is filled in advance from a behavior policy. The authors show that several so called off-policy algorithms, like DQN and DDPG fail dramatically in this pure off-policy setting. They attribute this to the extrapolation error, which occurs in the update of a value estimate $Q(s,a)$, where the target policy selects an unfamiliar action $\pi(s')$ such that $(s', \pi(s'))$ is unlikely or not present in $B$. Extrapolation error is caused by the mismatch between the true state-action visitation distribution of the current policy and the state-action distribution in $B$ due to: - state-action pairs (s,a) missing in $B$, resulting in arbitrarily bad estimates of $Q_{\theta}(s, a)$ without sufficient data close to (s,a). - the finiteness of the batch of transition tuples $B$, leading to a biased estimate of the transition dynamics in the Bellman operator $T^{\pi}Q(s,a) \approx \mathbb{E}_{\boldsymbol{s' \sim B}}\left[r + \gamma Q(s', \pi(s')) \right]$ - transitions are sampled uniformly from $B$, resulting in a loss weighted w.r.t the frequency of data in the batch: $\frac{1}{\vert B \vert} \sum_{\boldsymbol{(s, a, r, s') \sim B}} \Vert r + \gamma Q(s', \pi(s')) - Q(s, a)\Vert^2$ The proposed algorithm Batch-Constrained deep Q-learning (BCQ) aims to choose actions that: 1. minimize distance of taken actions to actions in the batch 2. lead to states contained in the buffer 3. maximizes the value function, where 1. is prioritized over the other two goals to mitigate the extrapolation error. Their proposed algorithm (for continuous environments) consists informally of the following steps that are repeated at each time $t$: 1. update generator model of the state conditional marginal likelihood $P_B^G(a \vert s)$ 2. sample n actions form the generator model 3. perturb each of the sampled actions to lie in a range $\left[-\Phi, \Phi \right]$ 4. act according to the argmax of respective Q-values of perturbed actions 5. update value function The experiments considers Mujoco tasks with four scenarios of batch data creation: - 1 million time steps from training a DDPG agent with exploration noise $\mathcal{N}(0,0.5)$ added to the action.This aims for a diverse set of states and actions. - 1 million time steps from training a DDPG agent with an exploration noise $\mathcal{N}(0,0.1)$ added to the actions as behavior policy. The batch-RL agent and the behavior DDPG are trained concurrently from the same buffer. - 1 million transitions from rolling out a already trained DDPG agent - 100k transitions from a behavior policy that acts with probability 0.3 randomly and follows otherwise an expert demonstration with added exploration noise $\mathcal{N}(0,0.3)$ I like the fourth choice of behavior policy the most as this captures high stake scenarios like education or medicine the closest, in which training data would be acquired by human experts that are by the nature of humans not optimal but significantly better than learning from scratch. The proposed BCQ algorithm is the only algorithm that is successful across all experiments. It matches or outperforms the behavior policy. Evaluation of the value estimates showcases unstable and diverging value estimates for all algorithms but BCQ that exhibits a stable value function. The paper outlines a very important issue that needs to be tackled in order to use reinforcement learning in real world applications. |

About