Welcome to ShortScience.org! |
[link]
This paper posits that one of the central problems stopping multi-task RL - that is, single models trained to perform multiple tasks well - from reaching better performance, is the inability to balance model resources and capacity between the different tasks the model is being asked to learn. Empirically, prior to this paper, multi-task RL could reach ~50% of human accuracy on Atari and Deepmind Lab tasks. The fact that this is lower than human accuracy is actually somewhat less salient than the fact that it’s quite a lot lower than single-task RL - how a single model trained to perform only that task could do. When learning a RL model across multiple tasks, the reward structures of the different tasks can vary dramatically. Some can have high-magnitude, sparse rewards, some can have low magnitude rewards throughout. If a model learns it can gain what it thinks is legitimately more reward by getting better at a game with an average reward of 2500 than it does with an average reward of 15, it will put more capacity into solving the former task. Even if you apply normalization strategies like reward clipping (which treats all rewards as a binary signal, regardless of magnitude, and just seeks to increase the frequency of rewards), that doesn’t deal with some environments having more frequent rewards than others, and thus more total reward when summed over timestep. The authors here try to solve this problem by performing a specific kind of normalization, called Pop Art normalization, on the problem. PopArt normalization (don’t worry about the name) works by adaptively normalizing both the target and the estimate of the target output by the model, at every step. In the Actor-Critic case that this model is working on, the target and estimate that are being normalized are, respectively, 1) the aggregated rewards of the trajectories from state S onward, and 2) the value estimate at state S. If your value function is perfect, these two things should be equivalent, and so you optimize your value function to be closer to the true rewards under your policy. And, then, you update your policy to increase probability of actions with higher advantage (expected reward with that action, relative to the baseline Value(S) of that state). The “adaptive” part of that refers to correcting for the fact when you’re estimating, say, a Value function to predict the total future reward of following a policy at a state, that V(S) will be strongly non-stationary, since by improving your policy you are directly optimizing to increase that value. This is done by calculating “scale” and “shift” parameters off of a recent data. The other part of the PopArt algorithm works by actually updating the estimate our model is producing, to stay normalized alongside the continually-being-re-normalized target. https://i.imgur.com/FedXTfB.png It does this by taking the new and old versions of scale (sigma) and shift (mu) parameters (which will be used to normalize the target) and updates the weights and biases of the last layer, such that the movement of the estimator moves along with the movement in the target. Using this toolkit, this paper proposes learning one *policy* that’s shared over all task, but keeping shared value estimation functions for each task. Then, it normalizes each task’s values independently, meaning that each task ends up contributing equal weight to the gradient updates of the model (both for the Value and Policy updates). In doing this, the authors find dramatically improved performance at both Atari and Deepmind, relative to prior IMPALA work https://i.imgur.com/nnDcjNm.png https://i.imgur.com/Z6JClo3.png |
[link]
This paper is about a recommendation system approach using collaborative filtering (CF) on implicit feedback datasets. The core of it is the minimization problem $$\min_{x_*, y_*} \sum_{u,i} c_{ui} (p_{ui} - x_u^T y_i)^2 + \underbrace{\lambda \left ( \sum_u || x_u ||^2 + \sum_i || y_i ||^2\right )}_{\text{Regularization}}$$ with * $\lambda \in [0, \infty[$ is a hyper parameter which defines how strong the model is regularized * $u$ denoting a user, $u_*$ are all user factors $x_u$ combined * $i$ denoting an item, $y_*$ are all item factors $y_i$ combined * $x_u \in \mathbb{R}^n$ is the latent user factor (embedding); $n$ is another hyper parameter. $n=50$ seems to be a reasonable choice. * $y_i \in \mathbb{R}^n$ is the latent item factor (embedding) * $r_{ui}$ defines the "intensity"; higher values mean user $u$ interacted more with item $i$ * $p_{ui} = \begin{cases}1 & \text{if } r_{ui} >0\\0 &\text{otherwise}\end{cases}$ * $c_{ui} := 1 + \alpha r_{ui}$ where $\alpha \in [0, \infty[$ is a hyper parameter; $\alpha =40$ seems to be reasonable In contrast, the standard matrix factoriation optimization function looks like this ([example](https://www.cs.cmu.edu/~mgormley/courses/10601-s17/slides/lecture25-mf.pdf)): $$\min_{x_*, y_*} \sum_{(u, i, r_{ui}) \in \mathcal{R}} {(r_{ui} - x_u^T y_i)}^2 + \underbrace{\lambda \left ( \sum_u || x_u ||^2 + \sum_i || y_i ||^2\right )}_{\text{Regularization}}$$ where * $\mathcal{R}$ is the set of all ratings $(u, i, r_{ui})$ - user $u$ has rated item $i$ with value $r_{ui} \in \mathbb{R}$ They use alternating least squares (ALS) to train this model. The prediction then is the dot product between the user factor and all item factors ([source](https://github.com/benfred/implicit/blob/master/implicit/recommender_base.pyx#L157-L176)) |
[link]
Lee et al. propose a variant of adversarial training where a generator is trained simultaneously to generated adversarial perturbations. This approach follows the idea that it is possible to “learn” how to generate adversarial perturbations (as in [1]). In this case, the authors use the gradient of the classifier with respect to the input as hint for the generator. Both generator and classifier are then trained in an adversarial setting (analogously to generative adversarial networks), see the paper for details. [1] Omid Poursaeed, Isay Katsman, Bicheng Gao, Serge Belongie. Generative Adversarial Perturbations. ArXiv, abs/1712.02328, 2017. |
[link]
* They describe a method that applies the style of a source image to a target image. * Example: Let a normal photo look like a van Gogh painting. * Example: Let a normal car look more like a specific luxury car. * Their method builds upon the well known artistic style paper and uses a new MRF prior. * The prior leads to locally more plausible patterns (e.g. less artifacts). ### How * They reuse the content loss from the artistic style paper. * The content loss was calculated by feed the source and target image through a network (here: VGG19) and then estimating the squared error of the euclidean distance between one or more hidden layer activations. * They use layer `relu4_2` for the distance measurement. * They replace the original style loss with a MRF based style loss. * Step 1: Extract from the source image `k x k` sized overlapping patches. * Step 2: Perform step (1) analogously for the target image. * Step 3: Feed the source image patches through a pretrained network (here: VGG19) and select the representations `r_s` from specific hidden layers (here: `relu3_1`, `relu4_1`). * Step 4: Perform step (3) analogously for the target image. (Result: `r_t`) * Step 5: For each patch of `r_s` find the best matching patch in `r_t` (based on normalized cross correlation). * Step 6: Calculate the sum of squared errors (based on euclidean distances) of each patch in `r_s` and its best match (according to step 5). * They add a regularizer loss. * The loss encourages smooth transitions in the synthesized image (i.e. few edges, corners). * It is based on the raw pixel values of the last synthesized image. * For each pixel in the synthesized image, they calculate the squared x-gradient and the squared y-gradient and then add both. * They use the sum of all those values as their loss (i.e. `regularizer loss = <sum over all pixels> x-gradient^2 + y-gradient^2`). * Their whole optimization problem is then roughly `image = argmin_image MRF-style-loss + alpha1 * content-loss + alpha2 * regularizer-loss`. * In practice, they start their synthesis with a low resolution image and then progressively increase the resolution (each time performing some iterations of optimization). * In practice, they sample patches from the style image under several different rotations and scalings. ### Results * In comparison to the original artistic style paper: * Less artifacts. * Their method tends to preserve style better, but content worse. * Can handle photorealistic style transfer better, so long as the images are similar enough. If no good matches between patches can be found, their method performs worse. ![Non-photorealistic example images](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Combining_MRFs_and_CNNs_for_Image_Synthesis__examples.png?raw=true "Non-photorealistic example images") *Non-photorealistic example images. Their method vs. the one from the original artistic style paper.* ![Photorealistic example images](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Combining_MRFs_and_CNNs_for_Image_Synthesis__examples_real.png?raw=true "Photorealistic example images") *Photorealistic example images. Their method vs. the one from the original artistic style paper.* |
[link]
The prediction gradient is just $\frac{\partial \mathbf{y}}{\partial w}$ where $\mathbf{y}$ is the output before the loss function. |