Welcome to ShortScience.org! 
[link]
A central problem in the domain of reinforcement learning is how to incentivize exploration and diversity of experience, since RL agents can typically only learn from states they go to, and it can often be the case that states with high reward don't have an obvious trail of highreward states leading to them, meaning that algorithms that are naively optimizing for reward will be relatively unlikely to discover them. One potential way to promote exploration is to train an ensemble of agents, and have part of your loss that incentivizes diversity in the behavior of those agents. Intuitively, an incentive for diversity will push policies away from one another, which will force them to behave differently, and thus reach a wider range of different states. Current work in diversity tends to penalize, for each agent, the average pairwise distance between it and every other agent. The authors of this paper have two critiques with this approach: 1. Pure pairwise distance doesn't fully capture the notion of diversity we might want, since if you have policies that are clustered together into multiple clusters, average pairwise distance can be increased by moving the clusters farther apart, without having to break up the clusters themselves 2. Having the diversity term be calculated for each policy independently can lead to "cycling" behavior, where policies move in ways that increase distance at each step, but don't do so when each agent's independent steps are taken into account As an alternative, they propose calculating the pairwise Kernel function similarity between all policies (where each policy is represented as the average action probabilities it returns across a sample of states), and calculating the determinate of that matrix. The authors claim that this represents a better measure of full population diversity. I can't fully connect the dots on this intuitively, but what I think they're saying is: the determinant of the kernel matrix tells you something about the effective dimensionality spanned by the different policies. In the same way that, if a matrix is lowrank, it tells you that some vectors within it can be nearly represented by linear combinations of others, a low value of the Kernel determinant means that some policies can be sufficiently represented by combinations of other policies, such that it isn't really adding diversity value to the ensemble. https://i.imgur.com/CmlGsNP.png Another contribution of this paper is to propose an interesting banditbased way of determining when to incentivize diversity vs focus on pure reward. The diversity term in the loss is governed by a lambda parameter, and the paper's model sets up Thompson sampling to determine what the value of the parameter should be at each training iteration. This bandit setup works by starting out uncertain over whether to include diversity in the loss, and building a model of whether reward tends to increase during steps where diversity is used. Over time, if diversity consistently doesn't produce benefits, the sampler will tend more towards excluding it from the loss. This is just a really odd, different idea, that I've never heard of before in the context of hyperparameter scheduling during training. I will say that I'm somewhat confused about the window of time that the bandit uses for calculating rewards. Is it a set of trajectories used in a single training batch, or longer than that? Questions aside, they do so fairly convincingly that the adaptive parameter schedule outperforms over a fixed schedule, though they don't test it against simpler forms of annealing, so I'm less clear on whether it would outperform those. Overall, I still have some confusions about the method proposed by this paper, but it seems to be approaching the question of exploration from an interesting direction, and I'd enjoy trying to understand it further in future.
1 Comments

[link]
Federated learning is the problem of training a model that incorporates updates from the data of many individuals, without having direct access to that data, or having to store it. This is potentially desirable both for reasons of privacy (not wanting to have access to private data in a centralized way), and for potential benefits to transport cost when data needed to train models exists on a user's device, and would require a lot of bandwidth to transfer to a centralized server. Historically, the default way to do Federated Learning was with an algorithm called FedSGD, which worked by:  Sending a copy of the current model to each device/client  Calculating a gradient update to be applied on top of that current model given a batch of data sampled from the client's device  Sending that gradient back to the central server  Averaging those gradients and applying them all at once to a central model The authors note that this approach is equivalent to one where a single device performs a step of gradient descent locally, sends the resulting *model* back to the the central server, and performs model averaging by averaging the parameter vectors there. Given that, and given their observation that, in federated learning, communication of gradients and models is generally much more costly than the computation itself (since the computation happens across so many machines), they ask whether the communication required to get to a certain accuracy could be better optimized by performing multiple steps of gradient calculation and update on a given device, before sending the resulting model back to a central server to be average with other clients models. Specifically, their algorithm, FedAvg, works by:  Dividing the data on a given device into batches of size B  Calculating an update on each batch and applying them sequentially to the starting model sent over the wire from the server  Repeating this for E epochs Conceptually, this should work perfectly well in the world where data from each batch is IID  independently drawn from the same distribution. But that is especially unlikely to be true in the case of federated learning, when a given user and device might have very specialized parts of the data space, and prior work has shown that there exist pathological cases where averaged models can perform worse than either model independently, even *when* the IID condition is met. The authors experiment empirically ask the question whether these sorts of pathological cases arise when simulating a federated learning procedure over MNIST and a language model trained on Shakespeare, trying over a range of hyperparameters (specifically B and E), and testing the case where data is heavily nonIID (in their case: where different "devices" had nonoverlapping sets of digits). https://i.imgur.com/xq9vi8S.png They show that, in both the IID and nonIID settings, they are able to reach their target accuracy, and are able to do so with many fewer rounds of communciation than are required by FedSGD (where an update is sent over the wire, and a model sent back, for each round of calculation done on the device.) The authors argue that this shows the practical usefulness of a Federated Learning approach that does more computation on individual devices before updating, even in the face of theoretical pathological cases. 
[link]
Kumar et al. propose an algorithm to learn in batch reinforcement learning (RL), a setting where an agent learns purely form a fixed batch of data, $B$, without any interactions with the environments. The data in the batch is collected according to a batch policy $\pi_b$. Whereas most previous methods (like BCQ) constrain the learned policy to stay close to the behavior policy, Kumar et al. propose bootstrapping error accumulation reduction (BEAR), which constrains the newly learned policy to place some probability mass on every non negligible action. The difference is illustrated in the picture from the BEAR blog post: https://i.imgur.com/zUw7XNt.png The behavior policy is in both images the dotted red line, the left image shows the policy matching where the algorithm is constrained to the purple choices, while the right image shows the support matching. **Theoretical Contribution:** The paper analysis formally how the use of outofdistribution actions to compute the target in the Bellman equation influences the backpropagated error. Firstly a distribution constrained backup operator is defined as $T^{\Pi}Q(s,a) = \mathbb{E}[R(s,a) + \gamma \max_{\pi \in \Pi} \mathbb{E}_{P(s' \vert s,a)} V(s')]$ and $V(s) = \max_{\pi \in \Pi} \mathbb{E}_{\pi}[Q(s,a)]$ which considers only policies $\pi \in \Pi$. It is possible that the optimal policy $\pi^*$ is not contained in the policy set $\Pi$, thus there is a suboptimallity constant $\alpha (\Pi) = \max_{s,a} \vert \mathcal{T}^{\Pi}Q^{*}(s,a)  \mathcal{T}Q^{*}(s,a) ]\vert $ which captures how far $\pi^{*}$ is from $\Pi$. Letting $P^{\pi_i}$ be the transitionmatrix when following policy $\pi_i$, $\rho_0$ the state marginal distribution of the training data in the batch and $\pi_1, \dots, \pi_k \in \Pi $. The error analysis relies upon a concentrability assumption $\rho_0 P^{\pi_1} \dots P^{\pi_k} \leq c(k)\mu(s)$, with $\mu(s)$ the state marginal. Note that $c(k)$ might be infinite if the support of $\Pi$ is not contained in the state marginal of the batch. Using the coefficients $c(k)$ a concentrability coefficient is defined as: $C(\Pi) = (1\gamma)^2\sum_{k=1}^{\infty}k \gamma^{k1}c(k).$ The concentrability takes values between 1 und $\infty$, where 1 corresponds to the case that the batch data were collected by $\pi$ and $\Pi = \{\pi\}$ and $\infty$ to cases where $\Pi$ has support outside of $\pi$. Combining this Kumar et a. get a bound of the Bellman error for distribution constrained value iteration with the constrained Bellman operator $T^{\Pi}$: $\lim_{k \rightarrow \infty} \mathbb{E}_{\rho_0}[\vert V^{\pi_k}(s) V^{*}(s)] \leq \frac{\gamma}{(1\gamma^2)} [C(\Pi) \mathbb{E}_{\mu}[\max_{\pi \in \Pi}\mathbb{E}_{\pi}[\delta(s,a)] + \frac{1\gamma}{\gamma}\alpha(\Pi) ] ]$, where $\delta(s,a)$ is the Bellman error. This presents the inherent batch RL tradeoff between keeping policies close to the behavior policy of the batch (captured by $C(\Pi)$ and keeping $\Pi$ sufficiently large (captured by $\alpha(\Pi)$). It is finally proposed to use support sets to construct $\Pi$, that is $\Pi_{\epsilon} = \{\pi \vert \pi(a \vert s)=0 \text{ whenever } \beta(a \vert s) < \epsilon \}$. This amounts to the set of all policies that place probability on all nonnegligible actions of the behavior policy. For this particular choice of $\Pi = \Pi_{\epsilon}$ the concentrability coefficient can be bounded. **Algorithm**: The algorithm has an actor critic style, where the Qvalue to update the policy is taken to be the minimum over the ensemble. The support constraint to place at least some probability mass on every non negligible action from the batch is enforced via sampled MMD. The proposed algorithm is a member of the policy regularized algorithms as the policy is updated to optimize: $\pi_{\Phi} = \max_{\pi} \mathbb{E}_{s \sim B} \mathbb{E}_{a \sim \pi(\cdot \vert s)} [min_{j = 1 \dots, k} Q_j(s,a)] s.t. \mathbb{E}_{s \sim B}[MMD(D(s), \pi(\cdot \vert s))] \leq \epsilon$ The Bellman target to update the Qfunctions is computed as the convex combination of minimum and maximum of the ensemble. **Experiments** The experiments use the Mujoco environments Halfcheetah, Walker, Hopper and Ant. Three scenarios of batch collection, always consisting of 1Mio. samples, are considered:  completely random behavior policy  partially trained behavior policy  optimal policy as behavior policy The experiments confirm that BEAR outperforms other offpolicy methods like BCQ or KLcontrol. The ablations show further that the choice of MMD is crucial as it is sometimes on par and sometimes substantially better than choosing KLdivergence. 
[link]
In this tutorial paper, Carl E. Rasmussen gives an introduction to Gaussian Process Regression focusing on the definition, the hyperparameter learning and future research directions. A Gaussian Process is completely defined by its mean function $m(\pmb{x})$ and its covariance function (kernel) $k(\pmb{x},\pmb{x}')$. The mean function $m(\pmb{x})$ corresponds to the mean vector $\pmb{\mu}$ of a Gaussian distribution whereas the covariance function $k(\pmb{x}, \pmb{x}')$ corresponds to the covariance matrix $\pmb{\Sigma}$. Thus, a Gaussian Process $f \sim \mathcal{GP}\left(m(\pmb{x}), k(\pmb{x}, \pmb{x}')\right)$ is a generalization of a Gaussian distribution over vectors to a distribution over functions. A random function vector $\pmb{\mathrm{f}}$ can be generated by a Gaussian Process through the following procedure: 1. Compute the components $\mu_i$ of the mean vector $\pmb{\mu}$ for each input $\pmb{x}_i$ using the mean function $m(\pmb{x})$ 2. Compute the components $\Sigma_{ij}$ of the covariance matrix $\pmb{\Sigma}$ using the covariance function $k(\pmb{x}, \pmb{x}')$ 3. A function vector $\pmb{\mathrm{f}} = [f(\pmb{x}_1), \dots, f(\pmb{x}_n)]^T$ can be drawn from the Gaussian distribution $\pmb{\mathrm{f}} \sim \mathcal{N}\left(\pmb{\mu}, \pmb{\Sigma} \right)$ Applying this procedure to regression, means that the resulting function vector $\pmb{\mathrm{f}}$ shall be drawn in a way that a function vector $\pmb{\mathrm{f}}$ is rejected if it does not comply with the training data $\mathcal{D}$. This is achieved by conditioning the distribution on the training data $\mathcal{D}$ yielding the posterior Gaussian Process $f \rvert \mathcal{D} \sim \mathcal{GP}(m_D(\pmb{x}), k_D(\pmb{x},\pmb{x}'))$ for noisefree observations with the posterior mean function $m_D(\pmb{x}) = m(\pmb{x}) + \pmb{\Sigma}(\pmb{X},\pmb{x})^T \pmb{\Sigma}^{1}(\pmb{\mathrm{f}}  \pmb{\mathrm{m}})$ and the posterior covariance function $k_D(\pmb{x},\pmb{x}')=k(\pmb{x},\pmb{x}')  \pmb{\Sigma}(\pmb{X}, \pmb{x}')$ with $\pmb{\Sigma}(\pmb{X},\pmb{x})$ being a vector of covariances between every training case of $\pmb{X}$ and $\pmb{x}$. Noisy observations $y(\pmb{x}) = f(\pmb{x}) + \epsilon$ with $\epsilon \sim \mathcal{N}(0,\sigma_n^2)$ can be taken into account with a second Gaussian Process with mean $m$ and covariance function $k$ resulting in $f \sim \mathcal{GP}(m,k)$ and $y \sim \mathcal{GP}(m, k + \sigma_n^2\delta_{ii'})$. The figure illustrates the cases of noisy observations (variance at training points) and of noisefree observationshttps://i.imgur.com/BWvsB7T.png (no variance at training points). In the Machine Learning perspective, the mean and the covariance function are parametrised by hyperparameters and provide thus a way to include prior knowledge e.g. knowing that the mean function is a second order polynomial. To find the optimal hyperparameters $\pmb{\theta}$, 1. determine the log marginal likelihood $L= \mathrm{log}(p(\pmb{y} \rvert \pmb{x}, \pmb{\theta}))$, 2. take the first partial derivatives of $L$ w.r.t. the hyperparameters, and 3. apply an optimization algorithm. It should be noted that a regularization term is not necessary for the log marginal likelihood $L$ because it already contains a complexity penalty term. Also, the tradeoff between datafit and penalty is performed automatically. Gaussian Processes provide a very flexible way for finding a suitable regression model. However, they require the high computational complexity $\mathcal{O}(n^3)$ due to the inversion of the covariance matrix. In addition, the generalization of Gaussian Processes to nonGaussian likelihoods remains complicated. 
[link]
We want to find two matrices $W$ and $H$ such that $V = WH$. Often a goal is to determine underlying patterns in the relationships between the concepts represented by each row and column. $W$ is some $m$ by $n$ matrix and we want the inner dimension of the factorization to be $r$. So $$\underbrace{V}_{m \times n} = \underbrace{W}_{m \times r} \underbrace{H}_{r \times n}$$ Let's consider an example matrix where of three customers (as rows) are associated with three movies (the columns) by a rating value. $$ V = \left[\begin{array}{c c c} 5 & 4 & 1 \\\\ 4 & 5 & 1 \\\\ 2 & 1 & 5 \end{array}\right] $$ We can decompose this into two matrices with $r = 1$. First lets do this without any nonnegative constraint using an SVD reshaping matrices based on removing eigenvalues: $$ W = \left[\begin{array}{c c c} 0.656 \\\ 0.652 \\\ 0.379 \end{array}\right], H = \left[\begin{array}{c c c} 6.48 & 6.26 & 3.20\\\\ \end{array}\right] $$ We can also decompose this into two matrices with $r = 1$ subject to the constraint that $w_{ij} \ge 0$ and $h_{ij} \ge 0$. (Note: this is only possible when $v_{ij} \ge 0$): $$ W = \left[\begin{array}{c c c} 0.388 \\\\ 0.386 \\\\ 0.224 \end{array}\right], H = \left[\begin{array}{c c c} 11.22 & 10.57 & 5.41 \\\\ \end{array}\right] $$ Both of these $r=1$ factorizations reconstruct matrix $V$ with the same error. $$ V \approx WH = \left[\begin{array}{c c c} 4.36 & 4.11 & 2.10 \\\ 4.33 & 4.08 & 2.09 \\\ 2.52 & 2.37 & 1.21 \\\ \end{array}\right] $$ If they both yield the same reconstruction error then why is a nonnegativity constraint useful? We can see above that it is easy to observe patterns in both factorizations such as similar customers and similar movies. `TODO: motivate why NMF is better` #### Paper Contribution This paper discusses two approaches for iteratively creating a nonnegative $W$ and $H$ based on random initial matrices. The paper discusses a multiplicative update rule where the elements of $W$ and $H$ are iteratively transformed by scaling each value such that error is not increased. The multiplicative approach is discussed in contrast to an additive gradient decent based approach where small corrections are iteratively applied. The multiplicative approach can be reduced to this by setting the learning rate ($\eta$) to a ratio that represents the magnitude of the element in $H$ to the scaling factor of $W$ on $H$. ### Still a draft 