[link]
In many policy gradient algorithms, we update the parameters in online fashion. We collect trajectories from a policy, use the trajectories to compute the gradient of policy parameters with respect to the long-term cumulative reward, and update the policy parameters using this gradient. It is to be noted here that we do not use these samples again after updating the policies. The main reason that we do not use these samples again because we need to use **importance sampling** and **importance sampling** suffers from high variance and can make the learning potentially unstable. This paper proposes an update on **Asynchronous Advantage Actor Critic (A3C)** to incorporate off-line data (the trajectories collected using previous policies). ** Incorporating offline data in Policy Gradient ** The offline data is incorporated using importance sampling. Mainly; lets $J(\theta)$ denote the total reward using policy $\pi(\theta)$, then using Policy Gradient Theorem $$ \Delta J(\theta) \propto \mathbb{E}_{x_t \sim \beta_\mu, a_t \sim \mu}[\rho_t \nabla_{\theta} \log \pi(a_t | x_t) Q^{\pi}(x_t, a_t)] $$ where $\rho_t = \frac{\pi(a_t | x_t)}{\mu({a_t|x_t})}$. $\rho_t$ is called the importance sampling term. $\beta_\mu$ is the stationary probability distribution of states under the policy $\mu$. **Estimating $Q^{\pi}(x_t, a_t)$ in above equation:** The authors used a *retrace-$\lambda$* approach to estimate $Q^{\pi}$. Mainly; the action-values were computed using the following recursive equation: $$ Q^{\text{ret}}(x_t, a_t) = r_t + \gamma \bar{\rho}_{t+1}\left(Q^{\text{ret}}(x_{t+1}, a_{t+1}) - Q(x_{t+1}, a_{t+1})\right) + \gamma V(x_{t+1}) $$ where $\bar{\rho}_t = \min\{c, \rho_t\}$ and $\rho_t$ is the importance sampling term. $Q$ and $V$ in the above equation are the estimate of action-value and state-value respectively. To estimate $Q$, the authors used a similar architecture as A3C except that the final layer outputs $Q-$values instead of state-values $V$. To train $Q$, the authors used the $Q^{\text{ret}}$. ** Reducing the variance because of importance sampling in the above equation:** The authors used a technique called *importance weight truncation with bias correction* to keep the variance bounded in the policy gradient equation. Mainly; they use the following identity: $$ \begin{array}{ccc} &&\mathbb{E}_{x_t \sim \beta_\mu, a_t \sim \mu}[\rho_t \nabla_{\theta} \log \pi(a_t | x_t) Q^{\pi}(x_t, a_t)] \\ &=& \mathbb{E}_{x_t \sim \beta_\mu}\left[ \mathbb{E}_{a_t \sim \mu}[\bar{\rho}_t \nabla_{\theta} \log \pi(a_t | x_t) Q^{\pi}(x_t, a_t)] \right] \\ &+& \mathbb{E}_{a\sim \pi}\left[\left[\frac{\rho_t(a) - c}{\rho_t(a)}\right] \nabla_{\theta} \log\pi_{\theta}(a | x_t) Q^{\pi}(x_t, a)\right] \end{array} $$ Note that in the above identity, the variance in the both the terms on the right hand side is bounded. ** Results: ** The authors showed that by using the off-line data, they were able to match the performance of best DQN agent with the less data and the same amount of computation. **Continuous task: ** The authors used a stochastic duelling architecture for tasks having continuous action spaces while utilizing the innovation of discrete cases.
Your comment:
|