[link]
- explore RL systems with (non-expert) human preferences between pairs of trajectory segments; - run experiments on some RL tasks, namely **Atari** and **MuJoCo**, and show effectiveness of this approach; - advantages mentioned: - no need to access to the reward function; - less than 1% feedback needed -> reduce the cost of human oversight; - can learn complex novel behaviors. ## Introduction **Challenges** - goals complex, pooly-defined or hard to specify; - reward function -> behaviors that optimize reward function without achieving goals; > This difficulty underlines recent concerns about misalignment between reward values and the objectives of RL systems. **Alternatives** - inverse RL: extract a reward function from demonstrations of desired tasks; - imitation learning: clone the demonstrated behavior; *con: not applicable to behaviors that are hard to demonstrate for humans* - use human feedback as a reward function; *con: require thousands of hours of experience, prohibitively expensive* **Basic idea** https://i.imgur.com/3R7tO7R.png **Contributions** 1. solve tasks for which we can only recognize but not demonstrate the desired behaviors; 2. allow non-expert agent training; 3. scale to larger problems; 4. economical with user feedback. **Related work** two lines of work: (1) RL from human ratings or rankings; (2) general problemn of RL from preferences rather than absolute reward values. *close-related paper:* (1) [Active preference learning-based reinforcement learning](https://arxiv.org/abs/1208.0984); (2) [Programming by feedback](http://proceedings.mlr.press/v32/schoenauer14.pdf); (3) [A Bayesian approach for policy learning from trajectory preference queries](https://papers.nips.cc/paper/4805-a-bayesian-approach-for-policy-learning-from-trajectory-preference-queries). *diffs with (1)-(2)*: a) elicit preferences over whole trajectories rather than short clips; b) change training procedure to cope with nonlinear reward models and modern deep RL. *diffs with (3)*: a) fit reward function by Bayesian inference; b) produce trajectories using MAP estimate of the target policy instead of RL -> involve 'synthetic' human feedback drawn from Bayesian model. ## Preliminaries and Method **Agent goal** to produce trajectories which are preferred by the human, while making as few queries as possible to the human. **Work flow** at each point maintains two deep NNs - policy *pi*: O -> A; reward estimate *r\_hat*: O x A -> R. *Update procedure (asyn):* 1. policy *pi* => env => trajectories *tau* = {*tau^1*,..., *tau^i*}. Then update *pi* by a traditional RL algorithm to maximize the sum of predicted rewards *r\_t* = *r\_hat*(*o\_t*, *a\_t*); 2. select segment pairs *sigma* = (*sigma^1*, *sigma^2*) from *tau*. *sigma* => human comparison => labeled data; 3. update *r\_hat* with labeled data by supervised learning. > step 1 => trajectory *tau* => step 2 => human comparison => step 3 => parameters for *r\_hat* => step 1 => .... **Policy optimization (step 1)** *subtlety:* non-stationary reward function *r\_hat* -> methods robust to changes in reward function. A2C => Atari, TRPO => MuJoCo. use parameter settings that work well for traditional RL tasks; only adjust the entropy bonus for TRPO (improve inadequate exploration); normalize rewards to zero mean and constand std. **Preference eliciation (step 2)** clips of trajectory segments for 1 to 2 seconds long. *data struct:* triples (*sigma^1*, *sigma^2*, *mu*), *mu* - distribution over {1, 2}. one preferable over the others -> *mu* puts all mass on that choice; equally preferable -> *mu* uniform; incomparable -> skip saving triples. **Fitting the reward function (step 3)** *assumption:* human’s probability of preferring a segment *sigma^i* depends exponentially on the value of the latent reward summed over the length of the clip. https://i.imgur.com/2ViIWcL.png (no discount of reward <- human being indifferent about when things happen in the trajectory segment; could consider discounting.) https://i.imgur.com/cpdTZW6.png *modifications:* 1. ensemble of predictors -> independently normalize base predictors and then average results; 2. validation set ratio 1/e; employ *l\_2* regularization, and tune regularization coefficient -> validation loss = 1.1~1.5 training loss; dropout in some domains; 3. assume 10% chance that human responds uniformly at random; **Selecting queries** *pipeline:* sample trajectory segments of length k -> predict preference by base reward predictor in our ensemble -> select trajectories with the highest variance across ensemble members *future work:* query based on the expected value of information of query. *Related articles:* 1. [APRIL: Active Preference-learning based Reinforcement Learning](https://arxiv.org/abs/1208.0984) 2. [Active reinforcement learning: Observing rewards at a cost](http://www.filmnips.com/wp-content/uploads/2016/11/FILM-NIPS2016_paper_30.pdf) > At each time-step, the agent chooses both an action and whether to observe the reward in the next time-step. If the agent chooses to observe the reward, then it pays the “query cost” c > 0. The agent’s objective is to maximize total reward minus total query cost. |
[link]
- *issue:* RL on real systems -> sparse and slow data sampling; - *solution:* pre-train the agent with the EGAN; - *performance:* ~20% improvement of training time in the beginning of learning compared to no pre-training; ~5% improvement and smaller variations compared to GAN pre-training. ## Introduction 5G telecom systems -> fufill ultra-low latency, high robustness, quick response to changed capacity needs, and dynamic allocation of functionality. *Problems:* 1. exploration has an impact on the service quality in real-time service systems; 2. sparse and slow data sampling -> extended training duration. ## Enhanced GAN **Fomulas** the training data for RL tasks: $$x = [x_1, x_2] = [(s_t,a),(s_{t+1},r)]$$ the generated data: $$G(z) = [G_1(z), G_2(z)] = [(s'_t,a'),(s'_{t+1},r')] $$ the value function for GAN: $$V(D,G) = \mathbb{E}_{z \sim p_z(z)}[\log(1-D(G(z)))] + \lambda D_{KL}(P||Q)$$ where the regularization term $D_{KL}$ has the following form: $$D_{KL}(P||Q) = \sum_i P(i) \log \frac{P(i)}{Q(i)}$$ **EGAN structure** https://i.imgur.com/FhPxamJ.png **Algorithm** https://i.imgur.com/RzOGmNy.png The enhancer is fed with training data *D\_r(s\_t, a)* and *D\_r(s\_{t+1}, r)*, and trained by supervised learning. After GAN generates synthetic data *D\_t(s\_t, a, s\_{t+1}, r)*, the enhancer could enhance the dependency between *D\_t(s\_t, a)* and *D\_t(s\_{t+1}, r)* and update the weights of GAN. ## Results two lines of experiments on CartPole environment involved with PG agents: 1. one for comparing the learning curves of agents with no pre-training, GAN pre-training and EGAN pre-training. => Result: EGAN > GAN > no pre-training 2. one for comparing the learning curves of agents with EGAN pre-training for various episodes (500, 2000, 5000). => Result: 5000 > 2000 ~= 500 |
[link]
Main idea: planning is 'trying things in your head' using an internal model of the world #### Diagram https://i.imgur.com/vDAobu1.png #### Generic algorithm - step 1-3: standard reinforcement learning agent - step 4: learning of domain knowledge - action model - step 5: RL from hypothetical, model-generated experiences - planning https://i.imgur.com/G6dZ00F.png #### Action model input: state, action; output: immediate resulting state and reward search control: how to select hypothetical state and action ## Potential problems 1. reliance on supervised learning 2. hierarchical planning 3. ambiguous and hidden state 4. ensuring variety in action 5. taskability 6. incorporation of prior knowledge |
[link]
this paper: develop a framework to replay important transitions more frequently -> learn efficienty prior work: uniformly sample a replay memory to get experience transitions evaluate: DQN + PER outperform DQN on 41 out of 49 Atari games ## Introduction **issues with online RL:** (solution: experience replay) 1. strongly correlated updates that break the i.i.d. assumption 2. rapid forgetting of rare experiences that could be useful later **key idea:** more frequently replay transitions with high expected learning progress, as measured by the magnitude of their temporal-difference (TD) error **issues with prioritization:** 1. loss of diversity -> alleviate with stochastic prioritization 2. introduce bias -> correct with importance sampling ## Prioritized Replay **criterion:** - the amount the RL agent can learn from a transition in its current state (expected learning progress) -> not directly accessible - proxy: the magnitude of a transition’s TD error ~= how far the value is from its next-step bootstrap estimate **stochastic sampling:** $$P(i)=\frac{p_i^\alpha}{\sum_k p_k^\alpha}$$ *p_i* > 0: priority of transition *i*; 0 <= *alpha* <= 1 determines how much prioritization is used. *two variants:* 1. proportional prioritization: *p_i* = abs(TD\_error\_i) + epsilon (small positive constant to avoid zero prob) 2. rank-based prioritization: *p_i* = 1/rank(i); **more robust as it is insensitive to outliers** https://i.imgur.com/T8je5wj.png **importance sampling:** IS weights: $$w_i = \left(\frac{1}{N} \cdot \frac{1}{P(i)}\right)^\beta $$ - weights can be folded into the Q-learning update by using $w_i*\delta_i$ instead of $\delta_i$ - weights normalized by $\frac{1}{\max w_i}$ |
[link]
GAN - derive backprop signals through a **competitive process** invovling a pair of networks; Aim: provide an overview of GANs for signal processing community, drawing on familiar analogies and concepts; point to remaining challenges in theory and applications. ## Introduction - How to achieve: implicitly modelling high-dimensional distributions of data - generator receives **no direct access to real images** but error signal from discriminator - discriminator receives both the synthetic samples and samples drawn from the real images - G: G(z) -> R^|x|, where z \in R^|z| is a sample from latent space, x \in R^|x| is an image - D: D(x) -> (0, 1). may not be trained in practice until the generator is optimal https://i.imgur.com/wOwSXhy.png ## Preliminaries - objective functions J_G(theta_G;theta_D) and J_D(theta_D;theta_G) are **co-dependent** as they are iteratively updated - difficulty: hard to construct likelihood functions for high-dimensional, real-world image data |