Deep Reinforcement Learning from Human Preferences.

Paul F. Christiano and Jan Leike and Tom Brown and Miljan Martic and Shane Legg and Dario Amodei

Neural Information Processing Systems Conference - 2017 via Local dblp

Keywords:

Paul F. Christiano and Jan Leike and Tom Brown and Miljan Martic and Shane Legg and Dario Amodei

Neural Information Processing Systems Conference - 2017 via Local dblp

Keywords:

[link]
- explore RL systems with (non-expert) human preferences between pairs of trajectory segments; - run experiments on some RL tasks, namely **Atari** and **MuJoCo**, and show effectiveness of this approach; - advantages mentioned: - no need to access to the reward function; - less than 1% feedback needed -> reduce the cost of human oversight; - can learn complex novel behaviors. ## Introduction **Challenges** - goals complex, pooly-defined or hard to specify; - reward function -> behaviors that optimize reward function without achieving goals; > This difficulty underlines recent concerns about misalignment between reward values and the objectives of RL systems. **Alternatives** - inverse RL: extract a reward function from demonstrations of desired tasks; - imitation learning: clone the demonstrated behavior; *con: not applicable to behaviors that are hard to demonstrate for humans* - use human feedback as a reward function; *con: require thousands of hours of experience, prohibitively expensive* **Basic idea** https://i.imgur.com/3R7tO7R.png **Contributions** 1. solve tasks for which we can only recognize but not demonstrate the desired behaviors; 2. allow non-expert agent training; 3. scale to larger problems; 4. economical with user feedback. **Related work** two lines of work: (1) RL from human ratings or rankings; (2) general problemn of RL from preferences rather than absolute reward values. *close-related paper:* (1) [Active preference learning-based reinforcement learning](https://arxiv.org/abs/1208.0984); (2) [Programming by feedback](http://proceedings.mlr.press/v32/schoenauer14.pdf); (3) [A Bayesian approach for policy learning from trajectory preference queries](https://papers.nips.cc/paper/4805-a-bayesian-approach-for-policy-learning-from-trajectory-preference-queries). *diffs with (1)-(2)*: a) elicit preferences over whole trajectories rather than short clips; b) change training procedure to cope with nonlinear reward models and modern deep RL. *diffs with (3)*: a) fit reward function by Bayesian inference; b) produce trajectories using MAP estimate of the target policy instead of RL -> involve 'synthetic' human feedback drawn from Bayesian model. ## Preliminaries and Method **Agent goal** to produce trajectories which are preferred by the human, while making as few queries as possible to the human. **Work flow** at each point maintains two deep NNs - policy *pi*: O -> A; reward estimate *r\_hat*: O x A -> R. *Update procedure (asyn):* 1. policy *pi* => env => trajectories *tau* = {*tau^1*,..., *tau^i*}. Then update *pi* by a traditional RL algorithm to maximize the sum of predicted rewards *r\_t* = *r\_hat*(*o\_t*, *a\_t*); 2. select segment pairs *sigma* = (*sigma^1*, *sigma^2*) from *tau*. *sigma* => human comparison => labeled data; 3. update *r\_hat* with labeled data by supervised learning. > step 1 => trajectory *tau* => step 2 => human comparison => step 3 => parameters for *r\_hat* => step 1 => .... **Policy optimization (step 1)** *subtlety:* non-stationary reward function *r\_hat* -> methods robust to changes in reward function. A2C => Atari, TRPO => MuJoCo. use parameter settings that work well for traditional RL tasks; only adjust the entropy bonus for TRPO (improve inadequate exploration); normalize rewards to zero mean and constand std. **Preference eliciation (step 2)** clips of trajectory segments for 1 to 2 seconds long. *data struct:* triples (*sigma^1*, *sigma^2*, *mu*), *mu* - distribution over {1, 2}. one preferable over the others -> *mu* puts all mass on that choice; equally preferable -> *mu* uniform; incomparable -> skip saving triples. **Fitting the reward function (step 3)** *assumption:* human’s probability of preferring a segment *sigma^i* depends exponentially on the value of the latent reward summed over the length of the clip. https://i.imgur.com/2ViIWcL.png (no discount of reward <- human being indifferent about when things happen in the trajectory segment; could consider discounting.) https://i.imgur.com/cpdTZW6.png *modifications:* 1. ensemble of predictors -> independently normalize base predictors and then average results; 2. validation set ratio 1/e; employ *l\_2* regularization, and tune regularization coefficient -> validation loss = 1.1~1.5 training loss; dropout in some domains; 3. assume 10% chance that human responds uniformly at random; **Selecting queries** *pipeline:* sample trajectory segments of length k -> predict preference by base reward predictor in our ensemble -> select trajectories with the highest variance across ensemble members *future work:* query based on the expected value of information of query. *Related articles:* 1. [APRIL: Active Preference-learning based Reinforcement Learning](https://arxiv.org/abs/1208.0984) 2. [Active reinforcement learning: Observing rewards at a cost](http://www.filmnips.com/wp-content/uploads/2016/11/FILM-NIPS2016_paper_30.pdf) > At each time-step, the agent chooses both an action and whether to observe the reward in the next time-step. If the agent chooses to observe the reward, then it pays the “query cost” c > 0. The agent’s objective is to maximize total reward minus total query cost. |

Enhanced Experience Replay Generation for Efficient Reinforcement Learning

Vincent Huang and Tobias Ley and Martha Vlachou-Konchylaki and Wenfeng Hu

arXiv e-Print archive - 2017 via Local arXiv

Keywords: cs.AI

**First published:** 2017/05/23 (6 years ago)

**Abstract:** Applying deep reinforcement learning (RL) on real systems suffers from slow
data sampling. We propose an enhanced generative adversarial network (EGAN) to
initialize an RL agent in order to achieve faster learning. The EGAN utilizes
the relation between states and actions to enhance the quality of data samples
generated by a GAN. Pre-training the agent with the EGAN shows a steeper
learning curve with a 20% improvement of training time in the beginning of
learning, compared to no pre-training, and an improvement compared to training
with GAN by about 5% with smaller variations. For real time systems with sparse
and slow data sampling the EGAN could be used to speed up the early phases of
the training process.
more
less

Vincent Huang and Tobias Ley and Martha Vlachou-Konchylaki and Wenfeng Hu

arXiv e-Print archive - 2017 via Local arXiv

Keywords: cs.AI

[link]
- *issue:* RL on real systems -> sparse and slow data sampling; - *solution:* pre-train the agent with the EGAN; - *performance:* ~20% improvement of training time in the beginning of learning compared to no pre-training; ~5% improvement and smaller variations compared to GAN pre-training. ## Introduction 5G telecom systems -> fufill ultra-low latency, high robustness, quick response to changed capacity needs, and dynamic allocation of functionality. *Problems:* 1. exploration has an impact on the service quality in real-time service systems; 2. sparse and slow data sampling -> extended training duration. ## Enhanced GAN **Fomulas** the training data for RL tasks: $$x = [x_1, x_2] = [(s_t,a),(s_{t+1},r)]$$ the generated data: $$G(z) = [G_1(z), G_2(z)] = [(s'_t,a'),(s'_{t+1},r')] $$ the value function for GAN: $$V(D,G) = \mathbb{E}_{z \sim p_z(z)}[\log(1-D(G(z)))] + \lambda D_{KL}(P||Q)$$ where the regularization term $D_{KL}$ has the following form: $$D_{KL}(P||Q) = \sum_i P(i) \log \frac{P(i)}{Q(i)}$$ **EGAN structure** https://i.imgur.com/FhPxamJ.png **Algorithm** https://i.imgur.com/RzOGmNy.png The enhancer is fed with training data *D\_r(s\_t, a)* and *D\_r(s\_{t+1}, r)*, and trained by supervised learning. After GAN generates synthetic data *D\_t(s\_t, a, s\_{t+1}, r)*, the enhancer could enhance the dependency between *D\_t(s\_t, a)* and *D\_t(s\_{t+1}, r)* and update the weights of GAN. ## Results two lines of experiments on CartPole environment involved with PG agents: 1. one for comparing the learning curves of agents with no pre-training, GAN pre-training and EGAN pre-training. => Result: EGAN > GAN > no pre-training 2. one for comparing the learning curves of agents with EGAN pre-training for various episodes (500, 2000, 5000). => Result: 5000 > 2000 ~= 500 |

Dyna, an Integrated Architecture for Learning, Planning, and Reacting

Sutton, Richard S.

SIGART Bulletin - 1991 via Local Bibsonomy

Keywords: dblp

Sutton, Richard S.

SIGART Bulletin - 1991 via Local Bibsonomy

Keywords: dblp

[link]
Main idea: planning is 'trying things in your head' using an internal model of the world #### Diagram https://i.imgur.com/vDAobu1.png #### Generic algorithm - step 1-3: standard reinforcement learning agent - step 4: learning of domain knowledge - action model - step 5: RL from hypothetical, model-generated experiences - planning https://i.imgur.com/G6dZ00F.png #### Action model input: state, action; output: immediate resulting state and reward search control: how to select hypothetical state and action ## Potential problems 1. reliance on supervised learning 2. hierarchical planning 3. ambiguous and hidden state 4. ensuring variety in action 5. taskability 6. incorporation of prior knowledge |

Prioritized Experience Replay

Tom Schaul and John Quan and Ioannis Antonoglou and David Silver

arXiv e-Print archive - 2015 via Local arXiv

Keywords: cs.LG

**First published:** 2015/11/18 (8 years ago)

**Abstract:** Experience replay lets online reinforcement learning agents remember and
reuse experiences from the past. In prior work, experience transitions were
uniformly sampled from a replay memory. However, this approach simply replays
transitions at the same frequency that they were originally experienced,
regardless of their significance. In this paper we develop a framework for
prioritizing experience, so as to replay important transitions more frequently,
and therefore learn more efficiently. We use prioritized experience replay in
Deep Q-Networks (DQN), a reinforcement learning algorithm that achieved
human-level performance across many Atari games. DQN with prioritized
experience replay achieves a new state-of-the-art, outperforming DQN with
uniform replay on 41 out of 49 games.
more
less

Tom Schaul and John Quan and Ioannis Antonoglou and David Silver

arXiv e-Print archive - 2015 via Local arXiv

Keywords: cs.LG

[link]
this paper: develop a framework to replay important transitions more frequently -> learn efficienty prior work: uniformly sample a replay memory to get experience transitions evaluate: DQN + PER outperform DQN on 41 out of 49 Atari games ## Introduction **issues with online RL:** (solution: experience replay) 1. strongly correlated updates that break the i.i.d. assumption 2. rapid forgetting of rare experiences that could be useful later **key idea:** more frequently replay transitions with high expected learning progress, as measured by the magnitude of their temporal-difference (TD) error **issues with prioritization:** 1. loss of diversity -> alleviate with stochastic prioritization 2. introduce bias -> correct with importance sampling ## Prioritized Replay **criterion:** - the amount the RL agent can learn from a transition in its current state (expected learning progress) -> not directly accessible - proxy: the magnitude of a transition’s TD error ~= how far the value is from its next-step bootstrap estimate **stochastic sampling:** $$P(i)=\frac{p_i^\alpha}{\sum_k p_k^\alpha}$$ *p_i* > 0: priority of transition *i*; 0 <= *alpha* <= 1 determines how much prioritization is used. *two variants:* 1. proportional prioritization: *p_i* = abs(TD\_error\_i) + epsilon (small positive constant to avoid zero prob) 2. rank-based prioritization: *p_i* = 1/rank(i); **more robust as it is insensitive to outliers** https://i.imgur.com/T8je5wj.png **importance sampling:** IS weights: $$w_i = \left(\frac{1}{N} \cdot \frac{1}{P(i)}\right)^\beta $$ - weights can be folded into the Q-learning update by using $w_i*\delta_i$ instead of $\delta_i$ - weights normalized by $\frac{1}{\max w_i}$ |

Generative Adversarial Networks

Ian J. Goodfellow and Jean Pouget-Abadie and Mehdi Mirza and Bing Xu and David Warde-Farley and Sherjil Ozair and Aaron Courville and Yoshua Bengio

arXiv e-Print archive - 2014 via Local arXiv

Keywords: stat.ML, cs.LG

**First published:** 2014/06/10 (9 years ago)

**Abstract:** We propose a new framework for estimating generative models via an
adversarial process, in which we simultaneously train two models: a generative
model G that captures the data distribution, and a discriminative model D that
estimates the probability that a sample came from the training data rather than
G. The training procedure for G is to maximize the probability of D making a
mistake. This framework corresponds to a minimax two-player game. In the space
of arbitrary functions G and D, a unique solution exists, with G recovering the
training data distribution and D equal to 1/2 everywhere. In the case where G
and D are defined by multilayer perceptrons, the entire system can be trained
with backpropagation. There is no need for any Markov chains or unrolled
approximate inference networks during either training or generation of samples.
Experiments demonstrate the potential of the framework through qualitative and
quantitative evaluation of the generated samples.
more
less

Ian J. Goodfellow and Jean Pouget-Abadie and Mehdi Mirza and Bing Xu and David Warde-Farley and Sherjil Ozair and Aaron Courville and Yoshua Bengio

arXiv e-Print archive - 2014 via Local arXiv

Keywords: stat.ML, cs.LG

[link]
GAN - derive backprop signals through a **competitive process** invovling a pair of networks; Aim: provide an overview of GANs for signal processing community, drawing on familiar analogies and concepts; point to remaining challenges in theory and applications. ## Introduction - How to achieve: implicitly modelling high-dimensional distributions of data - generator receives **no direct access to real images** but error signal from discriminator - discriminator receives both the synthetic samples and samples drawn from the real images - G: G(z) -> R^|x|, where z \in R^|z| is a sample from latent space, x \in R^|x| is an image - D: D(x) -> (0, 1). may not be trained in practice until the generator is optimal https://i.imgur.com/wOwSXhy.png ## Preliminaries - objective functions J_G(theta_G;theta_D) and J_D(theta_D;theta_G) are **co-dependent** as they are iteratively updated - difficulty: hard to construct likelihood functions for high-dimensional, real-world image data |

About