Tianxiao Zhao's profile - ShortScience.org

papers.nips.cc
scholar.google.com

Deep Reinforcement Learning from Human Preferences.
Paul F. Christiano and Jan Leike and Tom Brown and Miljan Martic and Shane Legg and Dario Amodei
Neural Information Processing Systems Conference - 2017 via Local dblp
Keywords:

[link] Summary by Tianxiao Zhao 7 years ago

- explore RL systems with (non-expert) human preferences between pairs of trajectory segments;
- run experiments on some RL tasks, namely **Atari** and **MuJoCo**, and show effectiveness of this approach;
- advantages mentioned: 
	- no need to access to the reward function; 
	- less than 1% feedback needed -> reduce the cost of human oversight;
	- can learn complex novel behaviors.

## Introduction

**Challenges**

- goals complex, pooly-defined or hard to specify;
- reward function -> behaviors that optimize reward function without achieving goals;

> This difficulty underlines recent concerns about misalignment between reward values and the objectives of RL systems.

**Alternatives**

- inverse RL: extract a reward function from demonstrations of desired tasks;
- imitation learning: clone the demonstrated behavior;

*con: not applicable to behaviors that are hard to demonstrate for humans*

- use human feedback as a reward function;

*con: require thousands of hours of experience, prohibitively expensive*

**Basic idea**

https://i.imgur.com/3R7tO7R.png

**Contributions**

1. solve tasks for which we can only recognize but not demonstrate the desired behaviors;
2. allow non-expert agent training;
3. scale to larger problems;
4. economical with user feedback.

**Related work**

two lines of work: (1) RL from human ratings or rankings; (2) general problemn of RL from preferences rather than absolute reward values.

*close-related paper:*

(1) [Active preference learning-based
reinforcement learning](https://arxiv.org/abs/1208.0984); (2) [Programming by
feedback](http://proceedings.mlr.press/v32/schoenauer14.pdf); (3) [A Bayesian approach for policy learning from
trajectory preference queries](https://papers.nips.cc/paper/4805-a-bayesian-approach-for-policy-learning-from-trajectory-preference-queries).

*diffs with (1)-(2)*: 

a) elicit preferences over whole trajectories rather than short clips; b) change training procedure to cope with nonlinear reward models and modern deep RL.

*diffs with (3)*: 

a) fit reward function by Bayesian inference; b) produce trajectories using MAP estimate of the target policy instead of RL -> involve 'synthetic' human feedback drawn from Bayesian model.

## Preliminaries and Method

**Agent goal**

to produce trajectories which are preferred by the human, while making as few queries as possible to the human.

**Work flow**

at each point maintains two deep NNs - policy *pi*: O -> A; reward estimate *r\_hat*: O x A -> R. 

*Update procedure (asyn):*

1. policy *pi* => env => trajectories *tau* = {*tau^1*,..., *tau^i*}. Then update *pi* by a traditional RL algorithm to maximize the sum of predicted rewards *r\_t* = *r\_hat*(*o\_t*, *a\_t*);
2. select segment pairs *sigma* = (*sigma^1*, *sigma^2*) from *tau*. *sigma* => human comparison => labeled data;
3. update *r\_hat* with labeled data by supervised learning.

> step 1 => trajectory *tau* => step 2 => human comparison => step 3 => parameters for *r\_hat* => step 1 => ....

**Policy optimization (step 1)**

*subtlety:* non-stationary reward function *r\_hat* -> methods robust to changes in reward function.

A2C => Atari, TRPO => MuJoCo.

use parameter settings that work well for traditional RL tasks; only adjust the entropy bonus for TRPO (improve inadequate exploration); normalize rewards to zero mean and constand std. 

**Preference eliciation (step 2)**

clips of trajectory segments for 1 to 2 seconds long.

*data struct:* triples (*sigma^1*, *sigma^2*, *mu*), *mu* - distribution over {1, 2}.

one preferable over the others -> *mu* puts all mass on that choice; equally preferable -> *mu* uniform; incomparable -> skip saving triples.

**Fitting the reward function (step 3)**

*assumption:* human’s probability of preferring a segment *sigma^i* depends exponentially on the value of the latent reward summed over the length of the clip.

https://i.imgur.com/2ViIWcL.png

(no discount of reward <- human being indifferent about when things happen in the trajectory segment; could consider discounting.)

https://i.imgur.com/cpdTZW6.png

*modifications:*

1. ensemble of predictors -> independently normalize base predictors and then average results;
2. validation set ratio 1/e; employ *l\_2* regularization, and tune regularization coefficient -> validation loss = 1.1~1.5 training loss; dropout in some domains;
3. assume 10% chance that human responds uniformly at random;

**Selecting queries**

*pipeline:* sample trajectory segments of length k -> predict preference by base reward predictor in our ensemble -> select trajectories with the highest variance across ensemble members

*future work:* query based on the expected value of information of query.

*Related articles:* 

1. [APRIL: Active Preference-learning based Reinforcement Learning](https://arxiv.org/abs/1208.0984)
2. [Active reinforcement learning: Observing rewards at a cost](http://www.filmnips.com/wp-content/uploads/2016/11/FILM-NIPS2016_paper_30.pdf)

> At each time-step, the agent chooses both an action and whether to observe the
reward in the next time-step. If the agent chooses to observe the reward, then it
pays the “query cost” c > 0. The agent’s objective is to maximize total reward
minus total query cost.

arxiv.org
scholar.google.com

Enhanced Experience Replay Generation for Efficient Reinforcement Learning
Vincent Huang and Tobias Ley and Martha Vlachou-Konchylaki and Wenfeng Hu
arXiv e-Print archive - 2017 via Local arXiv
Keywords: cs.AI
more

[link] Summary by Tianxiao Zhao 7 years ago

- *issue:* RL on real systems -> sparse and slow data sampling;
- *solution:* pre-train the agent with the EGAN;
- *performance:* ~20% improvement of training time in the beginning of learning compared to no pre-training; ~5% improvement and smaller variations compared to GAN pre-training.

## Introduction

5G telecom systems -> fufill ultra-low latency, high robustness, quick response to changed capacity needs, and dynamic allocation of functionality.

*Problems:*

1. exploration has an impact on the service quality in real-time service systems;
2. sparse and slow data sampling -> extended training duration.

## Enhanced GAN

**Fomulas**

the training data for RL tasks:

$$x = [x_1, x_2] = [(s_t,a),(s_{t+1},r)]$$

the generated data:

$$G(z) = [G_1(z), G_2(z)] =  [(s'_t,a'),(s'_{t+1},r')] $$

the value function for GAN:

$$V(D,G) = \mathbb{E}_{z \sim p_z(z)}[\log(1-D(G(z)))] + \lambda D_{KL}(P||Q)$$

where the regularization term $D_{KL}$ has the following form:

$$D_{KL}(P||Q) = \sum_i P(i) \log \frac{P(i)}{Q(i)}$$

**EGAN structure**

https://i.imgur.com/FhPxamJ.png

**Algorithm**

https://i.imgur.com/RzOGmNy.png

The enhancer is fed with training data *D\_r(s\_t, a)* and *D\_r(s\_{t+1}, r)*, and trained by supervised learning. After GAN generates synthetic data *D\_t(s\_t, a, s\_{t+1}, r)*, the enhancer could enhance the dependency between *D\_t(s\_t, a)* and *D\_t(s\_{t+1}, r)* and update the weights of GAN.

## Results

two lines of experiments on CartPole environment involved with PG agents: 

1. one for comparing the learning curves of agents with no pre-training, GAN pre-training and EGAN pre-training. => Result: EGAN > GAN > no pre-training

2. one for comparing the learning curves of agents with EGAN pre-training for various episodes (500, 2000, 5000). => Result: 5000 > 2000 ~= 500

doi.acm.org
sci-hub
scholar.google.com

Dyna, an Integrated Architecture for Learning, Planning, and Reacting
Sutton, Richard S.
SIGART Bulletin - 1991 via Local Bibsonomy
Keywords: dblp

[link] Summary by Tianxiao Zhao 7 years ago

Main idea: planning is 'trying things in your head' using an internal model of the world

#### Diagram

https://i.imgur.com/vDAobu1.png

#### Generic algorithm

- step 1-3: standard reinforcement learning agent
- step 4: learning of domain knowledge - action model
- step 5: RL from hypothetical, model-generated experiences - planning

https://i.imgur.com/G6dZ00F.png

#### Action model

input: state, action; output: immediate resulting state and reward

search control: how to select hypothetical state and action

## Potential problems

1. reliance on supervised learning
2. hierarchical planning
3. ambiguous and hidden state
4. ensuring variety in action
5. taskability
6. incorporation of prior knowledge

arxiv.org
arxiv-vanity.com
scholar.google.com

Prioritized Experience Replay
Tom Schaul and John Quan and Ioannis Antonoglou and David Silver
arXiv e-Print archive - 2015 via Local arXiv
Keywords: cs.LG
more

[link] Summary by Tianxiao Zhao 7 years ago

this paper: develop a framework to replay important transitions more frequently -> learn efficienty

prior work: uniformly sample a replay memory to get experience transitions
 
evaluate: DQN + PER outperform DQN on 41 out of 49 Atari games

## Introduction

**issues with online RL:** (solution: experience replay) 

1. strongly correlated updates that break the i.i.d. assumption
2. rapid forgetting of rare experiences that could be useful later

**key idea:** 

more frequently replay transitions with high expected learning progress, as measured by the magnitude of their temporal-difference (TD) error

**issues with prioritization:**

1. loss of diversity -> alleviate with stochastic prioritization
2. introduce bias -> correct with importance sampling

## Prioritized Replay

**criterion:**

- the amount the RL agent can learn from a transition in its current state (expected learning progress) -> not directly accessible
- proxy: the magnitude of a transition’s TD error ~= how far the value is from its next-step bootstrap estimate

**stochastic sampling:**

$$P(i)=\frac{p_i^\alpha}{\sum_k p_k^\alpha}$$

*p_i* > 0: priority of transition *i*; 0 <= *alpha* <= 1 determines how much prioritization is used.

*two variants:*

1. proportional prioritization: *p_i* = abs(TD\_error\_i) + epsilon (small positive constant to avoid zero prob)
2. rank-based prioritization: *p_i* = 1/rank(i); **more robust as it is insensitive to outliers**

https://i.imgur.com/T8je5wj.png

**importance sampling:**

IS weights: 

$$w_i = \left(\frac{1}{N} \cdot \frac{1}{P(i)}\right)^\beta $$

- weights can be folded into the Q-learning update by using $w_i*\delta_i$ instead of $\delta_i$
- weights normalized by $\frac{1}{\max w_i}$

arxiv.org
arxiv-vanity.com
scholar.google.com

Generative Adversarial Networks
Ian J. Goodfellow and Jean Pouget-Abadie and Mehdi Mirza and Bing Xu and David Warde-Farley and Sherjil Ozair and Aaron Courville and Yoshua Bengio
arXiv e-Print archive - 2014 via Local arXiv
Keywords: stat.ML, cs.LG
more

[link] Summary by Tianxiao Zhao 7 years ago

GAN - derive backprop signals through a **competitive process** invovling a pair of networks;

Aim: provide an overview of GANs for signal processing community, drawing on familiar analogies and concepts; point to remaining challenges in theory and applications.

## Introduction

- How to achieve: implicitly modelling high-dimensional distributions of data
- generator receives **no direct access to real images** but error signal from discriminator
- discriminator receives both the synthetic samples and samples drawn from the real images
- G: G(z) -> R^|x|, where z \in R^|z| is a sample from latent space, x \in R^|x| is an image
- D: D(x) -> (0, 1). may not be trained in practice until the generator is optimal

https://i.imgur.com/wOwSXhy.png

## Preliminaries

- objective functions J_G(theta_G;theta_D) and J_D(theta_D;theta_G) are **co-dependent** as they are iteratively updated
- difficulty: hard to construct likelihood functions for high-dimensional, real-world image data

Tianxiao Zhao

sciscore: 2