![]() |
Welcome to ShortScience.org! |
![]() ![]() ![]() |
[link]
## General Framework *wording: car = the autonomous car, driver = the other car it is interacting with* Builds a model of an **autonomous car's influence over the behavior of an interacting driver** (human or simulated) that the autonomous car can leverage to plan more efficiently. The driver is modeled by the policy that maximizes his defined objective. In brief, a **linear reward function is learned off-line with IRL on human demonstrations** and the modeled policy takes the actions that maximize the driver's return under this reward function (computed with **Model Predictive Control (MPC)**). They show that a car planning while using this driver learns ways to **influence the driver either towards specified behaviors** or in ways that **achieve higher payoff**. These results also hold when interacting with human **drivers which are loosely approximated by this driver model**. I believe the **key to this success lies in a learned reward that generalizes well** (linear function with few, clever hand-designed features) to new states as well as the **use of MPC**: by **focusing on modeling goals it captures a reasonable driver's policy** (since the learned reward is accurate) and does not have to deal with concurrent learning dynamics. On the other hand, IL would try to match behavior instead of goals and might not generalize as well to new (unseen) situations. Additionally, MPC enables to coordinate the car-driver interactions over **extended time horizons** (through planning). **This shows that leveraging an influence model is promising for communication emergence.** *Parallel with: Promoting influence like Jaques et al. is more effective than hoping agents figure it out by themselves like MADDPG* ## Motivations Previous methods use simplistic influence models ("will keep constant velocity"), yet the car behavior influences the driver whether the car is aware of it or not. This simplistic model only leads to "defensive behaviors" where the car avoids disturbing the driver and therefore does not interact with it and yields suboptimal strategies. Additionally, a simplistic interaction model seems to lead to exclusively to functional actions instead of communicative ones. ## Assumptions * **Sequential decision making**: the car acts first which forces a driver response which makes the world transition to a new state (could be alleviated by considering influence over time steps: car's action a time t influences driver's action at time t+1) * Approximates the **human as an optimal planner** (but with limited recursive induction) without uncertainty (deterministic model) etc. * The car uses **1st-order recursive induction** ("my action influences the driver but the driver believes that its action won't influence me (that my action is constant no matter what it does)"): *I see this as assuming "defensive" driver and "aggressive" car.* * **Fully observable** environment (not sure how difficult it would be to wave this) * Model Predictive Control (MPC) with L-BFGS requires access to the **differentiable transition function**. ## Method Model the interaction as a dynamical system where the car's action influences both the next state and the driver's action. Uses Theano to backprop through the transition function, and implicit function theorem to derive a symbolic expression of the gradient. ## Results * Car leverages driver's model to **influence** the simulated drivers (car has **perfect model**) in **specified ways** (modifying the car's reward function). * Car leverages driver's model to **influence** the simulated drivers (car has **perfect model**) in order to be **more efficient wrt its own objective**. * Car leverages driver's model to **influence** the human drivers (car has **imperfect model**) in **specified ways**. *colors: autonomous car is yellow, driver's car is red*     ![]() |
[link]
## General Framework Really **similar to DAgger** (see [summary](https://www.shortscience.org/paper?bibtexKey=journals/corr/1011.0686&a=muntermulehitch)) but considers **cost-sensitive classification** ("some mistakes are worst than others": you should be more careful in imitating that particular action of the expert if failing in doing so incurs a large cost-to-go). By doing so they improve from DAgger's bound of $\epsilon_{class}uT$ where $u$ is the difference in cost-to-go (between the expert and one error followed by expert policy) to $\epsilon_{class}T$ where $\epsilon_{class}$ is the error due to the lack of expressiveness of the policy class. In brief, by accounting for the effect of a mistake on the cost-to-go they remove the cost-to-go contribution to the bound (difference in the performance of the learned policy vs. expert policy) and thus have a tighter bound. In the paper they use the word "regret" for two distinct concepts which is confusing to me: one for the no-regret online learning meta-approach to IL (similar to DAgger) and another one because Aggrevate aims at minimizing the cost-to-go difference with the expert (cost-to-go difference: the sum of the cost I endured because I did not behave like the expert once = *regret*) compared to DAgger that aims at minimizing the error rate wrt. the expert. Additionally, the paper extends the view of Imitation learning as an online learning procedure to Reinforcement Learning. ## Assumptions **Interactive**: you can re-query the expert and thus reach $\epsilon T$ bounds instead of $\epsilon T^2$ like with non-interactive methods (Behavioral Cloning) due to compounding error. Additionally, one also needs a **reward/cost** that **cannot** be defined relative to the expert (no 0-1 loss wrt expert for ex.) since the cost-to-go is computed under the expert policy (would always yield 0 cost). ## Other methods **SEARN**: does also reason about **cost-to-go but under the current policy** instead of the expert's (even if you can use the expert's in practice and thus becomes really similar to Aggrevate). SEARN uses **stochastic policies** and can be seen as an Aggrevate variant where stochastic mixing is used to solve the online learning problem instead of **Regularized-Follow-The-Leader (R-FTL)**. ## Aggrevate - IL with cost-to-go  Pretty much like DAgger but one has to use a no-regret online learning algo to do **cost-sensitive** instead of regular classification. In the paper, they use the R-FTL algorithm and train the policy on all previous iterations. Indeed, using R-FTL with strongly convex loss (like the squared error) with stable batch leaner (like stochastic gradient descent) ensures the no-regret property. In practice (to deal with infinite policy classes and knowing the cost of only a few actions per state) they reduce cost-sensitive classification to an **argmax regression problem** where they train a model to match the cost given state-action (and time if we want nonstationary policies) using the collected datapoints and the (strongly convex) squared error loss. Then, they argmin this model to know which action minimizes the cost-to-go (cost-sensitive classification). This is close to what we do for **Q-learning** (DQN or DDPG): fit a critic (Q-values) with the TD-error (instead of full rollouts cost-to-go of expert), argmax your critic to get your policy. Similarly to DQN, the way you explore the actions of which you compute the cost-to-go is important (in this paper they do uniform exploration). **Limitations** If the policy class is not expressive enough and cannot match the expert policy performance this algo may fail to learn a reasonable policy. Example: the task is to go for point A to point B, there exist a narrow shortcut and a safer but longer road. The expert can handle both roads so it prefers taking the shortcut. Even if the learned policy class can handle the safer road it will keep trying to use the narrow one and fail to reach the goal. This is because all the costs-to-go are computed under the expert's policy, thus ignoring the fact that they cannot be achieved by any policy of the learned policy class. ## RL via No-Regrety Policy Iteration -- NRPI  NRPI does not require an expert policy anymore but only a **state exploration distribution**. NRPI can also be preferred when no policy in the policy class can match the expert's since it allows for more exploration by considering the **cost-to-go of the current policy**. Here, the argmax regression equivalent problem is really similar to Q-learning (where we use sampled cost-to-go from rollouts instead of Bellman errors) but where **the cost-to-go** of the aggregate dataset corresponds to **outdated policies!** (in contrast, DQN's data is comprised of rewards instead of costs-to-go). Yet, since R-FTL is a no-regret online learning method, the learned policy performs well under all the costs-to-go of previous iterations and the policies as well as the costs-to-go converge. The performance of NRPI is strongly limited to the quality of the exploration distribution. Yet if the exploration distribution is optimal, then NRPI is also optimal (the bound $T\epsilon_{regret} \rightarrow 0$ with enough online iterations). This may be a promising method for not interactive, state-only IL (if you have access to a reward). ## General limitations Both methods are much less sample efficient than DAgger as they require costs-to-go: one full rollout for one data-point. ## Broad contribution Seeing iterative learning methods such as Q-learning in the light of online learning methods is insightful and yields better bounds and understanding of why some methods might work. It presents a good tool to analyze the dynamics that interleaves learning and execution (optimizing and collecting data) for the purpose of generalization. For example, the bound for NRPI can seem quite counter-intuitive to someone familiar with on-policy/off-policy distinction, indeed NRPI optimizes a policy wrt to **costs-to-go of other policies**, yet R-FTL tells us that it converges towards what we want. Additionally, it may give a practical advantage for stability as the policy is optimized with larger batches and thus as to be good across many states and many cost-to-go formulations. ![]() |
[link]
## General Framework The imitation learning problem is here cast into a classification problem: label the state with the corresponding expert action. With this, you can see structured prediction (predict next label knowing your previous prediction) as a degenerated IL problem. They make the **reduction assumption** that you can make the probability of mistake $\epsilon$ as small as desired on the **training distribution** (expert or mixture). They also assume that the difference in the cost-to-go (Q-value) between expert-action followed by expert policy and any action followed by expert policy is small (which does not hold if one mistake makes you for fall from a cliff for example!). ## Problem definition There are three problems that are highlighted in this paper **1-interaction between policy and distribution**, **2-violation of the i.i.d assumption**, and **3-distributional shift and resulting compounding error of BC**. The paper implies that **1** implies **2** that implies **3** but they eventually only address **3** with the algorithm they propose (DAgger). 1. **interaction between policy and distribution**: the fact that the learned policy is present in the expectation (through the state-action distribution) as well as in the loss makes the **objective non-convex** and this even for convex loss functions. **Yet this also implies non i.i.d. samples!** Indeed, even if one could directly sample from the state-action distribution (like having its analytical form or an infinite experience replay buffer) and thus draw i.i.d. samples, the dependency will occur across optimization steps: if I draw a sample and use it to update my policy, I also update the distribution from which I will draw my next sample and then my next sample depends on my previous sample (since it conditioned my policy update). 2. **violation of the i.i.d. assumption**: So we just discussed that 1. implies non-iid samples across updates (batches) yet there is another form of dependency (inside a batch this time) as in practice we collect samples using rollouts and in a trajectory s' depends on (s,a) so we have a dependency because we collected the **samples sequentially**. 3. **distribution-shift and compounding error of Behavior cloning (supervised learning)**: During training, you assume iid data and you do not model that the next state you will have to classify (act upon) is influenced by your previous action. But at execution samples are not iid: if you do a mistake you reach a state that you may have never seen (distribution shift) and you cannot recover(compounding error). If the classifier's probability of making a mistake is $\epsilon$ and the episode length is $T$, then BC can make as many as $\epsilon T^2$ mistakes (CF Levine's class example with the funambulist and the worst-case scenario: as soon as you make a mistake, you cannot recover and you make mistakes until the end of the episode). **This is what this paper (and the papers it compares to) addresses, they propose a worst-case scenario linear in both** $\epsilon$ and $T$. **Additionnal assumptions compared to BC:**During training you are allowed to query the expert as much as you want. ## Other methods * **Forward Training**: trains a **non-stationary policy** (one policy for each time step). Each policy is trained to match the expert on the distribution of states encountered at that time-step when doing rollouts with the previously learned policies (there is no distribution shift since each policy is trained on the actual distribution of states it will see). **Problem**: you have to train $T$ different policies, you cannot stop until you trained all the policies. * **Stochastic Mixing Iterative Learning (SMILe)**: iteratively trains a **stochastic mixture of policies**. Each new policy is trained to match the expert on the state distribution of the current mixture, therefore iteratively accounting for the distribution drift, eventually, the distribution should converge. This yields near-linear regret on $T$ and $\epsilon$. **Problem**: At execution time, we sample a policy from the mixture, which can perform poorly on the given state, especially if it corresponded to a different distribution (early iteration for example). ## DAgger -- Dataset Aggregation **Algo**  $\beta_i$ is related to the proportion of state distribution that we want to collect similar to the expert distribution. This may help in the early iterations, where the policy is bad and may spend a lot of time collecting datapoints in states that are irrelevant. In practice using $\beta_i=I(i=0)$ works well. Intuitively, by training on generated transitions, DAgger reduces the distribution shift. **Link to no Regret Online Learning** DAgger can be seen as a *Follow-the-Leader* (FTL) algorithm where at iteration $n$ we pick the best policy in hindsight, i.e. with respect to all the transitions seen so far (in previous iterations). The use of $\beta_i$ is an "extra trick" that doesn't exist as is in FTL algos. *Regret*: the difference between what I did and the best thing that I could have done. Each iteration of DAgger can be seen as a step of Online Learning (treat mini-batches of trajectories under a single policy as a single online-learning example) where the online algo used is FTL but could actually be any no-regret algo. Indeed, by using the no-regret algo view, the authors show that using no-regret algos (regret decays with $1/N$) as a meta-algorithm for IL (just like FTL for DAgger) yield that, for N big enough (in the order of T), we have a worst-case performance linear in $\epsilon$, $u$ and $T$. Where it is **assumed** that we can reach an error probability of $\epsilon$ on any training distribution and a difference in the cost-to-go (-Q-value) of $u$. In other words, if we can guarantee a classification error rate of $\epsilon$ **on the training distribution** and that the expert is able to recover from a mistake ($u$ is bounded) then the use of a **no-regret online learning algorithm** will yield a policy that will make at most (proportional to) $\epsilon T u$ errors **on its own state distribution (test distribution)**. **Requirements** * An error rate of $\epsilon$ on the training distribution * A strongly convex loss function between expert and learner labels (0-1 loss is not but binary cross-entropy that is usually used for classification is) so that FTL is indeed a no-regret algo. If the loss in not strongly convex, FTL has no guarantees and we must use another no-regret method * The difference in the cost-to-go (Q-value) between expert-action followed by expert policy and any action followed by the expert policy must be small (expert must be able to recover: this does not account for falling from a cliff for example!). **Results**      ![]() |
[link]
## Summary The broad goal of this paper is to understand how a neural network learns the underlying distribution of the input data and the properties of the network that describes its generalization power. Previous literature tries to use statistical measures like Rademacher complexity, uniform stability and VC dimension to explain the generalization error of the model. These methods explain generalization in terms of the number of parameters in the model along with the applied regularization. The experiments performed in the [Section 2] of the paper show that the learning capacity of a CNN cannot be sufficiently explained by traditional statistical learning theory. Even the effect of different regularization strategies in CNN is shown to be potentially unrelated to the generalization error, which contradicts the theory behind VC dimension. The experiments of the paper show that the model is able to learn some underlying patterns for random labels and input with different amounts of gaussian noise. When the authors gradually increase the noise in the inputs the generalization error gradually increases while the training error is still able to reach zero. The authors have concluded that big networks are able to completely memorise the complete dataset. ## Personal Thoughts 1) Firstly we need a new theory to explain why and how CNN memorizes the inputs and generalizes itself to new data. Since the paper shows that regularization doesn't have too much effect on the generalization for big networks, maybe the network is actually memorizing the whole input space. But the memorization is very strategic in the sense that only the inputs (eg. noise) where no underlying simple features are found, are completely memorized unlike inputs with a stronger signal where patterns can be found. This may explain the discrepancy in number of training steps between ‘true labels’ and noisy inputs in [Figure 1 a.]. My very general understanding of Information Bottleneck Hypothesis [4] is that networks compresses noisy input data as much as possible while preserving important information. For a network more time is taken to compress noise compared to strong signals in images. This may give some intuision behind the learning process taking place. 2) CNN is highly non-linear with millions of parameters and has a very complex loss landscape. There might be multiple minima and we need a theory to explain which of these minima gives the highest generalization. Unfortunately the working of SGD is still a black box and is very difficult to characterize. There are many interesting phenomena like adversarial attacks, effect of optimizer used on the weights found (Daniel Jiwoong et al., 2016) and the actual understanding of non-linearity in CNN (Ian J. Goodfellow et al., 2015) that all point to lapses in our overall understanding of very high dimensional manifolds. This requires rigorous experimentation to study and understand the effect of the network architecture, optimizer and the actual input (Nitish Shirish et al.,2017) to the network independently on generalization. ## References 1. Im, Daniel Jiwoong et al. “An empirical analysis of the optimization of deep network loss surfaces.” arXiv: Learning (2016): n. pag. 2. Goodfellow, Ian J. and Oriol Vinyals. “Qualitatively characterizing neural network optimization problems.” CoRR abs/1412.6544 (2015): n. pag. 3. Keskar, Nitish Shirish et al. “On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima.” ArXiv abs/1609.04836 (2017): n. pag. 4. https://www.youtube.com/watch?v=XL07WEc2TRI ![]() |
[link]
Is the market efficient? This is perhaps the most prevalent question in all of finance. While this paper does not aim to answer that question, it does frame it in an information-theoretic context. Mainly, Maymin shows that at least the weak form of the efficient market hypothesis (EMH) holds if and only if P = NP. First, he defines what efficient market means: "The weakest form of the EMH states that future prices cannot be predicted by analyzing prices from the past. Therefore, technical analysis cannot work in the long run, though of course any strategy could make money randomly." For $n$ past historical price changes of {UP, DOWN}. Let there be three trading strategies that are neutral, long or short to the market. In order to check if there exists a strategy that statistically significantly makes money, requires checking all $3^n$ possible strategies. Verifying that a strategy is profitable beyond random chance can be done with a linear $O(n)$ pass of the historical data. Thus the problem of finding a profitable strategy is NP. If P=NP, then computing a profitable strategy can be done efficiently in polynomial time, since a trader can check each possible strategy in polynomial time. We can then trade based on our results to make the market efficient as well. If the market is efficient, it becomes impossible for a trader to predict future prices based on historical data, as the current price has all publicly available information "priced in". A future price would be a random fluctuation of the current price. Does an efficient market imply P=NP? An example 3-SAT: $$(a \lor b \lor !c) \land (a \lor !b \lor d)$$ The NP problem of 3-sat can be encoded into the market using order-cancels-order (OCO) orders[^1]. Where each variable is a security and negated variables are sell orders. Place these two OCOs in the market. $$\text{OCO (buy A, buy B, or sell C)}$$ $$\text{OCO (buy A, sell B, or buy D)}$$ After some constant time, any outstanding order is cancelled and all positions are liquidated. If the market is efficient, then there exists a way to execute these two OCO orders such that an overall profit is guaranteed. In other words, a market that is efficient allows us to solve an arbitrary 3-SAT problem in polynomial time. ### **Takeaway**: The author links market efficiency with computational complexity. [^1]: An order-cancels-order is a type of order on two different securities that automatically cancels the other order when one is filled. There is no chance that both orders can be filled. ![]() |