Chris Murray's profile - ShortScience.org

doi.acm.org
sci-hub
scholar.google.com

How to use expert advice
Cesa-Bianchi, Nicolò and Freund, Yoav and Helmbold, David P. and Haussler, David and Schapire, Robert E. and Warmuth, Manfred K.
Symposium on the Theory of Computing - 1993 via Local Bibsonomy
Keywords: dblp

[link] Summary by Chris Murray 10 years ago

The authors consider online learning of binary values, where each period each of $N$ experts makes a prediction (in $[0,1]$), and the learner must make predictions such that, in hindsight, the learner didn't do much worse than the best expert. The loss function the authors use is $|\hat{Y}_t - Y_t|$, the the total loss is the sum over all $t$.

The authors first solution to this problem is an algorithm called MM.  It uses a complex recursively-defined function v which takes exponential time to compute, but which (roughly) computes the anticipated future total loss of each expert for the rest of the game and uses that.  The bound they give for MM is in terms of this $v$ function, so it isn't easily interpreted.  This is analyzed in tremendous detail and under various assumptions.

The authors then give a more familiar (and implementable) multiplicative weight update algorithm, where a certain non-negative weight is placed on each expert and our prediction at any time is the weighted average of the experts' predictions.  After every period, each expert's weight is multiplied by some function of the loss it incurred that period and a learning rate. They show how, when the weight updates are done using the right function, the algorithm has a nice Bayesian interpretation. This paper is dense (at 59 pages) and so filled with proofs it feels like reading an appendix.

dx.doi.org
sci-hub
scholar.google.com

The Robustness of the p-Norm Algorithms
Gentile, Claudio
Machine Learning Journal - 2003 via Local Bibsonomy
Keywords: dblp

[link] Summary by Chris Murray 10 years ago

This paper describes a class of algorithms for classification or regression in the on-line setting.  That is, the data is a bunch of pairs $(X_t,Y_t)$ (where X may be a vector), and these data items arrive in some order: the algorithm must predict each $\hat{Y}_t$ using only the $X_t$ and previously seen pairs. In the regression setting, each mis-prediction has a loss that is like $(Y_t - \hat{Y}_t)^2$, and in the classification setting $Y_t$ is always 0 or 1 and the loss is $| Y_t - \hat{Y}_t |$.

Roughly, the algorithm makes linear predictions using some internal weight vector $(\hat{y} = w  * X)$, and does a gradient-descent like weight update. However, it tries to keep the q-norm (q can be any number) of the weight vector "small", preventing the weights themselves from becoming too large. The algorithm is actually simple, and the weight update takes advantage of link functions, which the author defines.  The majority of the paper is focused on deriving loss bounds, showing that the loss incurred by this algorithm isn't much worse than that incurred by the best weight vector, chosen in hindsight.  Typical readers will be interested in the first few pages, as the latter part of the paper is mainly technical proofs.

dx.doi.org
sci-hub
scholar.google.com

A Potential Reduction Algorithm with User-Specified Phase I-Phase II Balance for Solving a Linear Program from an Infeasible Warm Start
Freund, Robert M.
SIAM Journal on Optimization - 1995 via Local Bibsonomy
Keywords: dblp

[link] Summary by Chris Murray 10 years ago

This is a fantastic paper: it presents the problem well, the algorithms are intuitive and clearly presented, and the proofs are not overly long.  This describes the online allocation problem: roughly how to use the hypothesis of N "experts" to create a hypothesis that isn't much worse than the best expert (alternatively, how to allocate wealth every period among N traders, conditioned on only their past performance, such that at the end you performed almost as well as the best trader chosen in hindsight).  This paper presents the hedge algorithm, which works with general bounded summable loss functions and has only one parameter (a learning rate) to tune: it works by simply decreasing the weight on each expert every period by a factor like (learning_rate)^(loss), where 0 < learning_rate <= 1.  The final bound, where L(hedge) is the loss of the hedge algorithm, L* is the loss of the best expert (the one with minimum loss), and $0<B<=1$ is the learning rate, is:

$$L(hedge) <= \[  ln(1/B) \* L\* + ln(N)  \] / (1 - B)$$

This paper also describes boosting and relates it to the hedge algorithm, though in my opinion the description given of the boosting problem and adaboost isn't nearly as good as the explanation of the online allocation problem and Hedge.  In boosting, we have a weak learner, which can learn a function $X \rightarrow Y$ given some examples, but possibly with high error rate (the precise definition of a weak is, of course, in the paper).  In boosting, a "master algorithm" has some set of labelled examples $X_i \rightarrow Y_i$  (X may be multi-dimensional).  it calls the weak learner many times, giving it different distributions over these examples and looking at the hypotheses created by the weak learner each time.  It then combines these hypothesis into a "master" hypothesis that is guaranteed to be "good". The paper continues with several extentions of boosting to other domains like regression.

doi.acm.org
sci-hub
scholar.google.com

Context-sensitive Learning Methods for Text Categorization
Cohen, William W. and Singer, Yoram
Special Interest Group on Information Retrieval - 1996 via Local Bibsonomy
Keywords: dblp

[link] Summary by Chris Murray 10 years ago

This paper focuses on empirical results of various algorithms for text classification: my interest in it was its description of the sleeping experts algorithm.  Roughly, in the experts model, there are many "experts" making a prediction on every sample, and we must learn which experts to trust: i.e. we will make a prediction based on the weighted predictions of the experts, and the weights might depend on the experts' past peformance.  In "sleeping experts", some of the experts make no predictions on some samples. This paper has a good description, with pseudocode, of the sleeping expert algorithm, but does not derive any bounds.  The paper also describes Ripper, an algorithm for learning rule sets.

dx.doi.org
sci-hub
scholar.google.com

Learning Boolean Functions in an Infinite Attribute Space
Blum, Avrim
Machine Learning Journal - 1992 via Local Bibsonomy
Keywords: dblp

[link] Summary by Chris Murray 10 years ago

This paper develops algorithms and mistake bounds for learning non-noisy boolean functions.  It considers the case when there is a very large number of possible attributes, but each example (and the concept) depends on only a few attributes, so we want ot avoid having to ever explicitly list or consider all possible attributes.  The concept classes it deals with learning are boolean formulas. The paper is a good cite but (for my work) probably not directly applicable.

doi.acm.org
sci-hub
scholar.google.com

Universal Portfolios With and Without Transaction Costs
Blum, Avrim and Kalai, Adam
Conference on Learning Theory - 1997 via Local Bibsonomy
Keywords: dblp

[link] Summary by Chris Murray 10 years ago

The authors give a MUCH simpler description of constantly rebalanced portfolios (CRPs) and of their performance bound than Cover (above).  They also begin with a nice example of when a CRP has positive return even though neither stock does.  They motivate the problem of performing as well as the best CRP by describing how one can perform nearly as well as the best individual stock (within $1/N$ of the terminal wealth) by simply investing equal-weight initially and holding each stock.  They use the same strategy as Cover for performing nearly as well as the best CRP: spread wealth evenly across CRPs and let it grow within each CRP, and get the same bound: the wealth of the strategy is less than the wealth of the best CRP by no more than $(T+1)^{(N-1)}$ over $T$ periods, with $N$ assets.  They also consider a case of transaction costs to derive a (slightly looser) bound.

dx.doi.org
sci-hub
scholar.google.com

Universal portfolios with side information
Cover, Thomas M. and Ordentlich, Erik
IEEE Transactions on Information Theory - 1996 via Local Bibsonomy
Keywords: dblp

[link] Summary by Chris Murray 10 years ago

This paper is overly mathematical for what it accomplished (117 equations) but is somewhat interesting.  Cover considers CRPs (constantly rebalanced portfolios), where a certain portion of wealth is "fixed" in each stock and the portfolio is rebalanced every period: CRPs can earn positive return even if none of the stock's do. Finding the best CRP represents an optimization over a simplex (since we optimize over the fraction to invest in each stock), while the best stock is just a corner of the simplex, so the best CRP does at least as well as the best stock.  He considers the strategy of initially investing equal money in each CRP (i.e. over the continuous space of CRPs) and then letting the wealth in each CRP grow: doing this, he derives a bound that the terminal wealth of this strategy will only lag the terminal wealth of the best CRP chosen in hindsight by a factor  $(T+1)^{(N-1)}$, when there are $T$ periods and $N$ assets.  Thus the strategy earns the same asymptotic growth as the best CRP.  In practice I found that this technique isn't useful and the bound is very very weak (even over 50 years of data).  However, the idea of buy-and-hold having good asymptotic performance and comparing well to the best portfolio chosen in hindsight is a good one.

doi.acm.org
sci-hub
scholar.google.com

Algorithms for portfolio management based on the Newton method
Agarwal, Amit and Hazan, Elad and Kale, Satyen and Schapire, Robert E.
International Conference on Machine Learning - 2006 via Local Bibsonomy
Keywords: dblp

[link] Summary by Chris Murray 10 years ago

The authors present an algorithm for investment (picking which stocks to buy).  This online algorithm chooses an investment in N stocks (a point on the N-dimensional simplex) every period.  It works by optimizing, every period, some approximation to the current lossdefining a portfolio loss minus the two-norm of the investment vector.  This algorithm can be easily implemented and the authors claim is has optimal regret with respect to the best CRP (constantly rebalanced portfolio) chosen in hindsight (I leave technical details of the algorithm to those who want to read the paper).  The authors have experiments indicating that the algorithms does better than buy-and-hold on randomly selected S&P stocks.  My personal experiments with this algorithm (using code from one of the authors) indicated that the regret bounds, though optimal, are far too weak to be economically meaningful (the terminal wealth of ONS may lag the best CRP by a factor exponential in $22 \cdot N \cdot \sqrt{T}$, where $N$ is the number of stocks and T the number of periods) and the algorithm performed roughly as well as buy-and-hold on monthly data.

dx.doi.org
sci-hub
scholar.google.com

Regret to the Best vs. Regret to the Average
Even-Dar, Eyal and Kearns, Michael and Mansour, Yishay and Wortman, Jennifer
Conference on Learning Theory - 2007 via Local Bibsonomy
Keywords: dblp

[link] Summary by Chris Murray 10 years ago

Without a doubt this is one of the better papers I've read on this topic: it's well written and their proofs (except the last technical one) are relatively straightforward.  This also cites other useful papers on expert learning (especially the first two references).

The authors work in the experts allocation framework, where each period each of N experts has a gain in [0,1], and the algorithm must choose a weighting w on the probability simplex and will receive gain w \cdot g, where g is a vector of the experts' gains.  The algorithms total gain is the sum of its gains over T time steps.  The authors define regret to the best (average, worst) as the amount by which the algorithm's total gain lags that of the best (average, worst) expert. They point out that, while traditional expert weighting algorithm have regret $O(T^{0.5})$ to the best, they also have regret $\Omega(T^{0.5})$ to the average and worst expert.  They show that any algorithm with $O(T^0.5)$ regret to best has $\Omega(T^{0.5})$ regret to worst (more detailed bounds are provided) They develop an algorithm with $O((N\cdot T)^{0.5} \cdot log(T))$ regret to best and constant regret to average: thus giving up a little regret to the best expert for much better regret to the average. The first of these, algorithms, BestAverage, basically runs an exponential weights algorithm, resetting whenever the best expert diverges sufficiently from the average.

dx.doi.org
sci-hub
scholar.google.com

Reinforcement Learning: A Survey
Kaelbling, Leslie Pack and Littman, Michael L. and 0001, Andrew W. Moore
Journal of Artificial Intelligence Research - 1996 via Local Bibsonomy
Keywords: dblp

[link] Summary by Chris Murray 10 years ago

Great paper! Summarizes unsupervised reinforcement learning techniques, both with a model and model free. Include TD learning, Q learning, exploration vs. exploitation tradeoff, and other details. Not difficult to read for a technical audience. Explanations are clear while avoiding unnecessary detail and the paper has copious references. Granted I'm biased since I took one of the author's courses

Chris Murray

sciscore: 2.2