Mathematics of Operations Research
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


MATHEMATICS OF OPERATIONS RESEARCH
Vol. 32, No. 2, May 2007, pp. 374-394
DOI: 10.1287/moor.1060.0240
This Article
Right arrow Full Text (PDF)
Right arrow References
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Denardo, E. V.
Right arrow Articles by Rothblum, U. G.
Right arrow Search for Related Content

Risk-Sensitive and Risk-Neutral Multiarmed Bandits

Eric V. Denardo, Haechurl Park, Uriel G. Rothblum

Center for System Science, Yale University, P.O. Box 208267, New Haven, Connecticut 06520
Department of Business Administration, Chung-Ang University, 221 Heukseok-Dong, Dongjak-gu, Seoul 156-756, Korea
Faculty of Industrial Engineering and Management, Technion—Israel Institute of Technology, Haifa 32000, Israel

eric.denardo{at}yale.edu
hpark{at}cau.ac.kr
rothblum{at}ie.technion.ac.il

For the multiarmed bandit, the classic result is probabilistic: each state of each bandit (Markov chain with rewards) has an index that is determined by an optimal stopping time for that state’s bandit, and expected discounted income is maximized by playing at each epoch a bandit whose current state has the largest index. Our approach is analytic, not probabilistic. It uses pairwise comparison in place of stopping times. A simple recursion assigns to each state of each bandit a utility and an amplification of future utility that depend solely on the data for that state’s bandit. These utilities and amplifications determine whether or not one state dominates another. We show that it is optimal to play at each epoch any bandit whose current state is not dominated by the current states of the other bandits. We obtain this result by a coherent analysis that encompasses three models—one with risk-averse exponential utility, one with risk-seeking exponential utility, and one with linear utility and either stopping or discounting. We also show that the risk-seeking case and a model of Nash [Nash, P. 1980. A generalized bandit problem. J. Roy. Statist. Soc. B 42 165–169] are equivalent to each other.

Key Words: multiarmed bandits; exponential utility; risk-sensitive Markov decision processes; optimal stopping
History: Received: October 14, 2004; revision received: June 24, 2006;





HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
Copyright © 2007 by INFORMS.