Partially Observed Markov Decision Process Multiarmed Bandits—Structural Results
Vikram Krishnamurthy,
Bo Wahlberg
Department of Electrical and Computer Engineering, University of British Columbia, Vancouver, British Columbia V6T 1Z4, Canada
Automatic Control and ACCESS, School of Electrical Engineering, KTH, SE-100 44 Stockholm, Sweden
vikramk{at}ece.ubc.ca, http://www.ece.ubc.ca/
vikramk
bo.wahlberg{at}ee.kth.se
This paper considers multiarmed bandit problems involving partially observed Markov decision processes (POMDPs). We show how the Gittins index for the optimal scheduling policy can be computed by a value iteration algorithm on each process, thereby considerably simplifying the computational cost. A suboptimal value iteration algorithm based on Lovejoy's approximation is presented. We then show that for the case of totally positive of order 2 (TP2) transition probability matrices and monotone likelihood ratio (MLR) ordered observation probabilities, the Gittins index is MLR increasing in the information state. Algorithms that exploit this structure are then presented.
Key Words: multiarmed bandits; partially observed Markov decision process; monotone policies; likelihood ratio ordering; opportunistic scheduling; stochastic approximation algorithm
History: Received: July 24, 2008;
revision received: December 3, 2008;
Copyright © 2009 by INFORMS.