On the Empirical State-Action Frequencies in Markov Decision Processes Under General Policies
Shie Mannor,
John N. Tsitsiklis
Department of Electrical and Computer Engineering, McGill University, 3480 University Street, Montreal, Québec, Canada H3A 2A7
Laboratory for Information and Decision Systems, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139
shie{at}ece.mcgill.ca, www.ece.mcgill.ca/~shie/
jnt{at}mit.edu, web.mit.edu/~jnt/www/home.html
We consider the empirical state-action frequencies and the empirical reward in weakly communicating finite-state Markov decision processes under general policies. We define a certain polytope and establish that every element of this polytope is the limit of the empirical frequency vector, under some policy, in a strong sense. Furthermore, we show that the probability of exceeding a given distance between the empirical frequency vector and the polytope decays exponentially with time under every policy. We provide similar results for vector-valued empirical rewards.
Key Words: Markov decision processes; state-action frequencies; large deviations; empirical measure
History: Received: March 6, 2003;
revision received: April 6, 2004;
Copyright © 2005 by INFORMS.