Implementation of gradient estimation to a constrained Markov decision problem

TitleImplementation of gradient estimation to a constrained Markov decision problem
Publication TypeConference Paper
Year of Publication2003
AuthorsKrishnamurthy, V., K. Martin, and F. Vasquez Abad
Conference NameDecision and Control, 2003. Proceedings. 42nd IEEE Conference on
Pagination4841 - 4846 Vol.5
Date Publisheddec.
Keywordsapproximation theory, augmented Lagrangian, bias correction, constrained Markov decision process, decision theory, finite horizon derivative estimators, gradient estimation, gradient methods, Markov chain, Markov processes, nonlinear constraints, nonlinear optimization method, online learning, optimisation, perturbation analysis method, perturbation realization factors, perturbation techniques, primal/dual method, stochastic approximation

Consider the problem of a constrained Markov Decision Process (MDP). Under a parameterization of the control strategies, the problem can be transformed into a non-linear optimization problem with non-linear constraints. Both the cost and the constraints are stationary averages. We assume that the transition probabilities of the underlying Markov chain are unknown: only the values of the control variables are known, as well as the instantaneous values of the cost and the constraints, so no analytical expression for the stationary averages is available. To find the solution to the optimization problem, a stochastic version of a primal/dual method with an augmented Lagrangian is used. The updating scheme uses a "measure valued" estimator of the gradients that can be interpreted in terms of a finite horizon version of the Perturbation Analysis (PA) method known as the "perturbation realization factors". Most finite horizon derivative estimators are consistent as the sample size grows, so it is common to assume that large enough samples can be observed so as to make the bias negligible. This paper deals with the actual implementations of the gradient estimators in finite horizon with small sample sizes, so that the iterates of the stochastic approximation can be performed very often, as would be required for on-line learning. We identify the asymptotic bias of the stochastic approximation for the constrained optimization method, and by so doing we propose several means to correct it. As is very common with these problems, the bias correction introduces a conflict between precision and speed: the smaller the bias, the slower the reaction time. In the sequel, we present the theoretical basis for the study of bias and learning rate. Our experimental results indicate that smoothing at a faster time scale may not be necessary at all, only at a slower time scale. We include results where the algorithms have to track changes in the environment.


a place of mind, The University of British Columbia

Electrical and Computer Engineering
2332 Main Mall
Vancouver, BC Canada V6T 1Z4
Tel +1.604.822.2872
Fax +1.604.822.5949

Emergency Procedures | Accessibility | Contact UBC | © Copyright 2021 The University of British Columbia