1

Is there any convergence proof of the Policy Gradient algorithm with "general" value/Q function approximation ? Seminal papers (Sutton1999 & Tsitsiklis1999) prove the theorem using a compatibility assumption (i.e. the Q-function approximation is linear w.r.t to policy's features). Also later improvement such as DPG (Silver14) also have similar assumptions.

Yet in practice this compatibility assumption is not satisfied, policy network and Q-function network have their own, independent, set of parameters.

Hence I wonder to which extend those methods are supported by theoretical guarantees.

Thanks,

(Sutton1999) : Policy gradient methods for reinforcement learning with function approximation, Sutton et al, 1999 (Silver2014) : Deterministic Policy Gradient Algorithms, Silver et al, 2014 (Tsitsiklis1999) : Actor-Critic Algorithms, Tsitsiklis et al, 1999

chesh
  • 697
  • 7
  • 21
arnaud
  • 11
  • 1

0 Answers0