The actor-critic methods can be understood as an improvement over pure policy gradient methods like REINFORCE via introducing a baseline.
Actor The
actor uses a policy function \pi(a|s), while the critic estimates either the
value function V(s), the action-value Q-function Q(s,a), the advantage function A(s,a), or any combination thereof. The actor is a parameterized function \pi_\theta, where \theta are the parameters of the actor. The actor takes as argument the state of the environment s and produces a
probability distribution \pi_\theta(\cdot | s). If the action space is discrete, then \sum_{a} \pi_\theta(a | s) = 1. If the action space is continuous, then \int_{a} \pi_\theta(a | s) da = 1. The goal of policy optimization is to improve the actor. That is, to find some \theta that maximizes the expected episodic reward J(\theta): J(\theta) = \mathbb{E}_{\pi_\theta}\left[\sum_{t=0}^{T} \gamma^t r_t\right] where \gamma is the
discount factor, r_t is the reward at step t , and T is the time-horizon (which can be infinite). The goal of policy gradient method is to optimize J(\theta) by
gradient ascent on the policy gradient \nabla J(\theta). As detailed on the
policy gradient method page, there are many
unbiased estimators of the policy gradient:\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}\left[\sum_{0\leq j \leq T} \nabla_\theta\ln\pi_\theta(A_j| S_j) \cdot \Psi_j \Big|S_0 = s_0 \right]where \Psi_j is a linear sum of the following: • \sum_{0 \leq i\leq T} (\gamma^i R_i). • \gamma^j\sum_{j \leq i\leq T} (\gamma^{i-j} R_i): the
REINFORCE algorithm. • \gamma^j \sum_{j \leq i\leq T} (\gamma^{i-j} R_i) - b(S_j) : the
REINFORCE with baseline algorithm. Here b is an arbitrary function. • \gamma^j \left(R_j + \gamma V^{\pi_\theta}( S_{j+1}) - V^{\pi_\theta}( S_{j})\right):
TD(1) learning. • \gamma^j Q^{\pi_\theta}(S_j, A_j). • \gamma^j A^{\pi_\theta}(S_j, A_j):
Advantage Actor-Critic (A2C). • \gamma^j \left(R_j + \gamma R_{j+1} + \gamma^2 V^{\pi_\theta}( S_{j+2}) - V^{\pi_\theta}( S_{j})\right): TD(2) learning. • \gamma^j \left(\sum_{k=0}^{n-1} \gamma^k R_{j+k} + \gamma^n V^{\pi_\theta}( S_{j+n}) - V^{\pi_\theta}( S_{j})\right): TD(n) learning. • \gamma^j \sum_{n=1}^\infty \frac{\lambda^{n-1}}{1-\lambda}\cdot \left(\sum_{k=0}^{n-1} \gamma^k R_{j+k} + \gamma^n V^{\pi_\theta}( S_{j+n}) - V^{\pi_\theta}( S_{j})\right): TD(λ) learning, also known as
GAE (generalized advantage estimate). This is obtained by an exponentially decaying sum of the TD(n) learning terms.
Critic In the unbiased estimators given above, certain functions such as V^{\pi_\theta}, Q^{\pi_\theta}, A^{\pi_\theta} appear. These are approximated by the
critic. Since these functions all depend on the actor, the critic must learn alongside the actor. The critic is learned by value-based RL algorithms. For example, if the critic is estimating the state-value function V^{\pi_\theta}(s), then it can be learned by any value function approximation method. Let the critic be a function approximator V_\phi(s) with parameters \phi. The simplest example is TD(1) learning, which trains the critic to minimize the TD(1) error:\delta_i = R_i + \gamma V_\phi(S_{i+1}) - V_\phi(S_i)The critic parameters are updated by gradient descent on the squared TD error:\phi \leftarrow \phi - \alpha \nabla_\phi (\delta_i)^2 = \phi + \alpha \delta_i \nabla_\phi V_\phi(S_i)where \alpha is the learning rate. Note that the gradient is taken with respect to the \phi in V_\phi(S_i) only, since the \phi in \gamma V_\phi(S_{i+1}) constitutes a moving target, and the gradient is not taken with respect to that. This is a common source of error in implementations that use
automatic differentiation, and requires "stopping the gradient" at that point. Similarly, if the critic is estimating the action-value function Q^{\pi_\theta}, then it can be learned by
Q-learning or
SARSA. In SARSA, the critic maintains an estimate of the Q-function, parameterized by \phi, denoted as Q_\phi(s, a). The temporal difference error is then calculated as \delta_i = R_i + \gamma Q_\theta(S_{i+1}, A_{i+1}) - Q_\theta(S_i,A_i). The critic is then updated by\theta \leftarrow \theta + \alpha \delta_i \nabla_\theta Q_\theta(S_i, A_i)The advantage critic can be trained by training both a Q-function Q_\phi(s,a) and a state-value function V_\phi(s), then let A_\phi(s,a) = Q_\phi(s,a) - V_\phi(s). Although, it is more common to train just a state-value function V_\phi(s), then estimate the advantage byA_\phi(S_i,A_i) \approx \sum_{j\in 0:n-1} \gamma^{j}R_{i+j} + \gamma^{n}V_\phi(S_{i+n}) - V_\phi(S_i)Here, n is a positive integer. The higher n is, the more lower is the bias in the advantage estimation, but at the price of higher variance. The
Generalized Advantage Estimation (GAE) introduces a hyperparameter \lambda that smoothly interpolates between Monte Carlo returns ( \lambda = 1 , high variance, no bias) and 1-step
TD learning ( \lambda = 0 , low variance, high bias). This hyperparameter can be adjusted to pick the optimal bias-variance trade-off in advantage estimation. It uses an exponentially decaying average of n-step returns with \lambda being the decay strength. == Variants ==