Policy gradient algorithms

Even though there are tons of blogs, articles and papers on policy gradient algorithms, I find that most of them either focus on the math or high level ideas of RL but neither of them provide a comprehensive overview of the topic in one place. This blog is an attempt to fill this gap.

I plan to cover the following topics:

RL formulation
Policy gradient derivation
Policy gradient algorithms
- Reinforce
- Actor-critic methods
- Proximal policy optimization (PPO)
- Group relative policy optimization (GRPO)

RL formulation

For basics of key concepts in RL, please refer to the excellent notes by OAI Spinning Up. I would highly recommend going through the notes before reading this blog.

We assume our policy (in this blog, an LLM) is parameterized by: $\theta$ and is represented as $\pi_{\theta}(a \mid s)$ , where $a$ is the action and $s$ is the state. In the case of LLMs, $s$ is the input prompt and $a$ is the next token generated.

The probability of a complete trajectory $\tau$ , where:

\tau = (s_1, a_1, s_2, a_2, \ldots, s_T, a_T)

is given by:

\pi_{\theta}(\tau) = p(s_1) \prod_{t=1}^{T} \pi_{\theta}(a_t \mid s_t) \, p(s_{t+1} \mid s_t, a_t)

Using the policy $\pi$ we sample in the world and we have no real modeling or understanding of how the world really works and our goal is to just maximize the expected reward $R(\tau)$ .

The expected reward for the policy is:

J({\theta}) = \mathbb{E}_{\tau \sim p_{\theta}}[\sum_{t} r(s_{t}, a_{t})]

= \frac{1}{N} \sum_{i} \sum_{t} r(s_{t}, a_{t})

where the outer sum over $i$ is over different samples and inner sum over $t$ is over a trajectory.

Converting the samples to its expectation form:

J({\theta}) = \mathbb{E}_{\tau \sim \pi_{\theta}(\tau)}[r(\tau)]

= \int \pi_{\theta} (\tau) r(\tau) d\tau

Policy Gradient Derivation

Our goal is to find the gradient of the objective function $J(\theta)$ with respect to the policy parameters $\theta$ , i.e., $\nabla_{\theta} J(\theta)$ . We want to update our parameters $\theta$ in the direction that increases $J(\theta)$ .

Let's compute the gradient:

\nabla_{\theta} J(\theta) = \nabla_{\theta} \int \pi_{\theta}(\tau) r(\tau) d\tau

Swap the gradient and the integral (assuming validity):

= \int \nabla_{\theta} \pi_{\theta}(\tau) r(\tau) d\tau

Now, we use the log-derivative trick. Recall that $\nabla_x \log f(x) = \frac{\nabla_x f(x)}{f(x)}$ . Rearranging this gives $\nabla_x f(x) = f(x) \nabla_x \log f(x)$ . Applying this to our policy $\pi_{\theta}(\tau)$ :

\nabla_{\theta} \pi_{\theta}(\tau) = \pi_{\theta}(\tau) \nabla_{\theta} \log \pi_{\theta}(\tau)

Substituting this back into our gradient expression:

\nabla_{\theta} J(\theta) = \int \pi_{\theta}(\tau) \nabla_{\theta} \log \pi_{\theta}(\tau) r(\tau) d\tau

This integral is simply the expectation of $\nabla_{\theta} \log \pi_{\theta}(\tau) r(\tau)$ over trajectories sampled from $\pi_{\theta}(\tau)$ :

\nabla_{\theta} J(\theta) = \mathbb{E}_{\tau \sim \pi_{\theta}(\tau)} [ \nabla_{\theta} \log \pi_{\theta}(\tau) r(\tau) ] \quad \quad (1)

Now, let's expand the term $\nabla_{\theta} \log \pi_{\theta}(\tau)$ . We previously defined:

\pi_{\theta}(\tau) = p(s_1) \prod_{t=1}^{T} \pi_{\theta}(a_t \mid s_t) \, p(s_{t+1} \mid s_t, a_t)

Taking the logarithm:

\log \pi_{\theta}(\tau) = \log p(s_1) + \sum_{t=1}^{T} \log \pi_{\theta}(a_t \mid s_t) + \sum_{t=1}^{T} \log p(s_{t+1} \mid s_t, a_t)

Now, take the gradient with respect to $\theta$ :

\nabla_{\theta} \log \pi_{\theta}(\tau) = \nabla_{\theta} \left( \log p(s_1) + \sum_{t=1}^{T} \log \pi_{\theta}(a_t \mid s_t) + \sum_{t=1}^{T} \log p(s_{t+1} \mid s_t, a_t) \right)

The terms $\log p(s_1)$ and $\log p(s_{t+1} \mid s_t, a_t)$ represent the environment dynamics and the initial state distribution, which do not depend on our policy parameters $\theta$ . Therefore, their gradients are zero.

\nabla_{\theta} \log \pi_{\theta}(\tau) = \sum_{t=1}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t \mid s_t)

Substitute this back into equation (1) to get the Policy Gradient Theorem:

\nabla_{\theta} J(\theta) = \mathbb{E}_{\tau \sim \pi_{\theta}(\tau)} \left[ \left( \sum_{t=1}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t \mid s_t) \right) r(\tau) \right]

Here, $r(\tau) = \sum_{t=1}^{T} r(s_t, a_t)$ is the total reward for the entire trajectory $\tau$ .

REINFORCE

REINFORCE algorithm directly implements the Policy Gradient Theorem using Monte Carlo sampling.

We estimate the expectation using a batch of $N$ sampled trajectories:

\nabla_{\theta} J(\theta) \approx \hat{g} = \frac{1}{N} \sum_{i=1}^{N} \left[ \left( \sum_{t=1}^{T} \nabla_{\theta} \log \pi_{\theta}(a_{i,t} \mid s_{i,t}) \right) r(\tau_i) \right]

where $\tau_i = (s_{i,1}, a_{i,1}, \ldots, s_{i,T}, a_{i,T})$ is the i-th sampled trajectory and $r(\tau_i)$ is its total reward.

The parameters $\theta$ are then updated using gradient ascent:

\theta \leftarrow \theta + \alpha \hat{g}

where $\alpha$ is the learning rate.

Bias and Variance of REINFORCE

Understanding the properties of the REINFORCE gradient estimator ( $\hat{g}$ ) is crucial.

Bias

Vanilla policy gradient estimators (like REINFORCE as presented here) are unbiased. This means their expected value equals the true gradient of the objective function:

\mathbb{E}[\hat{g}] = \mathbb{E}_{\tau \sim \pi_{\theta}(\tau)} \left[ \left( \sum_{t=1}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t \mid s_t) \right) r(\tau) \right] = \nabla_{\theta} J(\theta)

This unbiasedness ensures that, on average over many samples, the gradient estimates point in the correct direction to improve the policy, even though individual estimates can be noisy.

Variance

REINFORCE suffers from high variance. This variance primarily arises because the gradient estimator uses the total cumulative reward $r(\tau)$ from the entire trajectory as a scaling factor for the gradients of the log-probabilities of all actions within that trajectory:

\hat{g}_{\tau} = \left(\sum_{t=1}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t \mid s_t)\right) r(\tau)

Because every action's gradient contribution is scaled by the same, often highly variable, total reward signal $r(\tau)$ , regardless of when the action occurred or its specific contribution to future rewards, the resulting gradient estimates $\hat{g}$ become very noisy. This requires many trajectories ( $N$ ) to get a reliable estimate and can cause unstable updates, especially in environments with long trajectories or sparse/delayed rewards. Subsequent methods aim to reduce this variance.