Policy gradient algorithms

March 28, 2025

Even though there are tons of blogs, articles and papers on policy gradient algorithms, I find that most of them either focus on the math or high level ideas of RL but neither of them provide a comprehensive overview of the topic in one place. This blog is an attempt to fill this gap.

I plan to cover the following topics:

  • RL formulation
  • Policy gradient derivation
  • Policy gradient algorithms
    • Reinforce
    • Actor-critic methods
    • Proximal policy optimization (PPO)
    • Group relative policy optimization (GRPO)

RL formulation

For basics of key concepts in RL, please refer to the excellent notes by OAI Spinning Up. I would highly recommend going through the notes before reading this blog.

We assume our policy (in this blog, an LLM) is parameterized by: θ\theta and is represented as πθ(as)\pi_{\theta}(a \mid s) , where aa is the action and ss is the state. In the case of LLMs, ss is the input prompt and aa is the next token generated.

The probability of a complete trajectory τ\tau, where:

τ=(s1,a1,s2,a2,,sT,aT)\tau = (s_1, a_1, s_2, a_2, \ldots, s_T, a_T)

is given by:

πθ(τ)=p(s1)t=1Tπθ(atst)p(st+1st,at)\pi_{\theta}(\tau) = p(s_1) \prod_{t=1}^{T} \pi_{\theta}(a_t \mid s_t) \, p(s_{t+1} \mid s_t, a_t)

Using the policy π\pi we sample in the world and we have no real modeling or understanding of how the world really works and our goal is to just maximize the expected reward R(τ)R(\tau).

The expected reward for the policy is:

J(θ)=Eτpθ[tr(st,at)]J({\theta}) = \mathbb{E}_{\tau \sim p_{\theta}}[\sum_{t} r(s_{t}, a_{t})] =1Nitr(st,at) = \frac{1}{N} \sum_{i} \sum_{t} r(s_{t}, a_{t})

where the outer sum over ii is over different samples and inner sum over tt is over a trajectory.

Converting the samples to its expectation form:

J(θ)=Eτπθ(τ)[r(τ)]J({\theta}) = \mathbb{E}_{\tau \sim \pi_{\theta}(\tau)}[r(\tau)] =πθ(τ)r(τ)dτ = \int \pi_{\theta} (\tau) r(\tau) d\tau

Policy Gradient Derivation

Our goal is to find the gradient of the objective function J(θ)J(\theta) with respect to the policy parameters θ\theta, i.e., θJ(θ)\nabla_{\theta} J(\theta). We want to update our parameters θ\theta in the direction that increases J(θ)J(\theta).

Let's compute the gradient:

θJ(θ)=θπθ(τ)r(τ)dτ\nabla_{\theta} J(\theta) = \nabla_{\theta} \int \pi_{\theta}(\tau) r(\tau) d\tau

Swap the gradient and the integral (assuming validity):

=θπθ(τ)r(τ)dτ= \int \nabla_{\theta} \pi_{\theta}(\tau) r(\tau) d\tau

Now, we use the log-derivative trick. Recall that xlogf(x)=xf(x)f(x)\nabla_x \log f(x) = \frac{\nabla_x f(x)}{f(x)}. Rearranging this gives xf(x)=f(x)xlogf(x)\nabla_x f(x) = f(x) \nabla_x \log f(x). Applying this to our policy πθ(τ)\pi_{\theta}(\tau):

θπθ(τ)=πθ(τ)θlogπθ(τ)\nabla_{\theta} \pi_{\theta}(\tau) = \pi_{\theta}(\tau) \nabla_{\theta} \log \pi_{\theta}(\tau)

Substituting this back into our gradient expression:

θJ(θ)=πθ(τ)θlogπθ(τ)r(τ)dτ\nabla_{\theta} J(\theta) = \int \pi_{\theta}(\tau) \nabla_{\theta} \log \pi_{\theta}(\tau) r(\tau) d\tau

This integral is simply the expectation of θlogπθ(τ)r(τ)\nabla_{\theta} \log \pi_{\theta}(\tau) r(\tau) over trajectories sampled from πθ(τ)\pi_{\theta}(\tau):

θJ(θ)=Eτπθ(τ)[θlogπθ(τ)r(τ)](1)\nabla_{\theta} J(\theta) = \mathbb{E}_{\tau \sim \pi_{\theta}(\tau)} [ \nabla_{\theta} \log \pi_{\theta}(\tau) r(\tau) ] \quad \quad (1)

Now, let's expand the term θlogπθ(τ)\nabla_{\theta} \log \pi_{\theta}(\tau). We previously defined:

πθ(τ)=p(s1)t=1Tπθ(atst)p(st+1st,at)\pi_{\theta}(\tau) = p(s_1) \prod_{t=1}^{T} \pi_{\theta}(a_t \mid s_t) \, p(s_{t+1} \mid s_t, a_t)

Taking the logarithm:

logπθ(τ)=logp(s1)+t=1Tlogπθ(atst)+t=1Tlogp(st+1st,at)\log \pi_{\theta}(\tau) = \log p(s_1) + \sum_{t=1}^{T} \log \pi_{\theta}(a_t \mid s_t) + \sum_{t=1}^{T} \log p(s_{t+1} \mid s_t, a_t)

Now, take the gradient with respect to θ\theta:

θlogπθ(τ)=θ(logp(s1)+t=1Tlogπθ(atst)+t=1Tlogp(st+1st,at))\nabla_{\theta} \log \pi_{\theta}(\tau) = \nabla_{\theta} \left( \log p(s_1) + \sum_{t=1}^{T} \log \pi_{\theta}(a_t \mid s_t) + \sum_{t=1}^{T} \log p(s_{t+1} \mid s_t, a_t) \right)

The terms logp(s1)\log p(s_1) and logp(st+1st,at)\log p(s_{t+1} \mid s_t, a_t) represent the environment dynamics and the initial state distribution, which do not depend on our policy parameters θ\theta. Therefore, their gradients are zero.

θlogπθ(τ)=t=1Tθlogπθ(atst)\nabla_{\theta} \log \pi_{\theta}(\tau) = \sum_{t=1}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t \mid s_t)

Substitute this back into equation (1) to get the Policy Gradient Theorem:

θJ(θ)=Eτπθ(τ)[(t=1Tθlogπθ(atst))r(τ)]\nabla_{\theta} J(\theta) = \mathbb{E}_{\tau \sim \pi_{\theta}(\tau)} \left[ \left( \sum_{t=1}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t \mid s_t) \right) r(\tau) \right]

Here, r(τ)=t=1Tr(st,at)r(\tau) = \sum_{t=1}^{T} r(s_t, a_t) is the total reward for the entire trajectory τ\tau.


REINFORCE

REINFORCE algorithm directly implements the Policy Gradient Theorem using Monte Carlo sampling.

We estimate the expectation using a batch of NN sampled trajectories:

θJ(θ)g^=1Ni=1N[(t=1Tθlogπθ(ai,tsi,t))r(τi)]\nabla_{\theta} J(\theta) \approx \hat{g} = \frac{1}{N} \sum_{i=1}^{N} \left[ \left( \sum_{t=1}^{T} \nabla_{\theta} \log \pi_{\theta}(a_{i,t} \mid s_{i,t}) \right) r(\tau_i) \right]

where τi=(si,1,ai,1,,si,T,ai,T)\tau_i = (s_{i,1}, a_{i,1}, \ldots, s_{i,T}, a_{i,T}) is the i-th sampled trajectory and r(τi)r(\tau_i) is its total reward.

The parameters θ\theta are then updated using gradient ascent:

θθ+αg^\theta \leftarrow \theta + \alpha \hat{g}

where α\alpha is the learning rate.

Bias and Variance of REINFORCE

Understanding the properties of the REINFORCE gradient estimator (g^\hat{g}) is crucial.

Bias

Vanilla policy gradient estimators (like REINFORCE as presented here) are unbiased. This means their expected value equals the true gradient of the objective function:

E[g^]=Eτπθ(τ)[(t=1Tθlogπθ(atst))r(τ)]=θJ(θ)\mathbb{E}[\hat{g}] = \mathbb{E}_{\tau \sim \pi_{\theta}(\tau)} \left[ \left( \sum_{t=1}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t \mid s_t) \right) r(\tau) \right] = \nabla_{\theta} J(\theta)

This unbiasedness ensures that, on average over many samples, the gradient estimates point in the correct direction to improve the policy, even though individual estimates can be noisy.

Variance

REINFORCE suffers from high variance. This variance primarily arises because the gradient estimator uses the total cumulative reward r(τ)r(\tau) from the entire trajectory as a scaling factor for the gradients of the log-probabilities of all actions within that trajectory:

g^τ=(t=1Tθlogπθ(atst))r(τ)\hat{g}_{\tau} = \left(\sum_{t=1}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t \mid s_t)\right) r(\tau)

Because every action's gradient contribution is scaled by the same, often highly variable, total reward signal r(τ)r(\tau), regardless of when the action occurred or its specific contribution to future rewards, the resulting gradient estimates g^\hat{g} become very noisy. This requires many trajectories (NN) to get a reliable estimate and can cause unstable updates, especially in environments with long trajectories or sparse/delayed rewards. Subsequent methods aim to reduce this variance.


References

  1. https://yugeten.github.io/posts/2025/01/ppogrpo/
  2. Lilian Weng's blog
  3. OAI spinning up blog