Even though there are tons of blogs, articles and papers on policy gradient algorithms, I find that most of them either focus on the math or high level ideas of RL but neither of them provide a comprehensive overview of the topic in one place. This blog is an attempt to fill this gap.
I plan to cover the following topics:
- RL formulation
- Policy gradient derivation
- Policy gradient algorithms
- Reinforce
- Actor-critic methods
- Proximal policy optimization (PPO)
- Group relative policy optimization (GRPO)
For basics of key concepts in RL, please refer to the excellent notes by OAI Spinning Up. I would highly recommend going through the notes before reading this blog.
We assume our policy (in this blog, an LLM) is parameterized by: θ and is represented as πθ(a∣s) , where a is the action and s is the state. In the case of LLMs, s is the input prompt and a is the next token generated.
The probability of a complete trajectory τ, where:
τ=(s1,a1,s2,a2,…,sT,aT)
is given by:
πθ(τ)=p(s1)t=1∏Tπθ(at∣st)p(st+1∣st,at)
Using the policy π we sample in the world and we have no real modeling or understanding of how the world really works and our goal is to just maximize the expected reward R(τ).
The expected reward for the policy is:
J(θ)=Eτ∼pθ[t∑r(st,at)]
=N1i∑t∑r(st,at)
where the outer sum over i is over different samples and inner sum over t is over a trajectory.
Converting the samples to its expectation form:
J(θ)=Eτ∼πθ(τ)[r(τ)]
=∫πθ(τ)r(τ)dτ
Policy Gradient Derivation
Our goal is to find the gradient of the objective function J(θ) with respect to the policy parameters θ, i.e., ∇θJ(θ). We want to update our parameters θ in the direction that increases J(θ).
Let's compute the gradient:
∇θJ(θ)=∇θ∫πθ(τ)r(τ)dτ
Swap the gradient and the integral (assuming validity):
=∫∇θπθ(τ)r(τ)dτ
Now, we use the log-derivative trick. Recall that ∇xlogf(x)=f(x)∇xf(x). Rearranging this gives ∇xf(x)=f(x)∇xlogf(x). Applying this to our policy πθ(τ):
∇θπθ(τ)=πθ(τ)∇θlogπθ(τ)
Substituting this back into our gradient expression:
∇θJ(θ)=∫πθ(τ)∇θlogπθ(τ)r(τ)dτ
This integral is simply the expectation of ∇θlogπθ(τ)r(τ) over trajectories sampled from πθ(τ):
∇θJ(θ)=Eτ∼πθ(τ)[∇θlogπθ(τ)r(τ)](1)
Now, let's expand the term ∇θlogπθ(τ). We previously defined:
πθ(τ)=p(s1)t=1∏Tπθ(at∣st)p(st+1∣st,at)
Taking the logarithm:
logπθ(τ)=logp(s1)+t=1∑Tlogπθ(at∣st)+t=1∑Tlogp(st+1∣st,at)
Now, take the gradient with respect to θ:
∇θlogπθ(τ)=∇θ(logp(s1)+t=1∑Tlogπθ(at∣st)+t=1∑Tlogp(st+1∣st,at))
The terms logp(s1) and logp(st+1∣st,at) represent the environment dynamics and the initial state distribution, which do not depend on our policy parameters θ. Therefore, their gradients are zero.
∇θlogπθ(τ)=t=1∑T∇θlogπθ(at∣st)
Substitute this back into equation (1) to get the Policy Gradient Theorem:
∇θJ(θ)=Eτ∼πθ(τ)[(t=1∑T∇θlogπθ(at∣st))r(τ)]
Here, r(τ)=∑t=1Tr(st,at) is the total reward for the entire trajectory τ.
REINFORCE
REINFORCE algorithm directly implements the Policy Gradient Theorem using Monte Carlo sampling.
We estimate the expectation using a batch of N sampled trajectories:
∇θJ(θ)≈g^=N1i=1∑N[(t=1∑T∇θlogπθ(ai,t∣si,t))r(τi)]
where τi=(si,1,ai,1,…,si,T,ai,T) is the i-th sampled trajectory and r(τi) is its total reward.
The parameters θ are then updated using gradient ascent:
θ←θ+αg^
where α is the learning rate.
Bias and Variance of REINFORCE
Understanding the properties of the REINFORCE gradient estimator (g^) is crucial.
Bias
Vanilla policy gradient estimators (like REINFORCE as presented here) are unbiased. This means their expected value equals the true gradient of the objective function:
E[g^]=Eτ∼πθ(τ)[(t=1∑T∇θlogπθ(at∣st))r(τ)]=∇θJ(θ)
This unbiasedness ensures that, on average over many samples, the gradient estimates point in the correct direction to improve the policy, even though individual estimates can be noisy.
Variance
REINFORCE suffers from high variance. This variance primarily arises because the gradient estimator uses the total cumulative reward r(τ) from the entire trajectory as a scaling factor for the gradients of the log-probabilities of all actions within that trajectory:
g^τ=(t=1∑T∇θlogπθ(at∣st))r(τ)
Because every action's gradient contribution is scaled by the same, often highly variable, total reward signal r(τ), regardless of when the action occurred or its specific contribution to future rewards, the resulting gradient estimates g^ become very noisy. This requires many trajectories (N) to get a reliable estimate and can cause unstable updates, especially in environments with long trajectories or sparse/delayed rewards. Subsequent methods aim to reduce this variance.
References
- https://yugeten.github.io/posts/2025/01/ppogrpo/
- Lilian Weng's blog
- OAI spinning up blog