Understand Policy Gradient by Building Cross Entropy from Scratch

Author:Murphy | View: 29034 | Time: 2025-03-23 18:25:20

Reinforcement learning (RL) can do amazing stuff. Most recently, ChatGPT is fine-tuned on human feedback with PPO, a variant of a class of reinforcement learning algorithm called Policy Gradient (PG). Understanding RL, especially policy gradient, could be non-trivial, particularly if you like grasping intuitions like I do. In this post, I will walk through a thread of thoughts that really helped me understand PG by starting from more familiar supervised learning setting.

TL;DR

We will start by designing a simple supervised training procedure of a binary classification robot by rewarding it +1 for correct answers
We will formulate the objective for the procedure
We will derive the gradient ascent formulation for the procedure (which will turn out to be the same procedure as gradient descent with Cross Entropy)
We will compare our procedure to RL settings and relate our gradient ascent to policy gradient

Who should read this?

My goal is to provide a friendly, intuitive aid to understand PG. It is helpful if you have a general understanding of RL problem setting and know on high level what PG is.
I hope to help you better understand the relationship between RL with PG and supervised ML. So it is helpful if you know about how to train a supervised ML algorithm with cross entropy loss function.

Why this post?

Policy Gradient

In an RL problem, an agent interacts with an environment to learn a policy. The policy tells the agent what to do in different states to maximize reward.

The idea of PG seems straightforward.

The policy that guides agent behavior at time t is _π_θ(a_t|st).
This is some kind of function (often a neural network) with parameter θ.
It takes in information of states _st and spits out a probability distribution of action to take _at.
Then it receives a reward _r(s_t, at).
When we have history of many of such cycles of action and reward, we can update parameter θ in order to maximize expected reward produced by actions generated from _πθ.

How do we do the update? Through…gradient! We update the model producing _πθ with the following gradient

Something Feels Off

This looks very familiar. When we train a neural network model in good old supervised learning, we also update the model parameters by doing the operation in the second line aka gradient descent (technically in PG case, since we are maximizing an objective, it's gradient ascent).

But this also feels very different. If you look at its derivation process, you can see it takes a bit of effort to derive this equation. That's very different from the more intuitive way of how we do supervised learning: feed an input into neural network, get an output, compare it with target and calculate a loss function, hit backprop button, and we are done!

Also, for me, the log term always seems coming out of nowhere. Although the same online course in the link above walks through how do we get to the log term, the process seems to just be a bunch of math that's correct but lacking motivation.

What is exactly the difference from supervised learning? It turns out diving into this question provides a great way of understanding PG. Furthermore, it is a good reminder for the nature of some familiar things in supervised learning that we do everyday.

Let's Build Cross Entropy from Scratch

If we are thrown with some loss functions used in supervised ML, they would "make sense" immediately. But it takes some more efforts to understand where it comes from. For example, the good old mean square loss intuitively make sense: it just minimizes the distance between the prediction and target. But there're so many distance metrics; why square distance? You have to look deeper to see that mean square error is the byproduct of doing maximum likelihood and assuming the underlying population distribution to be normal.

Same with another good old loss function that we use everyday: Cross Entropy. While there are many good interpretations of what Cross Entropy does, let's try to build it from the most rudimentary manner.

Let's Train a Classification Bot!

Say you want to train a robot to classify dog and cat image. It's intuitive To train it by rewarding the right answer and punish it (or not reward it) for wrong answers. Here's how it's done:

You give the robot an image. Let's call it s. This image is sampled from a population distribution _Ds

The robot will give you an answer if it think it's a dog image (action _adog) or it's a cat image (action _acat).
The robot has its own prediction given the image about the probability of the image being dog or cat: _π_θ(a|s) = (a_dog, acat). For example, _πθ(a|s) = (0.9, 0.1) means it thinks there's 0.9 probability it's a dog, and 0.1 probably it's a cat.

But every time the robot will only give you a solid answer. It either says "it's a dog" (_adog) or "it's a cat" _(acat). Every time it gives you a response, the responses (actions) are randomly sampled from distribution produced by _πθ(a|s): _a = (a_dog, a_cat) ~ πθ(a|s).

Then you will reward the robot (maybe give it a treat?) with reward of 1 when robot answers correctly. (r(s,a) = 1) There's no reward (0 reward) when the answer is wrong. (r(s,a) = 0)

This process is what I had in mind when first learning about supervised ML. Reward it when it's correct. Punish it (or simply no reward in this training process we designed) for it's wrong. Probably the most intuitive way to train something.

Maximizing Objective

What's our objective for the robot? We want its response to be correct as often as possible. More precisely, we want to find optimal parameter θ*, that produces a _πθ(a|s), such __ that over all possible occurrences of s (sampled from image population distribution _Ds) and a (sampled from distribution given s produced by mode_l π_θ(a|_s)), we get maximum average reward weighted by probability of each occurence o_f (s,_a):

In other words, we are maximizing the objective function J(θ) defined as

Gradient of Objective

Now we have an objective function, we could probably try to maximize it through…gradient ascent! Namely, we can optimize the function by iteratively do

But how should we calculate the gradient, i.e. derivative of J respect to θ? It's a bit tricky in this case since

The function that we want to take derivative of is an expectation.
If the expectation is not over a distribution dependent on θ, then by linearity of expectation, we can just take the derivative of what's inside of expectation and leave expectation there. However, in this case,tThe expectation is over _(s,a) ~ (D_s, πθ(a|s)), which is dependent θ. So the derivative is not obvious.
Another way of thinking about this is that J(θ) changes value as the frequency changes for how often we sample (s,a) from a distribution partially determined by _θ._ We want more frequent occurrences of s=dog image and _a=adog (similar pair for cat). How can we capture changes of θ towards this direction when we do gradient ascent?

In addition, ideally, we want gradient to be in the form

This is because you train the bot from samples of interactions between robot and you. Every sample consists of a triplet of (s,a,r). We can therefore approximate this gradient by taking average of f(θ,s,a,r) with N samples that we collect (by law of large number, i.e. do Stochastic Gradient Ascent):

We can then do gradient ascent by doing

Now let's find f.

Finding Gradient

To recap, we want to start from (1) to get (2) for some f(θ,s,a,r).

Let's first rewrite (1) with the definition of expectation:

This is basically the integration of reward weighted by probability over all possible (s,a) pair.

Now, what's exactly the joint probability P(s,a) for one pair of (s,a) to appear? We can decompose it into the probability of image sample (s) appearing, and the probability that the robot randomly select action a.

Since the robot randomly select action a from the robot's internal model of prediction _πθ(a|s), we have

Out of all terms inside of the parentheses, only _πθ(a|s) is dependent on θ. The other terms are all constant. So we can move the gradient operation inside of the integral sign next to this term and get

Note that we can also write the following. Nothing big here. Just multiplying the original lefthand side by 1 written as a fraction and rearranging the terms.

Replacing it back, and rearranging a bit, we get

_P(s)πθ(a|s) looks familiar. It's P(s,a) that we decomposed earlier! Putting it back, we have

Now we have an integral and P(s,a), we can…fit it back into the definition of expectation!

Which is exactly The form we want to get in (2), where f is the terms inside of the bracket!

You may have wondered why we rewrite gradient of _πθ(a|s) in the clunky fraction earlier? The idea was to create a _πθ(a|s) term (which we lost earlier by taking derivative of it), so we could produce a P(s,a) term again, and turn the integration back into an expectation!

Building Cross Entropy

Now it's magic time.

Don't believe me? Work from right hand side to left hand side with chain rule. ([Optional] Side thoughts: So if you are also confused about the motivation of the log term in Policy Gradient formula, it is a side product of simplifying a clunky equation we get, with the intention of extracting a _πθ(a|s) term to transform things back to expectation.)

So we can simplify gradient of J(θ) a bit:

So each time we have a batch of (s,a) as samples, we can do gradient ascent by

To bring it to a more familiar form, Moving the gradient sign outside of summation, we have

We will also invert the sign by doing

Does it ring a bell? Let's compare it with what we do when doing gradient descent on cross entropy loss.

Recall that cross entropy loss is

where y_i is true label, a one hot vector (_y_i_1, y_i2) that describes whether an image is cat and dog (either (0,1) or (1,0)). y_hat_i is prediction from a model, a vector (_y_hat_i_1, y_hat_i2) where the two entries some up two one.

When we do gradient descent on this loss function, we calculate the cross entropy loss function for the batch, and hit the backprop button:

The differences between this expression and the gradient ascent express we derived earlier are

To bring the relationship into words, it basically means: on sample _xi, y_i

The model makes prediction (_y_hat_i_1, y_hat_i2) given _xi
The model randomly samples a response from the predicted distribution
We reward response 1 with _y_i1 and response 2 with _y_i2.
Since when the label is class 1, _y_i_1 = 1, y_i2 = 0, we reward the model with 1 when model correctly responds 1 and there's 0 reward when it incorrectly responds 0. Similar with class 2.

And that's exactly what we've been doing!

So to recap,

We designed a simple training setting where we reward robot with 1 point when it answers correctly and 0 point when it answers incorrectly
We summerize what we want to achieve in an objective function that describes reward robot gets weighted by chances of its response
We find the gradient descent procedure to maximize this objective function
And we get… the exact procedure that we use when training a model by calculating Cross Entropy loss first and backproping through it!

Coming Back to Reinforcement Learning

Now let's turn our focus back to Reinforcement Learning setting. What are the differences between RL and the supervised ML setting.

Multiple Timesteps

The first difference is that RL usually involves in multiple states and multiple episodes. In our setting, the robot starts with input of the image, i.e. state s. After the robot gives you back an answer based on its prediction and collects the reward, the interactions between the robot and you are done.

In contrary, in RL problems, agents often interact with environment in multiple episodes, and it might transition to other states after the initial state.

The objective function then becomes

Putting in words, we maximize average sum of rewards of all timesteps over all possible sequence (trajectory) of states and actions, weighted by the probability of each trajectory occurring when actions are decided by parameter θ.

Notes that _pθ is the joint distribution of a sequence of states and actions when actions are decided by agent's model parameter θ. At each time step, the agent's action is decided by _π_θ(a_t|st), where _πθ is a model parameterized by θ. _pθ is a high level abstraction of how probable it is for a sequence of states and actions to occur when the agent makes decision based on _πθ (i.e. _pθ is a placeholder for theoretically how often the agent takes on the trajectory. On the other hand, _πθ(a|s) is the probability that the agent will take an action at a specific timestep. We don't actually easily know the value _pθ, so later we will rewrite it with model output _πθ(a|s) that we actually know).

Let's compare it with the objective we had earlier:

The main differences are

We calculate expectation over a sequence of s and a instead of just one pair.
We maximize the sum of rewards from all timesteps in the trajectory instead of only the one-timestep reward from the image and answer.

Comparing the Gradient Formula

We can do the similar manipulations to this objective to derive the gradient that we can use to update θ at every timestep.

To recap, our goal is to find gradient of J(θ) in the following form for some f

When we obtain a batch of sampled sequence of _s_1, a_1, r_1, … s_T, a_T, rT, we can then update θ via Stochastic Gradient Ascent:

To simplify things, let's denote the sequence of state and with one variable τ.

So we hope to maximize the following objective function

We can do the similar manipulations that we did:

Write expectation in terms of integration

Take derivative with respect to θ on __ the only term involving _θ: pθ(τ)

Rewrite the gradient of _pθ(τ) as product of _pθ(τ) and something else to recover the form that defines an expectation

So we obtain

Voila! It is exactly what We want to find. To put in words, it means we are updating θ to the direction of gradient of log probability of samples τ under the actions determined by θ, weighted by the total reward along the sampled τ. This is exactly the formulation of Policy Gradients.

If we extend the cross entropy analogy from earlier, sum of rewards are basically labels for the trajectory, and _pθ(τ) is how likely τ happens _ under model's prediction. The training process **encourages the model to predict distributions similar to the distribution of rewards over different trajectories **τ._ (This is in fact a mathematically accurate statement [correct me if I'm wrong]. If you know about KL-Divergence, compare what's being taken gradient of to KL-Divergence).

We can do some more manipulations with conditional probabilities and definition of _pθ(τ). This process is well explained by this video (around 9:27). We finally obtains the following, which rewrites _pθ(τ) as _π_θ(a_t|st) that we actually knows value of:

Note That when T = 1 (single episode), this is same as the gradient that we obtained in our setting before. In other words, supervised ML is a special case of RL where there's only one episode, and reward is non-stochastic (see the next section).

Another Difference: Estimating Rewards

Another difference between RL and supervised ML is how much can we trust the rewards. In supervised ML, the reward are ground truth label that come with the image samples. We are usually 100% sure that the rewards are correct, and our robot will adjust its behaviors toward those labels.

However, in RL problems, the rewards could be more stochastic (imagine when you play a game, you could be in the same place twice but get different scores). So we have to estimate the reward for a particular state-action pair with historical reward as we interact with environment.

[Optional] Side thoughts: I was also thinking if there's middle territory between supervised ML (where labels/rewards are 100% trustable ) and RL (where rewards are more stochastic). It seems like when labels are noisy (contains some wrong labels), we are kind of in the middle? So would psuedo-labeling method share some flavor as RL problem? Let me know your thoughts.

Technically, in the long run, we should have enough historical reward to understand the average reward behavior, but in the short run, small sample number might produce unstable, biased estimate about it.

Worse, since agent behavior is updated by the reward collected, if we collect low-quality rewards, we might go into and stuck at a bad policy. And it will take a long time to get out of there and back on right track.

This is one the challenges in RL that is still an ongoing research area. Doing some manipulations to rewards and variants of policy gradient such as TRPO and PPO are designed to address this issue better, and have become more commonly used than vanilla PG.

[Optional] Another Side Thought: Comparison to Sequential Supervised ML

One difference between our supervised ML setting and RL is that RL often involves multiple timesteps. I immediately had the question: then how does RL differ from training a sequential model like Transformer or LSTM?

The answer to this question definitely depends on the exact loss design of the training your favorite sequential model.

For now, let's say you train a sequence model f(x_1,x_2,…x_T) to predict y_1, y_2…y_T _ For example, in a machine translation task, x‘s could be words of input English sentence, and *y*'_s are words output French sentence (each of x_t, y_t is a one hot vector representation of the word).

We calculate the loss function on each sample by taking sum of cross entropy between each word output prediction and truth label. We the average it over a batch of samples and do backprop like following

Putting back into the Policy Gradient formulation, this to me is same as calculating gradient of objective function as

The difference between this formulation and PG's formulation is that we are not multiplying sum log probability of all timestep's prediction with sum of rewards from all steps. Instead, we take pairwise product of log probability and reward of each timestep and sum them up.

This removes a lot of terms thus greatly reduce variance of gradients, which might be what make training a Transformer/LSTM in supervised setting easier than RL algorithm? (in addition to the non-stochastic rewards in supervised setting).

A technique of reducing variance of PG is introduced in this video: Change the sum of reward of all time steps in PG to rewards to go (i.e. sum from t' = t to t' = T). This shares a similar flavor as what different between PG and training a Transformer/LSTM in supervised setting. While rewards to go method make the agent to evaluate each state by possible future rewards, could we say that supervised sequential training make model only focus on current timestep's correctness?

Also, I tried to work backwards from this gradient expression and find the original J(θ) that results this gradient expression, so we can more directly interpret the objective of supervised sequential training. But I got stuck in the halfway. Let me know if you have any thoughts.

Acknowledgment

The connection between policy gradient and cross entropy is not my own original idea. Thanks this post for giving me thoughts on expanding it to more fundamentally understand what cross entropy and policy gradient are doing.

Tags: Cross Entropy Machine Learning Policy Gradient Reinforcement Learning Supervised Learning