How Does PPO With Clipping Work?

Author:Murphy  |  View: 24699  |  Time: 2025-03-23 12:28:53
Photo by Tamanna Rumee on Unsplash

In Reinforcement Learning, Proximal Policy Optimization (PPO) is often cited as the example for a policy approach, as compared to DQN (a value-based approach) and the big family of actor-critic methods which includes TD3 and SAC.

I recalled some time ago, when I was learning it for the first time, I was left unconvinced. Many teachers adopt a sort of hand-wavy approach. I don't buy that, and neither should you.

In this article, I will attempt to explain how PPO works, supporting the math with both intuition and code. You can try different scenarios, and see for yourself that it works not just in principle but also in practice, and that there is no cherry picking.

Why Bother?

Ppo and the other SOTA models can be implemented in minutes using stable-baselines3 (sb3). Anyone following the documentation can get it running, without knowledge of the underlying model.

However, whether you are a practitioner or a theorist, fundamentals do matter. If you simply treat PPO (or any model for that matter) as a black box, how do you expect your users to have faith in what you deliver?

I will do a detailed code walkthrough later this month, writing a wrapper such that any environment, be it those from Gymnasium or your own, would work with any sb3 model, regardless of whether the space is ‘Discrete‘ or ‘Box‘. (Last month, I showed how Monte Carlo, SARSA, and Q-learning can be derived from TD(λ), all with a single set of code.)

Enough for tomorrow, let's be here, right now!

Predecessor to PPO

Vanilla policy gradient is the most basic case of the policy-based methods, where the policy is learnt and updated directly, rather than being derived from some value function. The drawback is that suffers from high variance in the policy updates, which is problematic for convergence, especially in environments with sparse rewards.

Math of TRPO

TRPO (Trust Region Policy Optimization) ensures that the new policy (where ‘new' refers to after an update) does not deviate too far from the old policy. This is done by asserting a constraint that the KL divergence of the new policy with respect to the old policy does not exceed some threshold δ.

Objective of PPO. Equation typed by author, with reference to OpenAI Spinning Up.

Notice that each policy _π__{θ} is, itself, a distribution. D{KL}_, with a ‘hat' above, is the (weighted) average of KL divergence over states visited under the old policy. KL divergence, itself, is an average of the log of probability ratios, weighted according to the first distribution.

Formula for KL divergence between two discrete distributions p and q. For a continuous distribution, we would have probability densities and integral in place of the summation. Equation by author.

The objective function is the surrogate advantage, which is a ratio (action probability under the new policy divided by that corresponding to the old policy) multiplied by the advantage. This advantage is the expected return over some baseline, which can simply be a moving average of the respective state value.

We have a constrained optimization problem, and this is solved using Lagrangian multipliers and the conjugate gradient method. The objective and constraint are linearized using Taylor expansion, hence the solution is approximate. In order to ensure the constraint is met, a backtracking line search is employed to adjust the step size in the policy update.

(This is where I draw the line, of not delving further away. The above are math corresponding to a Level 7xx course, which practitioners can skip; knowing the direction as described is good enough in my opinion.)

Intuition of TRPO

  • If an action is advantageous, ie. A > 0, we want to increase its probability. It should be selected more often after updating the policy.
  • If an action is disadvantageous, ie. A < 0, we want to reduce its probability. L is negative, and we want it to be less negative so as to maximize the objective.
  • A small KL divergence means that the probability of each action under the old and new policy remains close to each other. At the extreme, we have log(1) which makes the KL divergence zero.

Moving to PPO

Solving the constrained optimization problem involves the backtracking line search (ie. repeated computations with smaller steps each time, until the constraint is satisfied) as well as computing the Hessian matrix which comprises second order partial derivatives. Can we simplify the process?

The first author of the TRPO paper in 2015 gave an improved model in 2017, also as first author. That is PPO, and it remains widely used even today (year 2023).

Math of PPO

I will be showing less math here, so stay with me! There are different variants of PPO, and we will talk about the ‘clip' version here, which is also the version used by sb3.

Instead of imposing a constraint for the KL divergence, the objective function is modified such that it does not benefit from large changes in the policy. We still try to increase the probability of an advantageous action (and reduce the probability of a disadvantageous action), but the effects of the probability ratio going much greater (or much lower) than 1 is bounded by the clipping.

A value of 0.2 is commonly chosen for ϵ. This value is also used in the original paper.

Objective of TRPO. Equation typed by author, with reference to OpenAI Spinning Up.

Figure 1 of the paper (in page 3) shows that the factor of (1 + ϵ) or (1 – ϵ), depending on whether the advantage is positive or negative, both sets an upper bound on the objective function. This is better visualized with the following:

Simplified equation, depending on value of advantage A.

Do not get confused by the ‘min' operator. We want to optimize the objective L such that it is as high (positive) as possible. The ‘min' operator is simply for (1 + ϵ) or (1 – ϵ) to take effect and form the ceiling.

Intuition of PPO

  • If A > 0, our objection function L is greater (more positive) when the probability ratio is above 1, ie. the new policy selects the advantageous action more often. However, there is an upper bound due to (1 + ϵ).
  • If A < 0, our objection function L is greater (less negative) when the probability ratio is below 1, ie. the new policy selects the disadvantageous action more often. However, there is an upper bound due to (1 – ϵ).
  • The gradient update can easily be performed using the usual pytorch or tensorflow.

Code

Let's first convince everyone that gradient update works, without any clipping involved.

1 Layer

We set up a simple neural network with one layer, that takes in a state of dimension 1, and outputs action probabilities between 3 possible choices.

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np

torch.manual_seed(0)

class PolicyNetwork(nn.Module):
    def __init__(self, n_in=1, n_out=3):
        super(PolicyNetwork, self).__init__()
        self.fc = nn.Linear(n_in, n_out)
        self.softmax = nn.Softmax(dim=1)

    def forward(self, x):
        x = self.fc(x)
        return self.softmax(x)

You can have a multi-dimensional state, or even an image, and the concept remains – For some given state, the neural network gives you probabilities that sums to 1.

Without Clip

Now, let's prescribe a state. For this exercise, it can be any constant. In practice this state is obtained from observations of the environment and entirely independent of the parameters θ, ie. treated as a constant when computing the gradients.

The advantage can be computed using the generalized advantage estimation (see equation 11 of the same paper cited above, by Schulman et al. 2017 if interested). It combines information from multiple time horizons in an attempt to balance the trade-off between bias and variance, similar to TD(λ).

policy = PolicyNetwork()
optimizer = optim.Adam(policy.parameters(), lr=0.01)

state = torch.tensor([[1.]])

pi = policy(state)
print("Before: ", pi)

advantages = torch.tensor([[-2., 0., 0.]], requires_grad=True)
loss = -torch.sum(pi * advantages)

optimizer.zero_grad()
loss.backward()
optimizer.step()

pi = policy(state)
print("After: ", pi)

For this exercise, we arbitrarily prescribe a negative advantage for the first action, simulating that the first action had been taken which resulted in a negative advantage. This is for the purpose of verifying whether the parameters are updated such that the probability of the first action decreases.

Output of the code cell above, where the first action leading to a negative advantage was taken

Indeed, after the gradient update, the policy outputs a lower probability for the first action, and a higher probability for the other actions.

With Clip

Now, we will implement the clipping aspects. To first check for correctness, a single update will be performed. This time, suppose that the second action is taken, and the advantage is estimated to be 1.8.

Notice that a negative sign is added to the sum-product, to get the loss (upon which the back propagation will be performed). This is because we want to maximise the objective function, and are performing gradient descent.

epsilon = 0.2

old_policy = PolicyNetwork()
old_policy.load_state_dict(policy.state_dict())

pi_old = old_policy(state)
r = pi / pi_old

advantages = torch.tensor([[0., 1.8, 0.]], requires_grad=True)

clipped_ratio = torch.clamp(r, 1 - epsilon, 1 + epsilon)
ppo_objective = torch.min(r * advantages, clipped_ratio * advantages)
loss = -torch.sum(ppo_objective)

print("Probabilities before update:", policy(state))

optimizer.zero_grad()
loss.backward()
optimizer.step()

print("Probabilities after update:", policy(state))
Output of the code cell above, where the second action leading to a positive advantage was taken

So far, so good. When an action leading to a positive advantage was taken, updating the policy increases the probability of that action. Now, we will consider the net effect of many iterations, where each of the different actions are taken multiple times.

You would say that in reality, the advantage are noisy, and using a constant is unfair. And you are right! Let's add a random noise, via np.random.randn().

num_iterations = 500
advatange_dict = {
    0: [-2., 0., 0.],
    1: [0., 1.8, 0.],
    2: [0., 0., 0.1],
}
np.random.seed(2023)

for i in range(num_iterations):
    pi = policy(state)
    pi_old = old_policy(state)
    r = pi / pi_old
    old_policy.load_state_dict(policy.state_dict())

    noisy_advantange = [
        x+np.random.randn() for x in advatange_dict[i%3] if x != 0
    ]
    advantage = torch.tensor(noisy_advantange, requires_grad=True)
    ppo_objective = torch.min(
        r * advantage, 
        torch.clamp(r, 1 - epsilon, 1 + epsilon) * advantage
    )
    loss = -torch.sum(ppo_objective)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

print("After: ", pi)

After 500 iterations, with noise included, we have the following action probability under the given state. (Of course, we can easily extend this environment to consider different states.)

Output of the code cell above, where actions are taken repeatedly, with noise added

We see that the net effect of all updates lead to the probability of the second action increasing, while the probabilities of the other actions decrease. (We started with [0.2149, 0.5257, 0.2594]).

This is in spite of the fact that the last action has a positive mean advantage. It is a result of the gradient being dependent on not just the sign, but the magnitude, of the advantage. Each time the last action is taken, the probability increases after an update, but this is offset by the effects of updates when the second action is taken.

Finally, let's see the effect of clipping in action. Repeat the cells above (where epsilon ϵ was set to 0.2), but this time set it to a smaller value, say 0.02.

Output when the code is re-run with a smaller value of epsilon

Excellent! We see that the clipping indeed works! By using a smaller ϵ, the policy has changed to a smaller extent, albeit in the same desired direction.

Conclusion

In this article, we looked at how PPO came about. The math is kept to just the bare essentials – enough for practitioners to present the ideas to other stakeholders without drowning in detailed derivations. We obtained the key intuitions from the math, to understand the principles of how PPO works.

In addition, you now have the code to prove that the underlying components of PPO really delivers its promise. We see that the most advantageous actions has its probability increased, even with noise added. Furthermore, we see that by tightening the clipping, the resulting policy changes to a smaller extent.

Tags: AI Data Science Ppo Programming Reinforcement Learning

Comment