Policy Gradient Theorem


This is the first post in what will (hopefully) be a small series of blog posts documenting my understanding of various concepts and algorithms in reinforcement learning utilizing policy gradients. This is for my own understanding, so of course, there might be some mistakes. Feel free to notify me of any. I will be using MATLAB because I can (this is a free country, after all), and because MATLAB is great.

In case you do not care about the underlying theory and just want to mess around with RL, check out MATLAB’s Reinforcement Learning Toolbox which will make your life much easier (assuming your life revolves around RL). MATLAB’s RL Toolbox has some tools which allow you to easily build custom training loops to test your own RL algorithms without having to write everything from scratch. A good example of REINFORCE implemented with MATLAB’s tools is provided here. If you do not want to write your own algorithm and you just want to train a specific agent to do something, the RL Toolbox also has plenty of prebuilt agents for you to choose from, and plenty of examples on how to use them.

In this post, I will walk through the derivation of the policy gradient theorem. But before I begin, I would like to give credit to a few excellent resources that helped me along the way.

Of course, Sutton & Barto’s textbook is my first recommended resource. Specifically, chapter 13 is what you should read for information about policy gradient methods. Chapter 10 is also very helpful as it talks about function approximation methods which are used in conjunction with policy gradient methods in modern RL algorithms.

Personally, I found the derivation of the policy gradient theorem in the textbook to be a bit difficult to follow because a few mathematical details are glossed over. However, Lillian Weng has a blog post with an excellent derivation of the policy gradient theorem that includes an explanation of the minor mathematical details which I found to be very helpful.

OpenAI also has a post with a much simpler derivation of the policy gradient theorem which I think is the easiest to understand. This is the derivation that I will follow.

Finally, Janis Klaise has a great blog post with a simple implementation and clear explanation of the REINFORCE algorithm. I will be using a similar implementation.


Suppose we are trying to solve the discrete cart-pole problem. We want to find a policy that will push the cart either left or right depending on the current state. Mathematically, this means we want a function that will accept as input the current state, and will output the probability of pushing the cart left or right. Because the cart-pole environment is simple, we do not need a neural network. We can define the policy as

πθ(leftst)=11+ez(θ,st)probability of pushing left given the current state, stπθ(rightst)=1πθ(leftst)probability of pushing right given the current state, st\begin{aligned} \pi_{\theta} \left(\text{left} \vert s_t\right) &= \frac{1}{1 + e^{-z\left(\theta,s_t\right)}} &\quad \text{probability of pushing left given the current state, } s_t\\ \pi_{\theta} \left(\text{right} \vert s_t\right) &= 1 - \pi_{\theta} \left(\text{left} \vert s_t\right) &\quad \text{probability of pushing right given the current state, } s_t \end{aligned}


z(θ,st)=θ1st1+θ2st2+θ3st3+θ4st4z\left(\theta,s_t\right) = \theta_1 s_{t_1} + \theta_2 s_{t_2} + \theta_3 s_{t_3} + \theta_4 s_{t_4}

and st1s_{t_1}is the position of the cart, st2s_{t_2}is the velocity of the cart, st3s_{t_3}is the angle of the pole, and st4s_{t_4}is the angular velocity of the pole. The θi\theta_i values are the weights we want to tune so that the best action has the higher probability of being chosen for each state. Notice that the policy is chosen such that the output is always between 00 and 11, and the sum of the outputs for each action is 11.

Next, we define a trajectory τθ\tau_\theta as

τθ=(s0,a0,s1,a1,,sT+1)\tau_\theta = \left(s_0, a_0, s_1, a_1, \ldots, s_{T + 1}\right)

This is the sequence of states and actions that the agent follows in an episode under a certain policy πθ\pi_\theta.

Next, we need to determine how well the agent performed in each episode. To do this, we will use the finite-horizon discounted return which is defined as

G(τθ)=t=0TγtrtG\left(\tau_\theta\right) = \sum_{t = 0}^{T} \gamma^t r_t

This is the sum of the discounted rewards that the agent receives in the current episode by following policy πθ\pi_\theta. Note that I am assuming a finite horizon because the cart-pole environment is episodic. But, according to OpenAI, the derivation for the infinite-horizon case (T=)\left(T = \infty\right) is identical.

We can then define a policy performance metric for how good the policy is as

J(πθ)=E[G(τθ)]J\left(\pi_\theta\right) = \mathbb{E} \left[G\left(\tau_\theta\right)\right]

This is the expected return at the end of each episode by following policy πθ\pi_\theta. If we randomly initialize θ\theta, run a few episodes, and store the return at the end of each episode, we will most likely find that the average of the returns is low, that is, the policy is bad. However, from calculus, we know that the gradient of a function always points in the direction of the greatest rate of increase. If we can calculate the gradient of J(πθ)J\left(\pi_\theta\right) with respect to the policy parameters, θ\theta, we can update θ\theta according to

θ=θ+αθJ(πθ)\theta = \theta + \alpha \nabla_\theta J\left(\pi_\theta\right)

where α\alpha is the step size, or learning rate. This should result in a policy that will give us a higher return at the end of each episode. This is known as gradient ascent. You might be used to gradient descent from more common machine learning methods in which you want to minimize an error. Here, we want to maximize the return.

But how do we calculate θJ(πθ)\nabla_\theta J\left(\pi_\theta\right)? This can be done using the policy gradient theorem which we will now derive.

We begin by taking the gradient of both sides of the policy performance metric:

θJ(πθ)=θE[G(τθ)]\nabla_\theta J\left(\pi_\theta\right) = \nabla_\theta \mathbb{E} \left[G\left(\tau_\theta\right)\right]

We then expand the expectation. We know an expectation is the sum of the different values we expect to receive multiplied by their respective probability of occurring. In this case, the values are the returns we expect to receive at the end of each episode while following a certain trajectory, and the probabilities are the probabilities of those trajectories occurring under the current policy, so

θE[G(τθ)]=θτP(τθ)G(τ)\nabla_\theta \mathbb{E} \left[G\left(\tau_\theta\right)\right] = \nabla_\theta \int_\tau P \left(\tau \vert \theta\right) G\left(\tau\right)

We can then bring the gradient under the integral:

θτP(τθ)G(τ)=τθP(τθ)G(τ)\nabla_\theta \int_\tau P \left(\tau \vert \theta\right) G\left(\tau\right) = \int_\tau \nabla_\theta P \left(\tau \vert \theta\right) G\left(\tau\right)

I am not sure when this can or cannot be done, but according to OpenAI, it can be done in this case, so I will take their word for it. Now, we recall that

logx=1x\nabla \log x = \frac{1}{x}

Usually, the numerator is 11 because the derivative of the argument of the logarithm, xx, with respect to xx is 11, but in this case, we have

θlogP(τθ)=θP(τθ)P(τθ)\nabla_\theta \log P \left(\tau \vert \theta\right) = \frac{ \nabla_\theta P \left(\tau \vert \theta\right)}{ P \left(\tau \vert \theta\right)}

because the argument of the logarithm is a function of θ\theta. We can then rewrite the integral as

τθP(τθ)G(τ)=τP(τθ)θlogP(τθ)G(τ)\int_\tau \nabla_\theta P \left(\tau \vert \theta\right) G\left(\tau\right) = \int_\tau P \left(\tau \vert \theta\right) \nabla_\theta \log P \left(\tau \vert \theta\right) G\left(\tau\right)


P(τθ)θlogP(τθ)=θP(τθ)P \left(\tau \vert \theta\right) \nabla_\theta \log P \left(\tau \vert \theta\right) = \nabla_\theta P \left(\tau \vert \theta\right)

If you look closely, you will see that we have now rewritten the integral as a new expectation over the product of the gradient of the log-probability of the trajectory and the expected return:

τP(τθ)θlogP(τθ)G(τ)=E[θlogP(τθ)G(τθ)]\int_\tau P \left(\tau \vert \theta\right) \nabla_\theta \log P \left(\tau \vert \theta\right) G\left(\tau\right) = \mathbb{E} \left[\nabla_\theta \log P \left(\tau \vert \theta\right) G\left(\tau_\theta\right)\right]

It seems like we just made everything more complicated, but after this next step, the reasoning for doing all this should be more apparent. First, we will expand P(τθ)P \left(\tau \vert \theta\right) as

P(τθ)=ρ0(s0)t=0TP(st+1st,at)πθ(atst)P \left(\tau \vert \theta\right) = \rho_0 \left(s_0\right) \prod_{t= 0}^T P \left(s_{t + 1} \vert s_t, a_t \right) \pi_\theta \left(a_t \vert s_t \right)

This says that the probability of following trajectory τ\tau under the current policy is the product of the probability of starting in state s0s_0, the probability taking action a0a_0 under the current policy, and the probability of transitioning to state s1s_1 after taking action a0a_0, and so on. The log of this is

log(ρ0(s0)t=0TP(st+1st,at)πθ(atst))=log(ρ0(s0)P(s1s0,a0)πθ(a0s0)P(sT+1sT,aT)πθ(aTsT))=logρ0(s0)+logP(s1s0,a0)+logπθ(a0s0)++logP(sT+1sT,aT)+logπθ(aTsT)=logρ0(s0)+t=0T(logP(st+1st,at)+logπθ(atst))\begin{aligned} \log \left(\rho_0 \left(s_0\right) \prod_{t= 0}^T P \left(s_{t + 1} \vert s_t, a_t \right) \pi_\theta \left(a_t \vert s_t \right)\right) &= \log \left( \rho_0 \left(s_0\right) P \left(s_{1} \vert s_0, a_0 \right) \pi_\theta \left(a_0 \vert s_0 \right) \ldots P \left(s_{T + 1} \vert s_T, a_T \right) \pi_\theta \left(a_T \vert s_T \right)\right) \\ &= \log \rho_0 \left(s_0\right) + \log P \left(s_{1} \vert s_0, a_0 \right) + \log \pi_\theta \left(a_0 \vert s_0 \right) + \cdots + \log P \left(s_{T + 1} \vert s_T, a_T \right) + \log \pi_\theta \left(a_T \vert s_T \right) \\ &= \log \rho_0 \left(s_0\right) + \sum_{t = 0}^T \left( \log P \left(s_{t + 1} \vert s_t, a_t \right) + \log \pi_\theta \left(a_t \vert s_t \right) \right) \end{aligned}

We then take the gradient of this expression and note that only πθ\pi_\theta depends on θ\theta so that

θ(logρ0(s0)+t=0T(logP(st+1st,at)+logπθ(atst)))=θt=0Tlogπθ(atst)=t=0Tθlogπθ(atst)\begin{aligned} \nabla_\theta \left( \log \rho_0 \left(s_0\right) + \sum_{t = 0}^T \left( \log P \left(s_{t + 1} \vert s_t, a_t \right) + \log \pi_\theta \left(a_t \vert s_t \right) \right) \right) &= \nabla_\theta \sum_{t = 0}^T \log \pi_\theta \left(a_t \vert s_t \right) \\ &=\sum_{t = 0}^T \nabla_\theta \log \pi_\theta \left(a_t \vert s_t \right) \end{aligned}

Finally, we rewrite the expectation as

θJ(πθ)=E[t=0Tθlogπθ(atst)G(τθ)]\nabla_\theta J\left(\pi_\theta\right) = \mathbb{E} \left[\sum_{t = 0}^T \nabla_\theta \log \pi_\theta \left(a_t \vert s_t \right)G\left(\tau_\theta\right)\right]

This the policy gradient theorem. However, we need to make one last modification. Notice that the sum of the gradient of the log-probabilities is multiplied by the total return which is the sum of all discounted rewards obtained during the episode. This does not make much sense since only rewards that come after an action is taken should have an effect. So we modify the final result to be

θJ(πθ)=E[t=0Tθlogπθ(atst)t=tTγtrt]\nabla_\theta J\left(\pi_\theta\right) = \mathbb{E} \left[\sum_{t = 0}^T \nabla_\theta \log \pi_\theta \left(a_t \vert s_t \right) \sum_{t' = t}^T \gamma^{t'} r_{t'}\right]

Why this works mathematically is something I will leave to the geniuses over at OpenAI. They provide what I expect to be a great proof here. I started reading it but stopped when I realized it took almost four average mouse scrolls to get to the bottom of the page and saw nothing but equations. But I trust them.

We are now able to approximate the gradient of J(πθ)J\left(\pi_\theta\right) simply by sampling a few trajectories and plugging the appropriate terms into the above expectation. In my next post, we will implement REINFORCE with the cart-pole environment using the policy defined above which should help make everything more concrete.