## Welcome

This is the first post in what will (hopefully) be a small series of blog posts documenting my understanding of various concepts and algorithms in reinforcement learning utilizing policy gradients. This is for my own understanding, so of course, there might be some mistakes. Feel free to notify me of any. I will be using MATLAB because I can (this is a free country, after all), and because MATLAB is great.

In case you do not care about the underlying theory and just want to mess around with RL, check out MATLAB’s Reinforcement Learning Toolbox which will make your life much easier (assuming your life revolves around RL). MATLAB’s RL Toolbox has some tools which allow you to easily build custom training loops to test your own RL algorithms without having to write everything from scratch. A good example of REINFORCE implemented with MATLAB’s tools is provided here. If you do not want to write your own algorithm and you just want to train a specific agent to do something, the RL Toolbox also has plenty of prebuilt agents for you to choose from, and plenty of examples on how to use them.

In this post, I will walk through the derivation of the policy gradient theorem. But before I begin, I would like to give credit to a few excellent resources that helped me along the way.

Of course, Sutton & Barto’s textbook is my first recommended resource. Specifically, chapter 13 is what you should read for information about policy gradient methods. Chapter 10 is also very helpful as it talks about function approximation methods which are used in conjunction with policy gradient methods in modern RL algorithms.

Personally, I found the derivation of the policy gradient theorem in the textbook to be a bit difficult to follow because a few mathematical details are glossed over. However, Lillian Weng has a blog post with an excellent derivation of the policy gradient theorem that includes an explanation of the minor mathematical details which I found to be very helpful.

OpenAI also has a post with a much simpler derivation of the policy gradient theorem which I think is the easiest to understand. This is the derivation that I will follow.

Finally, Janis Klaise has a great blog post with a simple implementation and clear explanation of the REINFORCE algorithm. I will be using a similar implementation.

## Derivation

Suppose we are trying to solve the discrete cart-pole problem. We want to find a policy that will push the cart either left or right depending on the current state. Mathematically, this means we want a function that will accept as input the current state, and will output the probability of pushing the cart left or right. Because the cart-pole environment is simple, we do not need a neural network. We can define the policy as

\begin{aligned} \pi_{\theta} \left(\text{left} \vert s_t\right) &= \frac{1}{1 + e^{-z\left(\theta,s_t\right)}} &\quad \text{probability of pushing left given the current state, } s_t\\ \pi_{\theta} \left(\text{right} \vert s_t\right) &= 1 - \pi_{\theta} \left(\text{left} \vert s_t\right) &\quad \text{probability of pushing right given the current state, } s_t \end{aligned}

where

$z\left(\theta,s_t\right) = \theta_1 s_{t_1} + \theta_2 s_{t_2} + \theta_3 s_{t_3} + \theta_4 s_{t_4}$

and $s_{t_1}$is the position of the cart, $s_{t_2}$is the velocity of the cart, $s_{t_3}$is the angle of the pole, and $s_{t_4}$is the angular velocity of the pole. The $\theta_i$ values are the weights we want to tune so that the best action has the higher probability of being chosen for each state. Notice that the policy is chosen such that the output is always between $0$ and $1$, and the sum of the outputs for each action is $1$.

Next, we define a trajectory $\tau_\theta$ as

$\tau_\theta = \left(s_0, a_0, s_1, a_1, \ldots, s_{T + 1}\right)$

This is the sequence of states and actions that the agent follows in an episode under a certain policy $\pi_\theta$.

Next, we need to determine how well the agent performed in each episode. To do this, we will use the finite-horizon discounted return which is defined as

$G\left(\tau_\theta\right) = \sum_{t = 0}^{T} \gamma^t r_t$

This is the sum of the discounted rewards that the agent receives in the current episode by following policy $\pi_\theta$. Note that I am assuming a finite horizon because the cart-pole environment is episodic. But, according to OpenAI, the derivation for the infinite-horizon case $\left(T = \infty\right)$ is identical.

We can then define a policy performance metric for how good the policy is as

$J\left(\pi_\theta\right) = \mathbb{E} \left[G\left(\tau_\theta\right)\right]$

This is the expected return at the end of each episode by following policy $\pi_\theta$. If we randomly initialize $\theta$, run a few episodes, and store the return at the end of each episode, we will most likely find that the average of the returns is low, that is, the policy is bad. However, from calculus, we know that the gradient of a function always points in the direction of the greatest rate of increase. If we can calculate the gradient of $J\left(\pi_\theta\right)$ with respect to the policy parameters, $\theta$, we can update $\theta$ according to

$\theta = \theta + \alpha \nabla_\theta J\left(\pi_\theta\right)$

where $\alpha$ is the step size, or learning rate. This should result in a policy that will give us a higher return at the end of each episode. This is known as gradient ascent. You might be used to gradient descent from more common machine learning methods in which you want to minimize an error. Here, we want to maximize the return.

But how do we calculate $\nabla_\theta J\left(\pi_\theta\right)$? This can be done using the policy gradient theorem which we will now derive.

We begin by taking the gradient of both sides of the policy performance metric:

$\nabla_\theta J\left(\pi_\theta\right) = \nabla_\theta \mathbb{E} \left[G\left(\tau_\theta\right)\right]$

We then expand the expectation. We know an expectation is the sum of the different values we expect to receive multiplied by their respective probability of occurring. In this case, the values are the returns we expect to receive at the end of each episode while following a certain trajectory, and the probabilities are the probabilities of those trajectories occurring under the current policy, so

$\nabla_\theta \mathbb{E} \left[G\left(\tau_\theta\right)\right] = \nabla_\theta \int_\tau P \left(\tau \vert \theta\right) G\left(\tau\right)$

We can then bring the gradient under the integral:

$\nabla_\theta \int_\tau P \left(\tau \vert \theta\right) G\left(\tau\right) = \int_\tau \nabla_\theta P \left(\tau \vert \theta\right) G\left(\tau\right)$

I am not sure when this can or cannot be done, but according to OpenAI, it can be done in this case, so I will take their word for it. Now, we recall that

$\nabla \log x = \frac{1}{x}$

Usually, the numerator is $1$ because the derivative of the argument of the logarithm, $x$, with respect to $x$ is $1$, but in this case, we have

$\nabla_\theta \log P \left(\tau \vert \theta\right) = \frac{ \nabla_\theta P \left(\tau \vert \theta\right)}{ P \left(\tau \vert \theta\right)}$

because the argument of the logarithm is a function of $\theta$. We can then rewrite the integral as

$\int_\tau \nabla_\theta P \left(\tau \vert \theta\right) G\left(\tau\right) = \int_\tau P \left(\tau \vert \theta\right) \nabla_\theta \log P \left(\tau \vert \theta\right) G\left(\tau\right)$

because

$P \left(\tau \vert \theta\right) \nabla_\theta \log P \left(\tau \vert \theta\right) = \nabla_\theta P \left(\tau \vert \theta\right)$

If you look closely, you will see that we have now rewritten the integral as a new expectation over the product of the gradient of the log-probability of the trajectory and the expected return:

$\int_\tau P \left(\tau \vert \theta\right) \nabla_\theta \log P \left(\tau \vert \theta\right) G\left(\tau\right) = \mathbb{E} \left[\nabla_\theta \log P \left(\tau \vert \theta\right) G\left(\tau_\theta\right)\right]$

It seems like we just made everything more complicated, but after this next step, the reasoning for doing all this should be more apparent. First, we will expand $P \left(\tau \vert \theta\right)$ as

$P \left(\tau \vert \theta\right) = \rho_0 \left(s_0\right) \prod_{t= 0}^T P \left(s_{t + 1} \vert s_t, a_t \right) \pi_\theta \left(a_t \vert s_t \right)$

This says that the probability of following trajectory $\tau$ under the current policy is the product of the probability of starting in state $s_0$, the probability taking action $a_0$ under the current policy, and the probability of transitioning to state $s_1$ after taking action $a_0$, and so on. The log of this is

\begin{aligned} \log \left(\rho_0 \left(s_0\right) \prod_{t= 0}^T P \left(s_{t + 1} \vert s_t, a_t \right) \pi_\theta \left(a_t \vert s_t \right)\right) &= \log \left( \rho_0 \left(s_0\right) P \left(s_{1} \vert s_0, a_0 \right) \pi_\theta \left(a_0 \vert s_0 \right) \ldots P \left(s_{T + 1} \vert s_T, a_T \right) \pi_\theta \left(a_T \vert s_T \right)\right) \\ &= \log \rho_0 \left(s_0\right) + \log P \left(s_{1} \vert s_0, a_0 \right) + \log \pi_\theta \left(a_0 \vert s_0 \right) + \cdots + \log P \left(s_{T + 1} \vert s_T, a_T \right) + \log \pi_\theta \left(a_T \vert s_T \right) \\ &= \log \rho_0 \left(s_0\right) + \sum_{t = 0}^T \left( \log P \left(s_{t + 1} \vert s_t, a_t \right) + \log \pi_\theta \left(a_t \vert s_t \right) \right) \end{aligned}

We then take the gradient of this expression and note that only $\pi_\theta$ depends on $\theta$ so that

\begin{aligned} \nabla_\theta \left( \log \rho_0 \left(s_0\right) + \sum_{t = 0}^T \left( \log P \left(s_{t + 1} \vert s_t, a_t \right) + \log \pi_\theta \left(a_t \vert s_t \right) \right) \right) &= \nabla_\theta \sum_{t = 0}^T \log \pi_\theta \left(a_t \vert s_t \right) \\ &=\sum_{t = 0}^T \nabla_\theta \log \pi_\theta \left(a_t \vert s_t \right) \end{aligned}

Finally, we rewrite the expectation as

$\nabla_\theta J\left(\pi_\theta\right) = \mathbb{E} \left[\sum_{t = 0}^T \nabla_\theta \log \pi_\theta \left(a_t \vert s_t \right)G\left(\tau_\theta\right)\right]$

This the policy gradient theorem. However, we need to make one last modification. Notice that the sum of the gradient of the log-probabilities is multiplied by the total return which is the sum of all discounted rewards obtained during the episode. This does not make much sense since only rewards that come after an action is taken should have an effect. So we modify the final result to be

$\nabla_\theta J\left(\pi_\theta\right) = \mathbb{E} \left[\sum_{t = 0}^T \nabla_\theta \log \pi_\theta \left(a_t \vert s_t \right) \sum_{t' = t}^T \gamma^{t'} r_{t'}\right]$

Why this works mathematically is something I will leave to the geniuses over at OpenAI. They provide what I expect to be a great proof here. I started reading it but stopped when I realized it took almost four average mouse scrolls to get to the bottom of the page and saw nothing but equations. But I trust them.

We are now able to approximate the gradient of $J\left(\pi_\theta\right)$ simply by sampling a few trajectories and plugging the appropriate terms into the above expectation. In my next post, we will implement REINFORCE with the cart-pole environment using the policy defined above which should help make everything more concrete.