In my last post, I derived the policy gradient theorem which allows us to approximate a gradient that we can use to shift our policy in a direction that will yield higher returns. In this post, we will apply that theorem with the simplest policy gradient algorithm  REINFORCE.
Implementation
On a high level, the REINFORCE algorithm works as follows:

Sample $N$ trajectories. Note that in my implementation below, I assume $N = 1$ for simplicity.

For each trajectory, calculate the sum of the gradient of the logprobabilities of each action taken and multiply it by the finitehorizon discounted return from the current time step onward.

Add all of the sums and divide by $N$ to get the gradient estimate:
$\nabla_\theta J\left(\pi_\theta\right) \approx \frac{1}{N} \sum_{k = 1}^N \sum_{t = 0}^T \nabla_\theta \log \pi_\theta \left(a_t \vert s_t \right) \sum_{t' = t}^T \gamma^{t'  t} r_{t'}$

Update the policy parameters, $\theta$, according to
$\theta = \theta + \alpha \nabla_\theta J\left(\pi_\theta\right)$
Now we will apply this to cartpole. Recall the policy we defined in my last post:
$\begin{aligned} \pi_{\theta} \left(\text{left} \vert s_t\right) &= \frac{1}{1 + e^{z\left(\theta,s_t\right)}} &\quad \text{probability of pushing left given the current state, } s_t\\ \pi_{\theta} \left(\text{right} \vert s_t\right) &= 1  \pi_{\theta} \left(\text{left} \vert s_t\right) &\quad \text{probability of pushing right given the current state, } s_t \end{aligned}$
First, we need to calculate the gradients of the logprobability of this policy:
$\begin{aligned} \nabla_\theta \log \pi_\theta \left(\text{left} \vert s_t \right) &= \nabla_\theta \left(\log \frac{1}{1 + e^{z\left(\theta,s_t\right)}}\right) \\ &= \nabla_\theta \left(\log \frac{e^{z\left(\theta,s_t\right)}}{e^{z\left(\theta,s_t\right)} + 1}\right) \\ &= \nabla_\theta \left(\log e^{z\left(\theta,s_t\right)} \log \left(e^{z\left(\theta,s_t\right)} + 1\right)\right) \\ &= \nabla_\theta \left(z\left(\theta,s_t\right)  \log \left(e^{z\left(\theta,s_t\right)} + 1\right)\right) \\ &= s_t  s_t \frac{e^{z\left(\theta,s_t\right)}}{e^{z\left(\theta,s_t\right)} + 1} \\ &= s_t \left(1  \pi_\theta \left(\text{left} \vert s_t \right)\right) \\ \\ \nabla_\theta \log \pi_\theta \left(\text{right} \vert s_t \right) &= \nabla_\theta \log \left(1  \pi_\theta \left(\text{left} \vert s_t \right) \right) \\ &= \nabla_\theta \log \left(\frac{e^{z\left(\theta,s_t\right)}}{1 + e^{z\left(\theta,s_t\right)}}\right) \\ &= \nabla_\theta \log \left(\frac{1}{e^{z\left(\theta,s_t\right)} + 1}\right) \\ &= \nabla_\theta \log \left(e^{z\left(\theta,s_t\right)} + 1\right) \\ &=  s_t \frac{e^{z\left(\theta,s_t\right)}}{e^{z\left(\theta,s_t\right)} + 1} \\ &= s_t \pi_\theta \left(\text{left} \vert s_t \right) \end{aligned}$
Note that the gradient of $z\left(\theta,s_t\right)$ is simply $s_t$ because $z\left(\theta,s_t\right)$ can be rewritten as
$z\left(\theta,s_t\right) = \theta^T s_t$
where $\theta$ and $s_t$ are $4 \times 1$ column vectors. In this case, we are able to directly calculate the gradient of the policy because we defined it as the simple expressions above, but if we were using a neural network, we would need to rely on automatic differentiation to calculate the gradient. An example of this can be found here.
Below is the complete REINFORCE algorithm implemented from scratch in MATLAB.
% REINFORCE applied to the discrete cartpole environment
close all; clear; clc
% Seed the random number generator for reproducibility. I will be
% initializing the weights to be zero, so it does not really matter in this
% case, but I included it for completion.
rng(0)
% Create the environment.
env = rlPredefinedEnv('CartPoleDiscrete');
% Get the dimension of the observations, in this case, 4. This is not to be
% confused with the number of observations per time step, which is 1.
dimObs = env.getObservationInfo.Dimension(1);
% Get the dimension of the actions, in this case, 1. This is not to be
% confused with the number of actions to choose from, which is 2.
dimAct = env.getActionInfo.Dimension(1);
% Initialize the policy parameters to zero. They can also be initialized
% randomly from a normal distribution, or whatever else you would like, but
% I like zero.
theta = zeros(dimObs,1);
% Define the policy. This takes a 4x1 observation vector as input and
% outputs the probability of picking the "left" action. The "right" action
% is 1  probability(left), so we do not need to define it explicitly.
pi_theta_left = @(theta,s) 1 / (1 + exp(theta' * s(:)));
% Define the gradients of the log of the policy. This outputs a 2x1 cell
% array of the gradlogprobabilities of the actions.
grad_pi_theta = @(theta,s) {s * (1  pi_theta_left(theta,s)); s * pi_theta_left(theta,s)};
% Set the maximum number of episodes for which to train.
maxEpisodes = 1000;
% Set the maximum number of time steps per episode. When the agent gets
% good at balancing the pole, it may never fail, so we need a way to
% terminate the episode after a certain time.
maxStepsPerEpisode = 500;
% Set the learning rate.
learningRate = 0.001;
% Set the discount factor.
discountFactor = 0.99;
% Initialize a vector to store the total time steps that the agent lasts
% per episode.
totalStepsPerEpisode = zeros(1,maxStepsPerEpisode);
% Train the agent.
for episode = 1:maxEpisodes
% Reset the environment.
observation = env.reset;
% Preinitialize a variable to store the trajectory of the current
% episode.
trajectory = cell(1,maxStepsPerEpisode);
% Simulate an episode.
for stepCt = 1:maxStepsPerEpisode
% Sample an action.
action = randsample(env.getActionInfo.Elements,1,true,...
[pi_theta_left(theta,observation) 1  pi_theta_left(theta,observation)]);
% Apply the action and get the next observation, reward, and
% whether the episode is over.
[nextObservation,reward,isDone] = step(env,action);
% Store the experience in the trajectory.
trajectory{stepCt} = {observation action reward};
% Update the environment.
observation = nextObservation;
if isDone
break
end
end
% Update the list of total steps per episode.
totalStepsPerEpisode(episode) = stepCt;
% Initialize the gradient estimate.
gradientEstimate = 0;
% Calculate the gradient estimate.
for t = 1:stepCt
% Get the observation, action, and reward at the current time step.
observation = trajectory{t}{1};
action = trajectory{t}{2};
reward = trajectory{t}{3};
% Compute the discounted return.
discountedReturn = 0;
for k = t:stepCt
discountedReturn = discountedReturn + discountFactor^(k  t) * reward;
end
% Compute the gradlogprobabilities.
gradLogProbs = grad_pi_theta(theta,observation);
% Get the gradlogprobability corresponding to the action
% taken.
gradLogProb = gradLogProbs{action == env.getActionInfo.Elements};
% Update the gradient estimate.
gradientEstimate = gradientEstimate + gradLogProb * discountedReturn;
end
% Update the policy parameters.
theta = theta + learningRate * gradientEstimate;
end
% Plot Total Steps per Episode.
plot(1:episode,totalStepsPerEpisode(1:episode))
xlabel('Episode')
ylabel('Total Steps')
title('Total Steps per Episode')
grid on
grid minor
Results
Below are the training results for the above algorithm:
We can see that the agent does indeed learn, which is pretty impressive given the relatively few lines of code we need to write the algorithm. However, we see that the training is not very stable, especially around episode 500. This can be improved by reducing the variance in the rewards, and we will address this in my next post. But before that, I want to discuss a few other observations I had.
First, it is interesting to see the effect that the initialization of the weights has on the training. In my code above, I initialized all the weights to be zero. I tried to initialize them randomly from a normal distribution instead, and below are the training results:
We can see that with random weights initialization, the agent trains much slower, although the end result appears to be the same  a trained agent with very high variance.
The second thing I would like to discuss is a question that I, and potentially others studying policy gradients, have had. Recall that we estimate the gradient according to
$\nabla_\theta J\left(\pi_\theta\right) = \mathbb{E} \left[\sum_{t = 0}^T \nabla_\theta \log \pi_\theta \left(a_t \vert s_t \right) \sum_{t' = t}^T \gamma^{t'} r_{t'}\right]$
which requires a summation over all time of the episode. This means we can only do one policy update per episode. However, I noticed that some sources, including Sutton & Barto, do one policy update per time step:
I tried to look into any mathematical reasoning for why this is equivalent, but I could not find anything. Hado van Hasselt of DeepMind seems to briefly talk about this in his lecture at 58:45, and it seems like either approach is valid. Below is the modified training loop with a policy update at every time step:
% Calculate the gradient estimate.
for t = 1:stepCt
% Get the observation, action, and reward at the current time step.
observation = trajectory{t}{1};
action = trajectory{t}{2};
reward = trajectory{t}{3};
% Compute the discounted return.
discountedReturn = 0;
for k = t + 1:stepCt
discountedReturn = discountedReturn + discountFactor^(k  t  1) * reward;
end
% Compute the gradlogprobabilities.
gradLogProbs = grad_pi_theta(theta,observation);
% Get the gradlogprobability corresponding to the action
% taken.
gradLogProb = gradLogProbs{action == env.getActionInfo.Elements};
% Compute the gradient estimate.
gradientEstimate = gradLogProb * discountedReturn;
% Update the policy parameters.
theta = theta + learningRate * discountFactor^(t  1) * gradientEstimate;
end
It is very similar to the first implementation. The main difference is that I avoid the summation over all time and update $\theta$ at every step of the forloop instead of after the forloop. Below are the training results:
The results are clearly different from those of the first implementation, but the agent still trains nonetheless, so it seems like either approach is acceptable, although I am not sure which one is preferred. Feel free to share any information you may have on this in the comments.
That is all for this post. As mentioned earlier, I will discuss a way to reduce the variance of the rewards and improve the training in my next post. À plus tard!
Updates
1  11/30/2020
I looked more into why some sources use a single batch update while others use an update at every time step. It seems that it depends on how $J\left(\pi_\theta\right)$ is defined. In OpenAI’s derivation of the policy gradient theorem, and the one I followed, we have
$J\left(\pi_\theta\right) = \mathbb{E} \left[G\left(\tau_\theta\right)\right]$
But in Sutton & Barto, we have
$J\left(\pi_\theta\right) = \mathbb{E} \left[V\left(s_0\right)\right]$
where $V\left(s_0\right)$ is the value function evaluated at the initial state $s_0$. This results in $\nabla_\theta J\left(\pi_\theta\right)$ being approximated according to
$\nabla_\theta J\left(\pi_\theta\right) = \mathbb{E} \left[\nabla_\theta \log \pi_\theta\left(a_t \vert s_t\right) G_t\right]$
You can find the complete proof in their textbook under section 13.2 The Policy Gradient Theorem.
2  11/30/2020
Pages 99  104 of these notes contain a detailed proof of the policy gradient theorem and an explanation for the difference in the formulation of the expectation. I will stick to the form that OpenAI uses.
Welcome markhsaad! We’ll sort out MATLAB syntax support and horizontal scrolling for long equations this week. Thanks for helping us test the limits :)
Awesome! Able is terrific. Thanks for the hard work on this.
Okay, both issues should be sorted now. Just need to make long equations scrollable when using the editor on mobile.
It's all working great on my end, even on mobile.