REINFORCE

In my last post, I derived the policy gradient theorem which allows us to approximate a gradient that we can use to shift our policy in a direction that will yield higher returns. In this post, we will apply that theorem with the simplest policy gradient algorithm - REINFORCE.

Implementation

On a high level, the REINFORCE algorithm works as follows:

  1. Sample NN trajectories. Note that in my implementation below, I assume N=1N = 1 for simplicity.

  2. For each trajectory, calculate the sum of the gradient of the log-probabilities of each action taken and multiply it by the finite-horizon discounted return from the current time step onward.

  3. Add all of the sums and divide by NN to get the gradient estimate:

    θJ(πθ)1Nk=1Nt=0Tθlogπθ(atst)t=tTγttrt\nabla_\theta J\left(\pi_\theta\right) \approx \frac{1}{N} \sum_{k = 1}^N \sum_{t = 0}^T \nabla_\theta \log \pi_\theta \left(a_t \vert s_t \right) \sum_{t' = t}^T \gamma^{t' - t} r_{t'}

  4. Update the policy parameters, θ\theta, according to

    θ=θ+αθJ(πθ)\theta = \theta + \alpha \nabla_\theta J\left(\pi_\theta\right)

Now we will apply this to cart-pole. Recall the policy we defined in my last post:

πθ(leftst)=11+ez(θ,st)probability of pushing left given the current state, stπθ(rightst)=1πθ(leftst)probability of pushing right given the current state, st\begin{aligned} \pi_{\theta} \left(\text{left} \vert s_t\right) &= \frac{1}{1 + e^{-z\left(\theta,s_t\right)}} &\quad \text{probability of pushing left given the current state, } s_t\\ \pi_{\theta} \left(\text{right} \vert s_t\right) &= 1 - \pi_{\theta} \left(\text{left} \vert s_t\right) &\quad \text{probability of pushing right given the current state, } s_t \end{aligned}

First, we need to calculate the gradients of the log-probability of this policy:

θlogπθ(leftst)=θ(log11+ez(θ,st))=θ(logez(θ,st)ez(θ,st)+1)=θ(logez(θ,st)log(ez(θ,st)+1))=θ(z(θ,st)log(ez(θ,st)+1))=ststez(θ,st)ez(θ,st)+1=st(1πθ(leftst))θlogπθ(rightst)=θlog(1πθ(leftst))=θlog(ez(θ,st)1+ez(θ,st))=θlog(1ez(θ,st)+1)=θlog(ez(θ,st)+1)=stez(θ,st)ez(θ,st)+1=stπθ(leftst)\begin{aligned} \nabla_\theta \log \pi_\theta \left(\text{left} \vert s_t \right) &= \nabla_\theta \left(\log \frac{1}{1 + e^{-z\left(\theta,s_t\right)}}\right) \\ &= \nabla_\theta \left(\log \frac{e^{z\left(\theta,s_t\right)}}{e^{z\left(\theta,s_t\right)} + 1}\right) \\ &= \nabla_\theta \left(\log e^{z\left(\theta,s_t\right)} -\log \left(e^{z\left(\theta,s_t\right)} + 1\right)\right) \\ &= \nabla_\theta \left(z\left(\theta,s_t\right) - \log \left(e^{z\left(\theta,s_t\right)} + 1\right)\right) \\ &= s_t - s_t \frac{e^{z\left(\theta,s_t\right)}}{e^{z\left(\theta,s_t\right)} + 1} \\ &= s_t \left(1 - \pi_\theta \left(\text{left} \vert s_t \right)\right) \\ \\ \nabla_\theta \log \pi_\theta \left(\text{right} \vert s_t \right) &= \nabla_\theta \log \left(1 - \pi_\theta \left(\text{left} \vert s_t \right) \right) \\ &= \nabla_\theta \log \left(\frac{e^{-z\left(\theta,s_t\right)}}{1 + e^{-z\left(\theta,s_t\right)}}\right) \\ &= \nabla_\theta \log \left(\frac{1}{e^{z\left(\theta,s_t\right)} + 1}\right) \\ &= -\nabla_\theta \log \left(e^{z\left(\theta,s_t\right)} + 1\right) \\ &= - s_t \frac{e^{z\left(\theta,s_t\right)}}{e^{z\left(\theta,s_t\right)} + 1} \\ &= -s_t \pi_\theta \left(\text{left} \vert s_t \right) \end{aligned}

Note that the gradient of z(θ,st)z\left(\theta,s_t\right) is simply sts_t because z(θ,st)z\left(\theta,s_t\right) can be rewritten as

z(θ,st)=θTstz\left(\theta,s_t\right) = \theta^T s_t

where θ\theta and sts_t are 4×14 \times 1 column vectors. In this case, we are able to directly calculate the gradient of the policy because we defined it as the simple expressions above, but if we were using a neural network, we would need to rely on automatic differentiation to calculate the gradient. An example of this can be found here.

Below is the complete REINFORCE algorithm implemented from scratch in MATLAB.

% REINFORCE applied to the discrete cart-pole environment
close all; clear; clc

% Seed the random number generator for reproducibility. I will be
% initializing the weights to be zero, so it does not really matter in this
% case, but I included it for completion.
rng(0)

% Create the environment.
env = rlPredefinedEnv('CartPole-Discrete');

% Get the dimension of the observations, in this case, 4. This is not to be
% confused with the number of observations per time step, which is 1.
dimObs = env.getObservationInfo.Dimension(1);

% Get the dimension of the actions, in this case, 1. This is not to be
% confused with the number of actions to choose from, which is 2.
dimAct = env.getActionInfo.Dimension(1);

% Initialize the policy parameters to zero. They can also be initialized
% randomly from a normal distribution, or whatever else you would like, but
% I like zero.
theta = zeros(dimObs,1);

% Define the policy. This takes a 4x1 observation vector as input and
% outputs the probability of picking the "left" action. The "right" action 
% is 1 - probability(left), so we do not need to define it explicitly.
pi_theta_left = @(theta,s) 1 / (1 + exp(-theta' * s(:)));

% Define the gradients of the log of the policy. This outputs a 2x1 cell
% array of the grad-log-probabilities of the actions.
grad_pi_theta = @(theta,s) {s * (1 - pi_theta_left(theta,s)); -s * pi_theta_left(theta,s)};

% Set the maximum number of episodes for which to train.
maxEpisodes = 1000;

% Set the maximum number of time steps per episode. When the agent gets
% good at balancing the pole, it may never fail, so we need a way to
% terminate the episode after a certain time.
maxStepsPerEpisode = 500;

% Set the learning rate.
learningRate = 0.001;

% Set the discount factor.
discountFactor = 0.99;

% Initialize a vector to store the total time steps that the agent lasts
% per episode.
totalStepsPerEpisode = zeros(1,maxStepsPerEpisode);

% Train the agent.
for episode = 1:maxEpisodes
    % Reset the environment.
    observation = env.reset;
    
    % Pre-initialize a variable to store the trajectory of the current
    % episode.
    trajectory = cell(1,maxStepsPerEpisode);
    
    % Simulate an episode.
    for stepCt = 1:maxStepsPerEpisode
        % Sample an action.
        action = randsample(env.getActionInfo.Elements,1,true,...
            [pi_theta_left(theta,observation) 1 - pi_theta_left(theta,observation)]);
        
        % Apply the action and get the next observation, reward, and
        % whether the episode is over.
        [nextObservation,reward,isDone] = step(env,action);
        
        % Store the experience in the trajectory.
        trajectory{stepCt} = {observation action reward};
        
        % Update the environment.
        observation = nextObservation;
        
        if isDone
            break
        end
    end
    
    % Update the list of total steps per episode.
    totalStepsPerEpisode(episode) = stepCt;
    
    % Initialize the gradient estimate.
    gradientEstimate = 0;
    
    % Calculate the gradient estimate.
    for t = 1:stepCt
        % Get the observation, action, and reward at the current time step.
        observation = trajectory{t}{1};
        action = trajectory{t}{2};
        reward = trajectory{t}{3};
        
        % Compute the discounted return.
        discountedReturn = 0;
        for k = t:stepCt
            discountedReturn = discountedReturn + discountFactor^(k - t) * reward;
        end
        
        % Compute the grad-log-probabilities.
        gradLogProbs = grad_pi_theta(theta,observation);
        
        % Get the grad-log-probability corresponding to the action
        % taken.
        gradLogProb = gradLogProbs{action == env.getActionInfo.Elements};
        
        % Update the gradient estimate.
        gradientEstimate = gradientEstimate + gradLogProb * discountedReturn;
    end
    
    % Update the policy parameters.
    theta = theta + learningRate * gradientEstimate;
end

% Plot Total Steps per Episode.
plot(1:episode,totalStepsPerEpisode(1:episode))
xlabel('Episode')
ylabel('Total Steps')
title('Total Steps per Episode')
grid on
grid minor

‌Results

Below are the training results for the above algorithm:

We can see that the agent does indeed learn, which is pretty impressive given the relatively few lines of code we need to write the algorithm. However, we see that the training is not very stable, especially around episode 500. This can be improved by reducing the variance in the rewards, and we will address this in my next post. But before that, I want to discuss a few other observations I had.

First, it is interesting to see the effect that the initialization of the weights has on the training. In my code above, I initialized all the weights to be zero. I tried to initialize them randomly from a normal distribution instead, and below are the training results:

We can see that with random weights initialization, the agent trains much slower, although the end result appears to be the same - a trained agent with very high variance.

The second thing I would like to discuss is a question that I, and potentially others studying policy gradients, have had. Recall that we estimate the gradient according to

θJ(πθ)=E[t=0Tθlogπθ(atst)t=tTγtrt]\nabla_\theta J\left(\pi_\theta\right) = \mathbb{E} \left[\sum_{t = 0}^T \nabla_\theta \log \pi_\theta \left(a_t \vert s_t \right) \sum_{t' = t}^T \gamma^{t'} r_{t'}\right]

which requires a summation over all time of the episode. This means we can only do one policy update per episode. However, I noticed that some sources, including Sutton & Barto, do one policy update per time step:

REINFORCE algorithm with discounted rewards – where does gamma^t in the  update come from? - Data Science Stack Exchange

I tried to look into any mathematical reasoning for why this is equivalent, but I could not find anything. Hado van Hasselt of DeepMind seems to briefly talk about this in his lecture at 58:45, and it seems like either approach is valid. Below is the modified training loop with a policy update at every time step:

% Calculate the gradient estimate.
for t = 1:stepCt
    % Get the observation, action, and reward at the current time step.
    observation = trajectory{t}{1};
    action = trajectory{t}{2};
    reward = trajectory{t}{3};
    
    % Compute the discounted return.
    discountedReturn = 0;
    for k = t + 1:stepCt
        discountedReturn = discountedReturn + discountFactor^(k - t - 1) * reward;
    end
    
    % Compute the grad-log-probabilities.
    gradLogProbs = grad_pi_theta(theta,observation);
    
    % Get the grad-log-probability corresponding to the action
    % taken.
    gradLogProb = gradLogProbs{action == env.getActionInfo.Elements};
    
    % Compute the gradient estimate.
    gradientEstimate = gradLogProb * discountedReturn;
    
    % Update the policy parameters.
    theta = theta + learningRate * discountFactor^(t - 1) * gradientEstimate;
end

It is very similar to the first implementation. The main difference is that I avoid the summation over all time and update θ\theta at every step of the for-loop instead of after the for-loop. Below are the training results:

The results are clearly different from those of the first implementation, but the agent still trains nonetheless, so it seems like either approach is acceptable, although I am not sure which one is preferred. Feel free to share any information you may have on this in the comments.

That is all for this post. As mentioned earlier, I will discuss a way to reduce the variance of the rewards and improve the training in my next post. À plus tard!

Updates

1 - 11/30/2020

I looked more into why some sources use a single batch update while others use an update at every time step. It seems that it depends on how J(πθ)J\left(\pi_\theta\right) is defined. In OpenAI’s derivation of the policy gradient theorem, and the one I followed, we have

J(πθ)=E[G(τθ)]J\left(\pi_\theta\right) = \mathbb{E} \left[G\left(\tau_\theta\right)\right]

But in Sutton & Barto, we have

J(πθ)=E[V(s0)]J\left(\pi_\theta\right) = \mathbb{E} \left[V\left(s_0\right)\right]

where V(s0)V\left(s_0\right) is the value function evaluated at the initial state s0s_0. This results in θJ(πθ)\nabla_\theta J\left(\pi_\theta\right) being approximated according to

θJ(πθ)=E[θlogπθ(atst)Gt]\nabla_\theta J\left(\pi_\theta\right) = \mathbb{E} \left[\nabla_\theta \log \pi_\theta\left(a_t \vert s_t\right) G_t\right]

You can find the complete proof in their textbook under section 13.2 The Policy Gradient Theorem.

2 - 11/30/2020

Pages 99 - 104 of these notes contain a detailed proof of the policy gradient theorem and an explanation for the difference in the formulation of the expectation. I will stick to the form that OpenAI uses.



Rhett Trickett picture

Welcome markhsaad! We’ll sort out MATLAB syntax support and horizontal scrolling for long equations this week. Thanks for helping us test the limits :)

markhsaad picture

Awesome! Able is terrific. Thanks for the hard work on this.

Rhett Trickett picture

Okay, both issues should be sorted now. Just need to make long equations scrollable when using the editor on mobile.

markhsaad picture

It's all working great on my end, even on mobile.