In my last post, we introduced a function approximator to estimate the value function of each state. With this, we were able to subtract a state-dependent baseline from each return which helped reduce the variance. But that method required us to sample an entire trajectory before we could perform any policy updates. In this post, we will look at a method that will allow us to perform policy updates online during the episode.


The main change we will make is that instead of sampling an entire trajectory, we will sample NN steps of an episode and estimate the return using the approximate value function. Mathematically, this means that instead of sampling an entire trajectory and computing the total return as

Gt=t=0TγtrtG_t = \sum_{t = 0}^T \gamma^t r_t

we will sample NN steps of a trajectory and approximate the total return as

Gtt=0N1γtrt+γNV(sN,w)G_t \approx \sum_{t = 0}^{N - 1} \gamma^t r_t + \gamma^N V\left(s_{N},w\right)

We can do this because the value function evaluated at a state provides us with the return we expect to receive by following the same policy after leaving that state until the end of the episode. So instead of collecting every reward after time t+N1t + N - 1 to calculate the return, we simply bootstrap using the value function. This is where the name “actor-critic” comes from - the actor receives a state and outputs probabilities of actions, and the critic receives a state and outputs the value of the state. These work together to improve the policy.

As before, we will also use the value function as a baseline to help reduce the variance, so instead of using GtG_t in our policy update, we will use

At=GtV(st,w)A_t = G_t - V\left(s_{t},w\right)

where AtA_t is the advantage. In my last post, I denoted it with δ\delta because I was following Sutton & Barto’s notation, but it seems that AtA_t is what the cool kids call it these days. The idea of the advantage function also makes sense intuitively. It will update the policy only if the action we took is better or worse than expected. For example, if we take an action and get a reward of 11, but the state value is also 11, then we should not update the policy since otherwise we will be increasing or decreasing the likelihood of an action which we know to be bad or good. And this is reflected in the calculation because we would have At=11=0A_t = 1 - 1 = 0.

Using bootstrapping will allow us to update the policy online during the episode and reduce the variance, but it will also increase the bias. This should also allow for faster training since the agent will be learning as it performs. One big benefit of this method is that it can also be applied to continuous environments. The cart-pole environment is episodic, so we are able to sample entire trajectories and compute the returns. But we cannot do this with continuous episodes, so bootstrapping will help us get around that.


Below is the Actor-Critic code:

% Actor-Critic applied to the discrete cart-pole environment
close all; clear; clc

% Seed the random number generator for reproducibility. I will be
% initializing the weights to be zero, so it does not really matter in this
% case, but I included it for completion.

% Create the environment.
env = rlPredefinedEnv('CartPole-Discrete');

% Get the dimension of the observations, in this case, 4. This is not to be
% confused with the number of observations per time step, which is 1.
dimObs = env.getObservationInfo.Dimension(1);

% Get the dimension of the actions, in this case, 1. This is not to be
% confused with the number of actions to choose from, which is 2.
dimAct = env.getActionInfo.Dimension(1);

% Initialize the policy and value function parameters to zero. They can
% also be initialized randomly from a normal distribution, or whatever else
% you would like, but I like zero.
theta_policy = zeros(dimObs,1);
theta_valueFcn = zeros(dimObs,1);

% Define the policy. This takes a 4x1 observation vector as input and
% outputs the probability of picking the "left" action. The "right" action
% is 1 - probability(left), so we do not need to define it explicitly.
pi_theta_left = @(theta,s) 1 / (1 + exp(-theta' * s(:)));

% Define the gradients of the log of the policy. This outputs a 2x1 cell
% array of the grad-log-probabilities of the actions.
grad_pi_theta = @(theta,s) {s * (1 - pi_theta_left(theta,s)); -s * pi_theta_left(theta,s)};

% Set the maximum number of episodes for which to train.
maxEpisodes = 500;

% Set the maximum number of time steps per episode. When the agent gets
% good at balancing the pole, it may never fail, so we need a way to
% terminate the episode after a certain time.
maxStepsPerEpisode = 500;

% Set the learning rates.
learningRate_policy = 0.001;
learningRate_valueFcn = 0.01;

% Set the discount factor.
discountFactor = 0.99;

% Set the number of time steps to look ahead.
N = 50;

% Initialize a vector to store the total time steps that the agent lasts
% per episode.
totalStepsPerEpisode = zeros(1,maxStepsPerEpisode);

% Train the agent.
for episode = 1:maxEpisodes
    % Reset the environment.
    observation = env.reset;
    % Initialize a variable to store the trajectory of the current episode.
    trajectory = cell(1,maxStepsPerEpisode);
    % Simulate an episode.
    for stepCt = 1:maxStepsPerEpisode
        % Sample an action.
        action = randsample(env.getActionInfo.Elements,1,true,...
            [pi_theta_left(theta_policy,observation) 1 - pi_theta_left(theta_policy,observation)]);
        % Apply the action and get the next observation, reward, and
        % whether the episode is over.
        [nextObservation,reward,isDone] = step(env,action);
        % Store the experience in the trajectory.
        trajectory{stepCt} = {observation action reward};
        % Update if there are enough samples.
        if stepCt > N
            % Initialize the gradient estimates.
            gradientEstimate_policy = 0;
            gradientEstimate_valueFcn = 0;
            % Calculate the gradient estimates.
            for t = stepCt - N:stepCt
                % Get the observation, action, and reward at the current 
                % time step.
                observation = trajectory{t}{1};
                action = trajectory{t}{2};
                reward = trajectory{t}{3};
                % Compute the discounted reward up until t + N.
                discountedReward = 0;
                for k = t:stepCt
                    discountedReward = discountedReward + discountFactor^(k - t - 1) * reward;
                % Compute the total discounted return with boostrapping.
                discountedReturn = discountedReward + discountFactor^N * theta_valueFcn' * nextObservation * ~isDone;
                % Compute the grad-log-probabilities.
                gradLogProbs = grad_pi_theta(theta_policy,observation);
                % Get the grad-log-probability corresponding to the action
                % taken.
                gradLogProb = gradLogProbs{action == env.getActionInfo.Elements};
                % Calculate the advantage.
                advantage = discountedReturn - theta_valueFcn' * observation;
                % Update the gradient estimates.
                gradientEstimate_policy = gradientEstimate_policy + gradLogProb * advantage;
                gradientEstimate_valueFcn = gradientEstimate_valueFcn + advantage * observation;
            % Update the policy and value function parameters.
            theta_policy = theta_policy + learningRate_policy * discountFactor^(stepCt - N - 1) * gradientEstimate_policy;
            theta_valueFcn = theta_valueFcn + learningRate_valueFcn * gradientEstimate_valueFcn / N;
        % Update the environment.
        observation = nextObservation;
        if isDone
    % Update the list of total steps per episode.
    totalStepsPerEpisode(episode) = stepCt;

% Plot Total Steps per Episode.
ylabel('Total Steps')
title('Total Steps per Episode')
grid on
grid minor


‌Below are the training results for 500 episodes:

You can see that the training is quicker and more stable than the previous methods. The downside, however, is that we are continuously updating the policy with a nested for-loop during the episode, so it takes longer to run an episode and finish training. This might be noticeable for cart-pole which is a relatively simple environment that does not require much training. For more difficult environments, the tradeoff might be worth it. But this is assuming that I implemented the algorithm correctly and there is indeed a tradeoff. If not, oops. Please let me know of any bugs or mistakes in the comments.

I also ran the algorithm for 1000 episodes just to see what would happen. Below are the results:

Sure enough, the agent seems to “un-learn” after a while. The reason I say “sure enough” is because I have seen this happen before with more advanced algorithms that are implemented by people who are smarter than me. So I am willing to give myself the benefit of the doubt on this one and say that the disaster you see around episode 550 and onward is not my fault.


1 - 12/2/2020

I realized that I made a mistake in my original implementation. I calculated the discounted return as

discountedReturn = discountedReward + discountFactor^N * theta_valueFcn' * nextObservation * ~isDone;

‌but I believe it should be

discountedReturn = discountedReward + discountFactor^(stepCt - t + 1) * theta_valueFcn' * nextObservation * ~isDone;

The difference is that the discount factor exponent is adjusted based on the current time step in the updated code. Below are the training results with this modification. I also set N = 64 instead of 50 and learningRate_valueFcn = 0.001 instead of 0.01.

It trains faster than the first implementation, but it is also unstable. These algorithms are difficult to debug, so I am not sure if there are any more bugs or if this is just a matter of hyperparameter tuning. Let me know in the comments if you see any other issues with my code.

2 - 12/5/2020

I found another, rather serious, bug. My original code that involves the calculation of the discounted reward is

% Calculate the gradient estimates.
for t = stepCt - N:stepCt
    % Get the observation, action, and reward at the current
    % time step.
    observation = trajectory{t}{1};
    action = trajectory{t}{2};
    reward = trajectory{t}{3};
    % Compute the discounted reward up until t + N.
    discountedReward = 0;
    for k = t:stepCt
        discountedReward = discountedReward + discountFactor^(k - t - 1) * reward;

Notice that the discounted reward calculation uses the same value in reward rather than updating it at each time step k. It might help if I change it to

% Calculate the gradient estimates.
for t = stepCt - N:stepCt
    % Get the observation and action at the current time step.
    observation = trajectory{t}{1};
    action = trajectory{t}{2};
    % Compute the discounted reward.
    discountedReward = 0;
    for k = t:stepCt
        % Get the reward at the current time step, k.
        reward = trajectory{k}{3};
        % Update the discounted reward.
        discountedReward = discountedReward + discountFactor^(k - t) * reward;

The training results are below:

It looks somewhat better, but the very fact that I have updated my code twice now and the agent has successfully trained in each revision shows how difficult it can be to debug these algorithms.