Monte Carlo Methods

Author:Murphy | View: 24250 | Time: 2025-03-23 12:55:56

A Baby Robot's Guide To Reinforcement Learning

Introduction

Once again we're off to the casino, and this time it's situated in sunny Monte Carlo, made famous by its appearance in the classic movie _Madagascar 3: Europe's Most Wanted_ (although there's a slight chance that it was already famous).

In our last visit to a casino we looked at the multi-armed bandit and used this as a way to visualise the problem of how to choose the best action when confronted with many possible actions.

In terms of Reinforcement Learning the bandit problem can be thought of as representing a single state and the actions available within that state. Monte Carlo methods extend this idea to cover multiple, interrelated, states.

Additionally, in the previous problems we've looked at, we've always been given a full model of the environment. This model defines both the transition probabilities, that describe the chances of moving from one state to the next, and the reward received for making this transition.

In Monte Carlo methods this isn't the case. No model is given and instead the agent must discover the properties of the environment through exploration, gathering information as it moves from one state to the next. In other words, Monte Carlo methods learn from experience.

The examples in this article make use of the custom Baby Robot Gym Environment and all of the related code for this article can be found on Github.

Additionally, an interactive version of this article can be found in notebook form, where you can actually run all of the code snippets described below.

All of the previous articles in this series can be found here: A Baby Robot's Guide To Reinforcement Learning.

And, for a quick recap of the theory and terminology used in this article, check out State Values and Policy Evaluation in 5 minutes.

Monte Carlo Prediction

In the prediction problem we want to find how good it is to be in a particular state of the environment. This "goodness" is represented by the state value, which is defined as the expected reward that can be obtained when starting in that state and then following the current policy for all subsequent states.

When we have full knowledge about the environment, and know the transition probabilities and rewards, we can simply use Dynamic Programming to iteratively calculate the value for each state.

In practice, it's unlikely that a system's transition probabilities are known in advance. Therefore, to estimate how likely it is to move from one state to another, it's possible to observe multiple episodes and then take the average. This approach, of taking random samples to calculate estimates, is known as Monte Carlo Sampling.

Consider the level, shown in figure 1 below, where Baby Robot currently finds himself:

Figure 1: A level containing a glass wall and the co-ordinates on this level.

At first glance this level appears to be rather simple, with a short path from the start of the level to the exit. However, there are 2 obstacles of note:

In the top-middle square (coordinate [1,0]) there is a large puddle. As we've seen before, Baby Robot doesn't like puddles. They take longer to move through, incurring a negative reward of -4, and can cause him to skid. When a skid occurs Baby Robot won't reach the target state. Normally this would result in him moving to one of the other possible states, but in this case there are no other possible states, so he'll stay exactly where he is and receive another -4 penalty. If Baby Robot moves into this puddle there's a good chance he'll become stuck for several time periods and receive a large negative reward. It would be best to avoid this puddle!
The thick blue line, between the cells [1,1] and [1,2], represents a glass wall. This is a new type of challenge that Baby Robot hasn't encountered before. Unlike standard walls, Baby Robot can't see glass walls and may therefore select an action that causes him to walk into the wall. When this happens he'll bounce off the wall and, rather than reaching the target state, end up in the opposite state. Also, he'll be given a negative reward penalty of -1 for the additional time required to make the move. In this level there are 2 possible opportunities for walking into the glass wall:
if he moves South from cell [1,1] he'll instead end up in the puddle at [1,0] and receive a reward of -5 (-4 for moving into a puddle and -1 for hitting the glass wall).
if he moves North from [1,2], instead of arriving at [1,1] he'll actually bounce off the wall and end up at the exit. In this case he'll be given a reward of -2 (-1 wall penalty and -1 for moving to a dry square).

As mentioned above, when we have complete information about a system, and know all of its transition probabilities and rewards, we can use Policy Evaluation to calculate the state values.

For this environment with a stochastic policy, that just chooses randomly between the available actions in each state, Policy Evaluation gives the following state values:

Figure 2: State values calculated using Policy Evaluation for a stochastic policy.

Each of these state values represents the expected return from that state. So, at time ‘t‘ if we start in state ‘s‘ the value of a state under policy ‘π' is given by:

Equation 1: The state value function under policy π

Where the return Gₜ, the total amount of reward accumulated over an episode, starting at time ‘t', is given by:

Equation 2: The discount return at time ‘t'

With a stochastic policy, for this simple level, we know that ultimately Baby Robot will reach the exit and the episode will terminate. Therefore we can set the discount factor ‘γ‘ to 1. Under these conditions the state value gives the average future total reward that can be expected when starting in that state.

In other words, the value of a state is the total of all future rewards obtained from that state, until completion of an episode, averaged over infinite episodes.

Therefore, to get a simple estimate of the state value, we can simply take the average return of multiple episodes that commence in that state. The more episodes we run the better our estimate will be. This is the exact approach taken by Monte Carlo methods.

Figure 3: A sample episode with a trajectory that goes directly from the start state to the exit.

For example, from the start state, if Baby Robot was lucky enough to follow a path that took him directly to the exit (as shown in figure 3 above), his trajectory would have the following states and rewards:

[([1,1], -1), ([2,1], -1), ([2,2], -1), ([1,2], -1), ([1,3], -1)]

The moves on this trajectory are:

[0,1]->[1,1] - reward = -1
[1,1]->[2,1] - reward = -1
[2,1]->[2,2] - reward = -1
[2,2]->[1,2] - reward = -1
[1,2]-> Exit - reward = -1

So, for this episode, the return value is -5. If we then ran the experiment again, with a stochastic policy that produces a different, random, path through the level, we'd get a different return value. By adding the returns from many such episodes and taking the average we'd get an estimate of the state value for the start state [0,1]. The more episodes we run, the closer this estimate would be to the true state value.

For the initial state on this level, if we run multiple episodes and take the average return, we get an estimated state value of approximately -102, which is identical to the state value calculated using Policy Evaluation.

However, unlike Policy Evaluation, we've now calculated this state value purely by exploration, without knowing the model of the environment. Additionally, by following a random trajectory until termination of the episode, we've actually just performed our first Monte Carlo search.

To see this in action, check out the accompanying Jupyter notebook: Part 4 – Monte Carlo Methods.ipynb

Calculating the return values for other states on the trajectory

In the above example we saw that, by repeatedly running episodes from the start of the level, we could produce an estimate of the state value for that initial state. However, each episode passes through many states before finally reaching the terminal state, but we only considered the return values obtained from the start state and ignored all the other information gathered during the episode. This is very inefficient.

A better approach is to calculate the return value for every state that's seen during the episode. Then, over multiple random episodes, we can create an estimate for every state value that's been visited.

As you can see for the sample trajectory given above (figure 3), on his way to the exit, Baby Robot also visits the states [1,1], [2,1], [2,2] and [1,2]. So, during this episode, we can also gather information for all of these states. For each state visited, we can consider the future rewards obtained from each state to be the return for that state. This then gives us an estimated return value for all states seen during the trajectory, not just the start state.

For this sample trajectory we get the following return values for the states that have been visited:

[0,1] - return = -5
[1,1] - return = -4
[2,1] - return = -3
[2,2] - return = -2
[1,2] - return = -1

If we then average these values over many episodes we'll get an increasingly accurate estimate of the state value for all states that have been visited.

First-Visit Monte Carlo

When each state is visited only once during an episode, as in our example above, then it's easy to calculate the return for each state – you just add up all future rewards from that state. But what happens when a state is seen multiple times?

The simplest approach to calculate the value, for a state that can be visited repeatedly in a single episode, is to simply take the rewards obtained from the first time the state is visited until the end of the episode. When these returns are averaged over multiple episodes we can build up an estimate of the state value for every state that's been visited. Unsurprisingly, this approach is known as First-Visit Monte Carlo.

Example:

One possible path from the entrance to the exit of the sample level is shown in Figure 4 below:

Figure 4: A trajectory with repeated state visits

This path has the following steps and rewards:

1. [0,1]->[1,1] - reward = -1
2. [1,1]->[2,1] - reward = -1
3. [2,1]->[2,2] - reward = -1
4. [2,2]->[2,1] - reward = -1
5. [2,1]->[2,2] - reward = -1
6. [2,2]->[1,2] - reward = -1
7. [1,2]-> Exit - reward = -1

In this trajectory, rather than moving straight to the exit, on step 4 Baby Robot takes a step backwards to state [2,1]. After this he follows a direct path to the exit. As a result of this backwards step he visits states [2,1] and [2,2] twice.

With First-Visit Monte Carlo we sum the rewards obtained from the first time a state is seen until the end of the episode, to give an estimate of that state's value. When a state that's been seen before is re-visited, its reward is still used as part of the return, but we don't consider this to be a new trajectory from that state.

For this trajectory the first-visit return values are shown below. Note that we don't calculate returns from steps 5 and 6, for states that have already been seen:

1. [0,1] - return = -7
2. [1,1] - return = -6
3. [2,1] - return = -5
4. [2,2] - return = -4
5. [2,1]
6. [2,2]
7. [1,2] - return = -1

The steps of the method we've described above are detailed in the following Python Pseudo-Code, for the First-Visit Monte Carlo algorithm:

''' calculate state values using First-Visit Monte Carlo '''

# keep a count of the visits to each state                            
total_state_visits = np.zeros((env.height,env.width))                    ❶

# the total returns for each state
total_state_returns = np.zeros((env.height,env.width))

for episode in range(max_episodes):                                      ❷

  state_rewards = single_episode(env)                                    ❸
  state_returns = rewards_to_returns(state_rewards)                      ❹
  episode_returns,episode_visits = get_first_visit_return(state_returns) ❺

  # add the states that were visited to the total visits
  total_state_visits += episode_visits

  # add the return for each state visited to the total for each state
  total_state_returns += episode_returns

returns = (total_state_returns/total_state_visits)                       ❻

The points to note in the code above, for First-Visit Monte Carlo, are as follows:

We want to keep track of the episode returns and number of visits to each state, therefore we start by creating zero'd Numpy arrays for each of these. The array dimensions are set to the width and height of the environment.
Run for multiple episodes to get the returns generated over multiple trajectories. Here the number of runs is defined by ‘_maxepisodes‘.
Get the rewards for a single episode. This returns a list of all the states visited and the rewards received between the start and end of the episode.
Convert the rewards into returns. This is done by simply summing the rewards in reverse direction.
Take the first visit return value and add this value to the ‘_total_statereturns‘ array. Additionally, for every state visited, we increment the count of the total number of visits to that state (i.e. we count only the number of first-visits).
Finally, after all episodes have completed, we divide the total first-visit rewards by the total number of visits, to get the average return for each state.

Figure 5: First-Visit MC calculated state values and state first-visit count for 10000 episodes.

The state values, calculated using First-Visit Monte Carlo, running for 10000 episodes on this level, are shown above. Comparing these values to the ones we calculated previously using Policy Evaluation (see Figure 2), we can see that, for this very simple level, the values are identical.

This shows the power of Monte Carlo methods. By simply observing the results from a set of random episodes we've been able to calculate very good estimates of the state values and all without any knowledge of the underlying environment properties.

Sample Average Estimates

In the example above, to calculate the mean state values, we kept track of the total return and total number of first visits to each state. When all the episodes were complete the total return was divided by the number of visits to get the estimated state values. However, since the total return is getting progressively larger this has the potential to cause problems.

A better approach is to use a running mean where, as each new return value is obtained, we update the estimated state value based on the last estimate. Not only does this help avoid problems with storage and compute time, but it also lets us have an estimate of the state value at each step. We don't need to wait until the process has finished to compute the average state value.

The formula for calculating the new state value, in terms of the previous estimate and the return obtained from the latest episode, is shown below (where ‘n‘ is the number of visits to the state):

Equation 3: The estimated value of state ‘s' calculated in terms of the previous value and new return.

For a proof of this equation check out the Bandits article, where we used a similar method to calculate a sample average estimate for the action value.

Exploring Starts

The second image in Figure 5 above shows the state first-visit count. This number represents the number of episodes in which the state was visited at least once, as opposed to the total number of visits to that state. It is also equal to the number of trajectories that passed through this state on the way to the exit.

In this simple environment there's only a single route from the entrance to the exit (this is the direct path that we saw back in Figure 2). To reach the exit Baby Robot has to pass through each of the states on this path. Therefore, all of the states on this trajectory have to be visited at least once during an episode and, as a result, all have 10,000 first-visits, equal to the total number of episodes.

In contrast, the grid square containing the puddle (at grid location [1,0]) doesn't lie on the direct path to the exit. It's not necessary to visit this state while moving to the exit and therefore this state only gets visited when an action, chosen at random by the policy, causes Baby Robot to move there. Consequently, it was only visited in 8,737 of the 10,000 episodes.

From Baby Robot's point of view this was a good thing, as it stopped him getting wet, but from the point of view of the accuracy of our state estimate it's not so good, since this state was sampled fewer times than the other states. In general, the most frequently visited states will have more accurate estimates of the states values than the seldom seen states.

To compensate for this we can instead choose to begin each episode in a different state and thereby get a more balanced accuracy for the state value estimates. Additionally, if our policy resulted in some states never being visited we could start some episodes in these states and, in this way, estimate a value for these states.

This technique, of beginning each episode in a different state chosen from the total state space, is referred to as Exploring Starts.

Figure 6: First-Visit MC with Exploring Starts. Calculated state values and state first-visit count for 10000 episodes.

The state values and first-visit counts when using Exploring Starts for 10,000 episodes is shown in Figure 6 above. You can see that, in this case, the number of first-visits to most states has actually decreased. This is due to the average trajectory being shorter since, in most cases, the starting state will be closer to the exit. For a simple environment, such as this one, exploring starts doesn't really give any benefit. Indeed, while exploring starts can help balance the visits to each state it may not always be possible or even desirable.

When, as in this case, we're simply running a simulation of an environment then its very easy to select where each episode will begin. If, on the other hand, this was an actual, real-life, maze that Baby Robot was navigating then selecting the start state would be considerably more difficult and, for certain environments, might actually be impossible.

Additionally, although the values calculated for each state should be of a similar level of accuracy, this may not always be beneficial. Without exploring starts the states that are most frequently visited will have a higher level of accuracy than those that are rarely visited. This can be a good thing, since we're concentrating our efforts on finding the state values for the most often seen states and not wasting time on those that would be seen infrequently or never visited. However, by not using exploring starts we run the risk of missing states that would potentially give high levels of return. As is always the case with Reinforcement Learning, its a balancing act between exploration and exploitation.

Every-Visit Monte Carlo

So far, with both our experiments that began at the level's start point and with exploring starts, we were using First-Visit Monte Carlo, in which the returns were taken from the first time a state was visited until the end of the episode.

However, during any episode, a state may be visited multiple times (for example, see the sample trajectory given in figure 4, where the states [2,1] and [2,2] are both visited twice). So, another approach to calculating an estimate of the state value is to treat each visit to a state as the start of a separate trajectory and record the return from that visit until the episode terminates. This approach to estimating state values is known as Every-Visit Monte Carlo.

The results obtained for our sample level, when recording every visit to a state, are shown in Figure 7 below. Although the test was run for the standard 10,000 episodes, by recording every visit to a state the actual number of visits recorded to each state are in most cases much higher.

Both First-Visit and Every-Visit Monte Carlo are, given enough time, guaranteed to converge on the true state value. However, although Every-Visit MC records more visits to each state, it's not clear that it actually gives better results than First-Visit MC. This is probably due to the information from the extra trajectories that are recorded by Every-Visit MC already being included in the return obtained from the First-Visit trajectory.

Figure 7: Every-Visit MC, calculated state values and state visit count for 10,000 episodes.

Monte Carlo Control

Using any of the Monte Carlo methods described above, we are able to generate a pretty good estimate for the state values for any supplied policy. However, although this lets us see the relative goodness of any state with respect to the other states, this doesn't actually help us to navigate the level.

When we have full information about state transitions and rewards, we can use Dynamic Programming to turn the Bellman Equations into a set of update rules that can calculate the state values under the current policy. From these state values we can then select the action that gives the maximum expected return.

This is summarised by the following equation for greedy policy π in state s :

Equation 4: Greedy policy update w.r.t. the state values.

where:

p(s',r|s,a) is the probability of moving to the next state s' and getting reward r when starting in state s and taking action a.
r is the reward received after taking this action.
γ is the discount factor.
V(s') is the value of the next state.

Unfortunately, since we're not given the model of the environment, we don't know p(s',r|s,a) and therefore can't make use of the next state value V(s'). Since we're not given a model of the environment we don't know where we'll end up if we take a specific action, nor the reward we'll get for taking that action.

As an illustration, consider the level that Baby Robot has been exploring, along with the first-visit state values that we calculated, shown again in Figure 8 below:

When Baby Robot reaches the grid square [1,1], as shown above, if we simply act greedily with respect to the calculated state values, then the optimal action appears to be a move south, to grid square [1,2], since this is the adjacent state with the highest state value. However, since we don't have a model of the environment, we know nothing about the transition probabilities or rewards that would be received for taking this action.

If Baby Robot was to take this action his probability of reaching the target state [1,2] is actually zero and instead he'd bounce off the glass wall and end up in the puddle at [1,0]. Additionally, he'd receive a large negative reward of -5 for taking this action. So, in reality, this wouldn't be a good action to choose.

Therefore, for Monte Carlo control, rather than calculating the values of each state, we need to calculate the action values. As a reminder, these values are known as Q-Values and represent the returns that can be expected from taking an individual action. We can then use these to find the optimal policy and navigate the best path through the maze.

Once we know the state-action values (or approximations of them) then we can do policy improvement by acting greedily with respect to these values, simply choosing the action with the maximum value in each state. We don't need the model of the environment and we don't bootstrap (i.e. no use is made of the value of the potential next state).

This policy update is described by equation 5, where, for state ‘s', the greedy policy ‘π' simply chooses the action ‘a' with the maximum state-action value ‘q'.

Equation 5: Greedy policy update w.r.t. the maximum state-action value

The code to calculate the state-action values using Monte Carlo is almost identical to the code that was used to calculate the state values, except now, rather than considering visits to individual states, we're analysing the visits to the actions within those states.

The Python pseudo-code to calculate the state-action values using First-Visit Monte Carlo is shown below:

''' calculate state-action values using First-Visit Monte Carlo '''

# keep a count of the visits to each action
visits = np.zeros((env.height,env.width,len(Actions)))                    ❶

# the average returns for each action
returns = np.zeros((env.height,env.width,len(Actions)))

for episode in tqdm(range(max_episodes)):                                 ❷

  action_rewards = single_episode(env,policy)                             ❸
  action_returns = rewards_to_returns(action_rewards)                     ❹
  episode_returns,episode_visits = get_first_visit_return(action_returns) ❺

  # add the episode returns to the total returns
  returns += episode_returns

  # increment the count of any states that have been visited
  visits += episode_visits

avg_returns = (returns/visits)                                            ❻

If you compare this code with the Monte Carlo First-Visit code that we used to calculate the state values, you'll see that it's practically identical. The only real difference is that now, instead of using a 2-dimensional array of the states in the environment, we use a third dimension to allow the Q-values to be calculated.

The points to note in the code above, for First-Visit Monte Carlo, are as follows:

Over the course of all episodes, we want to keep track of the returns that are received after the first visit to each action and the number of episodes in which each action was visited (for First-Visit MC). Therefore we start by creating zero'd Numpy arrays for each of these. As before, the array dimensions are set to the width and height of the environment, but now a third dimension is added for the number of possible actions in each state.
As in all Monte Carlo methods we want to sample the rewards obtained over multiple episodes.
For each episode get the list of the state actions visited and the rewards received for each of these.
Sum the rewards in reverse direction to calculate the returns for each action.
Calculate the first-visit return value for the actions taken during the episode and then add this to the total returns. This updates the total return of every action seen during the episode and increments the count of the number of first visits for these actions.
Finally, after all episodes have completed, divide the total first-visit rewards by the total number of visits, to get the estimated Q-Value for each state-action.

Running this code for 1000 episodes gives the following state-action values and action visit counts:

Figure 9: Action values and counts calculated using Monte Carlo.

In Figure 9, above, we can see that the actions that must be taken to move from the entrance to the exit of the level all have been sampled 1000 times, equal to the number of episodes. Similarly, since Baby Robot is smart enough to not walk into standard walls, the actions that would result in hitting a wall are never taken and so have a zero visit count. Due to a stochastic policy being used, all other possible actions have been visited across the set of episodes.

If we now act greedily with respect to these action values we get the following policy which, for this very simple level, is also the optimal policy. By following this, Baby Robot will reach the exit in the shortest possible time:

Figure 10: The optimal policy, calculated by acting greedily with respect to the estimated action values.

Generalised Policy Iteration (GPI) with Monte Carlo

For the simple level that we've been using as an example, with a stochastic policy where, in any state, each action was equally likely to be chosen, we evaluated the policy using Monte Carlo to estimate the action values and then improved the policy by selecting the action with the maximum value in each state (i.e. we acted greedily with respect to the action values).

This represents a single iteration of Generalised Policy Iteration (GPI) in which repeated evaluation and improvement steps move the policy ever closer to the optimal policy, as shown in figure 11. But now, rather than using Dynamic Programming to evaluate each policy, we've approximated the value function using Monte Carlo.

Figure 11: Generalised Policy Iteration (GPI)

In the case of our simple level we reached the optimal policy after only a single iteration but, in practice, for more complex environments many iterations may be required.

As an example, consider the more complex level and an associated deterministic policy shown below:

Figure 12: A slightly more complex level and a deterministic policy for this level

We can run a single iteration of Monte Carlo policy evaluation for this deterministic policy, which gives us the following state-action values:

Figure 13: state-action values after a single iteration of MC policy evaluation for a deterministic policy

And, if we act greedily with respect to these values, we get the following updated policy:

Figure 14: Updated deterministic policy after acting greedily with respect to the estimated action values

We appear to have a problem! Not only has the majority of the level not been visited, but Baby Robot is stuck running in circles, never reaching the exit (and this would have continued for ever is we hadn't set a limit on the allowed length of an episode).

Clearly, when using a deterministic policy, we also need to add some form of exploration, otherwise we'll never be able to estimate the state-action values for actions that haven't been visited and, as a result, we'd never be able to improve the policy.

We could achieve this exploration by using the exploring starts idea that we saw for state values. In this case, rather than beginning each episode in a new state, we'd extend this to also cover all actions. However, outside of a simple simulation of the environment, this would be impractical. A better approach is to consider policies that have a non-zero chance of selecting all actions in each state.

The Epsilon-Greedy Algorithm (ε-Greedy)

When following a deterministic policy only a single action will be taken in each state. However, to allow us to choose the best actions we need to form estimates of the values for all actions, so need to add some form of exploration.

One of the simplest, but highly effective, methods for achieving this is the Epsilon-Greedy algorithm **** (which we saw previously when looking at Bandit problems). In this method, in any state, we normally choose the action with the maximum estimated action value, but with probability ‘ε' (epsilon) an action is selected at random.

Therefore, for the current state, we can select an action as shown below:

# probability of selecting a random action
p = np.random.random()

# if the probability is less than epsilon then a random action 
# is chosen from the state's available actions
if p < epsilon:
    action = env.action_space.sample()
else:                
    # get the policy's action in the current state
    action = policy.get_action(env.x,env.y)

Now, when following an epsilon-greedy policy, we can discover other states and actions that wouldn't be found under a purely deterministic policy. As a result, we gather a lot more information during an episode than we would if simply following the deterministic policy.

For example, for the same deterministic policy that was used above, but with the addition of random actions, taken with probability 0.1 (so approximately every 10th action will be chosen at random), for a single episode, we get action values like the one shown below:

Figure 15: Action values for a single episode of Epsilon-Greedy (with the allowed number of steps in an episode increased to 2000)

When we act greedily with respect to these action values, we get the following policy which, as you can see, has defined actions based on a much larger set of states:

Figure 16: Policy formed from acting greedily with respect to action values calculated using an Epsilon-Greedy policy.

Now that we've got a means by which we can explore the environment, we can return to our initial problem of using GPI to find the optimal policy.

In the example above, we only ran a single episode of Monte Carlo to evaluate the state-action values and then improved the policy based on these values. Instead, before the policy improvement step, we could have evaluated the policy over multiple episodes, to form a more accurate approximation of the action values.

These two approaches are similar to Value Iteration, were only a single step of policy evaluation is done before the policy is improved, and Policy Iteration, in which the policy is evaluated to convergence before the policy improvement step. In either case, for Monte Carlo methods, we need to run complete episodes in the policy evaluation step, before policy improvement takes place.

Figure 17: an initial deterministic policy that does reach the exit

The deterministic policy, shown in Figure 17 above, defines a trajectory that would take Baby Robot from the start to the exit of the level. But, as you've probably noticed, it's less than optimal.

Therefore, we'd like to improve this policy by running repeated rounds of MC policy evaluation and policy improvement to move us progressively closer to the optimal policy. With epsilon set to 0.5, to give us a high degree of exploration, and running a single episode of MC policy evaluation before each policy improvement, we get the action value estimates and associated policies shown below:

Figure 18: Action value estimates and associated policy improvements when running GPI with epsilon = 0.5

The delta value, shown at each iteration of GPI in figure 18, and in the graph below, represents the average difference between the state action values at the start and end of the iteration. As you can see, it initially begins with a high value and converges towards zero as the estimated action values become better estimates of the true action values.

Figure 19: The change in the delta value, representing the mean difference of all action values between the start and end of each iteration

In this run we've actually used the delta value to control the stopping of the GPI process (considering convergence to have been reached when delta is less than 0.04). The final policy is shown in figure 20 below:

Figure 20: The policy formed at the final iteration of GPI, with a minimum delta of 0.04 used as the stopping criteria.

This final policy defines a much shorter route from the entrance to the exit of the level. Additionally, the single action value of the start state has reduced from -45 in the initial policy down to -27 in the final policy. Since this value represents the return that can be expected when taking this action and, in the case of this environment where every action incurs a penalty of -1 for taking that action, you can see that we've dramatically reduced the amount of time that it will take Baby Robot to reach the exit.

However, things aren't perfect and this isn't the optimal policy for this level. For example, if you consider the bottom left-hand state, you can see that the final policy defines an action that would take Baby Robot North, when a shorter route to the exit would be to move East.

The best action hasn't been found due to a lack of sampling of the actions in this state. If you look at the number of first-visits to each action, as shown in Figure 21 below, you can see that the North action was only seen 3 times across the 23 episodes used for policy evaluation.

Figure 21: The first-visit count of the actions visited during the 23 iterations of GPI.

There are a few reasons why this action was hardly ever visited:

We defined a minimum delta value that stopped GPI before the optimal policy was found.
We only ran a single episode for each policy evaluation step. Increasing the number of episodes would increase the accuracy of the action value estimates and increase the number of visits to each action.
By default, Epsilon-Greedy follows the current deterministic policy. Although other actions are chosen at random to increase the exploration of the environment, remote states and actions are less likely to be visited.

As mentioned previously, the fact that this state and its actions have been visited infrequently may not be a bad thing. Rather than visiting all states and actions an equal number of times, by concentrating on those that lead from the start state to the exit, we've reduced the time required to locate the most optimal trajectory through the maze. As usual it's the old explorationexploitation trade-off.

Summary

Monte Carlo methods use sampling to learn from experience. Over the course of multiple episodes they gradually build up an increasingly accurate picture of the environment. As a result, they can estimate state and action values without being given a model of the environment.

During the policy evaluation step of Generalized Policy Iteration (GPI), by running complete episodes using the current policy, Monte Carlo methods can be used to form an average of the visited state-action values.

To ensure that all states and actions are visited, exploration can be introduced in a number of ways, such as exploring starts, in which each new episode begins in a different state or action, or epsilon-greedy policies where, by default, the action specified by the policy is taken, but with a probability ‘epsilon‘ a different, random, action is chosen.

If the policy is then improved by acting greedily with respect to these estimated values, the policy improvement theorem guarantees that the new policy will be as good or better than the previous policy. By repeating these steps of policy evaluation and improvement, Monte Carlo methods can be used to find the optimal policy.

What's Next

So far we've only considered ‘on-policy‘ methods, in which a single policy is evaluated and improved. Going forwards we'll take a look at ‘off-policy‘ methods which employ more than one policy. For example, one policy can be used to explore and gather information, and a second policy, which is the policy that's actually used to negotiate the environment, is then created from this information.

We've also seen that unlike Dynamic Programming, Monte Carlo methods don't use the values from other states when updating their value estimates. In other words they don't ‘bootstrap‘. In the next article we'll look at Temporal-Difference (TD) learning which, like Monte Carlo, learns from experience but also employs bootstrapping when calculating its value estimates.