A note about these notes

I made these notes a while ago, never completed them, and never double checked for correctness after becoming more comfortable with the content, so proceed at your own risk. I am only leaving them online as some people seemed to have found them useful in the past.

These notes and exercises are based off of the 15th of May 2018 draft of Reinforcement Learning – An Introduction by Sutton & Barto (the newest version is available here). I tend to summarize the main concepts from the chapters I go through and attempt the exercises. Feel free to let me know if you spot any mistakes.

My notes will not match up with the book as I have skipped things that I was familiar with and delved into more detail in concepts that I found confusing. As background before reading this book, I took CS188 in Spring 2017. However, since then I had definitely become pretty rusty on many of the concepts.

1. Introduction

What is reinforcement learning?

It’s picking actions in an environment with the objective of maximizing a reward signal.

Important aspects of RL are:

  • Trial-and-error search: try things and fail until things work out
  • Delayed reward: being able to take actions not only to maximize immediate reward but also more long-term rewards
  • exploration vs exploitation: this is a very RL characteristic tradeoff, arising from the fact that in order to find the best way to maximize reward, one could naively just adopt the best strategy seen so far (exploitation) or look for new strategies (exploration). However, an agent can be truly successful only if he combines both of these modes during his learning.
  • There is a clear goal that the agent is trying to achieve
  • Can sense environment and affect it with their actions: Pacman can see the food and eat the food pellets (changing the state of the environment)

Reinforcement learning is commonly considered as a different field from supervised and unsupervised learning, although they share some interesting characteristics.

An attempt to provide a brief description of a RL problem could be:

“we have an interaction between an active decision-making agent and its environment, within which the agent tries to achieve a goal despite uncertainty about its environment.”

RL vs Supervised Learning

I feel like in a way reinforcement learning and supervised learning are pretty similar. The book disagrees with this, but I can see how you could consider supervised learning as a form of reinforcement learning, where each item that is labeled correctly gives a positive reward, and viceversa.

However, there is one important difference: reinforcement learning agents take actions sequentially and interact with the environment. In this sense I think that it makes more sense to say that reinforcement learning is a very different beast compared to supervised learning.

RL and the curse of dimensionality

One of RL’s strong suits is that it partially addresses the curse of dimensionality characteristic of control theory and operations research.

How? Mainly (I believe) with temporal difference approaches, in which the value function is defined more simply: we don’t learn a value for each state independently, but learn values based on certain features of states.

Backgammon has about $10^{20}$ states, so we can’t define a policy explicitly for each state. What we do instead is come up with approximations of the value of each state, based on features of the state, that can be handcrafted, or can be also learned with supervised learning methods (for example neural networks).

Elements of RL

  • Policy: defines the agent’s behaviour at a given time
  • Reward signal: the immediate feedback the environment gives at every choice of action of the agent
  • Value function: a function that returns for each state $s$ the estimated expected total reward that the agent could gain starting from the state $s$
  • Model (optional): Enables the agent to estimate how the environment will react to its actions, enabling better planning

One characteristic element of RL compared to other “learning algorithms” is the dependence on value functions. Other solution methods such as genetic algorithms, genetic programming, and simulated annealing apply multiple static policies and estimate their effectiveness statistically. We call these evolutionary methods.

What is the clear distinction between evolutionary methods and RL? Are evolutionary methods reinforcement learning methods or are they separate things?

RL is more effective as it manages to use the information of each run, while genetic algorithms just evaluate policies in their entirety and throw away all information within each individual run.

Prior information can be incorporated algorithmically in order to make the policy search more efficient.

Some reinforcement learning methods don’t need any kind of environment model at all. These systems are called model-free. Their advantage is that they are a lot less computationally intensive, and can give better results if coming up with an accurate model is hard for the problem at hand. In general if model-based systems have bad environment approximations, it might be better to just go model-free.

Exercise 1.1: Self-Play

Suppose, instead of playing against a random opponent, the reinforcement learning algorithm described above played against itself, with both sides learning. What do you think would happen in this case? Would it learn a different policy for selecting moves?

If both sides were learning against themselves, I believe that it would end up learning the optimal game policy. Interestingly, I think it’s not obvious what the optimal game policy is. I think it’s kind of like the prisoner’s dilemma. If faced with a sub-optimal static-policy adversary, then the policy it would learn would be guaranteed to converge to the optimal policy against that agent. However, if the agent was trained against itself, it would eventually converge to the policy that would in expectation perform the best against a policy optimized against it. This is kind of a tongue twister. In particular, I think this is similar to the prisoner’s dilemma as in, the game-theoretic optimal action is to actually take an action that is sub-optimal in an absolute sense. I think this has a very strong relationship to minimax, even though it is a bit more complicated as the agent is playing against itself rather than against an optimal agent, making it a bit less trivial.

It would learn a different policy as the optimal actions would be different against some arbitrary opponent compared to training against itself. Also, something to consider is that both sides would constantly be changing their strategy in order to have a better chance against the other (basically itself). I don’t think it’s completely trivial to even state that convergence is guaranteed, as I can imagine a scenario in which the policy just bounces back and forth between a absolutely optimal and game-theoretic optimal stance.

Exercise 1.2: Symmetries

Many tic-tac-toe positions appear different but are really the same because of symmetries. How might we amend the learning process described above to take advantage of this? In what ways would this change improve the learning process? Now think again. Suppose the opponent did not take advantage of symmetries. In that case, should we? Is it true, then, that symmetrically equivalent positions should necessarily have the same value?

If we have a tabular approach, where we store the expected value associated with each state, we could enforce some constraints of equality between all states that are identical up to symmetry. This would definitely make sense when playing against an optimal player, for which states identical up to symmetry should have the same value. However, if the opponent has a skewed policy, in which he behaves differently given states identical up to symmetry, then we shouldn’t have these symmetry constraints enforced: in fact, not enforcing them might enable us to exploit the biases of the opponent. If the opponent behaves in a particularly sub-optimal way in a certain symmetric state, then the value we associate with that state should be higher than the other states that are identical up to symmetry.

If instead we are using a temporal difference approach, enforcing symmetry constraints could be a lot easier or harder. In general, naive approaches would already enforce symmetry constraints implicitly, as the most common features that come to mind are usually symmetrical. However, if we wanted to break this and enable non-symmetrical values, that would be harder and require a lot more thought.

Exercise 1.3: Greedy Play

Suppose the reinforcement learning player was greedy, that is, it always played the move that brought it to the position that it rated the best. Might it learn to play better, or worse, than a non-greedy player? What problems might occur?

I believe that it would be guaranteed to play equally or worse compared to a normal RL agent, under reasonable assumptions. This is because RL agents will try to optimize not only immediate but also long-term rewards, while a greedy agent would only optimize for immediate ones. This could cause it to prioritize the wrong things and optimize getting small rewards in the short term missing out on bigger long term ones.

Exercise 1.4: Learning from Exploration

Suppose learning updates occurred after all moves, including exploratory moves. If the step-size parameter is appropriately reduced over time (but not the tendency to explore), then the state values would converge to a set of probabilities. What are the two sets of probabilities computed when we do, and when we do not, learn from exploratory moves? Assuming that we do continue to make exploratory moves, which set of probabilities might be better to learn? Which would result in more wins?

If we perform learning updates even when performing explanatory moves, we will underestimate the actual value of the states we update after exploratory moves. This is because, unless the previous estimates were wrong, the action we pick is actually sub-optimal, and will lead to a worse state than the one we would have reached by exploiting our policy.


  • When we perform the updates also for exploration, our probabilities (value function values) will be lower bounds on the optimal ones
  • Otherwise, we will have unbiased values

Assuming that we continue to make explanatory moves, it’s better to learn the unbiased ones. If a action ($s \to a_{explore} \to s’$) that we choose during exploration is actually better than the greedy action, one we explore the state $s’$ and update it’s value, it could overtake the current best action $a_{exploit}$ for state $s$.

Given that the estimates for the values for the first case are biased, they would perform at least as bad as the unbiased estimates. I’m not totally sure if the bound would be strict or not.

Part I: Tabular Solution Methods

In this part we only deal with problems in which state and action spaces are sufficiently small that they can be stored in a table (or equivalently, an array). In this case it is often possible to find exact solutions, i.e. the optimal policy for our agent.

Problems with just one state are called bandit problems.

return to this after having finished part one

2. Multi-armed Bandits

  • evaluative feedback: indicates how good an action taken was, but not whether it was the worst or best action possible
  • instructive feedback: indicates the correct (best) action, independently of the action actually taken

Instructive feedback is that the basis of supervised learning. Evaluative is instead at the basis of RL.

Another important distinction is that between associative and non-associative problems: non-associative problems are simpler, as one doesn’t have to care about the combination of action and state, but just about the action. This means that taking a certain action right now will not affect the possibilities of the agent in the future in any way.

2.1 A k-armed Bandit Problem

In a k-armed bandit problem there are k possible actions to choose from, and after you select an action you get a reward, according to a distribution corresponding to that action. The goal is to be able to identify which are the best actions as soon as possible and concentrate on them (or more likely, the one best/optimal action). The problem becomes more complicated if the reward distributions are non-stationary, as our learning algorithm must realize the change in optimality and change it’s policy.

In the stationary version, each of our actions has an expected reward given that that action is selected, called action value.

\[q_\star(a) = \mathbb{E}[R_t \ \mid \ A_t = a]\]

We want $Q_t(a)$, our approximation of the action value, to be close to $q_\star(a)$, the true value. $Q_t(a)$ is also commonly called the $Q$ Value for action $a$.

One of the main ideas in the exploration vs exploitation is that if we want $Q$ to be close to $q$, it isn’t sufficient to take greedy (exploitative) moves all the time. In fact, with exploring – reward is lower in the short run, during exploration, but higher in the long run because after you have discovered the better actions, you can exploit them many times.

Whether to explore or exploit is a very hard question, and depends in a very complex way on the precise values of estimates, uncertainties, and number of remaining steps. There are some methods to determine what is the best tradeoff, but they usually make very limiting assumptions, such as stationarity and prior knowledge that can be impossible to verify or are just not true in most applications.

Does the definition above for action value contrast with CS188? Isn’t value in CS188 defined as “expected rewards from all following actions”?

I think the conflict is resolved by the fact that in this case we’re considering a stationary problem with infinite horizon and no discount, so it doesn’t make sense to define value as expected rewards up to infinity, as they would be infinite.

2.2 Action-value Methods

Action-value methods are all methods that estimate values of actions and that use these estimates to pick the agent’s next action. Remember that the true value of an action is it’s expected/mean reward.

One natural way to estimate the value of a state is just taking the average reward obtained every time we reach that state:

\[Q_t(a) = \frac{\text{sum of rewards when action `a` taken before time t}}{\text{number of times action `a` was taken}}\]

As the denominator goes to infinity, our estimate converges to the real value $q_\star(a)$. This is called the sample-average method.

The simples method for selecting actions is instead the greedy approach:

\[A_t = argmax_a Q_t(a)\]

A better, but still simple approach is the $\epsilon$-greedy approach, in which the greedy option is selected with probability $1-\epsilon$ and a random action (among the remaining) is selected with probability $\epsilon$.

Exercise 2.1

In $\epsilon$-greedy action selection, for the case of two actions and $ε = 0.5$, what is the probability that the greedy action is selected?

It is $\epsilon = 0.5$ if we follow the convention that when choosing the random action (if we’re in the $\epsilon$ probability case) we exclude the greedy action.

Otherwise: \(\begin{equation} \begin{split} P(\text{greedy}) & = P(\text{pick greedy} \mid \text{exploit}) P(\text{exploit}) + P(\text{pick greedy} \mid \text{exploration}) P(\text{exploration}) \\ & = 1 * (1 - \epsilon) + \frac{1}{\mid A\mid} \epsilon \\ & = 1 - (1 - \frac{1}{\mid A\mid})\epsilon \\ & = 1 - \frac{\mid A\mid - 1}{\mid A\mid} \epsilon \end{split} \end{equation}\)

I think that enabling to pick the greedy action when exploring should not be allowed as it just complicates what it means to be $\epsilon$-greedy. In fact, as seen above, this does not mean under this approach that we pick a non-greedy action $\epsilon$ proportion of the time.

2.3 The 10-armed Testbed

10-armed testbed is a test set for the performance of these algorithms. It’s composed of $N=2000$ randomly generated k-bandit problems, with $k=10$. The action values $q_\star(a)$ were selected at random from a normal distribution with mean 0 and variance 1. The reward distributions for each action were instead still gaussian, and centered at $q_\star(a)$ with variance 1.

For any learning method, we can benchmark it by measuring performance and behaviour over 1000 timesteps when applied to one of the $N$ bandit problems. This makes up one run. Repeating this for $N$ independent runs, each with a different bandit problem, enables us to measure expected behaviour.

We see that low $\epsilon$ converges slower, but to higher values. See book for details/images. It’s all pretty straightforward.

Things get a bit more complicated if the distribution of the rewards is non-stationary, i.e. it changes over time. In this case we always want to be exploring, and probably with a rate that is tuned to do well with the rate of change of the reward functions.

Would be interesting to look into this above point more.

Exercise 2.2: Bandit example

Consider a k-armed bandit problem with $k = 4$ actions, denoted 1, 2, 3, and 4. Consider applying to this problem a bandit algorithm using $ε$-greedy action selection, sample-average action-value estimates, and initial estimates of $Q_1(a) = 0$, for all $a$. Suppose the initial sequence of actions and rewards is $A_1 = 1, R_1 = 1, A_2 = 2, R_2 = 1, A_3 = 2, R_3 = 2, A_4 = 2, R_4 = 2, A_5 = 3, R_5 = 0$. On some of these time steps the $ε$ case may have occurred, causing an action to be selected at random. On which time steps did this definitely occur? On which time steps could this possibly have occurred?

At timestep $2$ this definitely occurred, as we know that the average reward associated with $A_1$ is $1$ and so $Q_2(a_1) > Q_2(a_2) = 0$. Therefore, choosing $A_2 = 2$ must have been the result of a exploration step. Similarly, this must have also occurred at timestep $5$.

It might have occurred at timestep $1$, depending on how the algorithm picks among actions with equal $Q$ values. The same is true for timestep $3$, at the beginning of which $a_2$ and $a_1$ have the same $Q$ value.

If when one is picking a random action, one chooses among all actions rather than just all the ones currently considered suboptimal, then it is possible that a random action was selected at any of the timesteps.

Exercise 2.3

In the comparison shown in Figure 2.2, which method will perform best in the long run in terms of cumulative reward and probability of selecting the best action? How much better will it be? Express your answer quantitatively.

The $\epsilon = 0.01$ method will perform the best in the long run. This is because – assuming stationarity – it is guaranteed to find the optimal action and then exploit it. Once the algorithm has found the optimal action, it will exploit it 99% of the time, while the $\epsilon = 0.1$ will only exploit the optimal action 90% of the time.

Let’s say that the average $Q$ value of suboptimal values is $\bar{q}_s$. Then we know that:

  • $\epsilon = 0.01$ will have an average reward of $0.01 * \bar{q}s + 0.99 * q{opt}$
  • $\epsilon = 0.1$ will have an average reward of $0.1 * \bar{q}s + 0.9 * q{opt}$

Depending on how large the gap between $\bar{q}s$ and $q{opt}$ is, this could change the performance of the algorithm very much.

From the figure mentioned in the exercise above: I wonder why it takes so long for the algorithm to figure it out – 1000 timesteps is a long time. It would be nice to see how the belief changes over time.

Note that $\epsilon=0.01$ means that it only explores a new action on average 10 times within the 1000 timesteps. The chance that the optimal action is not explored at all is decently high, and the chance that even if explored at one timestep, it’s reward was not decisively better than all other actions, is even higher. Therefore it makes sense that it would require quite a long time to converge. The graph looks the way it does (with incremental improvement) not because that’s the way it would look for just one bandit game, but because it’s averaged out. It shows the incremental discovery of the optimal action across all the 2000 games.

2.4 Incremental Implementation

How can these averages be computed in a computationally efficient manner?

Basically just store the average and update it at each timestep. (see book for details)

Given $Q_n$ and the $n$th reward $R_n$ (note that the $n$th reward forms the $Q_{n+1}$ estimate as $Q_1$ is based off of a prior – usually just set to zero), we get:

\[Q_{n+1} = \frac{1}{n} \sum_i^n R_i = \frac{1}{n} \Bigg(R_n + (n-1) \frac{1}{(n-1)}\sum_i^{n-1} R_i\Bigg) = \frac{1}{n} \Big(R_n + (n-1) Q_{n}\Big) = Q_n + \frac{1}{n} \Big[R_n - Q_{n}\Big]\]

The update formula turns out to take the form:

\[NewEstimate \leftarrow OldEstimate + StepSize [Target - OldEstimate]\]

This is a form that comes up very often in RL. Note that $[Target - OldEstimate]$ can be considered a measure of error of the estimate.

Note that the step-size parameter changes from timestep to timestep, depending on the number of times that one has chosen that action. Usually step-size is denoted by $\alpha_n$.

2.5 Tracking a Non-stationary Problem

If the reward probabilities change over time, we don’t want to give the same weight to all observations when calculating our estimates for the mean-rewards for each action. We want to give more weight to the most recent observations.

We can modify the formula derived in 2.4 to have a constant step size $\alpha$. Note that the step size found in 2.4 decreased with the number of observations ($\frac{1}{n}$), indicating that the weight given to observations decreases over time. However, we want the weight given to new observations to be the same, but that given to old observations to decrease.

By setting $\alpha$ as a constant, we effectively achieve this, as we get (see book for full derivation):

\[Q_{n+1} = Q_n + \alpha [R_n - Q_n] = ... = (1-\alpha)^n Q_1 + \sum_{i=1}^n \alpha (1-\alpha)^{n-1} R_i\]

This is a form of weighted average because the weights sum up to one: $(1-\alpha)^n + \sum_{i=1}^n \alpha (1-\alpha)^{n-1} = 1$

The above expression shows us that the weight given to each reward $R_i$ is exponentially decreasing as time progresses. This is also sometimes referred to as exponential recently-weighted average.

Instead of having a constant step size, it might sometimes be beneficial to have a variable step size $\alpha_n$. Such as step size should be chosen in a way that is which is mindful of the properties below.

To assure convergence with probability 1: \(\sum_{n=1}^\infty \alpha_n(a) = \infty \ \text{ and } \ \sum_{n=1}^\infty \alpha_n^2(a) < \infty\) Why is this? Intuitively, the first condition makes sure that the step size is large enough for the estimate to travel an infinite distance (hence overcoming an initial bias of arbitrary size). The second condition instead guarantees that the step size will eventually be small enough to guarantee convergence.

Note that for the step-size $\frac{1}{n}$, both conditions are met, but this is not the case for a constant step size. In fact, in the latter case the second condition is not met, i.e. the estimate never converges. This means that there will always be some bias in our estimate, but it also means that throughout time, our estimate will be able to “follow” our parameter through parameter space, without ever getting “tired” (converging).

Note that these last sentences also apply to the step size $\frac{1}{n}$, but as the step size would decrease over time, it’s convergence (though guaranteed in the limit) would become slower and slower as time progresses, making it less useful in real-world applications. In fact, if the true parameters move fast enough, an estimate updated with step size of $\frac{1}{n}$ would never be able to catch up, while for a constant step size that is not necessarily true.

The tradeoff here appears to be between speed of convergence and bias in the estimate.

Exercise 2.4

If the step-size parameters $\alpha_n$ are not constant, then the estimate $Q_n$ is a weighted average of previously received rewards with a weighting different from that given by $Q_{n+1} = Q_n + \alpha [R_n - Q_n]$. What is the weighting on each prior reward for the general case, analogous to $Q_{n+1} = Q_n + \alpha [R_n - Q_n]$, in terms of the sequence of step-size parameters?

\[\begin{equation} \begin{split} Q_{n + 1} & = Q_n + \alpha_n [R_n - Q_n] \\ & = (1 - \alpha_n)Q_n + \alpha_n R_n \\ & = (1 - \alpha_n)\Big((1 - \alpha_{n - 1})Q_{n-1} + \alpha_{n-1} R_{n-1}\Big) + \alpha_n R_n \\ & = (1 - \alpha_n)(1 - \alpha_{n - 1})Q_{n-1} + (1 - \alpha_n)\alpha_{n-1}R_{n-1} + \alpha_n R_n \\ & = Q_1 \prod_i^n(1-\alpha_i) + \sum_i^n \Bigg(\alpha_i R_i \prod_{j=i}^{n - 1} (1 - \alpha_j)\Bigg) \end{split} \end{equation}\]

Therefore, we see that the expressions end up being pretty similar, where the one in section 2.5 is just a special case of this more general one.

Exercise 2.5 (programming)

Design and conduct an experiment to demonstrate the difficulties that sample-average methods have for non-stationary problems. Use a modified version of the 10-armed testbed in which all the $q_∗(a)$ start out equal and then take independent random walks (say by adding a normally distributed increment with mean zero and standard deviation $0.01$ to all the $q_∗(a)$ on each step). Prepare plots like Figure 2.2 for an action-value method using sample averages, incrementally computed, and another action-value method using a constant step-size parameter, $α = 0.1$. Use $ε = 0.1$ and longer runs, say of $10,000$ steps.


2.6 Optimistic Initial Values

All the methods for estimating $Q(a)$ so far depend on $Q_1(a)$ in some amount. Note that $Q_1$ is the initial action-value estimate that we have before having observed any action whatsoever. Therefore, we can say that our estimators of $Q$ are biased because of this by their initial estimates.

For sample average methods, the bias disappears once all actions have been selected at least once. This is true as once we have taken one action we have $Q_2(a) = Q_1 + \frac{1}{n} [R_1 - Q_1] = Q_1 + [ R_1 - Q_1] = R_1 $, which is an unbiased estimate. From then onwards, the estimate will not depend on $Q_1$ anymore, and so the resulting estimator will be unbiased.

Instead, for methods with a constant $\alpha$, the bias is permanent, even though it decreases over time (this is true as we can see from section 2.5 that $Q_1$ does not cancel out in this case, even though it is given exponentially less weight in the final estimate as time progresses).

This kind of bias can actually turn out to be helpful. This is the main idea underlying the method of optimistic initial values: using the bias as a tool to encourage exploration and speed up convergence. Specifying the initial values $Q_1$ can be seen as introducing additional information (a prior) to the problem. In particular, consider setting all $Q_1$ values to $+5$ in our testbed case. This is very optimistic and the learner will definitely be “disappointed”. However, this as the effect of reducing the value of that action, making the agent more prone to try the other ones, still valued at $+5$. This way, we are encouraging a very active exploration at the start, without having to implement a $\epsilon$-greedy policy.

This trick is quite effective on stationary problems but not on non-stationary ones. This is because once the value estimates have started to converge to their true values, the agent will not explore anymore. The problem is that the drive for exploration is inherently temporary.

“This criticism also applies to sample-average methods, in which the beginning of time is treated as a special event”? (unsure what point the book is trying to make here)

Exercise 2.6: Mysterious Spikes

The results shown in Figure 2.3 should be quite reliable because they are averages over 2000 individual, randomly chosen 10-armed bandit tasks. Why, then, are there oscillations and spikes in the early part of the curve for the optimistic method? In other words, what might make this method perform particularly better or worse, on average, on particular early steps?

At initial steps, most agents across all tests will do the same thing: they will go through all actions once and be “disappointed” in all the rewards they receive. Depending on the rewards they receive, they now have a pretty high chance (it seems plausible that it would be around 40%, as in the figure) of selecting the optimal action in their next move after trying out all actions once, as on average the optimal action should have yielded a higher reward.

The fact that a large proportion of agents selects the optimal action exactly at that timestep causes the spike in the graph. However, after having selected it, most agents will have been “disappointed” even more, and switch back to selecting other actions. At this point most agents are doing different things and there is not more large constructive interference as in this case at $t=10$.

I’m not quite sure why optimistic takes so long to converge to the optimal action however – even though it takes less iterations than the realistic $\epsilon$-greedy approach. Might be something to look into more.

Another thing to notice is that the graph reflects that the probability of choosing the optimal action prior to $t=10$ is around 10%, that is a random pick out of all actions with $Q_1 = 5$.

Exercise 2.7: Unbiased Constant-Step-Size Trick

In most of this chapter we have used sample averages to estimate action values because sample averages do not produce the initial bias that constant step sizes do. However, sample averages are not a completely satisfactory solution because they may perform poorly on nonstationary problems. Is it possible to avoid the bias of constant step sizes while retaining their advantages on nonstationary problems? One way is to use a step size of $\beta_t = \frac{\alpha}{\bar{o}_t}$, where $α > 0$ is a conventional constant step size, and $\bar{o}_t$ is a trace of one that starts at $0$:

\[\bar{o}_{t+1} = \bar{o}_t + \alpha (1 - \bar{o}_t) \text{ for } t \geq 1 \text{ with } \bar{o}_1 = \alpha\]

Carry out an analysis to show that $\beta_t$ is an exponential recency-weighted average without initial bias.


2.7 Upper-Confidence-Bound Action Selection

Something that can be improved with $\epsilon$-greedy methods, is that if the $\epsilon$ chance of exploring is selected, the action taken is at random. We might want to explore according to a prior belief on how much benefit we expect to gain: if we already are very sure that an action is suboptimal, it might not be worth exploring that anymore, and prefer other ones where the certainty is less. In other words, we might want to select among the non-greedy actions according for their potential for actually being optimal (taking into account both how close their estimates are to the optimal action value and their uncertainty).

One way to do this is to select all actions as such (we don’t have $\epsilon$-greedy actions anymore):

\[A_t = arg\max\limits_a \Bigg[Q_t(a) + c \sqrt{\frac{\ln t}{N_t(a)}}\Bigg]\]

Note: when $N_t(a)=0$, $a$ is considered to be the maximizing action.

Why this expression? We see that we are selecting according to a proxy value that is a sum of the actual belief of the $Q$ value and a rescaled (by $c$) proxy for the uncertainty of the estimate. In fact, we see that every time the action is explored, $N_t(a)$ will increase, decreasing this proxy for uncertainty. Similarly, at every timestep, $\ln t$ will increase (even though at a slower – logarithmic – rate): this serves the purpose to increase the “uncertainty” of actions even though they might not be selected. In this case it’s not really about uncertainty, but more about making sure that every action is selected every once in a while, even though we are pretty confident of its sub-optimality.

This approach is called upper confidence bound (UCB). The upper bound aspect of it comes from the fact that the quantity argmaxed over can be considered an approximate upper bound for the actual value of each action $a$.

This all seems very reasonable and more theoretically grounded than most of the previous methods (even though the statistics is still very rough and approximate). However, unfortunately, even though UCB performs well, it is very hard to extend to general RL cases. In particular, it does not support nonstationary problems, or large state spaces (where the states cannot be stored in tabular form).

Why can’t we apply the nonstationary update rules for $Q_t$ 2.5 to UCB and be able to track nonstationary problems effectively?

The main reason behind this is that even though we would be updating the $Q$ values in the “right way” – capable of dealing with nonstationary problems – we would not be selecting the actions in a very legitimate way. This arises from the fact that our proxy for uncertainty assumes stationarity. In fact, there is no clear way to change the expression for action selection (presented in the equation above) in order to be good at dealing with nonstationary problems.

Exercise 2.8:

UCB Spikes In Figure 2.4 the UCB algorithm shows a distinct spike in performance on the 11th step. Why is this? Note that for your answer to be fully satisfactory it must explain both why the reward increases on the 11th step and why it decreases on the subsequent steps. Hint: if $c = 1$, then the spike is less prominent.

This is related to Exercise 2.6. In this case this phenomenon arises from the way we define the argmax to behave when $N_t(a)=0$: as mentioned above, when $N_t(a)=0$, $a$ is considered to be the maximizing action.

This means that at each of the first 10 moves (as there are 10 actions in our testbed), we will pick at random among the actions that have not been chosen yet, as they will have $N_t(a)= 0$ and so will be considered “optimal”.

On the 11th step, all the agents will pick the action among the 10 that in this first round of attempts yielded the highest reward: for all actions, the value of $c \sqrt{\frac{\ln t}{N_t(a)}}$ will be the same, so the only discriminatory factor will be the value of $Q_t$. And we know that this only depends on the first round of exploration.

Therefore the 11th step will be an example of “constructive interference” (as in Exercise 2.6) where all agents across games are in sync with selecting a “pretty good” action (not necessarily optimal, but most likely not too bad). This is what causes the spike. After step 11, the agents across games will not be “in sync” anymore, and the average reward will go down, to then slowly increase as more certainty about the system is gathered.

The spike therefore is mainly due to this “sync” across different games, that is guaranteed by the structure of the UCB action selection.

2.8 Gradient Bandit Algorithms

Up until now, we have concerned ourselves with estimating action values ($Q$ values) and using those estimates to select actions. This is not the only approach possible: in fact one can also try to learn a numerical preference $H_t(a)$ for each action. The value of this preference value is unrelated to reward: only the relative preference of one action over another is important, as the probability of selecting each action is determined by a performing soft-max on the preference values.

\[P(A_t = a) = \frac{e^{H_t(a)}}{\sum_{b=1}^k e^{H_t(b)}} = \pi_t(a)\]

where $\pi_t(a)$ denotes the probability of taking action a at time t. This notation is very common.

There is a natural learning algorithm in this setting based on the idea of stochastic gradient descent. This is based on updating the preference values as such, after taking action $A_t$ and obtaining reward $R_t$:

\[H_{t+1} (A_t) = H_{t+1} (A_t) + \alpha (R_t - \bar{R}_t)(1 - \pi_t(A_t))\] \[H_{t+1} (a) = H_t(a) - \alpha (R_t - \bar{R}_t) \pi_t(a) \ \text{ for all } \ a \neq A_t\]

Note that $(R_t - \bar{R}_t)$ will be positive if the reward obtained at time $t$ is greater than the average reward obtained.

This means that after taking action $A_t$, we will update the preference $H_{t+1}$ for $A_t$:

  • Increasing it if the reward obtained $R_t$ is greater than average
  • Decreasing it if the reward obtained $R_t$ is smaller than average

The amount the preference for $A_t$ is updated depends on $\pi_t(A_t)$: in particular, if $A_t$ has a high chance of being selected (high $\pi_t(A_t)$), then we will have a small update, and viceversa.

The opposite happens for all the other actions $a \neq A_t$. Their preference is updated too, in the opposite direction as that for $A_t$, and the update size will be proportional to their probability of being selected (the higher the chance, the bigger the update).

Note that the amount of increase/decrease is proportional to the step-size $\alpha$ that basically measures how aggressively we update our preferences.

There are quite a lot of moving parts, but they can all be derived by trying to port the concept of stochastic gradient ascent to this context. In particular, it turns out the the above update scheme is equivalent to stochastic gradient ascent with batch size 1. (See book for details).

Exercise 2.9:

Show that in the case of two actions, the soft-max distribution is the same as that given by the logistic, or sigmoid, function often used in statistics and artificial neural networks.


2.9 Associative Search (Contextual Bandits)

For now we have only considered nonassociative tasks, that is, tasks in which there is no need to associate different actions with different situations. However, in the general RL task there is more than one situation, and the goal is to learn a policy, i.e. a mapping from states to optimal actions given the state.

Suppose that instead of having one bandit for the entire game, at each turn you were given a bandit selected among a pool of bandits, and also a clue as to which bandit it might be. In this case, the optimal action depends on the state (the bandit you are actually confronting). Now you can learn a policy that associates an optimal action with each state you could potentially be in.

This is an example of associative search task, called as such as it involves both trial-and-error learning to search and association of these action with the situations in which they are best. This task is now called contextual bandits in the literature.

In this setting, we still don’t have actions affecting the next situation as well as the reward. This will add another layer of complexity and lead us to full reinforcement learning problems.

Exercise 2.10

Suppose you face a 2-armed bandit task whose true action values change randomly from time step to time step. Specifically, suppose that, for any time step, the true values of actions 1 and 2 are respectively 0.1 and 0.2 with probability 0.5 (case A), and 0.9 and 0.8 with probability 0.5 (case B). If you are not able to tell which case you face at any step, what is the best expectation of success you can achieve and how should you behave to achieve it? Now suppose that on each step you are told whether you are facing case A or case B (although you still don’t know the true action values). This is an associative search task. What is the best expectation of success you can achieve in this task, and how should you behave to achieve it?

If we are not told which case we face at any step, the best one can do is the (weighted) average the Q values associated with each action across cases, and always pick the action that leads to the highest (weighted) average reward across all cases.

In this case we are given the weights (0.5 for case A and 0.5 for case B), to the weighted average just corresponds to the average. Given that the average reward for action 1 and action 2 are both $1/2$, in this case the choice of action doesn’t matter, as we expect – on average – to receive the same reward in the long run.

To be able to achieve this optimal behavior, we could just use one of the standard stationary approaches, such as UCB with incremental $Q$ value update. (I think?)

If instead we were told what case we were facing, then we would want to run two separate nonstationary updates: instead of keeping one set of $Q$ value estimates, we have one for case A and one for case B, and update them independently. If implemented correctly this would lead us to select action 2 in case A, and action 1 in case B.

2.10 Summary

This chapter presented several ways to balance exploration and exploitation:

  • $\epsilon$-greedy methods choose randomly among actions a small fraction of the time to encourage exploration
  • UCB methods choose deterministically but favouring actions with uncertain estimates
  • Gradient methods don’t estimate action values, but preferences, and choose actions probabilistically according to the preferences
  • Optimistic initialization sets optimistic initial $Q$ values in order to encourage a very active exploration in the initial phase, leading to a fast convergence with no bias

There is no method among the above that is best. Common ways to assess performance of these kind of algorithms are through graphs:

  • learning curve: shows the performance of an algorithm vs iterations for a certain parameter setting
  • parameter studies: summarize each algorithm by the average reward over the first 1000 steps, and show such a value for various parameter settings and algorithms

One should also consider the sensitivity to parameter settings, that is a indication of robustness.

Despite their simplicity, these methods can be considered state of the art, as more sophisticated methods usually impose additional complexity and assumptions.

Stuff to potentially explore more: Gittins indices

Something to keep in mind is that in a Bayesian setting it would be possible to even compute the optimal balance between exploration and exploitation. However, this is often computationally infeasible. See book for more details. The main idea is that one could imagine expanding all possible outcomes for the next $x$ actions, and computing which tradeoff would lead, in expectation, to the best outcome.

Exercise 2.11 (programming)

Make a figure analogous to Figure 2.6 for the nonstationary case outlined in Exercise 2.5. Include the constant-step-size $ε$-greedy algorithm with $α=0.1$. Use runs of 200,000 steps and, as a performance measure for each algorithm and parameter setting, use the average reward over the last 100,000 steps.


13. Policy Gradient Methods

In opposition to action-value methods, that associate a value to each action given the state, and pick actions according to these values, there are also policy-gradient methods. These methods learn a parametrized policy that can select actions without consulting a value function. A value function might still be used to learn the policy parameter, but is not required for action selection.

What does this mean? As far as I can tell, the major difference is that now the choice of action (the policy) is made distinct from the values of the state. An implicit assumption in action-value methods is that we always pick the actions that maximize our value, or at least we do so most of the time.

In this context, we let $\theta \in \mathbb{R}^{d’}$ be the policy parameter vector: $\pi(a \mid s, \theta) = P(A_t = a \mid S_t = s, \theta_t = \theta)$. Our policy also depends on the value of $\theta$.

If the method uses a learned value function as well, we can denote it as $\hat{v}(s, w)$ where $w \in \mathbb{R}^d$ is the weight vector.

We can now consider methods for learning the policy parameter $\theta$ based on some performance measure $J(\theta)$ by performing gradient ascent with respect of $J$ with respect to $\theta$:

\[\theta_{t+1} = \theta_t + \alpha \widehat{\nabla J(\theta_t)}\]

where $\widehat{\nabla J(\theta_t)}$ is a stochastic estimated gradient of the performance measure $J$ with respect to the parameter $\theta$.

All methods that follow this schema are called policy gradient methods. Note that in this case, by performing gradient ascent with respect to $\theta$, if our parametrized policy is sufficiently expressive, we are guaranteed (I believe) to converge to the optimal policy, given that our performance measure is correct and probably some other assumptions. In general an important takeaway is this: by picking a value for $\theta$ we are implicitly approximating the optimal policy function.

Can’t action value methods be considered as a parametrized policy case also? Where the $\epsilon$ (for $\epsilon$-greedy methods) could be considered part of the $\theta$? What is the main difference? I think that the answer to this is yes – action-value methods are a subset of methods with a parametrized policy. However, the difference is that in that case $\theta$ (i.e. $\epsilon$) is not updated over time in order to optimize performance, so it is not to be considered a gradient method.

Methods that learn approximations to both the policy and value functions are called actor-critic methods, where actor refers to the learned policy, and critic refers to the learned value function, usually a state-value function.

What other function could we have that is not state-value? Probably skipped too much for now to have a good example.

13.1 Policy Approximation and its Advantages

In the setting of policy-based methods, the policy can be parametrized in any way, as long as $\pi(a \mid s, \theta)$ is differentiable with respect to $\theta$.

In practice, to ensure exploration, we require the policy to never become deterministic (i.e. $\pi(a \mid s, \theta) \in (0,1) \ \ \forall s, \theta$).

I think that the concept of not exploring is related to that of local minima in optimization. If our policy is deterministic, we don’t explore, and so the agent has a hard time improving. However, in this setting it’s a bit more complicated: even if the policy was deterministic, the $\theta$ value would still be updated, and thus it’s policy would still change over time to maximize performance. However, given that this relies on the environment being stochastic, the whole matter becomes more complicated. In fact, if the environment is deterministic, then a deterministic policy would lead us to not be able to improve at all? Or would it? I’m kind of confused as to how one would go about actually calculating $J$ in a real setting, or what $J$ would be. (I’m imagining something like the long-run average reward gained by playing the game with policy $\pi$)

Policy-based methods can also work well with continuous action spaces, as explored in 13.7, but here we only consider reasonably small discrete action spaces. In this case, the natural way to parametrize the policy function is with a vector of numerical preferences $h(s, a, \theta) \in \mathbb{R}$ for each state-action pair. Then one can recover the actual policy approximation with a soft-max, that converts these preferences to probabilities. This kind of parameterization is called soft-max in action preferences. The preferences could be parameterized arbitrarily, for example with a neural net or just with a linear combination of features.

One advantage of using the soft-max is that the approximate policy can approach a deterministic policy, while this is not possible with $\epsilon$-greedy approaches, that always maintain a certain amount of stochasticity. One could select according to a soft-max distribution based on action values, but this alone would not allow the policy to approach a deterministic policy.

Let’s think about why for a minute. If we were to select actions according to soft-max action values, then we would always preserve stochasticity in our action choices. This is because eventually, if we are estimating our action-values correctly, they will converge to their true values, and so their soft-max version will inevitably preserve some non-negligible probability of selecting all of them. The main difference from using the preference values $h$ is that they could grow arbitrarily large or small such that the soft-max probabilities actually approach a deterministic policy (one action has probability arbitrarily close to one and the others arbitrarily close to zero). One could still include a temperature parameter for the soft-max, but then the question arises as to how to vary it over time.

Action preferences are different because they are driven by the policy gradient (that drives the updates of $w$) to produce the optimal stochastic policy. If the optimal stochastic policy is deterministic, then it will produce a policy that is arbitrarily close to that deterministic one.

Another advantage of parametrizing policies according to the soft-max in action preferences is that it enables to converge to policies with particular probabilities of selecting each action. There are situations in which, for a given state, there does not exist one optimal action, but the optimal thing to do is to return a random action with a certain distribution. This occurs for example in poker, with bluffs. One has to switch up one’s game and can’t play too predictively. There is no optimal deterministic strategy. This is different than with action-based methods, that in their vanilla version determine an “optimal action” and then return it $1-\epsilon$ proportion of the time. Even in the case in which there are multiple “optimal actions” (actions with the same action-value), one usually picks among them uniformly at random.

This is to say that action-value methods have no natural way of finding stochastic policies, whereas policy approximating methods can.

The simplest advantage that policy parameterization might have over action-value parameterizations is that the policy may be a simpler function to approximate. For some problems, the action-value function is simpler and thus is easier to approximate, and for others the policy function is.

The choice of policy parameterization can also be a good way of injecting prior knowledge into the problem. This is often the most important reason in using a policy-based learning method.

Exercise 13.1

Use your knowledge of the gridworld and its dynamics to determine an exact symbolic expression for the optimal probability of selecting the right action in Example 13.1.

Note that in this example, $J(\theta) = v_{\pi_\theta}(S)$. One can derive a closed form symbolic expression for $v_{\pi_\theta}(S)$ and then differentiate with respect to $\theta$ to find the stationary point for the gradient ascent. This would parametrize the optimal policy, so one can derive the optimal action probabilities from the soft-max optimal preference values (that in turn are determined by the optimal $\theta$). TODO

Still in progress