Written by Paulina Moskwa
on August 31, 2025

Learning the hard way

Reinforcement Learning Saga — Part I:
From zero to Q-learning

Grid world example with Q-learning
Code available here

In this article

Introduction to Reinforcement Learning
Markov Decision Processes
- Model-based vs. Model-free
Return, Policy, and Value Functions
- Bellman Equations
Monte Carlo Methods
Temporal Difference Learning
SARSA
Q-Learning

Abstract

This article introduces reinforcement learning (RL), a framework for decision-making in uncertain, sequential settings. An agent repeatedly observes the state of a system, takes an action, and receives a numerical reward, with the aim of maximizing the total reward over time. The interaction is modeled as a Markov Decision Process (MDP), which formalizes states, actions, transition probabilities, and reward functions. The main interests are the policy (a strategy for choosing actions) and the value functions (quantifying the long-term benefit of states or actions). We derive recursive relationships for the value functions, known as Bellman equations, which form the foundation for computational methods.

RL fundamental problems are prediction (estimate the value functions for each state or state-action pair for a given policy), and control (find the best policy). For prediction, we introduce Monte Carlo (MC) methods, which learn from complete episodes, and Temporal Difference (TD) methods, which update the value estimates incrementally by combining observed rewards with bootstrapping. For control, we introduce SARSA and Q-learning, an on-policy and off-policy algorithm respectively. Examples throughout illustrate how these methods operate in practice.

Introduction

Reinforcement Learning (RL) is a framework for sequential decision making in which an agent learns by trial and error through interaction with its environment. The agent is what we directly control, it affects an environment that responds but is not directly controllable.

At each discrete time step $t$ the agent observes the current state $s_t$, selects an action $a_t$, and receives a scalar reward $r_{t+1}$ that evaluates the outcome as the system moves to a new state $s_{t+1}$. The goal is to maximize the expected cumulative reward over time.

Markov Decision Processes

To model decision making we use Markov Decision Processes (MDPs). An MDP formalizes the interaction between an agent an the environment at discrete time steps $t$ through:

a set of states $S =\{s_i\}_i$
a set of actions $A = \{a_i\}_i$
a transition probability function ${p}(s’|s,a)$ that maps pairs $(s,a)$ of state and action to a probability distribution over next states; i.e. it defines the probability of reaching state $s’$ after taking action $a$ in state $s$
a reward system
- $r(s,a,s’)$: reward received by taking action $a$ in state $s$ and moving to state $s’$
- $r(s,a)$: expected immediate reward for taking action $a$ in state $s$, averaged over all the possible next states $s’$ using the transition probability $p(s’|s,a)$ $$ r(s,a) = \mathbb{E}[r_{t+1}\,|\, s_t=s, a_t=a] = \sum_{s'} p(s'| s,a)\;r(s,a,s') $$

MDPs rely on the Markov property, which assumes that the future is independent of the past, given the present. This translates in the agent considering only the present, and not the full history, to make a decision.

\[{p}(s_{t+1}\,|\,s_t, a_t) = {p}(s_{t+1}\,|\,s_t, a_t, s_{t-1}, a_{t-1}, .., s_0, a_0)\]

Model-based vs. Model-free

Reinforcement learning methods differ primarily in whether they rely on an explicit model of the environment. By model, we mean the transition probability function $p(s’|s,a)$ and the reward function $r(s,a)$. This distinction governs the way the agent learns from interaction.

Model-based methods use a known model (or learned from data) to plan by simulating outcomes. These methods improve a strategy by repeatedly evaluating and optimizing action sequences without the need to interact with the real environment. As a downside, modeling an environment can be complex and can introduce a bias.
Model-free methods skip modeling and learn directly from experience. These methods improve behavior through trial and error, using signals from the environment. They are very flexible but, as a downside, they often require many interactions.

Return, Policy, and Value Functions

The goal is to maximize the expected cumulative reward received over time. To formalize this, we define as return $G_t$ the total future reward from the time $t$ onward:

\[G_t = r_{t+1} + r_{t+2} + r_{t+3} + \;..\]

where $r_i$ is the reward obtained at time $i$ after taking action at time $i-1$.

This definition does not take into account two important aspects. First, the further into the future we try to predict, the less certain we are about the outcomes. Second, there are tasks that can last indefinitely: summing raw rewards may end in an infinite value, making the return ill-posed. A revisited definition of return should therefore introduce a discount factor $\gamma\in[0,1]$ that controls how much we should value future rewards.

\[G_t = r_{t+1} + \gamma\,r_{t+2} + \gamma^2\,r_{t+3} + \;..\; = \sum_{k=0}^{\infty} \gamma^k\,r_{t+1+k}\]

If $\gamma =0$ only the immediate reward matters.
If $\gamma <1$ the rewards in the far future are less valuable than the immediate rewards.

If we define the effective horizon $H_{\text{eff}}$ as how far into the future an agent needs to plan or consider rewards to make good decisions, a rule of thumb is to approximate $H_{eff} \approx \small{\frac{1}{1-\gamma}}$. For example, $\gamma=0.99$ corresponds to roughly 100 steps of look ahead. This means that at time $t$ the contribution of the rewards coming after time $t+100$ is negligible.

An agent behavior is governed by a policy $\pi$, a distribution over all possible actions given a state. Formally, we denote as $\pi(a|s)$ the probability of selecting action $a$ when in state $s$. The optimal policy is the one that ensures the highest expected return. To evaluate how good a policy is, we introduce value functions. Value functions assign a numerical estimate to states or state-action pairs to represent the expected return the agent can achieve from that point onward under the provided policy.

The state-value function $V_{\pi}(s)$ represents the total expected return when starting from the state $s$ and following the policy $\pi$. $$ V_{\pi}(s) = \mathbb{E}_{\pi}[G_t\,|\,s_t=s] $$

The action-value function $Q_{\pi}(s,a)$ represents the total expected return starting from state $s$, taking action $a$, and then following the policy $\pi$. $$ Q_{\pi}(s,a) = \mathbb{E}_{\pi}[G_t\,|\,s_t=s, a_t=a] $$

The state-value function $V_{\pi}(s)$ and the action-value function $Q_{\pi}(s,a)$ are directly connected. Both represent total expected returns under the same policy $\pi$, but with slightly different starting conditions: $V_{\pi}(s)$ assumes we are in state $s$ and follow policy $\pi$ from there, $Q_{\pi}(s,a)$ assumes we are in state $s$, take a specific action $a$, and then follow policy $\pi$ from the next state onward. Their mathematical relationship is the following:

\[V_{\pi}(s) = \sum_{a} \pi(a|s)\;Q_{\pi}(s,a)\]

The intuition behind it is that, to get the value of a state $s$, we average the value of all possible actions $a$ from that state, assuming actions are chosen according to $\pi$ (each action value is weighted by how likely the policy $\pi$ will choose that action).

Bellman Equations

By expanding the value functions $V_{\pi}(s)$ and $Q_{\pi}(s,a)$ using their definitions, we derive the Bellman equations. These formulations reveal that the value of a state (or a state–action pair) can be expressed in terms of the values of its successor states.

State-value function $V_{\pi}(s)$

\[\begin{align} V_{\pi}(s) &= \mathbb{E}_{\pi}[G_t\,|\,s_t=s]\\[2pt] &= \mathbb{E}_{\pi}[\textstyle{\sum}_{k=0}^\infty \gamma^k\, r_{t+1+k}\,|\,s_t=s]\\[2pt] &=\mathbb{E}_{\pi}[r_{t+1} + \textstyle{\sum}_{k=1}^\infty \gamma^k\, r_{t+1+k}\,|\,s_t=s]\\[2pt] &=\mathbb{E}_{\pi}[r_{t+1} + \textstyle{\sum}_{k=0}^\infty \gamma^{k+1}\, r_{t+1+k+1}\,|\,s_t=s]\\[2pt] &=\mathbb{E}_{\pi}[r_{t+1} + \gamma \,\textstyle{\sum}_{k=0}^\infty \gamma^k\, r_{t+2+k}\,|\,s_t=s]\\[2pt] &=\mathbb{E}_{\pi}[r_{t+1} + \gamma \,G_{t+1}\,|\,s_t=s]\\[-4pt] &\overset{1}{=} \mathbb{E}_{\pi}[r_{t+1} + \gamma\,\mathbb{E}_{\pi}[G_{t+1}|S_{t+1}]\,|\,s_t=s]\\[-4pt] &\overset{2}{=} \mathbb{E}_{\pi}[r_{t+1} + \gamma \,V_{\pi}(S_{t+1})\,|\,s_t=s] \end{align}\]

$^1$ Law of iterated expectations: $\mathbb{E}[G_{t+1} \,|\, s_t = s] = \mathbb{E}\big[\,\mathbb{E}[G_{t+1} \,|\, S_{t+1} ]\,|\,s_t=s\big]$.
$^2$ We write $S_{t+1}$ as random variable, meaning for whatever state $S_{t+1}$ actually turns out to be. If $S_{t+1} = s’$ then $V_{\pi}(S_{t+1}) = V_{\pi}(s’)$.

By definition of the mathematical operators involved, we can rewrite the Bellman equation for the state-value function as follows.

Bellman equation for the state-value function $V_{\pi}(s)$ $$ \begin{align} V_{\pi}(s) &= \mathbb{E}_{\pi}[r_{t+1} + \gamma \,V_{\pi}(S_{t+1})\,|\,s_t=s]\\[2pt] &= \textstyle{\sum}_a \pi(a|s) \; \textstyle{\sum}_{s'} \,p(s'|s,a)\big[r(s,a,s') + \gamma\;V_{\pi}(s')\big] \end{align} $$

Action-value function $Q_{\pi}(s,a)$

\[\begin{align} Q_{\pi}(s,a) &= \mathbb{E}_{\pi}[G_t \,|\, s_t = s, a_t = a] \\[-4pt] &\overset{1}{=} \mathbb{E}_{\pi}[r_{t+1} + \gamma \,V_{\pi}(S_{t+1}) \,|\, s_t = s, a_t = a] \\[-4pt] &\overset{2}{=} \mathbb{E}_{\pi}[r_{t+1} + \gamma \,\mathbb{E}_{\pi}[Q_{\pi}(S_{t+1}, A_{t+1})\,|\,S_{t+1}] \,|\, s_t = s, a_t = a] \\[-4pt] &\overset{3}{=} \mathbb{E}_{\pi}[r_{t+1} + \gamma \,Q_{\pi}(S_{t+1}, A_{t+1}) \,|\, s_t = s, a_t = a] \end{align}\]

$^1$ From the expansion that we did for the state-value function.
$^2$ As stated earlier, the mathematical relation between state-value and action-value functions is $V_{\pi}(s) = \textstyle{\sum}_a \pi(a|s)\;Q_{\pi}(s,a)$ and, by definition, $\textstyle{\sum}_a \pi(a|s)\;Q_{\pi}(s,a) = \mathbb{E}_{\pi}[Q(s,A)\,|\,s_t=s]$. In terms of fully random variables, this translates to:

\[V_{\pi}(S_{t+1}) = \textstyle{\sum}_{a'}\,\pi(a'\,|\,S_{t+1})\;Q_{\pi}(S_{t+1},a') = \mathbb{E}_{\pi}[Q_{\pi}(S_{t+1}, A_{t+1})\,|\,S_{t+1}]\]

$^3$ Law of iterated expectations:

\[\mathbb{E}\big[\,\mathbb{E}[Q_{\pi}(S_{t+1}, A_{t+1})\,|\,S_{t+1}]\,|\,s_t=s, a_t=a\big] = \mathbb{E}[Q_{\pi}(S_{t+1}, A_{t+1})\,|\,s_t=s, a_t=a]\]

By definition of the mathematical operators involved, we can rewrite the Bellman equation for the action-value function as follows.

Bellman equation for the action-value function $Q_{\pi}(s,a)$ $$ \begin{align} Q_{\pi}(s,a) &= \mathbb{E}_{\pi}[r_{t+1} + \gamma \,Q_{\pi}(S_{t+1}, A_{t+1}) \,|\, s_t = s, a_t = a] \\[2pt] &= \textstyle{\sum}_{s'}\, p(s'|s,a) \big[r(s,a,s') + \gamma\,\textstyle{\sum}_{a'}\,\pi(a'|s')\;Q_{\pi}(s',a')\big] \end{align} $$

Monte Carlo Methods

Solving the Bellman equations requires the knowledge of the environment dynamics, i.e. the explicit model of the environment: the transition probabilities $p(s’|s,a)$ and the reward function $r(s,a,s’)$.

Monte Carlo (MC) methods offer an alternative by directly estimating value functions from sampled experience, without requiring explicit knowledge of the dynamics. In MC, the agent interacts with the environment to generate complete episodes, i.e. sequences of states, actions, and rewards that terminate at an end state. For each visited state or state-action pair, the return $G_t$ is then calculated as the sum of all rewards obtained from that time step until the episode conclusion:

\[G_t = r_{t+1} + \gamma\,r_{t+2} + \gamma^2\,r_{t+3} + \,..\, + \,\gamma^{T-(t+1)}\,r_{T}\]

where $T$ is the final time step of the episode. The Monte Carlo estimate of the state-value function is therefore given by:

\[V_{\pi}(s) \leftarrow \text{average of all } G_t \text{ for episodes where } s_t = s\]

The Monte Carlo estimate of the action-value function is given by:

\[Q_{\pi}(s,a) \leftarrow \text{average of all } G_t \text{ for episodes where } (s_t,a_t) = (s,a)\]

Let us consider a tiny example. Suppose we have two states $S = \{\text{low}, \text{high}\}$, two available actions $A = \{x,y\}$, and we generate a single episode following the given policy $\pi$.

\[(\text{low}, x) \xrightarrow{+1} (\text{high}, y) \xrightarrow{+2} (\text{low}, x) \xrightarrow{+3} \text{ terminal}\]

This episode translates into the following trajectory:

\[(\text{low}, x, +1) \rightarrow (\text{high}, y, +2) \rightarrow (\text{low}, x, +3) \rightarrow \text{ terminal}\]

We can now compute the returns $G_t$ for each time step (for simplicity $\gamma =1$):

$t=0: \quad G_0 = 1 + 2 + 3 = 6$
$t=1: \quad G_1 = 2 + 3 = 5$
$t=2: \quad G_2 = 3$

Then, we update the state values:

$\text{low}$ is visited at $t=0, 2$, with returns $\{6,3\}$ $\Rightarrow V_\pi(\text{low}) = \small\frac{6+3}{2} \normalsize = 4.5$
$\text{high}$ is visited at $t=1$, with return $\{5\}$ $\Rightarrow V_\pi(\text{high}) = 5$

and action values:

$(\text{low}, x)$ is visited at $t=0, 2$, with returns $\{6,3\}$ $\Rightarrow Q_\pi(\text{low},x) = \small\frac{6+3}{2} \normalsize = 4.5$
$(\text{high}, y)$ is visited at $t=1$, with return $\{5\}$ $\Rightarrow Q_\pi(\text{high},y) = 5$

The advantages for Monte Carlo methods are that they do not require the knowledge of the environment dynamics, and they provide unbiased estimates of the value functions as the number of episodes grows large. The only assumption that is needed is the ability to sample episodes under a given policy. However, MC methods must wait until the end of an episode to update the value estimates, which can be inefficient in the scenario of long episodes. Methods that wait for complete data before updating are called offline.

Temporal Difference Learning

Offline methods like Monte Carlo have the advantage of more stable updates, since the full return is used. As a downside, they are slower to learn and cannot update mid-episode. Online methods address this limitation by updating estimates immediately after each interaction with the environment. Their downside is that the updates are noisy, since they rely on single steps, but the advantage is that they can start learning while still collecting data.

Temporal Difference (TD) learning is an online framework that updates the estimates of the value functions every $n$ steps based on the observed rewards plus the current estimate of the following value functions. The strategy of updating the estimates based on other current estimates is called bootstrapping.

TD learning methods are characterized by the number of steps they look ahead, denoted by $n$, defining TD($n$) algorithms. The simplest case is TD($0$), where the value of a state (or state–action pair) is updated using the immediate reward plus the estimated value of the next state (just one step ahead). TD($n$) generalizes this idea: the value function is updated using an $n$-step return, which combines the actual rewards collected over the next $n$ steps with the estimated value of the state reached after those $n$ steps. In other words, TD($n$) looks $n$ steps into the future before bootstrapping with the value function.

TD learning can be used for both the state value function $V_\pi(s)$ and the action value function $Q_\pi(s,a)$. Following, the explanation of the method based on the state value function.

$\small\pmb{\text{TD(0)}}$

For a fixed policy $\pi$, the Bellman equation for the state value is:

\[V_\pi(s) = \mathbb{E}_{\pi}[r_{t+1} + \gamma\,V_\pi(S_{t+1}) \,|\, s_t=s]\]

At time $t$, we observe the state $s_t$, take an action $a_t \sim \pi(\cdot|s_t)$, then receive a reward $r_{t+1}$ and land in the next state $s_{t+1}$. From this single transition we get one sample of what the Bellman equation says $V_\pi(s_t)$ should equal to:

\[\hat{V}_{\substack{\text{one-step}\\\text{TD target}}}(s_t) = r_{t+1} + \gamma\,V_\pi(s_{t+1})\]

Notice that $V_\pi(s_{t+1})$ is available as an estimate of the value function in $s_{t+1}$ from previous iterations. The difference between the current estimate and the sampled target is defined as temporal difference error:

\[\delta_t = \underbrace{r_{t+1} + \gamma\,V_\pi(s_{t+1})\\[4pt]}_{\text{TD target}} - \underbrace{V_\pi(s_t)\\[4pt]}_{\substack{\text{current}\\\text{estimate}}}\]

This error quantifies how surprising the observed transition is compared to the expectation, i.e. the current estimate. Using this error, we update the estimate of $V_\pi(s_t)$ gradually, with the magnitude of the adjustment controlled by a step size $\alpha \in (0,1]$ denoted as learning rate.

\[V_\pi(s_t) \leftarrow V_\pi(s_t) + \alpha\,\delta_t\]

This brings us to the final form of the update rule for the state value function via TD($0$):

\[V_\pi(s_t) \leftarrow V_\pi(s_t) + \alpha \,\big(\underbrace{\underbrace{r_{t+1} + \gamma\,V_\pi(s_{t+1})\\[4pt]}_{\text{TD target}} - V_\pi(s_t)}_{\text{TD error}}\big)\]

Let us now go back to the tiny example. The generated episode was:

\[(\text{low}, x) \xrightarrow{+1} (\text{high}, y) \xrightarrow{+2} (\text{low}, x) \xrightarrow{+3} \text{ terminal}\]

Let us consider $\alpha = 0.5$, and again for simplicity $\gamma =1$. In the TD($0$) framework, we are able to update $V_\pi(\cdot)$ after each step. In the beginning, we initialize $V_\pi(\text{low}) = 0$ and $V_\pi(\text{high}) = 0$.

$\begin{align} \quad\,\mathrel{\raise 1pt\hbox{$\scriptsize\bullet$}} \normalsize\,\, t=0: \quad V_\pi(\text{low}) &\leftarrow V_\pi(\text{low}) + \alpha \,\big(r_1 + \gamma\,V_\pi(\text{high}) - V_\pi(\text{low})\big)\\ &\leftarrow 0 + 0.5 \, \big(1 + 0 - 0\big) = 0.5 \end{align}$ $\begin{align} \\[-20pt]\quad\,\mathrel{\raise 1pt\hbox{$\scriptsize\bullet$}} \normalsize\,\, t=1: \quad V_\pi(\text{high}) &\leftarrow V_\pi(\text{high}) + \alpha \,\big(r_2 + \gamma\,V_\pi(\text{low}) - V_\pi(\text{high})\big)\\ &\leftarrow 0 + 0.5 \, \big(2 + 0.5 - 0\big) = 1.25 \end{align}$ $\begin{align} \\[-20pt]\quad\,\mathrel{\raise 1pt\hbox{$\scriptsize\bullet$}} \normalsize\,\, t=2: \quad V_\pi(\text{low}) &\leftarrow V_\pi(\text{low}) + \alpha \,\big(r_3 + \gamma\,V_\pi(\text{terminal}) - V_\pi(\text{low})\big)\\ &\leftarrow 0.5 + 0.5 \, \big(3 + 0 - 0.5\big) = 1.75 \end{align}$

The estimated values of the states are quite different from the estimates provided by Monte Carlo. While MC gets the full return immediately, so that its first estimate after one episode is already complete for the episode, TD(0) is more gradual, and since it bootstraps, it initially underestimates future rewards if the value function starts at 0. If we collect many episodes, TD(0) will converge toward the true values.

$\small\pmb{\text{TD(n)}}$

Instead of using only the immediate reward and the value of the following state, TD($n$) accumulates the next $n$ rewards and then bootstrap from the state reached at time $t+n$. The $n$ step return is defined as:

\[G_t^{(n)} = r_{t+1} + \gamma\,r_{t+2} + \gamma^2\,r_{t+3} + \,..\, + \gamma^{n-1} r_{t+n} + \gamma^n\,V_\pi(s_{t+n})\]

This combines the actual rewards received in the next $n$ steps and the bootstrapped estimate of the value after those $n$ steps. The corresponding update rule for TD($n$) is:

\[V_\pi(s_t) \leftarrow V_\pi(s_t) + \alpha\,(G_t^{(n)} - V_\pi(s_t))\]

For $n=1$, this reduces to TD($0$). For large $n$ in episodic tasks (when the episode ends before the $n$-step horizon is reached) $\textstyle{G_t\small{^{(n)}}}$ approaches the full Monte Carlo return. Thus, TD($n$) provides a spectrum of methods that interpolate between pure bootstrapping (TD($0$)) and pure sampling (Monte Carlo).

SARSA

While TD methods are useful for prediction, i.e. estimating the value functions for a given policy $\pi$, in most real world scenarios we are interested in control, that is, finding the optimal policy that maximizes cumulative reward. Control problems require not only evaluating a policy, but also improving it over time.

SARSA is a model-free, on-policy, temporal difference algorithm that learns the action value $Q_\pi(s,a)$ for the current policy $\pi$, and uses it to improve the policy gradually.

The term on-policy refers to a type of learning where the policy being learned about (target policy) is the same as the policy used to generate data (behavior policy). This means that SARSA learns about the policy that the agent behaves according to. In contrast, in off-policy learning the behavior policy is different from the target policy: an off-policy algorithm is able to learn about one policy while having the agent behaving according to another.

SARSA core idea is to learn the action-value function $Q_\pi(s,a)$ for the current policy $\pi$ and update it as the agent interacts with the environment. At each step $t$, the agent acts according to the policy $\pi$, observes the next state $s_{t+1}$ and selects the next action $a_{t+1} \sim \pi(a_{t+1}|s_{t+1})$.
The update rule is:

\[Q_\pi(s_t,a_t) \leftarrow Q_\pi(s_t,a_t) + \alpha\,\big(r_{t+1} + \gamma\,Q_\pi(s_{t+1},a_{t+1}) - Q_\pi(s_t,a_t)\big)\]

This is simply TD($0$) applied to the action-value function.

Every update of $Q_\pi(s,a)$ is implicitly used to update the current policy according to a greedy or $ \varepsilon$-greedy approach. In the greedy approach, the policy is defined for every state $s$ as the action $a$ that maximizes $Q_\pi(s,a)$:

\[\pi'(s) = \arg\max_a Q_\pi(s,a)\]

In the $\varepsilon$-greedy approach, the policy selects for every state $s$ the action $a$ that maximizes $Q_\pi(s,a)$ with probability $1-\varepsilon$, and a random action with probability $\varepsilon$:

\[\pi'(s) = \begin{cases} \;\;\arg\max_a Q_\pi(s,a) & \quad\text{with probability } 1-\varepsilon\\[1pt] \;\;\text{random action} & \quad\text{with probability } \varepsilon \end{cases}\]

The $\varepsilon$-greedy approach allows the agent to keep exploring the environment.

At every step $t$ we get an updated estimate of the action-value function $Q_\pi(s,a)$ for the state $s_t$ and the chosen action $a_t\sim\pi$. If $Q_\pi(s,a)$ changes, the policy $\pi$ is automatically updated by its definition (greedy or $\varepsilon$-greedy). The agent continues interacting with the environment following the updated policy, generating new samples that are used to update again $Q_{\pi}(s,a)$. This cycle of on-policy evaluation and policy improvement continues until the policy stabilizes, eventually converging to the optimal policy $\pi^*$.

SARSA algorithm

Initialize $Q$-table arbitrarily (e.g. zeros)
Initialize $\alpha$, $\varepsilon$, $\gamma$
Define a starting policy $\pi$ (e.g. random with $\varepsilon$-greedy)
Repeat until convergence:
- For each episode:
  1. Start the agent at state $s_0$
  2. Take action $a$ according to $\pi$ ($\varepsilon$-greedy)
  3. Repeat until $s$ is terminal:
    1. Observe reward $r$, transition to the next state $s'$
    2. Choose the next action $a'$ according to $\pi$ ($\varepsilon$-greedy) at $s'$
    3. Update $Q_\pi(s,a)$:
      $$Q_\pi(s,a) \leftarrow Q_\pi(s,a) + \alpha \,\big( r + \gamma\, Q_\pi(s',a') - Q_\pi(s,a) \big)$$
    4. Update policy $\pi$ as $\varepsilon$-greedy over updated $Q_\pi(s,a)$
    5. Move to the next state: $s \leftarrow s'$, $a \leftarrow a'$
  4. Decay $\alpha$ and $\varepsilon$
Return optimal $Q$-table and derived optimal policy $\pi^*$

Let us now go back to the tiny example. The generated episode was:

\[(\text{low}, x) \xrightarrow{+1} (\text{high}, y) \xrightarrow{+2} (\text{low}, x) \xrightarrow{+3} \text{ terminal}\]

Let us consider $\alpha = 0.5$, and again for simplicity $\gamma =1$. For additional simplicy, let us focus on the greedy appraoch, meaning that $\varepsilon=0$. At the beginning, we initialize all $Q_\pi(s,a) = 0$.

$\begin{align} \quad\,\mathrel{\raise 1pt\hbox{$\scriptsize\bullet$}} \normalsize\,\, t=0: \quad Q_\pi(\text{low}, x) &\leftarrow Q_\pi(\text{low},x) + \alpha \,\big(r_1 + \gamma\,Q_\pi(\text{high},y) - Q_\pi(\text{low},x)\big)\\ &\leftarrow 0 + 0.5 \, \big(1 + 0 - 0\big) = 0.5 \end{align}$ $\begin{align} \\[-20pt]\quad\,\mathrel{\raise 1pt\hbox{$\scriptsize\bullet$}} \normalsize\,\, t=1: \quad Q_\pi(\text{high}, y) &\leftarrow Q_\pi(\text{high},y) + \alpha \,\big(r_2 + \gamma\,Q_\pi(\text{low},x) - Q_\pi(\text{high},y)\big)\\ &\leftarrow 0 + 0.5 \, \big(2 + 0.5 - 0\big) = 1.25 \end{align}$ $\begin{align} \\[-20pt]\quad\,\mathrel{\raise 1pt\hbox{$\scriptsize\bullet$}} \normalsize\,\, t=2: \quad Q_\pi(\text{low}, x) &\leftarrow Q_\pi(\text{low},x) + \alpha \,\big(r_3 + \gamma\,Q_\pi(\text{terminal},\,\cdot\,) - Q_\pi(\text{low},x)\big)\\ &\leftarrow 0.5 + 0.5 \, \big(3 + 0 - 0.5\big) = 1.75 \end{align}$

$Q$-table evolution over time:

Step	${Q_\pi(\text{low},x)}$	${Q_\pi(\text{low},y)}$	${Q_\pi(\text{high},x)}$	${Q_\pi(\text{high},y)}$
Init	$0$	$0$	$0$	$0$
$t=0$	$0.5$	$0$	$0$	$0$
$t=1$	$0.5$	$0$	$0$	$1.25$
$t=2$	$1.75$	$0$	$0$	$1.25$

Policy $\pi$ evolution over time with greedy approach:

Step	${\pi(\text{low})}$	${\pi(\text{high})}$
Init	tie $(x,y)$	tie $(x,y)$
$t=0$	$x$	tie $(x,y)$
$t=1$	$x$	$y$
$t=2$	$x$	$y$

Q-Learning

Another approach for control is Q-learning, a model-free, off-policy, temporal difference algorithm that directly learns the optimal action-value function $Q^{*}(s,a)$. This function represents the maximum expected return achievable from state $s$ by taking action $a$ and then following the best possible policy. Once $Q^*(s,a)$ is learned, the optimal policy $\pi^*$ is obtained by selecting in each state the action that greedily maximizes $Q^*(s,a)$.

\[\pi^*(s) = \arg\max_a Q^*(s,a)\]

The core of Q-learning is the Bellman optimality equation for action values:

\[Q^*(s,a) = \mathbb{E}[r_{t+1} + \gamma\,\max_{a'} Q^*(S_{t+1}, a') \,|\, s_t=s, a_t=a]\]

The difference with the policy evaluation Bellman equation is that, instead of taking an expectation over the next action under the current policy $\pi$, we take a maximum over all possible next actions. This substitution is what makes Q-learning an off-policy method: its updates are always directed toward the optimal action values, regardless of which behavior policy generated the data. In other words, Q-learning is able to learn about the best policy even if its agents behave according to another policy.

The Bellman equation defines $Q^*$ but, in practice, resolving it is unfeasible for two reasons: it requires environment dynamics and, even if the dynamics were known, computing the expectation over all possible next states and rewards becomes expensive in large spaces.

Q-learning overcomes these issues by using a sample-based temporal difference update. The structure of the update rule is the same as in SARSA, with the difference that in SARSA the action in $s_{t+1}$ is picked according to the ongoing policy $\pi$, while in Q-learning the action is picked according to what maximizes the expected return from $s_{t+1}$:

\[Q_\pi(s_t, a_t) \leftarrow Q_\pi(s_t, a_t) + \alpha \, \big(r_{t+1} + \gamma \,\max_{a'}Q_\pi(s_{t+1}, a') - Q_\pi(s_t, a_t)\big)\]

Intuitively, the update pulls the current Q-value toward a better estimate based on the reward just observed plus the estimated value of the best next action. Repeating this over many episodes gradually drives the estimates $Q_\pi(s,a)$ toward the true optimal action-value function $Q^*(s,a)$.

Q-learning algorithm

Initialize $Q$-table arbitrarily (e.g. zeros)
Initialize $\alpha$, $\varepsilon$, $\gamma$
Define a starting policy $\pi$ (e.g. random with $\varepsilon$-greedy)
Repeat until convergence:
- For each episode:
  1. Start the agent at state $s_0$
  2. Repeat until $s$ is terminal:
    1. Take action $a$ according to $\pi$ ($\varepsilon$-greedy)
    2. Observe reward $r$, transition to the next state $s'$
    3. Update $Q_\pi(s,a)$:
      $$Q_\pi(s,a) \leftarrow Q_\pi(s,a) + \alpha \,\big( r + \gamma\, \textstyle{\max_{a'}}\,Q_\pi(s',a') - Q_\pi(s,a)\big)$$
    4. Update policy $\pi$ as $\varepsilon$-greedy over updated $Q_\pi(s,a)$
    5. Move to the next state: $s \leftarrow s'$
  3. Decay $\alpha$ and $\varepsilon$
Return optimal $Q$-table and derived optimal policy $\pi^*$

Let us now reuse the same tiny example.

\[(\text{low}, x) \xrightarrow{+1} (\text{high}, y) \xrightarrow{+2} (\text{low}, x) \xrightarrow{+3} \text{ terminal}\]

Let $\alpha = 0.5$, $\gamma =1$, $\varepsilon=0$ (greedy approach). At the beginning, we initialize all $Q_\pi(s,a) = 0$.

$\begin{align} \quad\,\mathrel{\raise 1pt\hbox{$\scriptsize\bullet$}} \normalsize\,\, t=0: \quad Q_\pi(\text{low}, x) &\leftarrow Q_\pi(\text{low},x) + \alpha \,\big(r_1 + \gamma\, \textstyle{\max_{a'}}\, Q_\pi(\text{high},a') - Q_\pi(\text{low},x)\big)\\ &\leftarrow 0 + 0.5 \,(1 + 0 - 0) = 0.5 \end{align}$ $\begin{align} \\[-20pt]\quad\,\mathrel{\raise 1pt\hbox{$\scriptsize\bullet$}} \normalsize\,\, t=1: \quad Q_\pi(\text{high}, y) &\leftarrow Q_\pi(\text{high},y) + \alpha \,\big(r_2 + \gamma\, \textstyle{\max_{a'}}\, Q_\pi(\text{low},a') - Q_\pi(\text{high},y)\big)\\ &\leftarrow 0 + 0.5 \,(2 + 0.5 - 0) = 1.25 \end{align}$ $\begin{align} \\[-20pt]\quad\,\mathrel{\raise 1pt\hbox{$\scriptsize\bullet$}} \normalsize\,\, t=2: \quad Q_\pi(\text{low}, x) &\leftarrow Q_\pi(\text{low},x) + \alpha \,\big(r_3 + \gamma\, \textstyle{\max_{a'}}\, Q_\pi(\text{terminal},a') - Q_\pi(\text{low},x)\big)\\ &\leftarrow 0.5 + 0.5 \,(3 + 0 - 0.5) = 1.75 \end{align}$

Q-table evolution over time:

Step	${Q(\text{low},x)}$	${Q(\text{low},y)}$	${Q(\text{high},x)}$	${Q(\text{high},y)}$
Init	$0$	$0$	$0$	$0$
$t=0$	$0.5$	$0$	$0$	$0$
$t=1$	$0.5$	$0$	$0$	$1.25$
$t=2$	$1.75$	$0$	$0$	$1.25$

Policy $\pi$ evolution over time with greedy approach:

Step	${\pi(\text{low})}$	${\pi(\text{high})}$
Init	tie $(x,y)$	tie $(x,y)$
$t=0$	$x$	tie $(x,y)$
$t=1$	$x$	$y$
$t=2$	$x$	$y$

In this simple example, the Q-learning updates are numerically identical to SARSA, because the action taken in $s’$ happened to also be the maximizing action. To obtain different values, let us add one more step after $t=2$, so that the full episode becomes:

\[(\text{low}, x) \xrightarrow{+1} (\text{high}, y) \xrightarrow{+2} (\text{low}, x) \xrightarrow{+3} (\text{low}, y) \xrightarrow{-1} \text{ terminal}\]

At $t=3$, the agent takes $y$ at state $\text{low}$, which is not the greedy action, since by then $Q(\text{low},x)$ is higher. In our simplified setting we assumed $\varepsilon=0$, so such behavior would not occur. However, if $\varepsilon > 0$, exploratory (non-greedy) actions are possible.

SARSA updates toward the action actually chosen ($y$)

$\begin{align} t=2: \quad Q_\pi(\text{low}, x) &\leftarrow Q_\pi(\text{low},x) + \alpha \,\big(r_3 + \gamma\, Q_\pi(\text{low}, y) - Q_\pi(\text{low},x)\big)\\ &\leftarrow 0.5 + 0.5 \,(3 + 0 - 0.5) = 1.75 \end{align}\\[16pt]$ $\begin{align} t=3: \quad Q_\pi(\text{low}, y) &\leftarrow Q_\pi(\text{low},y) + \alpha \,\big(r_4 + \gamma\, Q_\pi(\text{terminal}, \,\cdot\,) - Q_\pi(\text{low},y)\big)\\ &\leftarrow 0 + 0.5 \,(-1 + 0 - 0) = -0.5 \end{align}$
Q-learning updates toward the best possible next action at $\text{low}$

$\begin{align} t=2: \quad Q_\pi(\text{low}, x) &\leftarrow Q_\pi(\text{low},x) + \alpha \,\big(r_3 + \gamma\, \textstyle{\max_{a'}}\, Q_\pi(\text{low},a') - Q_\pi(\text{low},x)\big)\\ &\leftarrow 0.5 + 0.5 \,(3 + 0.5 - 0.5) = 2.0 \end{align}\\[16pt]$ $\begin{align} t=3: \quad Q_\pi(\text{low},y) &\leftarrow Q_\pi(\text{low},y) + \alpha \,\big(r_4 + \gamma\, \textstyle{\max_{a'}}\, Q_\pi(\text{terminal},a') - Q_\pi(\text{low},y)\big)\\ &\leftarrow 0 + 0.5 \,(-1 + 0 - 0) = -0.5 \end{align}\\[10pt]$

Final Q-table for SARSA and Q-learning:

Algorithm	${Q(\text{low},x)}$	${Q(\text{low},y)}$	${Q(\text{high},x)}$	${Q(\text{high},y)}$
SARSA	$1.75$	$-0.5$	$0$	$1.25$
Q-learning	$2.0$	$-0.5$	$0$	$1.25$

Final policy $\pi$ for SARSA and Q-learning:

Algorithm	${\pi(\text{low})}$	${\pi(\text{high})}$
SARSA	$x$	$y$
Q-learning	$x$	$y$

The final policy is the same in the two cases. The difference lies in the value estimates. For SARSA $Q(\text{low},x) = 1.75$ is lower, because it punished $x$ for leading to a bad outcome. For Q-learning $Q(\text{low},x) = 2.0$ is higher, because it assumed that the agent would act optimally from then on, ignoring that the policy really chose $y$. This is the textbook distinction between SARSA, which evaluates the policy it is actually following, and Q-learning, which evaluates the optimal policy, even if it’s not followed.

Reinforcement Learning Saga — Part I: From zero to Q-learning

State-value function
$\\[8pt] \qquad V_\pi(s) = \mathbb{E}_\pi[G_t \mid s_t=s]$
Action-value function
$\\[8pt] \qquad Q_\pi(s,a) = \mathbb{E}_\pi[G_t\mid s_t=s,a_t=a]$
Bellman equations
$\\[6pt] \qquad V_\pi(s) = \mathbb{E}_\pi[r_{t+1} + \gamma\,V_\pi(S_{t+1})\,|\,s_t=s]$
$\\[8pt] \qquad Q_\pi(s,a) = \mathbb{E}_\pi[r_{t+1} + \gamma\,Q_\pi(S_{t+1}, A_{t+1})\,|\,s_t=s, a_t=a]$
Monte Carlo
$\\[8pt]\qquad V_\pi(s_t) \approx \frac{1}{N} \sum_t G_t$
TD(0) sample-based estimate
$\\[8pt]\qquad V_\pi(s_t) \approx r_{t+1} + \gamma V_\pi(s_{t+1})$
TD(0) update
$\\[8pt]\qquad V_\pi(s_t) \leftarrow V_\pi(s_t) + \alpha \,(r_{t+1} + \gamma \,V_\pi(s_{t+1}) - V_\pi(s_t))$
SARSA update
$\\[8pt]\qquad Q_\pi(s,a) \leftarrow Q_\pi(s,a) + \alpha \,\big( r + \gamma\, Q_\pi(s',a') - Q_\pi(s,a) \big)$
Q-learning update
$\\[8pt]\qquad Q_\pi(s,a) \leftarrow Q_\pi(s,a) + \alpha \,\big( r + \gamma\, \textstyle{\max_{a'}}\,Q_\pi(s',a') - Q_\pi(s,a) \big)$

Top

Learning the hard way

Reinforcement Learning Saga — Part I: From zero to Q-learning

Introduction

Markov Decision Processes

Model-based vs. Model-free

Return, Policy, and Value Functions

Bellman Equations

State-value function $V_{\pi}(s)$

Action-value function $Q_{\pi}(s,a)$

Monte Carlo Methods

Temporal Difference Learning

$\small\pmb{\text{TD(0)}}$

$\small\pmb{\text{TD(n)}}$

SARSA

Q-Learning

Reinforcement Learning Saga — Part I:
From zero to Q-learning