In Progress!!! 👇
What struck my curiosity to investigate why $Q$-learning and SARSA converge was the realization that these methods use
their estimates of action values at time step $t$ to estimate the action values at time step $t + 1$ in their update
targets. This sounded really weird to me as though the convergence should not happen. So let’s dive in what’s going on.
Facts
Law of large numbers still holds for ergodic processes. Even with correlated samples, averages converge under mild mixing conditions (this is why you can estimate expectations from Markov chains — same principle as MCMC). That’s why Monte-Carlo methods do need iid assumption.
Also, in RL, we often assume episodes are independent because the environment resets (to a start state distribution). This makes the return samples conditionally i.i.d. given the start state.
Assumptions
The MDP is ergodic (aperiodic + irreducible). The policy the agent follows is fixed for a long time. These two assumptions make us assume that if the agent chooses actions according to a Markov policy $\pi$, we should expect to obtain tuples that roughly (because the aforementioned assumptions are not met in practice) follow the stationary distribution of the Markov chain corresponding to $\pi$.
Stochastic approximation
Why do we need stochastic approximation techniques? To handle stochasticity. I’ll show that $Q$-learning and SARSA are instantiations of stochastic approximation scheme, hence I’ll spend almost the whole document on stochastic approximation, and $Q$-learning and SARSA just become immediate.
$Q$-learning
The update rule:
\(Q_{t + 1}(S_t, A_t) = Q_t(S_t, A_t) + \alpha_t(S_t, A_t)\left[R_{t + 1} + \gamma \max_a Q_t(S_{t + 1}, a) - Q_t(S_t, A_t) \right]\).
SARSA
The update rule:
\(Q_{t + 1}(S_t, A_t) = Q_t(S_t, A_t) + \alpha_t(S_t, A_t)\left[R_{t + 1} + \gamma Q_t(S_{t + 1}, A_{t + 1}) - Q_t(S_t, A_t) \right]\).
References
- Reinforcement Learning: Foundations
- Neuro-Dynamic Programming
- ChatGPT [A lot]