Journal

The life as a grad student taught me that I should maintain this page very seriously.

2025

Some cool questions

27 minute read

Published: October 31, 2025

$\begin{align*} \newcommand{\I}{\mathbb{1}} \newcommand{\R}{\mathbb{R}} \newcommand{\Q}{\mathbb{Q}} \newcommand{\N}{\mathbb{N}} \DeclareMathOperator{\EE}{\mathbb{E}} \DeclareMathOperator{\PP}{\mathbb{P}} \newcommand{\Ev}[1]{\EE\left[ #1 \right]} \newcommand{\Pr}[1]{\PP\left( #1 \right)} \end{align*}$

An elementary theorem in number theory states that if two integers $m$ and $n$ are relatively prime (i.e., greatest common divisor equal to 1 ), then there exist integers $x$ and $y$ (positive or negative) such that $mx + ny = 1$. Using this theorem show the following:
A If $m$ and $n$ are relatively prime then the set {$xm + ny : x, y$ positive integers} contains all but a finite number of the positive integers.
B. Let $J$ be a set of nonnegative integers whose greatest common divisor is $d$. Suppose also that $J$ is closed under addition, $m, n \in J \Rightarrow m + n \in J$. Then $J$ contains all but a finite number of integers in the set {$0, d, 2d,\dots$}.

Source: Introduction to Stochastic Processes

answer.

Part (A).

First, note that if we want to contain only positive integers, then $n$ and $m$ must be nonnegative. If one of them is negative and the other is positive, we would inevitably contain the whole integers. Also, to rule out the trivial case, let us assume $n, m \geq 1$.

Since $\mathsf{gcd} (n, m) = 1$, the great common divisor of any multiples of $n$ from $1n$ to $mn$ with $m$ is also one. In other words, the set {$n, 2n, \dots, mn$} forms the residue system modulo $m$. Let $N$ be any nonzero integer, we have $N \equiv ny \pmod m$, where $1 \leq y \leq m$. Equivalently, we can say $N - ny = mx$ for some integer $x$.

We know that since $1 \leq y \leq m$, $y$ is positive. On the other hand, if $N \geq mn + 1$, then $x = \frac{N - ny}{m} > 0$, hence $x$ is also positive. Therefore, we just proved that for positive integers $x$ and $y$, all positive integers bigger than $mn$ are in the set described in the question.

Part (B).

Consider the set $J’ =$ {$\frac{j}{d} \mid j \in J$}. If we show for any integer $k$ in $J’$ that is bigger than some threshold $K$, all but a finite number of nonnegative integers are in $J’$, then we can conclude for any integer $k$ in $J$ that is bigger than $Kd$, all but a finite number of nonnegative integers are in $J$

By construction, the greatest common divisor of $J’$s elements is one and also $J’$ is closed under addition. Now, consider an arbitrary element in $J’$ and let us denote it by $a$. The set of remainder of $J’$s elements when divided by $a$ is $R = $ {$j’ \pmod{a} \mid j’ \in J’$}. Since the greatest common divisor of $J’$ is one, the greatest common divisor of $R$ is also one. Therefore, $R$ forms the residue system modulo $a$, i.e., $R = $ {$0, 1, \dots a - 1$}, where for each element of $r \in R$ there exits at least one element in $J’$ denoted by $j’_r$ such that $j’_r \equiv r \pmod{a}$. Let us denote the largest representative of these elements by $K$:

[K = \max \left{j’0, j’_1, \dots j’{a - 1} \right}.]

Now we show that any integer bigger than $K$ is in $J’$. Let $k$ be an integer bigger than $K$ and let its remainder when divided by $a$ be $r$, i.e., $k = r \pmod{ a}$. We know that there is an element in $J’$ with same remainder, i.e., $j’_r$, therefore:

[k = j’_r + c\cdot a,]

for some integer $c$. Since $k > M$ and $j’_r \leq M$, $k - j’_r > 0$ and $c \cdot a$ is positive multiple of $a$.
Since $j’_r, a \in J’$, and $J’$ is closed under addition, both $j’_r$ and $c \cdot a = \underbrace{a + a + \dots + a}_{c\, \mathrm{times}}$ are in $J’$.
Therefore, their sum, $k$, must also be in $J’$. This proves that $J’$ contains all integers greater than $K$.
Consequently, $J$ contains {$(K + 1)d, (K + 2)d, \dots$}, which completes the proof.

What is an ergodic Markov chain?

answer.

It’s an aperiodic, irreducible, and recurrent Markov chain:

Aperiodic:
The period of the chain is one.

Irreducible:
If there is only one communication class, i.e., if for all $i, j$ there exists an $n = n(i,j)$ with $P_n(i,j) > 0$, then the chain is called irreducible.

Recurrent:
If the chain starts in a transient class (set of states), then with probability one it eventually leaves this class and never returns. Classes with this property are called transient classes and the states are called transient states. Other classes are called recurrent classes with recurrent states. A Markov chain starting in a recurrent class never leaves that class.

Why is the left eigenvector of a transition matrix with the eigen value of one equal to the stationary distribution of the Markov chain modelled by the transition matrix?

answer.

Suppose $\bar{\pi}$ is the stationary distribution and $P$ is the transition matrix. Then, we know that since all rows of $P^\infty$ are the same, then for any initial any probability vector $\bar{v}$ we have

[\bar{\pi} = \lim_{n \to \infty} \bar{v}P^n.]

Hence, we have,

$\bar{\pi} = \lim_{n \to \infty} \bar{v}P^{n + 1} = \left(\lim_{n \to \infty} \bar{v}P^{n}\right)P = \bar{\pi}P.$
The above display is a left eigenvector of $P with eigenvalue one.

Source: Introduction to Stochastic Processes

Prove that every state in an irreducible Markov chain has the same period.

The period of state $i$ is defined as the greatest common divisor of the set $J_i :=$ {$n \geq 1: P_n(i, i) >0$}. Not that $J_i$ is closed under addition because $P_{n +m} (i, i) = \sum_k P_n(i, k) P_m(k, i) \geq P_n(i, i) P_m(i, i) > 0$. Let $d$ be the greatest common divisor of $J_i$. We have shown before that $J_i$ contains all but a finite number of integers. Hence, $J_i$ contains $md$ for all $m$ greater than some integer $M$.

Let $j$ be another state and $n,m$ such that $P_n(i, j), P_m(j, i) > 0$ (chain is irreducible). Clearly $n + m \in J_i$ and $m + n \in J_j$. If $l \in J_j$, then

[P_{n+m+l}(i, j) \geq P_{n}(i, j)P_l(j, j)P_m(j, i) > 0.]

Therefore, $n+m+l \in J_i, J_j$. $d$ used to divide $n + m$, now we showed that it must divide $l$ as well.
So, we have just shown that if $d $divides every element of $J_i$
then it divides every element of $J_j$. From this we see that all states have the same period.

Source: Introduction to Stochastic Processes

— - Consider the Markov chain with state space {1, 2, 3, 4, 5} and matrix

[P = \begin{pmatrix}
0 & 1/3 & 2/3 & 0 & 0 \
0 & 0 & 0 & 1/4 & 3/4 \
0 & 0 & 0 & 1/2 & 1/2 \
1 & 0 & 0 & 0 & 0 \
1 & 0 & 0 & 0 & 0
\end{pmatrix}]

What are $P_{1000}(2, 1 ), P_{1000}(2, 2), P_{1000}(2, 4)$?

Due to the structure of the chain at no time step there would be probability for transitioning from state 2 to state 1, nor from state 2 to itself. Hence, $P_{1000}(2, 1) = P_{1000}(2, 2) = 0$

On the other hand to compute $P_{1000}(2, 4)$, since the chain is periodic with the period of three and the remainder of $1000$ when divided by $3$ is one, the same as number $4$, hence using Python software we raise the matrix to the power of $4$ instead, and $P_{1000}(2, 4) \approx P_4(2, 4) \approx 0.4167$.

[\begin{equation}
P_4 = \begin{pmatrix}
0 & 0.3333 & 0.6665 & 0 & 0\
0 & 0 & 0 & 0.4167 & 0.5835 \
0 & 0 & 0 & 0.4167 & 0.5835\
1 & 0 & 0 & 0 & 0 \
1 & 0 & 0 & 0 & 0
\end{pmatrix}
\end{equation}]

Source: Introduction to Stochastic Processes

Let $X_1, X_2,\dots$ be the successive values from independent rolls of a standard six-sided die. Let $S_n= X_1 + \dots + X_n$. Let

[T_1 = \min {n \geq 1: S_n \, \text{is divisible by 8} },]

[T_2 = \min{n \geq 1: S_{n - 1}\, \text{is divisible by 8}}.]

Find $\mathbb{E}[T_1]$ and $\mathbb{E}[T_2]$. (Hint: consider the remainder of $S_n after division by 8 as a Markov chain.)

answer.

The state space of the chain comprises ${0, 1, 2, 3, 4, 5, 6, 7}$. Let $g(i)$ denotes the expected number of rolls until the chain reaches state zero, while starting at state $i$. We have that $g(0) = 0$. Definitely we won’t reach the state $0$ with just one roll because the outcome would be among $1$ and $6$. Hence,

[\begin{equation}
\mathbb{E}[T_1] = 1 + \frac{1}{6}\sum_{i = 1}^6 g(i) = 1 + \frac{1}{6}\left(g(1) + g(2) + g(3) + g(4) + g(5) + g(6)\right).
\end{equation}]

On the other hand,

[\begin{equation}
\sum_{i=1}^7g(i) = \sum_{i=1}^7 \left(1 + \frac{1}{6}\sum_{j=1}^6 g\left(i + j \bmod{8}\right) \right).
\end{equation}]

That is, the expected number of rolls starting from state $i \neq 0$ is equal to rolling the die and the expected number of rolls of the resulting state. We rearrange the summations to get:

[\begin{equation}
\sum_{i=1}^7g(i) = 7 + \frac{1}{6}\sum_{j=1}^6\sum_{i=7}^6 g(i + j \bmod{8}).
\end{equation}]

Now for each fixed $j$, $(i + j \bmod{8})$ is a permutation of ${0, 1, 2, 3, 4, 5, 6, 7}$ excluding the number $j$.
Hence, by denoting $S := \sum_{i = 1}^7 g(i)$, we have:

[\begin{align}
S = 7 + \frac{1}{6}\sum_{j=1}^6(S - g(j)) \Rightarrow \sum_{j = 1}^6 g(j) = 42.
\end{align}]

Therefore,

[\begin{equation}
\mathbb{E}[T_1] = 1 + \frac{1}{6}\sum_{i = 1}^6 g(i) = 1 + \frac{1}{6}\left(g(1) + g(2) + g(3) + g(4) + g(5) + g(6)\right) = 8.
\end{equation}]

For computing $\mathbb{E}[T_2]$, we only need to roll the die once, because the sum of the random variables before the first roll is zero, so $S_{1 - 1} = S_0 = 0$, and $0$ is divisible by 8.

Source: Introduction to Stochastic Processes

Let $X_n, Y_n$ be independent Markov chains with state space {$0, 1, 2$} and transition matrix

[P = \begin{pmatrix}
1/2 & 1/4 & 1/4 \
1/4 & 1/4 & 1/2 \
0 & 1/2 & 1/2
\end{pmatrix}.]

Suppose $X_0 = 0, Y_0 = 2$ and let

[T = \inf{n: X_n = Y_n}.]

(A) Find $\mathbb{E}[T]$.

(B) What is $\mathbb{P}{X_T = 2}$?

(C) In the long run, what percentage of the time are both chains in the same state?
[Hint: consider the nine-state Markov chain $Z_n = (X_n, Y_n$).]

answer.

Part(A).

We need to set up a recursive formula connecting the expected number of steps until two chains meet.

Let $g(i, j)$ be the expected number of steps until the meeting time, when ${X_n}$ has started at state $i$ and ${Y_n}$ has started in state $j$. We have that $g(k, k) = 0,: k \in {0, 1, 2}$. For all other state, since ${X_n}$ and ${Y_n}$ are independent, we have that:

[\begin{equation}
g(i, j) = 1 + \sum_{k=0}^3\sum_{l=0}^3 P(i, k)P(j, l)g(k, j).
\end{equation}]

This gives a set of 6 equations with 6 unknowns
($g(0, 1), g(0, 2), g(1, 0), g(1, 2), g(2, 0),\, \mathrm{and}\, g(2, 1)$) as follows:

[\begin{align}
g(0, 1) & = 1 + \frac{1}{2}g(0, 0) + \frac{1}{4}g(0, 1) + \frac{1}{4}g(0, 2) \
g(0, 2) & = 1 + \frac{1}{2}g(0, 0) + \frac{1}{4}g(0, 1) + \frac{1}{4}g(0, 2)
\
g(1, 0) & = 1 + \frac{1}{4}g(1, 0) + \frac{1}{4}g(1, 1) + \frac{1}{2}g(1, 2) \
g(1, 2) & = 1 + \frac{1}{4}g(1, 0) + \frac{1}{4}g(1, 1) + \frac{1}{2}g(1, 2) \
g(2, 0) & = 1 + 0\cdot g(2, 0) + \frac{1}{2}g(2, 1) + \frac{1}{2}g(2, 2) \
g(2, 1) & = 1 + 0 \cdot g(2, 0) + \frac{1}{2}g(2, 1) + \frac{1}{2}g(2, 2).
\end{align}]

Solving this system of equations gives us: $g(0, 2) = \frac{118}{35} \approx 3.37$.

Part (B).

We need to set up a recursive formula connecting how the chains can meet at state 2 while starting at state 0 and 2 respectively.

Let $h(i, j)$ be the probability of first meeting at state 2, when ${X_n}$ has started at state $i$ and ${Y_n}$ has started in state $j$. We have that $h(1, 1) = h(0 , 0) = 0$ (first meeting should be at state 2), and $h(2, 2) = 1$. For all other state, since ${X_n}$ and ${Y_n}$ are independent, we have that:

[\begin{equation}
h(i, j) = \sum_{k=0}^3\sum_{l=0}^3 P(i, k)P(j, l)h(k, j).
\end{equation}]

This gives a set of 6 equations with 6 unknowns ($h(0, 1), h(0, 2), h(1, 0), h(1, 2), h(2, 0),\, \mathrm{and}\, h(2, 1)$) as follows:

[\begin{align}
h(0, 1) & = \frac{1}{2}h(0, 0) + \frac{1}{4}h(0, 1) + \frac{1}{4}h(0, 2) \
h(0, 2) & = \frac{1}{2}h(0, 0) + \frac{1}{4}h(0, 1) + \frac{1}{4}h(0, 2)
\
h(1, 0) & = \frac{1}{4}h(1, 0) + \frac{1}{4}h(1, 1) + \frac{1}{2}h(1, 2) \
h(1, 2) & = \frac{1}{4}h(1, 0) + \frac{1}{4}h(1, 1) + \frac{1}{2}h(1, 2) \
h(2, 0) & = 0\cdot h(2, 0) + \frac{1}{2}h(2, 1) + \frac{1}{2}h(2, 2) \
h(2, 1) & = 0 \cdot h(2, 0) + \frac{1}{2}h(2, 1) + \frac{1}{2}h(2, 2).
\end{align}]

Solving this system of equations gives us: $h(0, 2) = \frac{15}{28} \approx 0.535$.

Part (C).

Since the chains are independent, we need to multiple their invariant distributions for states ${1, 2, 3}$. Let us find the invariant distribution by solving $\bar{\pi} = \bar{\pi}P$:

[\begin{equation}
\begin{pmatrix}
\bar{\pi}_1 \
\bar{\pi}_2 \
\bar{\pi}_3
\end{pmatrix} =
\begin{pmatrix}
\bar{\pi}_1 \
\bar{\pi}_2 \
\bar{\pi}_3
\end{pmatrix}^\top \begin{pmatrix}
1/2 & 1/4 & 1/4 \
1/4 & 1/4 & 1/2 \
0 & 1/2 & 1/2
\end{pmatrix}
\quad \mathrm{and} \quad \bar{\pi}_1 + \bar{\pi}_2 + \bar{\pi}_3 = 1.
\end{equation}]

The solution of this equation is: $\bar{\pi} = \frac{1}{11}(2, 4, 5)$. Hence, the probability that both chains spend time at the same states in a long run is equal to: $\frac{1}{11^2}(4 + 16 + 25) = \frac{45}{121}$.

Source: Introduction to Stochastic Processes

Let $X_n$ be a Markov chain on state space {1, 2, 3, 4, 5} with transition matrix

[P = \begin{pmatrix}
0 & 1/2 & 1/2 & 0 & 0\
0 & 0 & 0 & 1/5 & 4/5 \
0 & 0 & 0 & 2/5 & 3/5 \
1 & 0 & 0 & 0 & 0 \
1/2 & 0 & 0 & 0 & 1/2
\end{pmatrix}]

(A) Suppose $X_0 = 1$. What is the expected number of steps until the chain is in state 4?

(B) Suppose $X_0 = 1$. What is the probability that the chain will enter state 5 before it enters state 3?

answer.

Part (A).

We need to set up a recursive formula connecting the expected number of steps until reaching state 4 for each distinct state. Let $g(i)$ be the expected number of steps until reaching state $4$, when the chain has started at state $i$. We have that $g(4) = 0$. For all other state we have that:

[\begin{equation}
g(i) = 1 + \sum_{j=0, j \neq 4}^5P(i, j)g(j).
\end{equation}]

This gives a set of 4 equations with 4 unknowns ($g(1), g(2), g(3),\, \mathrm{and}\, g(5)$) as follows:

[\begin{align}
g(5) & = 1 + \frac{1}{2}g(0) + \frac{1}{2}g(5) \
g(3) & = 1 + \frac{2}{5}g(4) + \frac{3}{5}g(5) \
g(2) & = 1 + \frac{1}{5}g(4) + \frac{4}{5}g(5) \
g(1) & = 1 + \frac{1}{2}g(2) + \frac{1}{2}g(3)
\end{align}]

Solving this system of equations gives us: $g(1) = \frac{34}{3}$.

Part (B).

We need to set up a recursive formula connecting how the chain can hit state 5 before state 3, while starting at state 1.

Let $h(i)$ be the probability that the chain hits state $5$ before state 3, when the chain has started at state $i$. We have that $h(5) =1$ and $h(3) = 0$. For all other state we have that:

[\begin{equation}
h(i) = \sum_{j=1}^5P(i, j)h(j).
\end{equation}]

This gives a set of 3 equations with 3 unknowns ($h(1), h(2),\, \mathrm{and}\, h(4)$) as follows:

[\begin{align}
h(4) & = h(1) \
h(2) & = \frac{1}{5}h(4) + \frac{4}{5}h(5) = \frac{1}{5}h(4) + \frac{4}{5}\
h(1) & = \frac{1}{2}h(2) + \frac{1}{2}h(3) = \frac{1}{2}h(2)
\end{align}]

Solving this system of equations gives us: $h(1) = \frac{4}{9}$.

Source: Introduction to Stochastic Processes

Theorem. Suppose the state space $S$ and the action space $A$ are finite. Then, there exits a deterministic stationary Markov Blackwell optimal policy.

Proof. Let $\Pi^\mathrm{MD}$ be the set of all deterministic Markov polices for an MDP with finite state and action spaces. Since $\Pi^\mathrm{MD}$ is finite, there exits a sequence of discount factors {$\gamma_n$} converging to one for which there exits a $\pi^* \in \Pi^\mathrm{MD}$ with $\left(\pi^*\right)^\infty = (\pi, \pi, \dots) = (\pi_0, \pi_1, \dots)$ (as the stationary policy) being discount optimal for all $\gamma_n$.

The reason the aforementioned fact is true is that since $\Pi^\mathrm{MD}$ finite and $0 \leq \gamma <1$ is infinite, due to pigeonhole principle some optimal policies are shared for some discount factors. Therefore, we can pick a subsequence of discount factors that increases toward one, which will have an associated optimal policy.

With having the above fact in mind, for each $\pi \in \Pi^\mathrm{MD}$,

[v_\gamma^{\left(\pi^*\right)^\infty}(s) - v_\gamma^{\pi^\infty}(s) \geq 0,]

for all states, $\gamma = \gamma_n$ and $n = 1, 2, \dots$ . Each function on L.H.S. is a rational function of $\gamma$, so is their difference. Hence, the difference is zero for all $\gamma$, or equals zero for at most finitely many $\gamma$s. Therefore, there exists a $\gamma_\pi < 1$ for which the above display holds for $\gamma_\pi \leq \gamma < 1$. Since $\Pi^\mathrm{MD}$ is finite the above displays holds for all $\gamma^* \leq \gamma < 1$ where $\gamma^* = \max_\pi \gamma_{\pi}$.

Now that we have fixed $\gamma$, by virtue of the existence of a deterministic stationary Markov policy in the discounted setting, the result follows. $\square$

I’ve been concise and incomplete for this theorem. I’ll add supplementary details one day that I stumbled on this post. :D

Source: Markov Decision Processes: Discrete Stochastic Dynamic Programming

Let $\Omega$ be a measurable set, and let $f: \Omega \to [0, \infty]$ be a non-negative measurable function. Prove that we have $0 \leq \int_\Omega f \leq \infty$. Furthermore, we have $\int_\Omega f = 0$ if and only if $f(x) = 0$ for almost every $x \in \Omega$.

Proof. Since $f$ is a non-negative measurable function, we have that

[\int_\Omega f = \sup {\int_\Omega s: s \text{ is non-negative, simple and dominated by } f }.]

Step 1: $0 \leq \int_\Omega \leq \infty$.

Consider a fixed $s$. Since $s$ is simple, $\int_\Omega s = \sum_{j = 1}^N c_j \cdot m(E_j)$,
where $N \in \mathbb{N}, c_j > 0, m: \Omega \to \mathbb{R}^*$ is the Lebesgue measure, $E_1, \dots, E_N \in \Omega$,
and $E_i \cap E_j = \varnothing$ for all $i, j \in [N], \mathrm{and}\, i \neq j$. The sum of non-negative terms on the
extended real line $\mathbb{R}^*$ is in $[0, \infty]$, hence is their supremum.

Step 2: If $f = 0$ a.e., then $\int_\Omega f = 0$.

Since $f$ dominates $s$, then $0 \leq s(x) \leq f(x)$, for all $x \in \Omega$. If $f(x) = 0$ for almost every $x \in \Omega$, then $s = 0$ a.e., hence $\int_\Omega s = 0$, and consequently $\sup \int_\Omega s = 0$.

Step 3: If $\int_\Omega f = 0$, then $f = 0$ a.e.

Consider the set

[E_j := {x \in \Omega: f(x) > \frac 1j }, \quad j \geq 1.]

The function $s_j := \frac 1j \mathbb{1}_{E_j}$ is simple and also dominated by $f$ because $0 \leq s_j \leq f$. By Step 1, we have $\frac 1j m(E_j) = \int_\Omega s_j \leq \int_\Omega f = 0$. This means $m(E_j) = 0$ for all $j$. The domain of $f$ where it is non zero is

[\Omega’ := {x : f(x) > 0 } = \cup_{j = 1}^\infty E_j.]

By the subadditivity property of the Lebesgue measure we have

[m\left(\Omega’ \right) = \sum_{j = 1}^\infty m(E_j) = 0 \Rightarrow f = 0: \mathrm{a.e.}]

Source: Analysis II

Convergence of Positive Supermartingales

26 minute read

Published: October 27, 2025

Some cool things about finite Markov chains and thereof

12 minute read

Published: September 23, 2025

In this post, I wanna summarize Sections 1.5 and 1.6 of Introduction to Stochastic Processes and kinda blew my mind.

A reminder on Student’s t-distribution and p-values

4 minute read

Published: September 14, 2025

Let’s see what Sir Larry Wasserman has to tell us about it.

So we’re in the realm of hypothesis testing [Aaditya Ramdas calls it stochastic proof by contraction, I
love this phrase]. Suppose we divide our parameter space $\Theta$ into two disjoint sets $\Theta_0$ and $\Theta_1$.
We wish to test

$H_0: \theta \in \Theta_0 \quad\text{versus} \quad H_1: \theta \in \Theta_1$.

We call $H_0$ the null hypothesis that we’d like to reject. Because it says nothing interesting is going on
(hence the name null). Let $\mathbb{P}_\theta$ be a probability distribution with support $\mathcal{X}$
parameterized by $\theta$. Define $\mathcal{R} \subset \mathcal{X}$ called the rejection region.
Let $X \sim \mathbb{P}$_$\theta$$(\cdot)$. Then, if

[X \in \mathcal{R} \Rightarrow \text{ reject } H_0, \
X \notin \mathcal{R} \Rightarrow \text{ retain } H_0.]

Usually, the rejection region is of the form

[R = {x \in \mathcal{X}: T(x) > c },]

where $T$ is a test statistic and $c$ is a critical value. The problem in hypothesis testing is to find an appropriate test statistic $T$ and an appropriate critical value $c$.

P.S.: A test statistic is a single number calculated from sample data that is used to evaluate a hypothesis in statistical analysis. It quantifies the difference between the observed data and what would be expected if the null hypothesis were true. Essentially, it helps determine how compatible your data is with a specific hypothesis.

Definition: The power function of a test with rejection region $\mathcal{R}$ is defined by

[\beta(\theta) = \mathbb{P}_\theta(X \in \mathcal{R}).]

The size of a test is defined to be

[\sup_{\theta \in \Theta_0} \beta(\theta).]

A test is said to have level $\alpha$ if its size is less than or equal to $\alpha$.
Basically, the level $\alpha$ specifies the maximum probability of rejection the null hypothesis.

Some reminders on Linear Algebra

2 minute read

Published: September 13, 2025

In this post I’ll mention a handful reminders of Linear Algebra.

Induction by Contradiction

1 minute read

Published: September 12, 2025

In this post, we wanna prove the correctness of the proof by induction itself. Lol

When is a deterministic optimal policy in MDPs attainable?

10 minute read

Published: September 10, 2025

In non-theory research papers, I often see that the optimal (deterministic Markov) policy for MDPs is defined as

Stochastic Approximation Part 1

26 minute read

Published: September 07, 2025

What struck my curiosity to investigate why $Q$-learning and SARSA converge was the realization that these methods use
their estimates of action values at time step $t$ to estimate the action values at time step $t + 1$ in their update
targets. This sounded really weired to me as though the convergence should not happen. So, I dug in and discovered the
answer lies in the topic of stochastic approximation. Stochastic approximation is fairly a big topic, hence I cover it
in four separate parts. In the last part I’ll turn my attention to Q-Learning and SARSA eventually.

I’ll mention the assumption required to each proposition during their corresponding proofs to see where those
assumptions were inevitably needed.

Needed background

In this section I’ll review some concepts needed throughout the document.

The law of total variance in practice

8 minute read

Published: August 19, 2025

In this post, I’ll solve an example that requires the use the law of total variance in RL.

Background

Background on some technical tools used in the main section. Before we start, note that for two random variables $X\, \mathrm{and}\, Y$, $\mathbb{E}[Y \mid X]$ is a shorthand notation for $\mathbb{E}[Y \mid \sigma(X)]$!

The law of total variance

Let $(\Omega, \mathcal{F}, \mathbb{P})$ be a probability space, $\mathcal{G}_1 \subseteq \mathcal{G}_2 \subseteq \mathcal{F}$ two sub-$\sigma$-algebras of $\mathcal{F}$, $X$ an integrable random variable on $(\Omega, \mathcal{F}, \mathbb{P})$ with finite variance. The following holds true:

$\mathbb{V}(X \mid \mathcal{G}_1) = \mathbb{\mathbb{E}}[\mathbb{V}(X \mid \mathcal{G}_2) \mid \mathcal{G}_1] - \mathbb{\mathbb{V}}[\mathbb{E}(X \mid \mathcal{G}_2) \mid \mathcal{G}_1].$

The law of the unconscious statistician (LOTUS)

Let $\mathbb{P}_X$ be the pushforward of the random element $X \in \mathcal{X}$. For any real-valued, $f: \mathcal{X} \to \mathbb{R}$ measurable function,

[\mathbb{E}[f(X)] = \sum_xf(x)\mathbb{P}_X(x),]

[\mathbb{E}[f(X)] = \int_\mathcal{X} f(x)\mathrm{d}\mathbb{P}_X(x),]

provided that either the right-hand side, or the left-hand side exist. This is known as the “law of the unconscious statistician”, or LOTUS.

Jensen’s inequality

Let $\bar{\mathbb{R}} = \mathbb{R} \cup$ 2025-08-19-ltv.md 2025-08-19-ltv.md{$ -\infty, +\infty$}, and $\mathrm{dom}(f) = {x \in \mathbb{R}^d: f(x) < \infty }$ for a real-valued function $f$ on $\mathbb{R}^d$.

Jensen’s inequality: Let $f: \mathbb{R}^d \to \bar{\mathbb{R}}$ be a measurable convex function and $X$ be an $\mathbb{R}^d$-valued random element on some probability space such that $\mathbb{E}[X]$ exists and $X \in \mathrm{dom}(f)$ holds almost surely. Then,

[\mathbb{E}[f(X)] \geq f(\mathbb{E}[X]).]

An example the law of total variance in practice

We want to compare the variance of the target for SARSA and expected SARSA. The update rule of is

[Q_{t + 1}(S_t, A_t) = Q_t(S_t, A_t) + \alpha \left[R_{t + 1} + \gamma Q_t(S_{t + 1}, A_{t + 1}) - Q_t(S_t, A_t) \right].]

And the update rule for expected SARSA is

[Q_{t + 1}(S_t, A_t) = Q_t(S_t, A_t) + \alpha \left[R_{t + 1} + \gamma \sum_{a’ \in \mathcal{A}} \pi\left(a’ \mid S_{t + 1}\right)Q_t \left(S_{t + 1}, a’ \right) - Q_t(S_t, A_t) \right],]

where $\pi$ is the fixed policy that was used to generate the data $S_0, A_0, R_1, \, \dots\;$.

Let $H_t = (S_0, A_0, R_1, \dots, S_t)$, and $H’_t = (S_0, A_0, R_1, \dots, S_t, A_t)$.

Show that

[\mathbb{V}\left[Q_t(S_{t + 1}, A_{t + 1}) \middle\vert H_{t + 1}\right] \geq \mathbb{V}\left[\sum_{a’ \in \mathcal{A}} \pi\left(a’ \mid S_{t + 1}\right)Q_t \left(S_{t + 1}, a’ \right) \middle\vert H_{t + 1}\right].]

First, note that using Markov property, we can replace $H_{t + 1}$ in the above expressions with $S_{t + 1}$. Second,

[\mathbb{V}\left[\sum_{a’ \in \mathcal{A}} \pi\left(a’ \mid S_{t + 1}\right)Q_t \left(S_{t + 1}, a’ \right) \middle\vert S_{t + 1}\right] = 0.]

Because given $S_{t + 1}$ (it is not random anymore), the randomness in the actions is averaged out using the expectation, so there is no randomness remaining more. Third,

[\begin{align}
\mathbb{V}\left[Q_t(S_{t + 1}, A_{t + 1}) \middle\vert S_{t + 1}\right] & = \mathbb{E}\left[Q_t(S_{t + 1}, A_{t + 1})^2 \middle\vert S_{t + 1}\right] - \mathbb{E}\left[Q_t(S_{t + 1}, A_{t + 1}) \middle\vert S_{t + 1}\right]^2 \quad \text{(def of variance)} \
& = \sum_{a \in \mathcal{A}} \pi(a \mid S_{t + 1}) Q_t(a, S_{t + 1})^2 - \left(\sum_{a \in \mathcal{A}} \pi(a \mid S_{t + 1}) Q_t(a, S_{t + 1}) \right)^2 \quad \text{(def of expectation and LOTUS)} \
&\geq \sum_{a \in \mathcal{A}} \pi(a \mid S_{t + 1}) Q_t(a, S_{t + 1})^2 - \sum_{a \in \mathcal{A}} \pi(a \mid S_{t + 1})^2 Q_t(a, S_{t + 1})^2 \quad \text{(Cauchy-Swhartz)} \
&\geq \sum_{a \in \mathcal{A}} \pi(a \mid S_{t + 1}) Q_t(a, S_{t + 1})^2 - \sum_{a \in \mathcal{A}} \pi(a \mid S_{t + 1}) Q_t(a, S_{t + 1})^2 \
&\geq 0.
\end{align}]

Right off the bat, we already knew that the variance is always non-negative anyway. 😅

When would have the equality happened?

Well you’re asking when the variance of a random variable is zero. Then, the answer is when it’s deterministic. For a deterministic random variable $X$ we have that $\mathbb{E}[X] = X$. Hence, we want

[\mathbb{E}\left[Q_t(S_{t + 1}, A_{t + 1}) \middle\vert S_{t + 1}\right] = Q_t(S_{t + 1}, A_{t + 1}),]

which can only happen when the policy is deterministic.

Show that

[\mathbb{V} \left[R_{t + 1} + \gamma Q_t(S_{t + 1}, A_{t + 1}) \middle\vert H’t \right] \geq \mathbb{V}\left[R{t + 1} + \gamma \sum_{a’ \in \mathcal{A}} \pi\left(a’ \mid S_{t + 1}\right)Q_t \left(S_{t + 1}, a’ \right) \middle\vert H’_t\right],]

that is, the appropriate conditional variance of the SARSA target is always at least as large as that of for the expected SARSA target.

For convenience, let $Z_t = (S_t, A_t)$. We have,

[\begin{align}
& \mathbb{V} \left[R_{t + 1} + \gamma Q_t(S_{t + 1}, A_{t + 1}) \middle\vert H’_t \right] - \mathbb{V}\left[R_{t + 1} + \gamma \sum_{a’ \in \mathcal{A}} \pi\left(a’ \mid S_{t + 1}\right)Q_t \left(S_{t + 1}, a’ \right) \middle\vert H’_t\right] = \
& \mathbb{V} [R_{t + 1} \mid Z_t] + \gamma^2 \mathbb{V} \left[Q_t(S_{t + 1}, A_{t + 1}) \middle\vert Z_t \right] + 2\mathrm{Cov}\left[R_{t + 1}, \gamma Q_t(S_{t + 1}, A_{t + 1}) \middle\vert Z_t \right] - \mathbb{V} [R_{t + 1} \mid Z_t] - \gamma^2 \mathbb{V}\left[\sum_{a’ \in \mathcal{A}} \pi\left(a’ \mid S_{t + 1}\right)Q_t \left(S_{t + 1}, a’ \right) \middle\vert Z_t\right] - 2\mathrm{Cov}\left[R_{t + 1}, \gamma \sum_{a’ \in \mathcal{A}} \pi\left(a’ \mid S_{t + 1}\right)Q_t \left(S_{t + 1}, a’ \right) \middle\vert Z_t \right]= \
& \gamma^2 \mathbb{V} \left[Q_t(S_{t + 1}, A_{t + 1}) \middle\vert Z_t \right] + 2\mathrm{Cov}\left[R_{t + 1}, \gamma Q_t(S_{t + 1}, A_{t + 1}) \middle\vert Z_t \right] - \gamma^2 \mathbb{V}\left[\sum_{a’ \in \mathcal{A}} \pi\left(a’ \mid S_{t + 1}\right)Q_t \left(S_{t + 1}, a’ \right) \middle\vert Z_t\right] - 2\mathrm{Cov}\left[R_{t + 1}, \gamma \sum_{a’ \in \mathcal{A}} \pi\left(a’ \mid S_{t + 1}\right)Q_t \left(S_{t + 1}, a’ \right) \middle\vert Z_t \right].
\end{align}]

For covariance terms, note that although $R_{t + 1}$ is not independent of $Q_t$ but given $Z_t$, it is (it is naturally independent of $S_{t + 1}$ and $A_{t + 1}$). So, covariance terms are zero. Hence, we end up with

[\mathbb{V} \left[R_{t + 1} + \gamma Q_t(S_{t + 1}, A_{t + 1}) \middle\vert H’t \right] - \mathbb{V}\left[R{t + 1} + \gamma \sum_{a’ \in \mathcal{A}} \pi\left(a’ \mid S_{t + 1}\right)Q_t \left(S_{t + 1}, a’ \right) \middle\vert H’t\right] =
\gamma^2 \left(\mathbb{V} \left[Q_t(S{t + 1}, A_{t + 1}) \middle\vert Z_t \right] - \mathbb{V}\left[\sum_{a’ \in \mathcal{A}} \pi\left(a’ \mid S_{t + 1}\right)Q_t \left(S_{t + 1}, a’ \right) \middle\vert Z_t\right]\right).]

Now we apply the law of total variance

[\begin{align}
&\gamma^2 \left(\mathbb{V} \left[Q_t(S_{t + 1}, A_{t + 1}) \middle\vert Z_t \right] - \mathbb{V}\left[\sum_{a’ \in \mathcal{A}} \pi\left(a’ \mid S_{t + 1}\right)Q_t \left(S_{t + 1}, a’ \right) \middle\vert Z_t\right]\right) = \
&\gamma^2 \left(\mathbb{E}\left[\mathbb{V}\left[Q_t(S_{t + 1}, A_{t + 1}) \middle\vert S_{t + 1} \right] \middle\vert Z_t \right] - \mathbb{V}\left[\mathbb{E}\left[Q_t(S_{t + 1}, A_{t + 1}) \middle\vert S_{t + 1} \right] \middle\vert Z_t \right] \right) - \
&\gamma^2 \left( \mathbb{E}\left[\mathbb{V}\left[\sum_{a’ \in \mathcal{A}} \pi\left(a’ \mid S_{t + 1}\right)Q_t \left(S_{t + 1}, a’ \right) \middle\vert S_{t + 1}\right] \middle\vert Z_t \right] - \mathbb{V}\left[\mathbb{E}\left[\sum_{a’ \in \mathcal{A}} \pi\left(a’ \mid S_{t + 1}\right)Q_t \left(S_{t + 1}, a’ \right) \middle\vert S_{t + 1}\right] \middle\vert Z_t \right] \right)
\end{align}]

First, $\mathbb{E}\left[\mathbb{V}\left[\sum_{a’ \in \mathcal{A}} \pi\left(a’ \mid S_{t + 1}\right)Q_t \left(S_{t + 1}, a’ \right) \middle\vert S_{t + 1} \right] \middle\vert Z_t \right] = 0$. Given $S_{t + 1}$ everything inside becomes deterministic. Also,

[\mathbb{V}\left[\mathbb{E}\left[Q_t(S_{t + 1}, A_{t + 1}) \middle\vert S_{t + 1} \right] \middle\vert Z_t \right] = \mathbb{V}\left[ \sum_{a’} \pi(a’ \mid S_{t + 1})Q_t(S_{t + 1}, a’) \middle\vert Z_t \right]= \mathbb{V}\left[\mathbb{E}\left[ \sum_{a’} \pi(a’ \mid S_{t + 1})Q_t(S_{t + 1}, a’) \middle\vert S_{t + 1}\right] \middle\vert Z_t \right]]

So, we have

[\mathbb{V} \left[R_{t + 1} + \gamma Q_t(S_{t + 1}, A_{t + 1}) \middle\vert H’t \right] - \mathbb{V}\left[R{t + 1} + \gamma \sum_{a’ \in \mathcal{A}} \pi\left(a’ \mid S_{t + 1}\right)Q_t \left(S_{t + 1}, a’ \right) \middle\vert H’t\right] = \gamma^2 \mathbb{E}[\mathbb{V}[Q_t(S{t + 1}, A_{t + 1}) \mid S_{t + 1}] \mid Z_t] \geq 0.]

Since the variance is always positive.

Reference

Sobolev spaces (not too wiggly functions), Hilbert spaces, and RHKS

11 minute read

Published: August 17, 2025

This post will be hell of a post! Haha 🤩

Random variables’ moments and thereof

5 minute read

Published: August 17, 2025

Mostly from All of Statistics.

What to remember from RLC 2025

3 minute read

Published: August 06, 2025

Day 1:

What to remember from ICML 2025

less than 1 minute read

Published: August 06, 2025

Andreas Krause’s first step in research (keynote talk):

You have to know what you don’t know.

1900

Why does Q-Learning and SARSA converge? (a.k.a Stochastic Approximation Part 4)

1 minute read

Published: August 20, 1900

Alireza Kazemipour

Journal

2025

Some cool questions

Convergence of Positive Supermartingales

Some cool things about finite Markov chains and thereof

A reminder on Student’s t-distribution and p-values

Some reminders on Linear Algebra

Induction by Contradiction

When is a deterministic optimal policy in MDPs attainable?

Stochastic Approximation Part 1

Needed background

The law of total variance in practice

Background

The law of total variance

The law of the unconscious statistician (LOTUS)

Jensen’s inequality

An example the law of total variance in practice

Reference

Sobolev spaces (not too wiggly functions), Hilbert spaces, and RHKS

Random variables’ moments and thereof

What to remember from RLC 2025

Day 1:

What to remember from ICML 2025

1900

Why does Q-Learning and SARSA converge? (a.k.a Stochastic Approximation Part 4)

In Progress!!! 👇