Journal

The life as a grad student taught that I should maintain this page very seriously.

2025

Some cool questions

18 minute read

Published:

  • An elementary theorem in number theory states that if two integers $m$ and $n$ are relatively prime (i.e., greatest common divisor equal to 1 ), then there exist integers $x$ and $y$ (positive or negative) such that $mx + ny = 1$. Using this theorem show the following:
    A If $m$ and $n$ are relatively prime then the set {$xm + ny : x, y$ positive integers} contains all but a finite number of the positive integers.
    B. Let $J$ be a set of nonnegative integers whose greatest common divisor is $d$. Suppose also that $J$ is closed under addition, $m, n \in J \Rightarrow m + n \in J$. Then $J$ contains all but a finite number of integers in the set {$0, d, 2d,\dots$}.

Source: Introduction to Stochastic Processes

answer.

Part (A).

First, note that if we want to contain only positive integers, then $n$ and $m$ must be nonnegative. If one of them is negative and the other is positive, we would inevitably contain the whole integers. Also, to rule out the trivial case, let us assume $n, m \geq 1$.

Since $\mathsf{gcd} (n, m) = 1$, the great common divisor of any multiples of $n$ from $1n$ to $mn$ with $m$ is also one. In other words, the set {$n, 2n, \dots, mn$} forms the residue system modulo $m$. Let $N$ be any nonzero integer, we have $N \equiv ny \pmod m$, where $1 \leq y \leq m$. Equivalently, we can say $N - ny = mx$ for some integer $x$.

We know that since $1 \leq y \leq m$, $y$ is positive. On the other hand, if $N \geq mn + 1$, then $x = \frac{N - ny}{m} > 0$, hence $x$ is also positive. Therefore, we just proved that for positive integers $x$ and $y$, all positive integers bigger than $mn$ are in the set described in the question.

Part (B).

Consider the set $J’ =$ {$\frac{j}{d} \mid j \in J$}. If we show for any integer $k$ in $J’$ that is bigger than some threshold $K$, all but a finite number of nonnegative integers are in $J’$, then we can conclude for any integer $k$ in $J$ that is bigger than $Kd$, all but a finite number of nonnegative integers are in $J$

By construction, the greatest common divisor of $J’$s elements is one and also $J’$ is closed under addition. Now, consider an arbitrary element in $J’$ and let us denote it by $a$. The set of remainder of $J’$s elements when divided by $a$ is $R = $ {$j’ \pmod{a} \mid j’ \in J’$}. Since the greatest common divisor of $J’$ is one, the greatest common divisor of $R$ is also one. Therefore, $R$ forms the residue system modulo $a$, i.e., $R = $ {$0, 1, \dots a - 1$}, where for each element of $r \in R$ there exits at least one element in $J’$ denoted by $j’_r$ such that $j’_r \equiv r \pmod{a}$. Let us denote the largest representative of these elements by $K$:

[K = \max \left{j’0, j’_1, \dots j’{a - 1} \right}.]

Now we show that any integer bigger than $K$ is in $J’$. Let $k$ be an integer bigger than $K$ and let its remainder when divided by $a$ be $r$, i.e., $k = r \pmod{ a}$. We know that there is an element in $J’$ with same remainder, i.e., $j’_r$, therefore:

[k = j’_r + c\cdot a,]

for some integer $c$. Since $k > M$ and $j’_r \leq M$, $k - j’_r > 0$ and $c \cdot a$ is positive multiple of $a$.
Since $j’_r, a \in J’$, and $J’$ is closed under addition, both $j’_r$ and $c \cdot a = \underbrace{a + a + \dots + a}_{c\, \mathrm{times}}$ are in $J’$.
Therefore, their sum, $k$, must also be in $J’$. This proves that $J’$ contains all integers greater than $K$.
Consequently, $J$ contains {$(K + 1)d, (K + 2)d, \dots$}, which completes the proof.


  • What is an ergodic Markov chain?

answer.

It’s an aperiodic, irreducible, and recurrent Markov chain:

Aperiodic:
The period of the chain is one.

Irreducible:
If there is only one communication class, i.e., if for all $i, j$ there exists an $n = n(i,j)$ with $P_n(i,j) > 0$, then the chain is called irreducible.

Recurrent:
If the chain starts in a transient class (set of states), then with probability one it eventually leaves this class and never returns. Classes with this property are called transient classes and the states are called transient states. Other classes are called recurrent classes with recurrent states. A Markov chain starting in a recurrent class never leaves that class.


  • Why is the left eigenvector of a transition matrix with the eigen value of one equal to the stationary distribution of the Markov chain modelled by the transition matrix?

answer.

Suppose $\bar{\pi}$ is the stationary distribution and $P$ is the transition matrix. Then, we know that since all rows of $P^\infty$ are the same, then for any initial any probability vector $\bar{v}$ we have

[\bar{\pi} = \lim_{n \to \infty} \bar{v}P^n.]

Hence, we have,

\(\bar{\pi} = \lim_{n \to \infty} \bar{v}P^{n + 1} = \left(\lim_{n \to \infty} \bar{v}P^{n}\right)P = \bar{\pi}P.\)
The above display is a left eigenvector of $P with eigenvalue one.

Source: Introduction to Stochastic Processes


  • Prove that every state in an irreducible Markov chain has the same period.

The period of state $i$ is defined as the greatest common divisor of the set $J_i :=$ {$n \geq 1: P_n(i, i) >0$}. Not that $J_i$ is closed under addition because $P_{n +m} (i, i) = \sum_k P_n(i, k) P_m(k, i) \geq P_n(i, i) P_m(i, i) > 0$. Let $d$ be the greatest common divisor of $J_i$. We have shown before that $J_i$ contains all but a finite number of integers. Hence, $J_i$ contains $md$ for all $m$ greater than some integer $M$.

Let $j$ be another state and $n,m$ such that $P_n(i, j), P_m(j, i) > 0$ (chain is irreducible). Clearly $n + m \in J_i$ and $m + n \in J_j$. If $l \in J_j$, then

[P_{n+m+l}(i, j) \geq P_{n}(i, j)P_l(j, j)P_m(j, i) > 0.]

Therefore, $n+m+l \in J_i, J_j$. $d$ used to divide $n + m$, now we showed that it must divide $l$ as well.
So, we have just shown that if $d $divides every element of $J_i$
then it divides every element of $J_j$. From this we see that all states have the same period.

Source: Introduction to Stochastic Processes


  • Consider the Markov chain with state space {1, 2, 3, 4, 5} and matrix

[P = \begin{pmatrix}
0 & 1/3 & 2/3 & 0 & 0 \
0 & 0 & 0 & 1/4 & 3/4 \
0 & 0 & 0 & 1/2 & 1/2 \
1 & 0 & 0 & 0 & 0 \
1 & 0 & 0 & 0 & 0
\end{pmatrix}]

What are $P_{1000}(2, 1 ), P_{1000}(2, 2), P_{1000}(2, 4)$?

Due to the structure of the chain at no time step there would be probability for transitioning from state 2 to state 1, nor from state 2 to itself. Hence, $P_{1000}(2, 1) = P_{1000}(2, 2) = 0$

On the other hand to compute $P_{1000}(2, 4)$, since the chain is periodic with the period of three and the remainder of $1000$ when divided by $3$ is one, the same as number $4$, hence using Python software we raise the matrix to the power of $4$ instead, and $P_{1000}(2, 4) \approx P_4(2, 4) \approx 0.4167$.

[\begin{equation}
P_4 = \begin{pmatrix}
0 & 0.3333 & 0.6665 & 0 & 0\
0 & 0 & 0 & 0.4167 & 0.5835 \
0 & 0 & 0 & 0.4167 & 0.5835\
1 & 0 & 0 & 0 & 0 \
1 & 0 & 0 & 0 & 0
\end{pmatrix}
\end{equation
}]

Source: Introduction to Stochastic Processes


  • Let $X_1, X_2,\dots$ be the successive values from independent rolls of a standard six-sided die. Let $S_n= X_1 + \dots + X_n$. Let

[T_1 = \min {n \geq 1: S_n \, \text{is divisible by 8} },]

[T_2 = \min{n \geq 1: S_{n - 1}\, \text{is divisible by 8}}.]

Find $\mathbb{E}[T_1]$ and $\mathbb{E}[T_2]$. (Hint: consider the remainder of $S_n after division by 8 as a Markov chain.)

answer.

The state space of the chain comprises ${0, 1, 2, 3, 4, 5, 6, 7}$. Let $g(i)$ denotes the expected number of rolls until the chain reaches state zero, while starting at state $i$. We have that $g(0) = 0$. Definitely we won’t reach the state $0$ with just one roll because the outcome would be among $1$ and $6$. Hence,

[\begin{equation}
\mathbb{E}[T_1] = 1 + \frac{1}{6}\sum_{i = 1}^6 g(i) = 1 + \frac{1}{6}\left(g(1) + g(2) + g(3) + g(4) + g(5) + g(6)\right).
\end{equation
}]

On the other hand,

[\begin{equation}
\sum_{i=1}^7g(i) = \sum_{i=1}^7 \left(1 + \frac{1}{6}\sum_{j=1}^6 g\left(i + j \bmod{8}\right) \right).
\end{equation
}]

That is, the expected number of rolls starting from state $i \neq 0$ is equal to rolling the die and the expected number of rolls of the resulting state. We rearrange the summations to get:

[\begin{equation}
\sum_{i=1}^7g(i) = 7 + \frac{1}{6}\sum_{j=1}^6\sum_{i=7}^6 g(i + j \bmod{8}).
\end{equation
}]

Now for each fixed $j$, $(i + j \bmod{8})$ is a permutation of ${0, 1, 2, 3, 4, 5, 6, 7}$ excluding the number $j$.
Hence, by denoting $S := \sum_{i = 1}^7 g(i)$, we have:

[\begin{align}
S = 7 + \frac{1}{6}\sum_{j=1}^6(S - g(j)) \Rightarrow \sum_{j = 1}^6 g(j) = 42.
\end{align
}]

Therefore,

[\begin{equation}
\mathbb{E}[T_1] = 1 + \frac{1}{6}\sum_{i = 1}^6 g(i) = 1 + \frac{1}{6}\left(g(1) + g(2) + g(3) + g(4) + g(5) + g(6)\right) = 8.
\end{equation
}]

For computing $\mathbb{E}[T_2]$, we only need to roll the die once, because the sum of the random variables before the first roll is zero, so $S_{1 - 1} = S_0 = 0$, and $0$ is divisible by 8.

Source: Introduction to Stochastic Processes


  • Let $X_n, Y_n$ be independent Markov chains with state space ${0, 1, 2}$ and transition matrix

[P = \begin{pmatrix}
1/2 & 1/4 & 1/4 \
1/4 & 1/4 & 1/2 \
0 & 1/2 & 1/2
\end{pmatrix}.]

Suppose $X_0 = 0, Y_0 = 2$ and let

[T = \inf{n: X_n = Y_n}.]

(A) Find $\mathbb{E}[T]$.

(B) What is $\mathbb{P}{X_T = 2}$?

(C) In the long run, what percentage of the time are both chains in the same state?
[Hint: consider the nine-state Markov chain $Z_n = (X_n, Y_n$).]

answer.

Part(A).

We need to set up a recursive formula connecting the expected number of steps until two chains meet.

Let $g(i, j)$ be the expected number of steps until the meeting time, when ${X_n}$ has started at state $i$ and ${Y_n}$ has started in state $j$. We have that $g(k, k) = 0,: k \in {0, 1, 2}$. For all other state, since ${X_n}$ and ${Y_n}$ are independent, we have that:

[\begin{equation}
g(i, j) = 1 + \sum_{k=0}^3\sum_{l=0}^3 P(i, k)P(j, l)g(k, j).
\end{equation
}]

This gives a set of 6 equations with 6 unknowns
($g(0, 1), g(0, 2), g(1, 0), g(1, 2), g(2, 0),\, \mathrm{and}\, g(2, 1)$) as follows:

[\begin{align}
g(0, 1) & = 1 + \frac{1}{2}g(0, 0) + \frac{1}{4}g(0, 1) + \frac{1}{4}g(0, 2) \
g(0, 2) & = 1 + \frac{1}{2}g(0, 0) + \frac{1}{4}g(0, 1) + \frac{1}{4}g(0, 2)
\
g(1, 0) & = 1 + \frac{1}{4}g(1, 0) + \frac{1}{4}g(1, 1) + \frac{1}{2}g(1, 2) \
g(1, 2) & = 1 + \frac{1}{4}g(1, 0) + \frac{1}{4}g(1, 1) + \frac{1}{2}g(1, 2) \
g(2, 0) & = 1 + 0\cdot g(2, 0) + \frac{1}{2}g(2, 1) + \frac{1}{2}g(2, 2) \
g(2, 1) & = 1 + 0 \cdot g(2, 0) + \frac{1}{2}g(2, 1) + \frac{1}{2}g(2, 2).
\end{align
}]

Solving this system of equations gives us: $g(0, 2) = \frac{118}{35} \approx 3.37$.

Part (B).

We need to set up a recursive formula connecting how the chains can meet at state 2 while starting at state 0 and 2 respectively.

Let $h(i, j)$ be the probability of first meeting at state 2, when ${X_n}$ has started at state $i$ and ${Y_n}$ has started in state $j$. We have that $h(1, 1) = h(0 , 0) = 0$ (first meeting should be at state 2), and $h(2, 2) = 1$. For all other state, since ${X_n}$ and ${Y_n}$ are independent, we have that:

[\begin{equation}
h(i, j) = \sum_{k=0}^3\sum_{l=0}^3 P(i, k)P(j, l)h(k, j).
\end{equation
}]

This gives a set of 6 equations with 6 unknowns ($h(0, 1), h(0, 2), h(1, 0), h(1, 2), h(2, 0),\, \mathrm{and}\, h(2, 1)$) as follows:

[\begin{align}
h(0, 1) & = \frac{1}{2}h(0, 0) + \frac{1}{4}h(0, 1) + \frac{1}{4}h(0, 2) \
h(0, 2) & = \frac{1}{2}h(0, 0) + \frac{1}{4}h(0, 1) + \frac{1}{4}h(0, 2)
\
h(1, 0) & = \frac{1}{4}h(1, 0) + \frac{1}{4}h(1, 1) + \frac{1}{2}h(1, 2) \
h(1, 2) & = \frac{1}{4}h(1, 0) + \frac{1}{4}h(1, 1) + \frac{1}{2}h(1, 2) \
h(2, 0) & = 0\cdot h(2, 0) + \frac{1}{2}h(2, 1) + \frac{1}{2}h(2, 2) \
h(2, 1) & = 0 \cdot h(2, 0) + \frac{1}{2}h(2, 1) + \frac{1}{2}h(2, 2).
\end{align
}]

Solving this system of equations gives us: $h(0, 2) = \frac{15}{28} \approx 0.535$.

Part (C).

Since the chains are independent, we need to multiple their invariant distributions for states ${1, 2, 3}$. Let us find the invariant distribution by solving $\bar{\pi} = \bar{\pi}P$:

[\begin{equation}
\begin{pmatrix}
\bar{\pi}_1 \
\bar{\pi}_2 \
\bar{\pi}_3
\end{pmatrix} =
\begin{pmatrix}
\bar{\pi}_1 \
\bar{\pi}_2 \
\bar{\pi}_3
\end{pmatrix}^\top \begin{pmatrix}
1/2 & 1/4 & 1/4 \
1/4 & 1/4 & 1/2 \
0 & 1/2 & 1/2
\end{pmatrix}
\quad \mathrm{and} \quad \bar{\pi}_1 + \bar{\pi}_2 + \bar{\pi}_3 = 1.
\end{equation
}]

The solution of this equation is: $\bar{\pi} = \frac{1}{11}(2, 4, 5)$. Hence, the probability that both chains spend time at the same states in a long run is equal to: $\frac{1}{11^2}(4 + 16 + 25) = \frac{45}{121}$.

Source: Introduction to Stochastic Processes


  • Let $X_n$ be a Markov chain on state space {1, 2, 3, 4, 5} with transition matrix

[P = \begin{pmatrix}
0 & 1/2 & 1/2 & 0 & 0\
0 & 0 & 0 & 1/5 & 4/5 \
0 & 0 & 0 & 2/5 & 3/5 \
1 & 0 & 0 & 0 & 0 \
1/2 & 0 & 0 & 0 & 1/2
\end{pmatrix}]

(A) Suppose $X_0 = 1$. What is the expected number of steps until the chain is in state 4?

(B) Suppose $X_0 = 1$. What is the probability that the chain will enter state 5 before it enters state 3?

answer.

Part (A).

We need to set up a recursive formula connecting the expected number of steps until reaching state 4 for each distinct state. Let $g(i)$ be the expected number of steps until reaching state $4$, when the chain has started at state $i$. We have that $g(4) = 0$. For all other state we have that:

[\begin{equation}
g(i) = 1 + \sum_{j=0, j \neq 4}^5P(i, j)g(j).
\end{equation
}]

This gives a set of 4 equations with 4 unknowns ($g(1), g(2), g(3),\, \mathrm{and}\, g(5)$) as follows:

[\begin{align}
g(5) & = 1 + \frac{1}{2}g(0) + \frac{1}{2}g(5) \
g(3) & = 1 + \frac{2}{5}g(4) + \frac{3}{5}g(5) \
g(2) & = 1 + \frac{1}{5}g(4) + \frac{4}{5}g(5) \
g(1) & = 1 + \frac{1}{2}g(2) + \frac{1}{2}g(3)
\end{align
}]

Solving this system of equations gives us: $g(1) = \frac{34}{3}$.

Part (B).

We need to set up a recursive formula connecting how the chain can hit state 5 before state 3, while starting at state 1.

Let $h(i)$ be the probability that the chain hits state $5$ before state 3, when the chain has started at state $i$. We have that $h(5) =1$ and $h(3) = 0$. For all other state we have that:

[\begin{equation}
h(i) = \sum_{j=1}^5P(i, j)h(j).
\end{equation
}]

This gives a set of 3 equations with 3 unknowns ($h(1), h(2),\, \mathrm{and}\, h(4)$) as follows:

[\begin{align}
h(4) & = h(1) \
h(2) & = \frac{1}{5}h(4) + \frac{4}{5}h(5) = \frac{1}{5}h(4) + \frac{4}{5}\
h(1) & = \frac{1}{2}h(2) + \frac{1}{2}h(3) = \frac{1}{2}h(2)
\end{align
}]

Solving this system of equations gives us: $h(1) = \frac{4}{9}$.

Source: Introduction to Stochastic Processes

A reminder on Student’s t-distribution and p-values

4 minute read

Published:

Let’s see what Sir Larry Wasserman has to tell us about it.

So we’re in the realm of hypothesis testing [Aaditya Ramdas calls it stochastic proof by contraction, I
love this phrase
]. Suppose we divide our parameter space $\Theta$ into two disjoint sets $\Theta_0$ and $\Theta_1$.
We wish to test

\(H_0: \theta \in \Theta_0 \quad\text{versus} \quad H_1: \theta \in \Theta_1\).

We call $H_0$ the null hypothesis that we’d like to reject. Because it says nothing interesting is going on
(hence the name null). Let $\mathbb{P}_\theta$ be a probability distribution with support $\mathcal{X}$
parameterized by $\theta$. Define $\mathcal{R} \subset \mathcal{X}$ called the rejection region.
Let $X \sim \mathbb{P}$$\theta$$(\cdot)$. Then, if

[X \in \mathcal{R} \Rightarrow \text{ reject } H_0, \
X \notin \mathcal{R} \Rightarrow \text{ retain } H_0.]

Usually, the rejection region is of the form

[R = {x \in \mathcal{X}: T(x) > c },]

where $T$ is a test statistic and $c$ is a critical value. The problem in hypothesis testing is to find an appropriate test statistic $T$ and an appropriate critical value $c$.

P.S.: A test statistic is a single number calculated from sample data that is used to evaluate a hypothesis in statistical analysis. It quantifies the difference between the observed data and what would be expected if the null hypothesis were true. Essentially, it helps determine how compatible your data is with a specific hypothesis.

Definition: The power function of a test with rejection region $\mathcal{R}$ is defined by

[\beta(\theta) = \mathbb{P}_\theta(X \in \mathcal{R}).]

The size of a test is defined to be

[\sup_{\theta \in \Theta_0} \beta(\theta).]

A test is said to have level $\alpha$ if its size is less than or equal to $\alpha$.
Basically, the level $\alpha$ specifies the maximum probability of rejection the null hypothesis.

Induction by Contradiction

1 minute read

Published:

In this post, we wanna prove the correctness of the proof by induction itself. Lol

Stochastic Approximation Part 1

26 minute read

Published:

What struck my curiosity to investigate why $Q$-learning and SARSA converge was the realization that these methods use
their estimates of action values at time step $t$ to estimate the action values at time step $t + 1$ in their update
targets. This sounded really weired to me as though the convergence should not happen. So, I dug in and discovered the
answer lies in the topic of stochastic approximation. Stochastic approximation is fairly a big topic, hence I cover it
in four separate parts. In the last part I’ll turn my attention to Q-Learning and SARSA eventually.

I’ll mention the assumption required to each proposition during their corresponding proofs to see where those
assumptions were inevitably needed.

Needed background

In this section I’ll review some concepts needed throughout the document.

What to remember from ICML 2025

less than 1 minute read

Published:

Andreas Krause’s first step in research (keynote talk):

You have to know what you don’t know.

1900