Random variables’ moments and thereof

5 minute read

Published:

Mostly from All of Statistics.

Definitions

Definition. The $k^{\text{th}}$ moment of a random variable $X$ is $\mathbb{E}\left[X^k\right]$,
assuming $\mathbb{E}\left[\lvert X \rvert^k\right]$ exits. With this definition the mean is the first moment.

Definition. The variance of a random variable $X$ with mean $\mu$ is defined by

\[\mathbb{V}(X) = \mathbb{E}[X - m]^2.\]

The relation between the first and second moment, and the variance is as follows (the variance is the difference between the second and the first moment square):

\[\mathbb{V}(X) = \mathbb{E}[X^2] - \mathbb{E}[X]^2 .\]

Moment generating function (MGF)

MGFs are useful for finding random variables’ moments.

MGF or Laplace transform of $X$ is defined by

\[\psi_X(t) = \mathbb{E}\left[e^{tX}\right] = \int e^{tX} d\mathbb{P}(x), \quad t \in \mathbb{R}.\]

Suppose MGF is well-defined (discussed below) around $t = 0$ (so we can exchange the derivatives and expectations). Then,

\[\psi'(0) = \frac{d}{dt} \mathbb{E}\left[e^{tX}\right] \Big\vert_{t = 0} = \mathbb{E}\left[\frac{d}{dt} e^{tX}\right] \Big\vert_{t = 0} = \mathbb{E}\left[X e^{tX}\right] \Big\vert_{t = 0} = \mathbb{E}[X].\]

Hence, by taking $k$ derivatives, $\psi^{(k)}(0) = \mathbb{E}\left[X^k \right]$.

When can we exchange the order of applying a derivative with an expectation?

Proposition. Let $X \in \mathcal{X}$ be random variable, $g: \mathbb{R} \times \mathcal{X} \to \mathbb{R}$ be a function such that $g(t, X)$ is integrable for all $t$ and is courteously differentiable with respect to $t$. Assume there exits a random variable $Z$, where $\left \vert\frac{\partial}{\partial t} g(t, X) \right\vert < Z$ almost surely for all $t$ and $\mathbb{Z} < \infty$. Then,

\[\frac{\partial}{\partial t}\mathbb{E}[g(t, X)] =\mathbb{E}\left[\frac{\partial}{\partial t} g(t, X) \right]\]

Proof.

\[\begin{align*} \frac{\partial}{\partial t}\mathbb{E}[g(t, X)] & = \lim_{h \to 0} \frac 1h \mathbb{E}[g(t + h, X) - g(t, X)] \\ &= \lim_{h \to 0} \mathbb{E}\left[\frac{g(t + h, X) - g(t, X)}{h}\right] \\ & = \lim_{h \to 0} \mathbb{E}\left[\frac{\partial}{\partial t}g(\tau(h), X)\right], \end{align*}\]

where $\tau(h) \in (t, t + h)$ exits by the mean value theorem. Since $\left \vert\frac{\partial}{\partial t} g(\tau(h), X) \right\vert < Z$, we can use the dominated convergence theorem and send the limit to the inside,

\[\lim_{h \to 0} \mathbb{E}\left[\frac{\partial}{\partial t}g(\tau(h), X)\right] = \mathbb{E}\left[\lim_{h \to 0} \frac{\partial}{\partial t}g(\tau(h), X)\right] = \mathbb{E}\left[\frac{\partial}{\partial t}g(t, X)\right]. \; \square\]

If you wanna take the derivative around a single point $t^\star$, boundedness only around $t^\star$ is needed, which is the case for $t^\star=0$ in MGFs to compute moments.

Mean and variance in discounted MDPs

We assume we’re in the episodic settings.
The state space is finite $\mathcal{X} = {1, \dots, n}$ and there is a special terminal state $x_T$.
We assume the agent follows a fixed proper (definition below) policy $\pi$.
Let $r_\pi \in \mathbb{R}^n$ and $P_\pi \in \mathbb{R}^{n \times n}$ denote the expected reward vector,
and the transition probability induced by $\pi$. Let $\tau = \min {t > 0 \mid X_t = X_T }$ denote the first visit time to the terminal state, and let the random variable $G$ denote the accumulated discounted reward along the trajectory until that time

\[G = \sum_{t = 0}^{\tau - 1} \gamma^t r_\pi(X_t).\]

Then, the first and the second moment of $G$ are equal to

\[J(x) = \mathbb{E}[G \mid X_0 = x] \\ M(x) = \mathbb{E}[G^2 \mid X_0 = x] .\]

Let’s expand each. First $J(x)$:

\[\begin{align*} J(x) &= \mathbb{E} \left[ \sum_{t = 0}^{\tau - 1} \gamma^t r_\pi(X_t) \middle\vert X_0 = x \right] \\ &= r_\pi(x) + \mathbb{E} \left[ \sum_{t = 1}^{\tau - 1} \gamma^t r_\pi(X_t) \middle\vert X_0 = x\right] \\ & = r_\pi(x) + \gamma \mathbb{E} \left[ \mathbb{E}\left[\sum_{t = 1}^{\tau - 1} \gamma^{t - 1} r_\pi(X_t) \middle\vert X_0 = x, X_1 = x' \right]\right] & \text{(tower rule)} \\ & = r_\pi(x) + \gamma \mathbb{E}\left[J(x') \right] \\ & = r_\pi(x) + \gamma \sum_{x'} P_\pi\left(x' \middle\vert x\right)J\left(x'\right). & \text{(Bellman equation)} \end{align*}\]

Now $M(x)$:
\(\begin{align*} M(x) &= \mathbb{E} \left[ \left(\sum_{t = 0}^{\tau - 1} \gamma^t r_\pi(X_t)\right)^2 \middle\vert X_0 = x \right] \\ &= \mathbb{E} \left[ \left(r_\pi(x) + \sum_{t = 1}^{\tau - 1} \gamma^t r_\pi(X_t)\right)^2 \middle\vert X_0 = x \right] \\ & = r_\pi(x)^2 + 2r_\pi(x)\mathbb{E} \left[ \sum_{t = 1}^{\tau - 1} \gamma^t r_\pi(X_t) \middle\vert X_0 = x \right] + \mathbb{E} \left[ \left(\sum_{t = 1}^{\tau - 1} \gamma^t r_\pi(X_t)\right)^2 \middle\vert X_0 = x \right] \\ & r_\pi(x)^2 + 2\gamma r_\pi(x) \sum_{x' }P_\pi\left(x' \middle\vert x\right)J(x') + \gamma^2 \sum_{x' }P_\pi\left(x' \middle\vert x\right)M(x') & \text{(tower rule)}. \end{align*}\)

So the variance of $G$ at state $x$ is equal to

\[\begin{align*} V(x) = \mathbb{V}(G \mid X_0=x) & = M(x) - J(x)^2 \\ & = \gamma^2 \sum_{x' }P_\pi\left(x' \middle\vert x\right)M(x') - \left(\gamma \sum_{x'} P_\pi\left(x' \middle\vert x\right)J\left(x'\right)\right)^2 \\ & = \gamma^2\left[ \sum_{x' }P_\pi\left(x' \middle\vert x\right)M(x') - \left(\sum_{x'} P_\pi\left(x' \middle\vert x\right)J\left(x'\right)\right)^2 \right] \\ & \geq \gamma^2\left[ \sum_{x' }P_\pi\left(x' \middle\vert x\right)M(x') - \sum_{x'} P_\pi\left(x' \middle\vert x\right)^2 J\left(x'\right)^2 \right] & \text {(Cauchy-Swhartz)} \\ & \geq \gamma^2\left[ \sum_{x'} P_\pi\left(x' \middle\vert x\right) \left(M\left(x'\right) - J(x')^2\right) \right] \\ & = \gamma^2 \mathbb{E}\left[V(x') \right]. \end{align*}\]

The above display shows that the variance of the return for a state is mostly influenced by the states that are temporally close to it. Further states’ variance is discounted at a fast geometric rate. For example, the effect of the variance at the state just before the terminal state on the initial state is about $\gamma^{2n}$. This matches our intuition as well.

Definition. A stationary policy $\pi$ is said to be proper if, using this policy, there is positive probability that the termination state will be reached after at most $n$ stages, regardless of the initial state, that is, if

\[\begin{equation*} \max_{x = 1, \dots, n} \mathbb{P}_{\pi} \left( X_n \neq x_T \mid X_0=x\right) < 1 \end{equation*}.\]

References

  1. Tamar, et al. 2013.
  2. All of Statistics
  3. When can we interchange the derivative with an expectation?
  4. Bertsekas, Neuro dynamic programming