Continuous Random Variables

1. Continuous Random Variables
2. Continuous Probability Distributions
3. Central Limit Theorem

1. Continuous Random Variables

Realize that for a continuous probability distribution, it doesn’t make sense to consider the probability of a random variable at a point \(\mathbb{P}[X=a]\), because since there are infinite points, the probability for any individual one of them would be zero. Therefore, instead of specifying \(\mathbb{P}[X=a]\), we specify \(\mathbb{P}[a \leq X \leq b]\) for intervals \([a,b]\). To do this, we consider continuous random variables with a probability density function (PDF), which is a function \(f: \mathbb{R} \to \mathbb{R}\) such that:

\begin{align} \boxed{\mathbb{P}[a\leq X \leq b] = \int_{a}^{b}f(x)\text{ d}x} \end{align}

for all \(a < b\). This function must satisfy the following two properties:

\(f(x) \geq 0\) for all \(x \in \mathbb{R}\)
\(\int_{-\infty}^{\infty}f(x)\text{ d}x = 1\)

1.1. Cumulative Distribution Function

The cumulative distribution function (CDF) is the function \(F(x) = \mathbb{P}[X\leq x]\), and is related to the probability density function like so:

\begin{align} F(x) = \mathbb{P}[X\leq x] = \int_{-\infty}^xf(z)\text{ d}z \end{align}

Then, if we can describe a random variable \(X\) with its CDF, denoted by \(F(x)\), we can differentiate it to get our probability density function:

\begin{align} f(x)=\frac{\text{d}F(x)}{\text{d}x} \end{align}

1.2. Expectation and Variance

The expectation of a continuous random variable \(X\) with probability density function \(f\) is:

\begin{align} \boxed{\mathbb{E}[X] = \int_{-\infty}^{\infty}xf(x)\text{d}x} \end{align}

The variance of a continuous random variable \(X\) with probability density function \(f\) is:

\begin{align} \boxed{\text{Var}(X) = \int_{-\infty}^{\infty}(x-\mathbb{E}[X])^2f_X(x)\text{ d}x} \end{align}

The equivalent expression \(\mathbb{E}[X^2] - \mu^2\) also works in the continuous case.

1.3. Joint Density

Just like how the distribution of a continuous random variable \(X\) can be characterized by its density function, the joint distribution of \(X\) and \(Y\) can be characterized by their joint density.

A joint density function for two random variables \(X\) and \(Y\) is a function \(f_{X,Y}:\mathbb{R}^2\to\mathbb{R}\) satisfying:

\(f_{X,Y}(x,y) \geq 0\) for all \(x,y\in\mathbb{R}\)
\(\int_{-\infty}^{\infty}\int_{-\infty}^{\infty}f_{X,Y}(x,y)\text{ d}x\text{ d}y=1\)

Then, the joint distribution of \(X\) and \(Y\) is given by:

\begin{align} \mathbb{P}[a \leq X \leq b, c \leq Y \leq d] = \int_c^d\int_a^bf_{X,Y}(x,y)\text{ d}x\text{ d}y \end{align}

We can interpret \(f_{X,Y}(x,y)\) as the probability per unit area in the vicinity of \((x,y)\).

1.4. Marginal and Conditional Density

Consider a joint distribution on \(X\) and \(Y\) with joint density function \(f_{X,Y}(x,y)\).

The marginal density function \(f_X\) of \(X\) is defined as:

\begin{align} f_X(x) = \int_{-\infty}^{\infty}f_{X,Y}(x,y)\text{ d}y \end{align}

The conditional density function \(f_{Y|X=x}\) of \(Y\) given \(X=x\) is defined as:

\begin{align} f_{Y|X=x}(y) = \frac{f_{X,Y}(x,y)}{f_X(x)} \end{align}

1.5. Independence

Two continuous random variables \(X\) and \(Y\) are said to be independent if the events \(a\leq X \leq b\) and \(c\leq Y \leq d\) are independent for all \(a \leq b\) and \(c \leq d\):

\begin{align} \mathbb{P}[a\leq X \leq b, c \leq Y \leq d] = \mathbb{P}[a \leq X \leq b]\mathbb{P}[c \leq Y \leq d] \end{align}

Equivalently, \(X\) and \(Y\) are also independent if:

\begin{align} f_{X|Y=y}(x) = f_X(x) \end{align}

1.6. Law of Total Probability

The law of total probability in the continuous case says that:

\begin{align} f_X(x) = \int_{-\infty}^{\infty}f_{X,Y}(x,y)\text{ d}y = \int_{-\infty}^{\infty}f_{X|Y=y}(x)f_Y(y)\text{ d}y \end{align}

2. Continuous Probability Distributions

2.1. Uniform Distribution

A random variable \(U\) takes on a uniform distribution \(U \sim \text{Uniform}(0,L)\) if we have

\begin{align} \mathbb{P}[a \leq U \leq b] = \frac{b-a}{L} \end{align}

for all \(a,b\in[0,L]\). The PDF of the uniform distribution would then be:

\begin{align} f_U(u) = \begin{cases} \frac{1}{L} &U\in[0,L) \\ 0 &\text{otherwise} \end{cases} \end{align}

The expectation and variance for the uniform distribution is:

\begin{align} \mathbb{E}[U] &= \frac{L}{2} \\ \text{Var}(U) &= \frac{L^2}{12} \end{align}

2.2. Exponential Distribution

The exponential distribution is the continuous version of the geometric distribution. For \(\lambda>0\), a continuous random variable \(X\) with PDF

\begin{align} f(x) = \begin{cases} \lambda e^{-\lambda x} &x\geq 0 \\ 0 &\text{otherwise} \end{cases} \end{align}

is called an exponential random variable with parameter \(\lambda\), and we write \(X\sim \text{Exp}(\lambda)\). The expectation and variance of the exponential distribution is:

\begin{align} \mathbb{E}[X] &= \frac{1}{\lambda} \\ \text{Var}(X) &= \frac{1}{\lambda^2} \end{align}

2.3. Normal Distribution

A continuous random variable \(X\) has a normal distribution (or Gaussian distribution) if it has a PDF of:

\begin{align} f(x) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{\frac{-(x-\mu)^2}{2\sigma^2}} \end{align}

We write \(A \sim N(\mu,\sigma^2)\). In the special case of \(N(0,1)\), \(X\) is said to have the standard normal distribution. In fact, given any normal distribution, we can shift and rescale it to become the standard normal distribution.

More specifically, if \(X \sim N(\mu,\sigma^2)\), then \(Y=\frac{X-\mu}{\sigma} \sim N(0,1)\). Equivalently, if \(Y\sim N(0,1)\), then \(X=\sigma Y + \mu \sim N(\mu,\sigma^2)\).

The expectation and variance of the normal distribution is:

\begin{align} \mathbb{E}[X] &= \mu \\ \text{Var}(X) &= \sigma^2 \end{align}

2.3.1. Sum of Independent Normal Random Variables

An important property of the normal distribution is that the sum of independent normal random variables is also normally distributed.

For the standard normal case: if \(X \sim N(0,1)\) and \(Y \sim N(0,1)\) are independent and \(a,b\in\mathbb{R}\) are constants, then \(Z = aX + bY \sim N(0,a^2+b^2)\).

The key insight behind this is that the joint density of two independent standard normals, \(f_{X,Y}(x,y)=\frac{1}{2\pi}e^{-(x^2+y^2)/2}\), is rotationally symmetric around the origin (it only depends on the distance \(x^2+y^2\) from the origin). This means that rotating the half-plane \(\{(x,y) \mid ax+by \leq t\}\) into \(\{(x,y) \mid x \leq \frac{t}{\sqrt{a^2+b^2}}\}\) does not change the probability, so \(Z\) has the same distribution as \(\sqrt{a^2+b^2}\,X\), which by the shift/scale lemma is \(N(0,a^2+b^2)\).

The general case follows from the shift/scale property. If \(X \sim N(\mu_X,\sigma_X^2)\) and \(Y \sim N(\mu_Y,\sigma_Y^2)\) are independent, then for any constants \(a,b\in\mathbb{R}\):

\begin{align} \boxed{Z = aX + bY \sim N(a\mu_X + b\mu_Y,\; a^2\sigma_X^2 + b^2\sigma_Y^2)} \end{align}

This follows by standardizing: let \(Z_1 = \frac{X-\mu_X}{\sigma_X}\) and \(Z_2 = \frac{Y-\mu_Y}{\sigma_Y}\), which are independent standard normals. Then \(Z = (a\mu_X + b\mu_Y) + (a\sigma_X Z_1 + b\sigma_Y Z_2)\), and the standard normal result tells us the second term is \(N(0, a^2\sigma_X^2 + b^2\sigma_Y^2)\). Shifting by the constant gives the result.

3. Central Limit Theorem

The Central Limit Theorem (CLT) is a much stronger statement than the Law of Large Numbers. While the Law of Large Numbers tells us that the sample average \(\frac{S_n}{n}\) converges to the mean \(\mu\), the CLT tells us that the distribution of the sample average, for large enough \(n\), looks like a normal distribution with mean \(\mu\) and variance \(\frac{\sigma^2}{n}\).

To state the CLT precisely, we standardize \(\frac{S_n}{n}\) so that the limiting distribution is a fixed constant (rather than depending on \(n\)):

\begin{align} \frac{\frac{S_n}{n} - \mu}{\frac{\sigma}{\sqrt{n}}} = \frac{S_n - n\mu}{\sigma\sqrt{n}} \end{align}

Let \(X_1,X_2,\ldots\) be a sequence of i.i.d. random variables with common finite expectation \(\mathbb{E}[X_i]=\mu\) and finite variance \(\text{Var}(X_i)=\sigma^2\). Let \(S_n = \sum_{i=1}^{n}X_i\). Then:

\begin{align} \boxed{\mathbb{P}\left[\frac{S_n - n\mu}{\sigma\sqrt{n}} \leq c\right] \to \frac{1}{\sqrt{2\pi}}\int_{-\infty}^{c}e^{-x^2/2}\text{ d}x \quad \text{as } n\to\infty} \end{align}

In other words, the distribution of \(\frac{S_n - n\mu}{\sigma\sqrt{n}}\) converges to \(N(0,1)\) as \(n\to\infty\).

This is a striking result: all trace of the original distribution of \(X\) disappears as \(n\) gets large. No matter how complex the distribution of \(X\) is, the sample average always looks normal for large \(n\). The only effect of the original distribution is through the variance \(\sigma^2\), which determines the width of the bell curve (and hence how fast it shrinks to a spike at \(\mu\)).

This immediately explains why the normal distribution shows up so often in practice. Much of the data we work with in real life (mean wealth, mean height, mean temperature, etc.) is the result of averaging, so by the CLT it naturally follows a normal distribution. Even something like the height of a population tends to be normal, because a person’s height is itself the result of averaging many independent factors (nutrition, environment, various genetic factors, etc.).