The Central Limit Theorem isn’t a Statistical Silver Bullet

# The Central Limit Theorem isn’t a Statistical Silver Bullet

Chances are, if you took anything away from that high school or college statistics class you were dragged into, you remember some vague notion about the Central Limit Theorem. It’s likely the most famous theorem in statistics, and the most widely used. Most introductory statistics textbooks state the theorem in broad terms, that as the sample size increases, the sample distribution of the sum of the sample elements will be approximately normally distributed, regardless of the underlying distribution. Many things used in statistical inference as justification in a broad variety of fields, such as the classical z-test, rely on this theorem. Many conclusions in science, economics, public policy, and social studies have been drawn with tests that rely on the Central Limit Theorem as justification. We’re going to dive into this theorem a bit more formally, and discuss some counterexamples to this theorem. Not every sequence of random variables will obey the conditions of theorem, and the assumptions are a bit more strict than are used in practice.

First off, the counterexamples I’m going to discuss come from the book Counterexamples in Probability, written by Jordan M. Stoyanov. This book has many great counterexamples, but is quite heavy for a non mathematician. That’s why I’m going to pull examples and discuss them in detail. Counterexamples are important in mathematics, because they remind us where the limitations of the theory are.

[Fair warning here. This post will get a little math-heavy. Stick with me, and we’ll walk through the valley of the shadow of math together.]

## The Central Limit Theorem, now in formal attire

First, we’ll state the actual Central Limit Theorem. I’m going to make comments as we go to point some things out, so watch the footnotes. From Stoyanov:

Let $\{X_{n}, n \geq 1\}$ be a sequence of independent1 random variables defined on the probability space $(\Sigma, \mathcal{F}, \mathbb{P})$2.

Let $S_{n} = \sum_{i=1}^{n}X_{i}$3, $\sigma_{i}^{2}$ be the variance of the random variable $X_{i}$, $A_{n} = \text{E}[S_{n}] = \sum_{i=1}^{n}\text{E}[X_{i}]$4, and $s_{n}^{2} = \text{Var}[S_{n}] = \sum_{i=1}^{n}\sigma_{i}^{2}$5.

Then the sequence $\{X_{n}\}$ obeys the Central Limit Theorem if

$$\lim\limits_{n \to \infty}\mathbb{P}\left[\frac{S_{n}-A_{n}}{s_{n}} \leq x\right] = \Phi(x) = \frac{1}{\sqrt{2\pi}}\int_{-\infty}^{x}e^{-u^{2}/2}du$$

for all $x \in \mathbb{R}$

I’d like to talk about this for a bit. What this is saying is that the limit of the probability that the centered (subtract the expected value of the sum from the random variable sum) and scaled (divide by the variance of the sum) sum of random variables is less than some number $x$ converges to the probability that a standard normal random variable is less than that $x$. The limit here is important. Limits are what happens when we go on forever, and this theorem doesn’t state anything about how fast we get there. That speed is important, because many people just use the sample-size-is-over-30 rule of thumb (which has no mathematical justification, by the way) to assume that after 30 units in your sample, you’ll get close enough. Depending on the underlying distribution, you may need 100, 1000, or billions of terms.

One other note here. This theorem is general enough to handle random variables from the same probability space that have different variances. If we assume the variances are all the same $\sigma^{2}$, and that the means are all the same $\mu$, then you get this more “typical” formulation:

$$\lim\limits_{n \to \infty}\mathbb{P}\left[\frac{S_{n}-n\mu}{\sigma \sqrt{n}} \leq x \right] = \Phi(x)$$

## Counterexample

The Central Limit Theorem looked pretty broad. We just needed a sequence of independent random variables on the same probability space. So as long as we have that, we’re good, right? Not exactly. Let’s work through a fairly simple example.

Let’s suppose we have a random variable $Y$ that can take on two possible values: $\pm 1$, each with probability $1/2$. That is, $P(Y = 1) = P(Y = -1) = 1/2$.

Suppose now that we just make a sequence by repeating $Y$. That is, we define $\{Y_{k}, k \geq 1\}$, but each $Y_{k}$ is really just $Y$.6. We’re now going to build a new random sequence $\{X_{k}, k \geq 1\}$ from these $\{Y_{k}, k \geq 1\}$. Let each $X_{k} = \frac{\sqrt{15}Y_{k}}{4^{k}}$7. This means that each time we get an $X_{k}$, we flip the coin to see if $Y_{k} = \pm 1$, so

$$X_{k} = \frac{\pm\sqrt{15}}{4^{k}}$$

with equal probability regarding the sign.

Now,

$$E[Y] = 1 \cdot 1/2 + -1\cdot 1/2 = 0$$

and

$$\text{Var}[Y] = E[Y^{2}]-E[Y]^{2} = (1\cdot 1/2 + 1\cdot 1/2)-0^{2} = 1$$8

Now, since $X_{k}$ is just a constant multiple of $Y_{k}$, $\text{E}[X_{k}] = 0$ $$\text{E}[S_{n}] = \sum_{k=1}^{n}\text{E}[X_{k}] = 0$$

Now we have to calculate $\text{Var}[S_{n}]$9 \begin{aligned}\text{Var}[X_{k}] &= \text{E}[X_{k}^{2}]-\text{E}[X_{k}]^{2}\end{aligned}

We already know that $\text{E}[X_{k}] = 0$, so we just need to find $\text{E}[X_{k}]^{2}$ \begin{aligned}\text{E}[X_{k}]^{2} &=\frac{15\cdot 1^{2}}{(4^{k})^{2}}\cdot P\left(X_{k}=\frac{\sqrt{15}\cdot 1}{4^{k}}\right)+\frac{15\cdot -1^{2}}{(4^{k})^{2}}\cdot P\left(X_{k}=\frac{\sqrt{15}\cdot -1}{4^{k}}\right)\\&=\frac{15\cdot 1^{2}}{(4^{k})^{2}}\cdot\frac{1}{2}+\frac{15\cdot -1^{2}}{(4^{k})^{2}}\cdot\frac{1}{2}\\&=\frac{15}{16^{k}}\end{aligned}

Now we can use the linearity of the variance operator to quickly find $\text{Var}[S_{n}]$:

\begin{aligned}\text{Var}[S_{n}] &= \sum_{k=1}^{n}\text{Var}[X_{k}]\\&= \sum_{k=1}^{n}\frac{15}{16^{k}}\end{aligned}

This sum is the partial sum of a geometric series. So

\begin{aligned}\text{Var}[S_{n}] &= \sum_{k=1}^{n}\text{Var}[X_{k}]\\&= \sum_{k=1}^{n}\frac{15}{16^{k}}\\&= \frac{15}{16}\left(\dfrac{1-\left(\frac{1}{16}\right)^{n}}{1-\frac{1}{16}}\right)\\&= 1-\left(\frac{1}{16}\right)^{n}\end{aligned}

Since we are discussing convergence, we really only care about really really large $n$. So as $n$ gets very large, $\frac{1}{16^{n}}$ is basically 0, so $\text{Var}[S_{n}] \approx 1$ when $n$ is large. So the $\sqrt{15}$ inserted in the definition of $X_{k}$ seemed to come from left field, but we needed it to ensure the variance of the sum that we wanted.

Now, we’re going to note that $P(|S_{n}| \leq 1/2) = 0$. Why? Because $|X_{1} + X_{2} +\ldots + X_{n}| > 1/2$ always.10

This means that the interval $\left[\frac{-1}{2},\frac{1}{2}\right]$ is off limits for this sum to possibly converge to. Since the variance of the sum is approximately 1, there’s no scaling factor that will “squish” it to fit into this interval. This mean that there are real numbers $x$ such that

$$\mathbb{P}\left[\frac{S_{n}-A_{n}}{s_{n}} \leq x\right] \neq \Phi(x)$$

ever, meaning it can never converge there. That means this sequence of random variables violates the Central Limit Theorem. No amount of approximation, large sample size, etc can justify its use in this case.

## Conclusion

This example may have seemed a bit artificial and constructed, but it serves to show that the Central Limit Theorem is not universal. As in, there are sequences of random variables that cannot be manipulated into fitting the hypotheses at all. This is different than listing the cautions about using the Central Limit Theorem when the sequence does actually obey. In those cases, the distinction is a bit more subtle, because we have to really look at the underlying distribution to see how fast the scaled and centered sum converges to a standard normal distribution. We’ll tackle that issue in another post.