The Central Limit Theorem isn’t a Statistical Silver Bullet

The Central Limit Theorem isn’t a Statistical Silver Bullet

Chances are, if you took anything away from that high school or college statistics class you were dragged into, you remember some vague notion about the Central Limit Theorem. It’s likely the most famous theorem in statistics, and the most widely used. Most introductory statistics textbooks state the theorem in broad terms, that as the sample size increases, the sample distribution of the sum of the sample elements will be approximately normally distributed, regardless of the underlying distribution. Many things used in statistical inference as justification in a broad variety of fields, such as the classical z-test, rely on this theorem. Many conclusions in science, economics, public policy, and social studies have been drawn with tests that rely on the Central Limit Theorem as justification. We’re going to dive into this theorem a bit more formally, and discuss some counterexamples to this theorem. Not every sequence of random variables will obey the conditions of theorem, and the assumptions are a bit more strict than are used in practice.

First off, the counterexamples I’m going to discuss come from the book Counterexamples in Probability, written by Jordan M. Stoyanov. This book has many great counterexamples, but is quite heavy for a non mathematician. That’s why I’m going to pull examples and discuss them in detail. Counterexamples are important in mathematics, because they remind us where the limitations of the theory are. 

[Fair warning here. This post will get a little math-heavy. Stick with me, and we’ll walk through the valley of the shadow of math together.]

The Central Limit Theorem, now in formal attire

First, we’ll state the actual Central Limit Theorem. I’m going to make comments as we go to point some things out, so watch the footnotes. From Stoyanov:

 Let \{X_{n}, n \geq 1\} be a sequence of independent1 random variables defined on the probability space (\Sigma, \mathcal{F}, \mathbb{P})2.

Let S_{n} = \sum_{i=1}^{n}X_{i}3, \sigma_{i}^{2} be the variance of the random variable X_{i}, A_{n} = \text{E}[S_{n}] = \sum_{i=1}^{n}\text{E}[X_{i}] 4, and s_{n}^{2} = \text{Var}[S_{n}] = \sum_{i=1}^{n}\sigma_{i}^{2}5.

Then the sequence \{X_{n}\} obeys the Central Limit Theorem if 

\lim\limits_{n \to \infty}\mathbb{P}\left[\frac{S_{n}-A_{n}}{s_{n}} \leq x\right] = \Phi(x) = \frac{1}{\sqrt{2\pi}}\int_{-\infty}^{x}e^{-u^{2}/2}du

for all x \in \mathbb{R}

I’d like to talk about this for a bit. What this is saying is that the limit of the probability that the centered (subtract the expected value of the sum from the random variable sum) and scaled (divide by the variance of the sum) sum of random variables is less than some number x converges to the probability that a standard normal random variable is less than that x. The limit here is important. Limits are what happens when we go on forever, and this theorem doesn’t state anything about how fast we get there. That speed is important, because many people just use the sample-size-is-over-30 rule of thumb (which has no mathematical justification, by the way) to assume that after 30 units in your sample, you’ll get close enough. Depending on the underlying distribution, you may need 100, 1000, or billions of terms. 

One other note here. This theorem is general enough to handle random variables from the same probability space that have different variances. If we assume the variances are all the same \sigma^{2}, and that the means are all the same \mu, then you get this more “typical” formulation:

\lim\limits_{n \to \infty}\mathbb{P}\left[\frac{S_{n}-n\mu}{\sigma \sqrt{n}} \leq x \right] = \Phi(x)


The Central Limit Theorem looked pretty broad. We just needed a sequence of independent random variables on the same probability space. So as long as we have that, we’re good, right? Not exactly. Let’s work through a fairly simple example.

Let’s suppose we have a random variable Y that can take on two possible values: \pm 1, each with probability 1/2. That is, P(Y = 1) = P(Y = -1) = 1/2.

Suppose now that we just make a sequence by repeating Y. That is, we define \{Y_{k}, k \geq 1\}, but each Y_{k} is really just Y.6. We’re now going to build a new random sequence \{X_{k}, k \geq 1\} from these \{Y_{k}, k \geq 1\}. Let each X_{k} = \frac{\sqrt{15}Y_{k}}{4^{k}}7. This means that each time we get an X_{k}, we flip the coin to see if Y_{k} = \pm 1, so 

X_{k} = \frac{\pm\sqrt{15}}{4^{k}}

with equal probability regarding the sign. 


E[Y] = 1 \cdot 1/2 + -1\cdot 1/2 = 0


\text{Var}[Y] = E[Y^{2}]-E[Y]^{2} = (1\cdot 1/2 + 1\cdot 1/2)-0^{2} = 18

Now, since X_{k} is just a constant multiple of Y_{k}, \text{E}[X_{k}] = 0 \text{E}[S_{n}] = \sum_{k=1}^{n}\text{E}[X_{k}] = 0

Now we have to calculate \text{Var}[S_{n}]9 \begin{aligned}\text{Var}[X_{k}] &= \text{E}[X_{k}^{2}]-\text{E}[X_{k}]^{2}\end{aligned}

We already know that \text{E}[X_{k}] = 0, so we just need to find \text{E}[X_{k}]^{2} \begin{aligned}\text{E}[X_{k}]^{2} &=\frac{15\cdot 1^{2}}{(4^{k})^{2}}\cdot P\left(X_{k}=\frac{\sqrt{15}\cdot 1}{4^{k}}\right)+\frac{15\cdot -1^{2}}{(4^{k})^{2}}\cdot P\left(X_{k}=\frac{\sqrt{15}\cdot -1}{4^{k}}\right)\\&=\frac{15\cdot 1^{2}}{(4^{k})^{2}}\cdot\frac{1}{2}+\frac{15\cdot -1^{2}}{(4^{k})^{2}}\cdot\frac{1}{2}\\&=\frac{15}{16^{k}}\end{aligned}

Now we can use the linearity of the variance operator to quickly find \text{Var}[S_{n}]:

\begin{aligned}\text{Var}[S_{n}] &= \sum_{k=1}^{n}\text{Var}[X_{k}]\\&= \sum_{k=1}^{n}\frac{15}{16^{k}}\end{aligned}

This sum is the partial sum of a geometric series. So

\begin{aligned}\text{Var}[S_{n}] &= \sum_{k=1}^{n}\text{Var}[X_{k}]\\&= \sum_{k=1}^{n}\frac{15}{16^{k}}\\&= \frac{15}{16}\left(\dfrac{1-\left(\frac{1}{16}\right)^{n}}{1-\frac{1}{16}}\right)\\&= 1-\left(\frac{1}{16}\right)^{n}\end{aligned}

Since we are discussing convergence, we really only care about really really large n. So as n gets very large, \frac{1}{16^{n}} is basically 0, so \text{Var}[S_{n}] \approx 1 when n is large. So the \sqrt{15} inserted in the definition of X_{k} seemed to come from left field, but we needed it to ensure the variance of the sum that we wanted. 

Now, we’re going to note that P(|S_{n}| \leq 1/2) = 0. Why? Because |X_{1} + X_{2} +\ldots + X_{n}| > 1/2 always.10

This means that the interval \left[\frac{-1}{2},\frac{1}{2}\right] is off limits for this sum to possibly converge to. Since the variance of the sum is approximately 1, there’s no scaling factor that will “squish” it to fit into this interval. This mean that there are real numbers x such that 

\mathbb{P}\left[\frac{S_{n}-A_{n}}{s_{n}} \leq x\right] \neq \Phi(x)

ever, meaning it can never converge there. That means this sequence of random variables violates the Central Limit Theorem. No amount of approximation, large sample size, etc can justify its use in this case.


This example may have seemed a bit artificial and constructed, but it serves to show that the Central Limit Theorem is not universal. As in, there are sequences of random variables that cannot be manipulated into fitting the hypotheses at all. This is different than listing the cautions about using the Central Limit Theorem when the sequence does actually obey. In those cases, the distinction is a bit more subtle, because we have to really look at the underlying distribution to see how fast the scaled and centered sum converges to a standard normal distribution. We’ll tackle that issue in another post. 
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.


  1. This is really important here. If the sequence of random variables isn’t independent, do not pass go, do not collect $200. You do not fit the CLT. At best you can look for some approximation argument.
  2. We’re not going to get into measure theory in this post. Just keep in mind here that the capital Sigma is the sample space of possible outcomes, and P is the probability distributions. No need to get into Borel sigma-algebras just yet.
  3. just the sum of the first n terms.
  4. This is the expected value (typically called the mean) of the sum of n of these random variables. The expectation operator is linear, so the expectation of a sum is the sum of expectations. I’ll write a quick post on linear operators later.
  5. This is the variance of the sum of random variables, given by the sum of variances of the individual ones
  6. Like repeating 1s for an entire column in Excel. I can’t believe I just made that analogy.
  7. Yea, sometimes counterexamples look oddly specific or contrived. Powerful theorems can be hard to break, which is a good thing. Nonetheless, a theorem is only true forever and always and for every single thing if you can prove that no counterexample, no matter how weird, can be created
  8. Basic calculations of mean and variance, especially for discrete random variables, can be found almost anywhere
  9. Bear with me. We’re going to have to do a little work here.
  10. The first term dominates heavily over the other terms, and in absolute value is very close to 1. The absolute value of the partial sum will never actually hit 1/2 because of this. I realize this is a bit hand-wavy, but I’m already getting pretty deep in the weeds, and I don’t want to detract from the overall message here. This is provable.