Expecting the Unexpected: Borel’s Paradox

Expecting the Unexpected: Borel’s Paradox

One of the best ways to shorten a proof in statistics or probability is to use conditioning arguments. I myself have used the Law of Total Probability extensively in my work, as well as other conditioning arguments in my PhD dissertation. Like many things in mathematics, there are subtleties that, if ignored, can cause quite a bit of trouble. It’s a theme on which I almost feel like I sound preachy, because subtlety, nuance, and deliberation followed by cautious proceeding is about as old-fashioned as my MS Statistics1

One particularly good paper that discusses this was written by Michael Proschan and Brett Presnell  in The American Statistician  in August 1998 titled “Expect the Unexpected from Conditional Expectation”. In it, the authors noted the following seemingly innocuous question posed on a statistics exam

If X and Y are independent standard normal random variables, what is the conditional distribution of Y given Y=X?

There are three approaches to this problem.

(1) Interpret the statement that Y=X by declaring a new random variable Z_{1} = Y-X where Z_{1}=0.

Here, the argument proceeds as follows:

Y and Z_{1} have a bivariate normal distribution with \mu = (0,0), \sigma_{Y}^{2}=1, \sigma_{Z_{1}}^{2}=2, and correlation \rho = \tfrac{1}{\sqrt{2}}. Thus, we know that the conditional distribution of Y given Z_{1}=0 is itself normal with mean \mu_{Y}+\rho\frac{\sigma_{Y}}{\sigma_{Z_{1}}}(0-\mu_{Z_{1}})=0 and variance \sigma_{Y}^{2}(1-\rho^{2}) = \tfrac{1}{2}. Thus, the conditional density is 

f(y|Z_{1}=0) = \frac{1}{\sqrt{2\pi}}e^{-y^{2}}

This was the expected answer. However, one creative student did a different argument:

(2) Interpret the statement that Y=X by declaring a new random variable Z_{2} = \tfrac{Y}{X} where Z_{2}=1.

This student used the Jacobian method 2 and transformed the variables via declaring s=y, z_{2}=\tfrac{y}{x} and finding the joint density of S and Z_{2}. The marginal density for S was then found by dividing the joint density by the marginal density of Z_{2} evaluating at z_{2}=1. The reason for the ratio is that the marginal density of Z_{2} being the ratio of independent standard normal random variables has a Cauchy distribution3. This student’s answer was 

f(y|Z_{2}=1) = |y|e^{-y^{2}}

Not the expected answer. This is a correct interpretation of the condition Y=X, so the calculations are correct. There was a third different answer, from a student who had taken a more advanced probability course.

(3) Interpret the statement Y=X as Z_{3} = 1 where Z_{3} = \mathbb{I}[Y=X]

Here \mathbb{I}(\cdot) is the indicator function, where the variable is 1 if the condition is met, and 0 otherwise. The argument here is that Y and Z_{3} are independent. Why? Z_{3} is a constant zero with probability 1. A constant is independent of any random variable. Thus the conditional distribution of Y given Z_{3}=1 must be the same as the unconditional distribution of Y, which is standard normal.

This is also a correct interpretation of the condition. 

From the paper, “At this point the professor decided to drop the question from the exam and seek therapy.”

What happened?

At this point, both the paper did and we shall revisit exactly what conditional probability means. Suppose we have continuous random variables X and Y. We’ll usually write expressions like \mathbb{P}(Y\leq y|X=x) or \mathbb{E}(Y|X=x). However, an acute reader might already ask the question about conditioning on sets of probability 0. For a continuous random variable, the probability that we land on any specific real value x is indeed 0, which hearkens back to the measure-theoretic basis of probability. As it turns out, this little subtlety is indeed the culprit.

Formal definition of conditional distribution and conditional expectation

First, we take a look at the formal definition of conditional expectation

Definition.  A conditional expected value \mathbb{E}(Y|X) of Y given X is any Borel function g(X) = Z that satisfies 

\mathbb{E}(Z\mathbb{I}(X \in B)) = \mathbb{E}(Y\mathbb{I}(X \in B))

for every Borel set B. Then g(x)4 is the conditional expected value of Y given X=x and we can write \mathbb{E}(Y|X=x)

What this means is that the conditional expectation is actually defined as a random variable whose integral over any Borel set agrees with that of X5. Of import here is the fact that the conditional expectation is defined only in terms of an integral. From Billingsley, 1979, there always exists at least one such Borel function g, but the problem here is that there may be infinitely many. Each Borel function g that satisfies the above definition is called a version of \mathbb{E}(Y|X)

So let’s say we have two versions of \mathbb{E}(Y|X), called Z_{1} = g_{1}(X) and Z_{2} = g_{2}(X). Then we have that \mathbb{P}(Z_{1}=Z_{2})=1. Still seems pedantic, but if we look at this from a measure-theoretic perspective, this means that two versions are equal except on a set N of x such that \mathbb{P}(X \in N)=0.

What does this have to do with conditional distributions? 

For each y we fix we can find some function6 G_{y}(x) = G(y|x) such that 

G(y|x) = \mathbb{P}(Y\leq y|X=x)

In other words7, G(y|X) is a version of \mathbb{E}(\mathbb{I}(Y\leq y)|X). This last expression here is a conditional distribution of Y given X=x

Notice I said “a” conditional distribution, and not “the” conditional distribution. The words were chosen carefully. This leads us into the Borel Paradox, and the answer to why all three answers of that exam question are technically correct. 

The Borel Paradox

Also known as the Equivalent Event Fallacy, this paradox was noted by Rao (1988) and DeGroot (1986). If we attempt to condition on a set of probability (or measure) 0, then the conditioning may not be well-defined, which can lead to different conditional expectations and thus different conditional distributions. 

In the exam question, Z_{1}, Z_{2} and Z_{3} are all equivalent events. The fallacy lies in assuming that this equivalence would mean that conditioning on, say, \{Y-X=0\} is the same as conditioning on the event \{Y/X=1\}. In almost all8 cases, this is true. If the events in question have nonzero probability, then the professor’s assumption was true. However, when we’re conditioning on events that have zero probability, the classic Bayes’s formula interpretation doesn’t hold anymore, because the denominator is now 0.9

If we have an event like \{Y=X\}, then from the previous section we know that there are infinitely many versions of this event. We have to think of conditioning on a random variable, not just a value. There were three versions given above:

  • Z_{1} = Y-X, which has value z_{1}=0
  • Z_{2} = Y/X, with value z_{2}=1
  • Z_{3} = \mathbb{I}(Y=X) with value z_{3} =1

Proschan and Presnell dig a bit deeper and discuss the details here on exactly where these equivalent interpretations diverge10 and yield different conditional distributions. They also discuss the substitution fallacy, which again notes the consequences of having infinitely many versions of \mathbb{E}(Y|X), and why the convolution argument typically given to derive the distribution of the sum of independent random variables is nowhere near as air-tight as it appears.11

What’s the solution?

The Borel Paradox reared its ugly head because different yet correct interpretations of the conditioning events/sets of probability 0 yielded different conditional expectations and different conditional distributions. The core of the problem was those sets having probability 0. How do we avoid this? We actually would like to calculate \mathbb{E}(Y|X=x) (and conditional distributions that result from it), so how do we get around this paradox of infinite versions?

We take the analysts’ favorite trick: sequences and limits. Take a sequence of sets (B_{n}) where B_{n} \downarrow x. Now we’re only ever actually computing \mathbb{E}(Y|X \in B_{n}) where \mathbb{P}(X \in B_{n})>0. Now define

\mathbb{E}(Y|X=x) = \lim_{n\to\infty}\mathbb{E}(Y|X \in B_{n})

 

(I admit I’m seriously sidestepping some careful considerations of convergence and choices of sequence here. There is more subtlety and consideration in the analytic study of sequences of sets, but I don’t want to push too far down the rabbit hole here. The point is that we can avoid the paradox with care.)

Conclusion

Subtleties and paradoxes occur all over mathematics. This doesn’t mean mathematics is broken, that all statistics are lies, or any other variation of frustrated exasperation I’ll hear when discussing these. What these subtleties, fallacies, and paradoxes do show is that careful consideration is paramount to the study, practice, and application of mathematics. 

 References

  1. Billingsley P,. (1979), Probability and Measure(1st ed.),New York:Wiley
  2. Casella, G. and Berger, R.(2002) Statistical Inference (2nd ed.), Wadsworth
  3. DeGroot,M.H. (1986), Probability and Statistics (2nd ed.), Reading,MA: Addison-Wesley.
  4. Proschan, M.A. and Presnel, B. (1998), “Expect the Unexpected from Conditional Expectation”, American Statistician, 48, 248-252
  5. Rao,M.M.(1988),”Paradoxes in Conditional Probability,” Journal of Multivariate Analysis,27, 434-446

Footnotes

  1. Yep, I’ve had a job interview in the last year where the interviewer actually told me that.
  2. As a brief aside, this method isn’t just for probability. This shows up in multivariate calculus as well when we wish to change coordinates.
  3. This distribution is particularly interesting, as another side note. It has no mean, nor variance. Both are undefined, but it does have a median and mode
  4. Note the small x here.
  5. From Casella and Berger, 2002
  6. Again please note the small letters
  7. Now note the capital letter X, signifying a random variable
  8. See what I did there?
  9. Let that be a lesson too. Bayesian everything isn’t a silver bullet.
  10. Pardon the language here, used colloquially
  11. This also has consequences in the study of random processes. Perhaps I’ll write something on that specific error.
Comments are closed.