### Browsed byCategory: Statistics

Should I accept this shipment?

## Should I accept this shipment?

Suppose you work for an engineering or manufacturing firm, and you receive shipments of different parts from various suppliers. It’s not good business practice to just blindly accept a shipment, because some parts in your batch may be defective. Too many defective parts, and you’ll slow down manufacturing (in addition to wasting valuable time and money). You come up with a brilliant plan (which is used in industry already): you’ll take a sample from the shipment, test the sample, and if the number of defective parts in the sample is below some predetermined number, you’ll accept the shipment. Otherwise, you’ll send it back and probably place an angry call to your supplier.

## Considerations

### (1) Why not just test everything?

For one, shipments may contain thousands of parts. That will take forever. Secondly, some tests are destructive, meaning you push an item to its breaking point, which renders it unusable. (In these cases, you’re testing to ensure the breaking point is high enough.) Thirdly, certain kinds of tests may be expensive or time-consuming. We have real costs — time and money and physical scarcity — we must consider now.

### (2) How big should our sample be?

There are statistical considerations and business/cost considerations. There’s no perfect answer here. Sampling theory is a large branch of statistics, and there are tools for analysis of optimal sample size. However, we must also consider the specific case. If we have a destructive test, the “statistically optimal” size may destroy too many. If the test is time-consuming, we may lack the time to perform the test on a large number of items1

### (3) What should be the “cutoff” for shipment rejection?

This discussion is the main focus of this article. We’re going to take a pretty small and simplistic overview of it, just to illustrate how powerful even basic probability can be for engineering and business problems. To do this, we’ll briefly describe the hypergeometric distribution, and then illustrate its use in an operating characteristic curve.

## The Hypergeometric Distribution

Suppose we have $N$ objects belonging to one of two classes, with $N_{1}$ in the first class, and $N_{2}$ in the second class, so $N = N_{1} + N_{2}$. (In this example, $N_{1}$ is the total number of defective parts in the batch, and $N_{2}$ is the number of good parts.) We select $n$ objects without replacement2. Find the probability that exactly $x$ belong to the first class and $n-x$ to the second.

We can select $x$ objects from the $N_{1}$ in class one in ${N_{1} \choose x} = \frac{N_{1}!}{x!(N_{1}-x)!}$ ways3. There are $n-x$ left and they must come from $N_{2}$ objects. We can select those in ${N_{2} \choose n-x}$ ways. Then $P(X=x) = \frac{{N_{1} \choose x}{N_{2} \choose n-x}}{{N \choose n}}.$

## Evaluating a shipment acceptance plan

We’ll create a small example with manageable numbers to illustrate the use of the hypergeometric distribution in acceptance sampling.

A supplier ships parts to another company in lots of 25. Some items in the shipment may be defective. The receiving company has an acceptance sampling plan to inspect $n=5$ parts from the lot without replacement. If none of the sample is defective, they accept the shipment, otherwise they reject it. Suppose we want to evaluate this plan.

If $X$ is the random number of defectives in the sample, then $X$ has a hypergeometric distribution. Except this time, $N_{1}$ and $N_{2}$ (number of defective and nondefective parts respectively) is unknown. We only know $N_{1} + N_{2} = 25$.

In designing an acceptance plan, we want the probability of accepting the lot to be large if $N_{1}$ is small. That is, we want to have a high probability of accepting the lot if the true number of defective parts is very small. We also want the probability of “falsely accepting” the lot to be low. (That is, we want the probability of acceptance to be low when $N_{1}$ is high).

When we treat these probabilities as a function of $N_{1}$ (or equivalently, the fraction defective given by $p = \frac{N_{1}}{25}$ in the lot) we call this the operating characteristic curve. Mathematically, the operating characteristic curve, denoted $OC(p)$ is given in this case by:

$OC(p) = P(X=0) = \frac{{N_{1} \choose 0}{25-N_{1} \choose 5}}{{25 \choose 5}}$

Note here that $OC(p) = P(X=0)$ because that’s the plan in this fictional example. If the plan changed, the operating characteristic curve would be defined by something different. For example, if the plan was to accept shipments that contain 1 or fewer defects, then $OC(p) = P(X\leq 1)$ and we would recalculate those probabilities using the hypergeometric probability mass function given above.

Let’s look at some numerical values of $OC(p)$ for our fictional example. Remember that $p = \frac{N_{1}}{25}$.

 p OC(p) 0 1 0.04 0.8 0.08 0.633 0.12 0.496 0.16 0.383

Is the acceptance plan satisfactory? With $N_{1} = 1$, $OC(0.04) = 0.8$ which may be considered too low (we may reject perfectly valid shipments), and with $N_{1} = 4$, $OC(0.16) = 0.383$ may be too high (we may not want that high a probability of accepting a shipment with that many defects). Thus, we may want to reconsider the number of items we test, or reconsider our acceptance plan.

Usually lot sizes are far larger than the numbers seen here, and sample sizes are in the hundreds, so as the values get large, this computation becomes cumbersome. We can approximate the hypergeometric distribution with the Poisson distribution, which we won’t cover here.

## Conclusion

This is a small illustration of the very practical use of the hypergeometric distribution to devise an intelligent strategy for accepting/rejecting shipments of goods. This type of work falls within the purview of the services we offer.

## Expecting the Unexpected: Borel’s Paradox

One of the best ways to shorten a proof in statistics or probability is to use conditioning arguments. I myself have used the Law of Total Probability extensively in my work, as well as other conditioning arguments in my PhD dissertation. Like many things in mathematics, there are subtleties that, if ignored, can cause quite a bit of trouble. It’s a theme on which I almost feel like I sound preachy, because subtlety, nuance, and deliberation followed by cautious proceeding is about as old-fashioned as my MS Statistics1

One particularly good paper that discusses this was written by Michael Proschan and Brett Presnell  in The American Statistician  in August 1998 titled “Expect the Unexpected from Conditional Expectation”. In it, the authors noted the following seemingly innocuous question posed on a statistics exam

If $X$ and $Y$ are independent standard normal random variables, what is the conditional distribution of $Y$ given $Y=X$?

There are three approaches to this problem.

### (1) Interpret the statement that $Y=X$ by declaring a new random variable $Z_{1} = Y-X$ where $Z_{1}=0$.

Here, the argument proceeds as follows:

$Y$ and $Z_{1}$ have a bivariate normal distribution with $\mu = (0,0)$, $\sigma_{Y}^{2}=1$, $\sigma_{Z_{1}}^{2}=2$, and correlation $\rho = \tfrac{1}{\sqrt{2}}$. Thus, we know that the conditional distribution of $Y$ given $Z_{1}=0$ is itself normal with mean $\mu_{Y}+\rho\frac{\sigma_{Y}}{\sigma_{Z_{1}}}(0-\mu_{Z_{1}})=0$ and variance $\sigma_{Y}^{2}(1-\rho^{2}) = \tfrac{1}{2}$. Thus, the conditional density is

$$f(y|Z_{1}=0) = \frac{1}{\sqrt{2\pi}}e^{-y^{2}}$$

This was the expected answer. However, one creative student did a different argument:

### (2) Interpret the statement that $Y=X$ by declaring a new random variable $Z_{2} = \tfrac{Y}{X}$ where $Z_{2}=1$.

This student used the Jacobian method 2 and transformed the variables via declaring $s=y$, $z_{2}=\tfrac{y}{x}$ and finding the joint density of $S$ and $Z_{2}$. The marginal density for $S$ was then found by dividing the joint density by the marginal density of $Z_{2}$ evaluating at $z_{2}=1$. The reason for the ratio is that the marginal density of $Z_{2}$ being the ratio of independent standard normal random variables has a Cauchy distribution3. This student’s answer was

$$f(y|Z_{2}=1) = |y|e^{-y^{2}}$$

Not the expected answer. This is a correct interpretation of the condition $Y=X$, so the calculations are correct. There was a third different answer, from a student who had taken a more advanced probability course.

### (3) Interpret the statement $Y=X$ as $Z_{3} = 1$ where $Z_{3} = \mathbb{I}[Y=X]$

Here $\mathbb{I}(\cdot)$ is the indicator function, where the variable is 1 if the condition is met, and 0 otherwise. The argument here is that $Y$ and $Z_{3}$ are independent. Why? $Z_{3}$ is a constant zero with probability 1. A constant is independent of any random variable. Thus the conditional distribution of $Y$ given $Z_{3}=1$ must be the same as the unconditional distribution of $Y$, which is standard normal.

This is also a correct interpretation of the condition.

From the paper, “At this point the professor decided to drop the question from the exam and seek therapy.”

## What happened?

At this point, both the paper did and we shall revisit exactly what conditional probability means. Suppose we have continuous random variables $X$ and $Y$. We’ll usually write expressions like $\mathbb{P}(Y\leq y|X=x)$ or $\mathbb{E}(Y|X=x)$. However, an acute reader might already ask the question about conditioning on sets of probability 0. For a continuous random variable, the probability that we land on any specific real value $x$ is indeed 0, which hearkens back to the measure-theoretic basis of probability. As it turns out, this little subtlety is indeed the culprit.

### Formal definition of conditional distribution and conditional expectation

First, we take a look at the formal definition of conditional expectation

Definition.  A conditional expected value $\mathbb{E}(Y|X)$ of $Y$ given $X$ is any Borel function $g(X) = Z$ that satisfies

$$\mathbb{E}(Z\mathbb{I}(X \in B)) = \mathbb{E}(Y\mathbb{I}(X \in B))$$

for every Borel set $B$. Then $g(x)$4 is the conditional expected value of $Y$ given $X=x$ and we can write $\mathbb{E}(Y|X=x)$

What this means is that the conditional expectation is actually defined as a random variable whose integral over any Borel set agrees with that of $X$5. Of import here is the fact that the conditional expectation is defined only in terms of an integral. From Billingsley, 1979, there always exists at least one such Borel function $g$, but the problem here is that there may be infinitely many. Each Borel function $g$ that satisfies the above definition is called a version of $\mathbb{E}(Y|X)$

So let’s say we have two versions of $\mathbb{E}(Y|X)$, called $Z_{1} = g_{1}(X)$ and $Z_{2} = g_{2}(X)$. Then we have that $\mathbb{P}(Z_{1}=Z_{2})=1$. Still seems pedantic, but if we look at this from a measure-theoretic perspective, this means that two versions are equal except on a set $N$ of $x$ such that $\mathbb{P}(X \in N)=0$.

What does this have to do with conditional distributions?

For each $y$ we fix we can find some function6 $G_{y}(x) = G(y|x)$ such that

$G(y|x) = \mathbb{P}(Y\leq y|X=x)$

In other words7, $G(y|X)$ is a version of $\mathbb{E}(\mathbb{I}(Y\leq y)|X)$. This last expression here is a conditional distribution of $Y$ given $X=x$

Notice I said “a” conditional distribution, and not “the” conditional distribution. The words were chosen carefully. This leads us into the Borel Paradox, and the answer to why all three answers of that exam question are technically correct.

Also known as the Equivalent Event Fallacy, this paradox was noted by Rao (1988) and DeGroot (1986). If we attempt to condition on a set of probability (or measure) 0, then the conditioning may not be well-defined, which can lead to different conditional expectations and thus different conditional distributions.

In the exam question, $Z_{1}, Z_{2}$ and $Z_{3}$ are all equivalent events. The fallacy lies in assuming that this equivalence would mean that conditioning on, say, $\{Y-X=0\}$ is the same as conditioning on the event $\{Y/X=1\}$. In almost all8 cases, this is true. If the events in question have nonzero probability, then the professor’s assumption was true. However, when we’re conditioning on events that have zero probability, the classic Bayes’s formula interpretation doesn’t hold anymore, because the denominator is now 0.9

If we have an event like $\{Y=X\}$, then from the previous section we know that there are infinitely many versions of this event. We have to think of conditioning on a random variable, not just a value. There were three versions given above:

• $Z_{1} = Y-X$, which has value $z_{1}=0$
• $Z_{2} = Y/X$, with value $z_{2}=1$
• $Z_{3} = \mathbb{I}(Y=X)$ with value $z_{3} =1$

Proschan and Presnell dig a bit deeper and discuss the details here on exactly where these equivalent interpretations diverge10 and yield different conditional distributions. They also discuss the substitution fallacy, which again notes the consequences of having infinitely many versions of $\mathbb{E}(Y|X)$, and why the convolution argument typically given to derive the distribution of the sum of independent random variables is nowhere near as air-tight as it appears.11

## What’s the solution?

The Borel Paradox reared its ugly head because different yet correct interpretations of the conditioning events/sets of probability 0 yielded different conditional expectations and different conditional distributions. The core of the problem was those sets having probability 0. How do we avoid this? We actually would like to calculate $\mathbb{E}(Y|X=x)$ (and conditional distributions that result from it), so how do we get around this paradox of infinite versions?

We take the analysts’ favorite trick: sequences and limits. Take a sequence of sets $(B_{n})$ where $B_{n} \downarrow x$. Now we’re only ever actually computing $\mathbb{E}(Y|X \in B_{n})$ where $\mathbb{P}(X \in B_{n})>0$. Now define

$\mathbb{E}(Y|X=x) = \lim_{n\to\infty}\mathbb{E}(Y|X \in B_{n})$

(I admit I’m seriously sidestepping some careful considerations of convergence and choices of sequence here. There is more subtlety and consideration in the analytic study of sequences of sets, but I don’t want to push too far down the rabbit hole here. The point is that we can avoid the paradox with care.)

## Conclusion

Subtleties and paradoxes occur all over mathematics. This doesn’t mean mathematics is broken, that all statistics are lies, or any other variation of frustrated exasperation I’ll hear when discussing these. What these subtleties, fallacies, and paradoxes do show is that careful consideration is paramount to the study, practice, and application of mathematics.

## References

1. Billingsley P,. (1979), Probability and Measure(1st ed.),New York:Wiley
2. Casella, G. and Berger, R.(2002) Statistical Inference (2nd ed.), Wadsworth
4. Proschan, M.A. and Presnel, B. (1998), “Expect the Unexpected from Conditional Expectation”, American Statistician, 48, 248-252
5. Rao,M.M.(1988),”Paradoxes in Conditional Probability,” Journal of Multivariate Analysis,27, 434-446
Time Series Analysis Part 1: Regression with a Twist

## Time Series Analysis Part 1: Regression with a Twist

We’re surrounded by time series. It’s one of the more common plots we see in day-to-day life. Finance and economics are full of them – stock prices, GDP over time, and 401K value over time to name a few. The plot looks deceptively simple; just a nice univariate squiggle. No crazy vectors, no surfaces, just one predictor – time. It turns out time is a tricky and fickle explanatory variable, which makes analysis of time series a bit more nuanced than first glance. This nuance is obscured by the ease of automatic implementation of time series modeling in languages like R1 As nice as this is for practitioners, the mathematics behind this analysis is lost. Ignoring the mathematics can lead to improper use of these tools. This series will examine some of the mathematics behind stationarity and what is known as ARIMA (Auto-Regressive Integrated Moving Average) modeling. Part 1 will examine the very basics, showing that time series modeling is really just regression with a twist.

Poisson Processes and Data Loss

## Poisson Processes and Data Loss

There are many applications for counting arrivals over time. Perhaps I want to count the arrivals into a store, or shipments into a postal distribution center, or node failures in a cloud cluster, or hard drive failures in a traditional storage array. It’s rare that these events come neatly, one after the other, with a constant amount of time between each event or arrival. Typically those interarrival times, the time between two events in a sequence arriving, are random. How then do we study these processes?

The Central Limit Theorem isn’t a Statistical Silver Bullet

## The Central Limit Theorem isn’t a Statistical Silver Bullet

Chances are, if you took anything away from that high school or college statistics class you were dragged into, you remember some vague notion about the Central Limit Theorem. It’s likely the most famous theorem in statistics, and the most widely used. Most introductory statistics textbooks state the theorem in broad terms, that as the sample size increases, the sample distribution of the sum of the sample elements will be approximately normally distributed, regardless of the underlying distribution. Many things used in statistical inference as justification in a broad variety of fields, such as the classical z-test, rely on this theorem. Many conclusions in science, economics, public policy, and social studies have been drawn with tests that rely on the Central Limit Theorem as justification. We’re going to dive into this theorem a bit more formally, and discuss some counterexamples to this theorem. Not every sequence of random variables will obey the conditions of theorem, and the assumptions are a bit more strict than are used in practice.

Uncorrelated and Independent: Related but not Equivalent

## Uncorrelated and Independent: Related but not Equivalent

Mathematics is known for its resolute commitment to precision in definitions and statements. However, when words are pulled from the English language and given rigid mathematical definitions, the connotations and colloquial use outside of mathematics still remain. This can lead to immutable mathematical terms being used interchangeably, even though the mathematical definitions are not equivalent. This occurs frequently in probability and statistics, particularly with the notion of uncorrelated and independent. We will focus this post on the exact meaning of both of these words, and how they are related but not equivalent.

## Independence

First, we will give the formal definition of independence:

Definition (Independence of Random Variables).

Two random variables $X$ and $Y$ are independent if the joint probability distribution $P(X, Y)$ can be written as the product of the two individual distributions. That is,

$$P(X, Y) = P(X)P(Y)$$

Essentially this means that the joint probability of the random variables $X$ and $Y$ together are actually separable into the product of their individual probabilities. Here are some other equivalent definitions:

• $P(X \cap Y) = P(X)P(Y)$
• $P(X|Y) = P(X)$ and $P(Y|X) = P(Y)$

This first alternative definition states that the probability of any outcome of $X$ and any outcome of $Y$ occurring simultaneously is the product of those individual probabilities.

For example, suppose the probability that you will put ham and cheese on your sandwich is $P(H \cap C) = 1/3$. The probability that ham is on your sandwich (with or without any other toppings) is $P(H) = 1/2$ and the probability that cheese is on your sandwich (again, with or without ham or other goodies) is $P(C) = 1/2$. If ham and cheese were independent sandwich fixings, then $P(H\cap C) = P(H)P(C)$, but

$$P(H\cap C)= 1/3 \neq 1/4 = P(H)P(C)$$

Thus, ham and cheese are not independent sandwich fixings. This leads us into the next equivalent definition:

Two random variables are independent if  $P(X|Y) = P(X)$ and $P(Y|X) = P(Y)$.

The vertical bar is a conditional probability.  $P(X|Y)$ reads “probability of $X$ given $Y$“, and is the probability $X$ will have any outcome $x$ given that the random variable $Y$ has occurred.

The second definition means that two random variables are independent if the outcome of one has no effect on the other. That is, putting cheese on my sandwich doesn’t affect the likelihood that I will then add ham, and  if I started with ham, then it doesn’t affect the likelihood I will add cheese second. The example above already showed that ham and cheese were not independent, but we’ll repeat it again with this other equivalent definition.

By Bayes’ formula,

$$P(H | C) = \frac{P(H\cap C)}{P(C)} = \frac{1/3}{1/2} = \frac{2}{3} \neq \frac{1}{2} = P(H)$$

The probability that ham will be on my sandwich given that cheese is on my sandwich is 2/3. This means the presence of cheese increases the likelihood that ham will be there too, which also tells me ham and cheese are not independent.

$$P(C | H) = \frac{P(H\cap C)}{P(H)} = \frac{2}{3} \neq \frac{1}{2} = P(C)$$

In addition, I’m more likely to add cheese to the sandwich if ham is already there. In both of these, the presence of one affects the probability of the other, so they are not independent.

Coin flips are independent. The probability of me flipping a quarter and getting a head doesn’t affect the probability of you then getting a tail (or head) when you pick up that quarter and flip it after me. Independence is a common assumption in statistics because most distributions rely on it, though real data is rarely truly independent. For the development of the theory of dependent random variables, see herehere, and here.

Next we’ll define what it means to be uncorrelated, and discuss some of the subtleties and equivalent interpretations.

## Uncorrelated

When people use the word “uncorrelated”, they are typically referring to the Pearson correlation coefficient (or product-moment coefficient) having a value of 0. The Pearson correlation coefficient of random variables $X$ and $Y$ is given by

$$\rho = \frac{\text{Cov}(X,Y)}{\sqrt{\text{Var}(X)}\sqrt{\text{Var}(Y)}}$$

Where $\text{Var}(X)$ is the variance of the random variable, and $\text{Cov}(X,Y)$ is the covariance between $X$ and $Y$. A correlation of 0, or $X$ and $Y$ being uncorrelated implies $\text{Cov}(X,Y) = 0$, and thus it suffices to just look at the numerator.

The covariance between two random variables $X$ and $Y$ measures the joint variability, and has the formula

$$\text{Cov}(X,Y) = E[XY]-E[X]E[Y]$$

$E[\cdot]$ is the expectation operator and gives the expected value (or mean) of the object inside. $E[X]$ is the mean value of the random variable $X$. $E[XY]$ is the mean value of the product of the two random variables $X$ and $Y$.

For an example, suppose $X$ and $Y$ can take on the joint values (expressed as ordered pairs (0,0), (1,0), and (1,1) with equal probability. Then for any of the three possible points $(x,y)$, $P( (X,Y) = (x,y)) = 1/3$. We will find the covariance between these two random variables.

The first step is to calculate the mean of each individual random variable. $X$ only takes on two values, 0 and 1, with probability 1/3 and 2/3 respectively. (Remember that two of the points have $X = 1$, with each of those probabilities as 1/3.) Then

$$E[X] = 0\cdot 1/3 + 1\cdot 2/3 = 2/3$$

Similarly, $E[Y] = 0\cdot 2/3 + 1\cdot 1/3 = 1/3$. Now, we must calculate the expected value of the product of $X$ and $Y$. That product can take on values 0 or 1 (multiply the elements of each ordered pair together) with respective probabilities 2/3 and 1/3. These probabilities are obtained the same way as for the individual expectations. Thus,

$$E[XY] = 0\cdot 2/3 + 1\cdot 1/3 = 1/3$$

Finally, we put it all together:

$$\text{Cov}(X,Y) = E[XY]-E[X]E[Y] = \frac{1}{3}-\frac{2}{3}\cdot\frac{1}{3} = \frac{1}{3}-\frac{2}{9} = \frac{1}{9}$$

Covariance (and correlation, the normalized form of covariance) measure the linear relationship between two random variables. This is important. Next, we’ll look at how independence and correlation are related.

## Independence $\Rightarrow$ Uncorrelated

Here’s the start of where the confusion lies. First, it is absolutely true that if two random variables are independent, then they are uncorrelated. It’s important to prove this, so I will do it. I will prove this for discrete random variables to avoid calculus, but this holds for all random variables, both continuous and discrete.

Theorem. If two random variables $X$ and $Y$ are independent, then they are uncorrelated.

Proof. Uncorrelated means that their correlation is 0, or, equivalently, that the covariance between them is 0. Therefore, we want to show that for two given (but unknown) random variables that are independent, then the covariance between them is 0.

Now, recall the formula for covariance:

$$\text{Cov}(X,Y) = E[XY]-E[X]E[Y]$$

If the covariance is 0, then $E[XY] = E[X]E[Y]$. Therefore, we say mathematically that it is sufficient to show that $E[XY] = E[X]E[Y]$. This is because uncorrelated is equivalent to a 0 covariance, which is equivalent to $E[XY] = E[X]E[Y]$, and thus showing this last equality is equivalent to showing that $X$ and $Y$ are uncorrelated.

OK, we are given that $X$ and $Y$ are independent, which by definition means that $P(X,Y) = P(X)P(Y)$. This is our starting point, and we know what we want to show, so let’s calculate $E[XY]$:

$$E[XY] = \sum_{x}\sum_{y}x\cdot y\cdot P(X=x, Y=y)$$

This is what $E[XY]$ is. We have to sum the product of the two random variables times the probability that $X=x$ and $Y=y$ over all possible values of $X$ and all possible values of $Y$. Now, we use the definition of independence and substitute $P(X=x)P(Y=y)$ for $P(X=x, Y=y)$:

\begin{aligned}E[XY] &= \sum_{x}\sum_{y}x\cdot y\cdot P(X=x, Y=y)\\&=\sum_{x}\sum_{y}x\cdot y\cdot P(X=x)P(Y=y)\end{aligned}

If I sum over $Y$ first, then everything related to $X$ is constant with respect to $Y$. 1. Then I can factor out everything related to $X$ from the sum over $y$:

\begin{aligned}E[XY] &= \sum_{x}\sum_{y}x\cdot y\cdot P(X=x, Y=y)\\&=\sum_{x}\sum_{y}x\cdot y\cdot P(X=x)P(Y=y)\\&= \sum_{x}x\cdot P(X=x)\sum_{y}y\cdot P(Y=y)\end{aligned}

Then if I actually carry out that inner sum over $y$, it becomes a completed object with no relationship to the outer sum going over $x$. That means I can put parentheses around it and pull it out of the sum over $x$:

\begin{aligned}E[XY] &= \sum_{x}\sum_{y}x\cdot y\cdot P(X=x, Y=y)\\&=\sum_{x}\sum_{y}x\cdot y\cdot P(X=x)P(Y=y)\\&= \sum_{x}x\cdot P(X=x)\left(\sum_{y}y\cdot P(Y=y)\right)\\&= \left(\sum_{y}y\cdot P(Y=y)\right)\left(\sum_{x}x\cdot P(X=x)\right)\end{aligned}

Looking at the objects in each group of parentheses, we see that it matches the definition of expectation for $X$ and $Y$. That is $E[X] = \sum_{x}x\cdot P(X=x)$, and similar for $Y$. Therefore, we have shown that $E[XY] = E[X]E[Y]$, and have proven that independence always implies uncorrelated.

Now, to use these two words (independent and uncorrelated) interchangeably, then we would have to know that the converse of the statement we just proven is true: that

Uncorrelated implies independence

If we find even one counterexample (an example where the two variables have 0 correlation but do not fit the definition of independence), then the converse is false and we cannot use those terms interchangeably.

## No luck, I have a counterexample

Let’s take $X$ and $Y$ to exist as an ordered pair at the points (-1,1), (0,0), and (1,1) with probabilities 1/4, 1/2, and 1/4. Then $E[X] = -1\cdot 1/4 + 0 \cdot 1/2 + 1\cdot 1/4 = 0 = E[Y]$ and

$$E[XY] = -1\cdot 1/4 + 0 \cdot 1/2 + 1\cdot 1/4 = 0 = E[X]E[Y]$$

and thus $X$ and $Y$ are uncorrelated.

Now let’s look at the marginal distributions of $X$ and $Y$. $X$ can take on values -1, 0, and 1, and the probability it takes each of those is 1/4, 1/2, and 1/4. Same with Y. Then looping through the possibilities, we have to check if $P(X=x, Y=y) = P(X=x)P(Y=y)$

$$P(X=-1, Y=1) =1/4 \neq 1/16 = P(X=-1)P(Y=1)$$

We loop through the other two points, and see that $X$ and $Y$ do not meet the definition of independent.

Therefore, we just found an example where two uncorrelated random variables are not independent, and thus the converse statement does not hold.

## Conclusion

Correlation is a linear association measure. We saw in our counterexample that there was no linear relationship between the two random variables2 That doesn’t mean the variables don’t have any effect on each other (again, as we saw in our counterexample.) The common mistake is to forget that correlation is a restrictive measure of relationships, since it only covers linear types. Independence is a measure of “probabilistic effect”, which encompasses far more than simply linear association.

The words uncorrelated and independent may be used interchangeably in English, but they are not synonyms in mathematics. Independent random variables are uncorrelated, but uncorrelated random variables are not always independent. In mathematical terms, we conclude that independence is a more restrictive property than uncorrelated-ness.

Interventions, not Anomalies

## Interventions, not Anomalies

Anomaly Detection is becoming almost universally considered a “hot topic” across industries and business fields.

A venture capitalist wants to know if a potential investment is an anomaly that will make billions at the time of IPO. A credit card company wants to know if a particular transaction is anomalous, as it could be potentially fraudulent. A systems administrator wants to know if the latency on his storage system is unique to him, or widespread through out the network. A logistics manager wants to know if the recent influx of orders to his warehouse is unusual and/or permanent. In each of these cases, they are all implicitly asking the same question:

Is my current observation abnormal?

Before we discuss how to detect anomalies/abnormalities/outliers, the term must be defined. What is an anomaly?           The brutal secret is that there is no rigid (i.e. mathematical) definition of an anomaly. (See the previous discussion here).  Some attempts at a definition are

• “An outlying observation, or “outlier,” is one that appears to deviate markedly from other members of the sample in which it occurs.” [1]

•  An outlier is an observation that is far removed from the rest of the observations. [2]

• Anything that lies outside the median $\pm$ 1.5IQR1

In all of these cases, these statements are all relative to some definition of “typical” or “normal”, and defining an outlier to lie “too far” according to some distance metric away from it. Many debates focus on exactly which metric to use, depending on the type of data ($L^{2}$ norm, absolute distance away from the median, more than three standard deviations away from the mean, etc), but the crux of the issue is sidestepped.

In order to define an anomaly, one must define “normal”, and then decide how much variability in “normal” is tolerable. There are two stages of subjectivity at play here:

1. The definition of “normal”
2. How much variability is acceptable

These two items must be decided before a definition of an anomaly is attempted, which means the resulting definition of an anomaly has a shaky foundation at best. That is, the working definition of an anomaly is relative, and relative to moving targets.  Subjectivity has no place in mathematics, for the applications of the vague mathematics are weak and prone to misuse and abuse. (See previous post)

There are occasions when the definition of an anomaly using the above two steps can be made clear. These occasions are rare, but should be noted. These circumstances typically show up in rigid, well-defined environments, like manufacturing, or engineering/physics laws.

Manufacturing specifications define “normal” clearly and unambiguously, like the location of a hole in sheet metal.  Laws of physics are immutable, and not dependent on judgment or subjective interpretation. The laws of motion define how a “macro” object behaves (I hear you, quantum physicists), regardless of the object. We know where both a bowling ball and a golf ball are going to land given their initial position and velocity on Earth.

Acceptable variability is also clearly defined in both of these cases. In manufacturing, a design engineer specifies an acceptable tolerance for deviation of the hole’s location. In the case of observing physical laws, the variability comes from measurement error, and is given by the accuracy of the measurement tool.

Example. (The Ideal Gas Law)

The ideal gas law describes the relationship between pressure (in atmospheres), volume (in liters), temperature (in degrees Kelvin), and amount of gas (in moles) of an “ideal gas”. Nitrogen is considered to behave very closely to an ideal gas, so for this example, we will assume that nitrogen is an ideal gas and follows the ideal gas law exactly. The ideal gas law is given by

$$PV = nRT$$

where $R = 0.08 \tfrac{L\cdot \text{atm}}{\text{mol}\cdot K}$ is the ideal gas constant. Suppose now that we have exactly 1 mol of nitrogen in a 1 L container. We wish to increase the temperature of the nitrogen to see the effect on the pressure. In this case, we know exactly what the result should be, because we have a physical law that defines the model for this behavior. Here we will also note that the hypothetical pressure sensor has a tolerance of 0.001 atm. That is, the sensor can be up to 0.001 atm off when a measurement is taken.

The graph above shows the exact pressure readings we should expect as we increase the temperature of the container. Therefore, if the temperature of the container is known to be 2 degrees Celsius, and the pressure sensor reads more than 22.001 atm, then we can say the pressure reading is anomalous, because the tolerance is defined by the sensor, and the model is defined exactly by a physical law.

That is, both (1) and (2) from above are specified rigidly and objectively, and not according to a person’s judgment. In the vast majority of use cases for “anomaly detection”, measurements are not being taken according to a known model. Most of the time, the mathematical model that governs the data is unknown prior to even the desire to detect anomalies (whatever we mean by that) in the dataset. Thus, we cannot extend the principles used in manufacturing and physics to the uncertainty of data science.

### Intervention vs. Anomalies

The data science world requires a different approach that encompasses the inherent uncertainty involved in modeling behavior in the first place, let alone determining “excessive” deviations from it. Enter intervention analysis.

Intervention analysis is a broader, yet more rigid term that describes the study of changes to a data set.2 Conceptually, this term embodies the question

Did the behavior change, and is it permanent?

When we express a desire for an “anomaly detection system”, this is the fundamental question we are asking. Typical algorithms used for outlier analysis, like the Extreme Studentized Deviate3 are only looking for “strange points” within the sequence or dataset relative to what you currently see. If a decision is to be made based on a model or a deviation from it, we would like to know what kind of deviation it is. Intervention analysis formally classifies five4:

1. Level shift: At a particular point, the model made a stepwise change that is permanent.

1. Level shift with Lag: Same as (1), but takes a few time steps to occur. This shows up in the dataset as a gradual increase to the new permanent level.

1. Temporary Change: The model experiences a level shift that has a decaying effect over time.

1. Innovation Outlier: This is the most complex type, typically is defined to represent the onset of a causal effect or a change in the model itself.

1. Additive Outlier: the “point outlier”. It has no effect on future behavior.

For examples of each, visit this page for a good introductory description. I also cite the relevant papers in the references.

The important aspect with this approach is that a model is formally defined (for time series, this is called ARIMA), as well as changes to that model. Thus, when we talk about a level shift, there is no ambiguity. Moreover, with formal definitions and types, we can classify a change according to the effect the point has on the future points.5

## Proposed Strategy

Detecting innovations in practice is difficult; this post is not meant to diminish this. Creating a model from data without much prior knowledge of the behavior of the data is also difficult, as these interventions can have adverse effects on building the model in the first place. That is, if a level shift is present in the data at a certain point, we may end up with a different initial model that doesn’t notice the level shift, because it got built into the model as if it were normal, simply because we didn’t know any better. There are estimation techniques, and some papers are referenced at the end for those interested. Here we are interested in an overall strategy to consider these problems in the first place.

Since we must estimate the model, then attempt to identify interventions and classify them, a sensible solution is a probabilistic approach. That is, the modelling is done via some (hopefully robust) method X, and then the potential intervention points are identified via another method Y. Behind the scenes, we ultimately want to classify each point as

1. nothing
2. one of the types of interventions describes above

and rank these possibilities according to the likelihood a particular point belongs in one of the 6 possibilities. The uncertainty is therefore demonstrated in a true probabilistic fashion, and a decision can be made taking into account all the possibilities for a particular intervention.

## Conclusion

In order to identify and solve business problems effectively, the terms must be rigidly defined. By looking at intervention analysis, we can approach very difficult problems with a more solid foundation, since the model and the interventions/outliers have unambiguous, formal definitions. Moreover, a “big picture” strategy for employment should be probabilistic, rather than threshold or classification-based. In this way, the uncertainty can be expressed fully, which allows for more informed decision-making.