Browsed by
Category: Probability Theory

Should I accept this shipment?

Should I accept this shipment?

Suppose you work for an engineering or manufacturing firm, and you receive shipments of different parts from various suppliers. It’s not good business practice to just blindly accept a shipment, because some parts in your batch may be defective. Too many defective parts, and you’ll slow down manufacturing (in addition to wasting valuable time and money). You come up with a brilliant plan (which is used in industry already): you’ll take a sample from the shipment, test the sample, and if the number of defective parts in the sample is below some predetermined number, you’ll accept the shipment. Otherwise, you’ll send it back and probably place an angry call to your supplier.


(1) Why not just test everything?

For one, shipments may contain thousands of parts. That will take forever. Secondly, some tests are destructive, meaning you push an item to its breaking point, which renders it unusable. (In these cases, you’re testing to ensure the breaking point is high enough.) Thirdly, certain kinds of tests may be expensive or time-consuming. We have real costs — time and money and physical scarcity — we must consider now.

(2) How big should our sample be?

There are statistical considerations and business/cost considerations. There’s no perfect answer here. Sampling theory is a large branch of statistics, and there are tools for analysis of optimal sample size. However, we must also consider the specific case. If we have a destructive test, the “statistically optimal” size may destroy too many. If the test is time-consuming, we may lack the time to perform the test on a large number of items1

(3) What should be the “cutoff” for shipment rejection?

This discussion is the main focus of this article. We’re going to take a pretty small and simplistic overview of it, just to illustrate how powerful even basic probability can be for engineering and business problems. To do this, we’ll briefly describe the hypergeometric distribution, and then illustrate its use in an operating characteristic curve.

The Hypergeometric Distribution

Suppose we have N objects belonging to one of two classes, with N_{1} in the first class, and N_{2} in the second class, so N = N_{1} + N_{2}. (In this example, N_{1} is the total number of defective parts in the batch, and N_{2} is the number of good parts.) We select n objects without replacement2. Find the probability that exactly x belong to the first class and n-x to the second.

We can select x objects from the N_{1} in class one in {N_{1} \choose x} = \frac{N_{1}!}{x!(N_{1}-x)!} ways3. There are n-x left and they must come from N_{2} objects. We can select those in {N_{2} \choose n-x} ways. Then P(X=x) = \frac{{N_{1} \choose x}{N_{2} \choose n-x}}{{N \choose n}}.

Evaluating a shipment acceptance plan

We’ll create a small example with manageable numbers to illustrate the use of the hypergeometric distribution in acceptance sampling.

A supplier ships parts to another company in lots of 25. Some items in the shipment may be defective. The receiving company has an acceptance sampling plan to inspect n=5 parts from the lot without replacement. If none of the sample is defective, they accept the shipment, otherwise they reject it. Suppose we want to evaluate this plan.

If X is the random number of defectives in the sample, then X has a hypergeometric distribution. Except this time, N_{1} and N_{2} (number of defective and nondefective parts respectively) is unknown. We only know N_{1} + N_{2} = 25.

In designing an acceptance plan, we want the probability of accepting the lot to be large if N_{1} is small. That is, we want to have a high probability of accepting the lot if the true number of defective parts is very small. We also want the probability of “falsely accepting” the lot to be low. (That is, we want the probability of acceptance to be low when N_{1} is high).

When we treat these probabilities as a function of N_{1} (or equivalently, the fraction defective given by p = \frac{N_{1}}{25} in the lot) we call this the operating characteristic curve. Mathematically, the operating characteristic curve, denoted OC(p) is given in this case by:

OC(p) = P(X=0) = \frac{{N_{1} \choose 0}{25-N_{1} \choose 5}}{{25 \choose 5}}

Note here that OC(p) = P(X=0) because that’s the plan in this fictional example. If the plan changed, the operating characteristic curve would be defined by something different. For example, if the plan was to accept shipments that contain 1 or fewer defects, then OC(p) = P(X\leq 1) and we would recalculate those probabilities using the hypergeometric probability mass function given above.

Let’s look at some numerical values of OC(p) for our fictional example. Remember that p = \frac{N_{1}}{25}.


Is the acceptance plan satisfactory? With N_{1} = 1, OC(0.04) = 0.8 which may be considered too low (we may reject perfectly valid shipments), and with N_{1} = 4, OC(0.16) = 0.383 may be too high (we may not want that high a probability of accepting a shipment with that many defects). Thus, we may want to reconsider the number of items we test, or reconsider our acceptance plan.

Usually lot sizes are far larger than the numbers seen here, and sample sizes are in the hundreds, so as the values get large, this computation becomes cumbersome. We can approximate the hypergeometric distribution with the Poisson distribution, which we won’t cover here.


This is a small illustration of the very practical use of the hypergeometric distribution to devise an intelligent strategy for accepting/rejecting shipments of goods. This type of work falls within the purview of the services we offer.

Expecting the Unexpected: Borel’s Paradox

Expecting the Unexpected: Borel’s Paradox

One of the best ways to shorten a proof in statistics or probability is to use conditioning arguments. I myself have used the Law of Total Probability extensively in my work, as well as other conditioning arguments in my PhD dissertation. Like many things in mathematics, there are subtleties that, if ignored, can cause quite a bit of trouble. It’s a theme on which I almost feel like I sound preachy, because subtlety, nuance, and deliberation followed by cautious proceeding is about as old-fashioned as my MS Statistics1

One particularly good paper that discusses this was written by Michael Proschan and Brett Presnell  in The American Statistician  in August 1998 titled “Expect the Unexpected from Conditional Expectation”. In it, the authors noted the following seemingly innocuous question posed on a statistics exam

If X and Y are independent standard normal random variables, what is the conditional distribution of Y given Y=X?

There are three approaches to this problem.

(1) Interpret the statement that Y=X by declaring a new random variable Z_{1} = Y-X where Z_{1}=0.

Here, the argument proceeds as follows:

Y and Z_{1} have a bivariate normal distribution with \mu = (0,0), \sigma_{Y}^{2}=1, \sigma_{Z_{1}}^{2}=2, and correlation \rho = \tfrac{1}{\sqrt{2}}. Thus, we know that the conditional distribution of Y given Z_{1}=0 is itself normal with mean \mu_{Y}+\rho\frac{\sigma_{Y}}{\sigma_{Z_{1}}}(0-\mu_{Z_{1}})=0 and variance \sigma_{Y}^{2}(1-\rho^{2}) = \tfrac{1}{2}. Thus, the conditional density is 

f(y|Z_{1}=0) = \frac{1}{\sqrt{2\pi}}e^{-y^{2}}

This was the expected answer. However, one creative student did a different argument:

(2) Interpret the statement that Y=X by declaring a new random variable Z_{2} = \tfrac{Y}{X} where Z_{2}=1.

This student used the Jacobian method 2 and transformed the variables via declaring s=y, z_{2}=\tfrac{y}{x} and finding the joint density of S and Z_{2}. The marginal density for S was then found by dividing the joint density by the marginal density of Z_{2} evaluating at z_{2}=1. The reason for the ratio is that the marginal density of Z_{2} being the ratio of independent standard normal random variables has a Cauchy distribution3. This student’s answer was 

f(y|Z_{2}=1) = |y|e^{-y^{2}}

Not the expected answer. This is a correct interpretation of the condition Y=X, so the calculations are correct. There was a third different answer, from a student who had taken a more advanced probability course.

(3) Interpret the statement Y=X as Z_{3} = 1 where Z_{3} = \mathbb{I}[Y=X]

Here \mathbb{I}(\cdot) is the indicator function, where the variable is 1 if the condition is met, and 0 otherwise. The argument here is that Y and Z_{3} are independent. Why? Z_{3} is a constant zero with probability 1. A constant is independent of any random variable. Thus the conditional distribution of Y given Z_{3}=1 must be the same as the unconditional distribution of Y, which is standard normal.

This is also a correct interpretation of the condition. 

From the paper, “At this point the professor decided to drop the question from the exam and seek therapy.”

What happened?

At this point, both the paper did and we shall revisit exactly what conditional probability means. Suppose we have continuous random variables X and Y. We’ll usually write expressions like \mathbb{P}(Y\leq y|X=x) or \mathbb{E}(Y|X=x). However, an acute reader might already ask the question about conditioning on sets of probability 0. For a continuous random variable, the probability that we land on any specific real value x is indeed 0, which hearkens back to the measure-theoretic basis of probability. As it turns out, this little subtlety is indeed the culprit.

Formal definition of conditional distribution and conditional expectation

First, we take a look at the formal definition of conditional expectation

Definition.  A conditional expected value \mathbb{E}(Y|X) of Y given X is any Borel function g(X) = Z that satisfies 

\mathbb{E}(Z\mathbb{I}(X \in B)) = \mathbb{E}(Y\mathbb{I}(X \in B))

for every Borel set B. Then g(x)4 is the conditional expected value of Y given X=x and we can write \mathbb{E}(Y|X=x)

What this means is that the conditional expectation is actually defined as a random variable whose integral over any Borel set agrees with that of X5. Of import here is the fact that the conditional expectation is defined only in terms of an integral. From Billingsley, 1979, there always exists at least one such Borel function g, but the problem here is that there may be infinitely many. Each Borel function g that satisfies the above definition is called a version of \mathbb{E}(Y|X)

So let’s say we have two versions of \mathbb{E}(Y|X), called Z_{1} = g_{1}(X) and Z_{2} = g_{2}(X). Then we have that \mathbb{P}(Z_{1}=Z_{2})=1. Still seems pedantic, but if we look at this from a measure-theoretic perspective, this means that two versions are equal except on a set N of x such that \mathbb{P}(X \in N)=0.

What does this have to do with conditional distributions? 

For each y we fix we can find some function6 G_{y}(x) = G(y|x) such that 

G(y|x) = \mathbb{P}(Y\leq y|X=x)

In other words7, G(y|X) is a version of \mathbb{E}(\mathbb{I}(Y\leq y)|X). This last expression here is a conditional distribution of Y given X=x

Notice I said “a” conditional distribution, and not “the” conditional distribution. The words were chosen carefully. This leads us into the Borel Paradox, and the answer to why all three answers of that exam question are technically correct. 

The Borel Paradox

Also known as the Equivalent Event Fallacy, this paradox was noted by Rao (1988) and DeGroot (1986). If we attempt to condition on a set of probability (or measure) 0, then the conditioning may not be well-defined, which can lead to different conditional expectations and thus different conditional distributions. 

In the exam question, Z_{1}, Z_{2} and Z_{3} are all equivalent events. The fallacy lies in assuming that this equivalence would mean that conditioning on, say, \{Y-X=0\} is the same as conditioning on the event \{Y/X=1\}. In almost all8 cases, this is true. If the events in question have nonzero probability, then the professor’s assumption was true. However, when we’re conditioning on events that have zero probability, the classic Bayes’s formula interpretation doesn’t hold anymore, because the denominator is now 0.9

If we have an event like \{Y=X\}, then from the previous section we know that there are infinitely many versions of this event. We have to think of conditioning on a random variable, not just a value. There were three versions given above:

  • Z_{1} = Y-X, which has value z_{1}=0
  • Z_{2} = Y/X, with value z_{2}=1
  • Z_{3} = \mathbb{I}(Y=X) with value z_{3} =1

Proschan and Presnell dig a bit deeper and discuss the details here on exactly where these equivalent interpretations diverge10 and yield different conditional distributions. They also discuss the substitution fallacy, which again notes the consequences of having infinitely many versions of \mathbb{E}(Y|X), and why the convolution argument typically given to derive the distribution of the sum of independent random variables is nowhere near as air-tight as it appears.11

What’s the solution?

The Borel Paradox reared its ugly head because different yet correct interpretations of the conditioning events/sets of probability 0 yielded different conditional expectations and different conditional distributions. The core of the problem was those sets having probability 0. How do we avoid this? We actually would like to calculate \mathbb{E}(Y|X=x) (and conditional distributions that result from it), so how do we get around this paradox of infinite versions?

We take the analysts’ favorite trick: sequences and limits. Take a sequence of sets (B_{n}) where B_{n} \downarrow x. Now we’re only ever actually computing \mathbb{E}(Y|X \in B_{n}) where \mathbb{P}(X \in B_{n})>0. Now define

\mathbb{E}(Y|X=x) = \lim_{n\to\infty}\mathbb{E}(Y|X \in B_{n})


(I admit I’m seriously sidestepping some careful considerations of convergence and choices of sequence here. There is more subtlety and consideration in the analytic study of sequences of sets, but I don’t want to push too far down the rabbit hole here. The point is that we can avoid the paradox with care.)


Subtleties and paradoxes occur all over mathematics. This doesn’t mean mathematics is broken, that all statistics are lies, or any other variation of frustrated exasperation I’ll hear when discussing these. What these subtleties, fallacies, and paradoxes do show is that careful consideration is paramount to the study, practice, and application of mathematics. 


  1. Billingsley P,. (1979), Probability and Measure(1st ed.),New York:Wiley
  2. Casella, G. and Berger, R.(2002) Statistical Inference (2nd ed.), Wadsworth
  3. DeGroot,M.H. (1986), Probability and Statistics (2nd ed.), Reading,MA: Addison-Wesley.
  4. Proschan, M.A. and Presnel, B. (1998), “Expect the Unexpected from Conditional Expectation”, American Statistician, 48, 248-252
  5. Rao,M.M.(1988),”Paradoxes in Conditional Probability,” Journal of Multivariate Analysis,27, 434-446
Networking Mathematics: Random Early Detection and TCP synchronization

Networking Mathematics: Random Early Detection and TCP synchronization

Computer networks are something most of us take for granted–speed, reliability, availability are expectations. In fact, network problems tend to make us very angry, whether it’s dropped packets (yielding jittery Skype calls), congestion (that huge game download eating all the bandwidth), or simply a network outage. There’s an awful lot going on underneath the hood of all devices on a network to load that webpage or download that song for you. 

Much of the reliability in networking relies on maintaining good Quality of Service (QoS) policies, which involve buffering and queue management. Networks aren’t unlike the roads we travel on; traffic isn’t steady, and congestion at interfaces (or intersections on the road) happen. How do we handle that? We’ll explore some basic networking principles in order to uncover some interesting mathematics governing buffering and queue management at congested network interfaces.

Update: Mr. Fred Baker reached out to me with a few corrections regarding my interchangeable use of queue and buffer. I’ve inserted his comments into the “Buffer” section. Incidentally, Mr. Baker was the inventor of the WRED (Weighted Random Early Detection) algorithm mentioned as an extension below. 

Read More Read More

Little’s Law: For Estimation Only

Little’s Law: For Estimation Only

I had been intending on writing some posts on queuing theory for a while now, as this branch is the closest to my research interests and was the spark that sent me down the road that eventually led to my PhD dissertation. Most are quite familiar with the concepts of queuing theory, at least intuitively, so this is one of the more tangible topics in mathematics. We’ve all stood in queues at grocery stores, grouched in rush hour traffic, or had to refresh a webpage when the connection drops. I was reminded of my intentions when Datrium (to my surprise) mentioned a common queuing theory result called Little’s Law in their Storage Field Day presentation, and they even have an article they’ve written that makes use of it1. What exactly is Little’s Law, and what does and doesn’t it say?

Some queuing theory terms

To be rigorous in our discussion, we’ll need to formalize some concepts. A queuing system is composed of two components: a queue and a server. The queue is the line you stand in, and the server is the object performing some kind of task (like scanning your groceries). A queuing system can have more than one server2, and different kinds of service policies3

Counting the number of customers in each stage

A customer is in a queuing system when he is standing in the queue itself, and when he is being served by a server. So if we let N_{q} be the number of customers waiting in a queue, and N_{s} be the number of customers currently being served, then the total number of customers in a queuing system (N) is given by 

N = N_{q} + N_{s}

It’s important to note here that customers arrive randomly and are served with random service times. So the number of customers in a queue or in service is a random variable, and changing with time. The arrival times, the service times, and the number of customers in the queuing system all can be assigned probability distributions, but these are random variables. 

Looking at the time

When standing in a queue, you’re now in the queue; that is, you’ve arrived. Now you have to wait. How long will you wait before service? How long will service take? As anyone moving through everyday life can attest, both of those times are random. We denote W as the waiting time a customer spends in the queue before service, and X the amount of time a customer spends being served. Thus, the total is 

T = W+X

and is commonly referred to as the sojourn time, or the total time spent in the queuing system.

Once again, W and X are random variables, and therefore so is T

Arrival Rates

Arrivals to a queuing system are random, and governed by a counting process with a probability distribution. The most common counting process used in queuing theory is the Poisson process due to its very comfortable mathematical properties. 

If we watched the arrivals to a queue for a very long time, a simple method of analyzing a queue is to find the mean arrival rate, or the average number of arrivals per unit time, typically denoted by \lambda. We could make a rough guess at it by dividing the number of arrivals in a long period of time by the length of our “watch period”. A Poisson process has a mean arrival rate, and the number of arrivals in a period of time follows a Poisson distribution. 

At the risk of sounding like a broken record, I’d like to point out that a mean arrival rate of 1/hr does not mean that each hour, only one customer will arrive. It’s an average. Some hours will have no arrivals, some will have 5 arrivals, or 3 arrivals. An average is simply a point estimate of a typical hour.4

The Statement of Little’s Law

Little’s Law is an elegant little formula that relates the mean number of customers in a queuing system, the mean sojourn time, and the mean arrival rate. It’s quite simply written

\text{E}[N] = \lambda\text{E}[T]

There’s not much to analyze here, which in many ways is nice. If you want to get the mean or expected number of customers in your queuing system at a given time, simply multiply the average arrival rate with the average time in the system. 

What’s great about this law is that it’s just that–a law. It’s true regardless of the service distribution, the arrival process, or the service policies. Sounds like a silver bullet, right? Queuing theory solved!

Some subtleties

I keep insisting the reader pay close attention to the fact that we’re dealing with random variables the whole way through. Everything is random, which means that nothing really hits the mean that often. Little’s Law is great for quick calculations and estimations, but relying too heavily on it oversimplifies a queuing system, which can be damaging. (This is similar to wantonly applying the Central Limit Theorem, even when your assumptions are not met. )

Little’s Law essentially smooths out the randomness into a deterministic, steady state equation. The means are numbers, not random variables. What we need to understand about this formula is what it doesn’t help us with. Random variables have variances, and Little’s Law just gives us an average number calculated from other averages.


A queuing system is dynamic, not a steadily flowing river. Little’s Law is great in practice for estimation, and having such an elegant relationship between the means of several random variables is useful to get a general idea of what’s going on in your queuing system. Traffic planning and analysis is a difficult task because of all the randomness involved.

The danger of using Little’s Law as your silver bullet (much the way the Central Limit Theorem is commonly used as a silver bullet) is that you risk losing valuable information about the variation in the random variables that make up your queuing system, which can wreak havoc on the best-laid plans.

Example: Suppose a store owner applies Little’s Law in order to determine how many cashiers he should call in each day. He calculates via Little’s Law that he expects 10 customers/hr, and thus feels perfectly justified in only having two cashiers per day in the store. What he didn’t realize is that the weekends tended to be very busy, with swarms of people coming in the store, and Wednesdays were practically a ghost town in the store. 

Everything averaged out to 10 customers/hr, but that wasn’t much help for the two cashiers who had to handle the weekend rush, and it wasn’t much help to the owner who ended up paying two cashiers that weren’t needed on Wednesdays. He probably could have just handled the store alone on Wednesdays. 

The example above is a simple one, but there is an important lesson in here. Too much smoothing causes valuable information loss. Little’s Law is great for estimation, and should only be used as such.


There is a distributional version of Little’s Law, published in 1993, which is much better suited than the original Little’s Law because it discusses the relationship of the probability distributions of the random variables in the queuing system rather than simply their averages.

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

The Red-Headed Step-Distributions

The Red-Headed Step-Distributions

Almost every textbook in probability or statistics will speak of classifying distributions into two different camps: discrete (singular in some older textbooks) and continuous. Discrete distributions have either a finite or a countable sample space (also known as a set of Lebesgue measure 0), such as the Poisson or binomial distribution, or simply rolling a die. The probability of each point in the sample space is nonzero. Continuous distributions have a continuous sample space, such as the normal distribution. A distribution in either of these classes is either characterized by a probability mass function (pmf) or probability distribution function (pdf) derived from the distribution function via taking a derivative. There is, however, a third kind.

One rarely talked about, or mentioned quickly and then discarded. This class of distributions is defined on a set of Lebesgue measure 0, yet the probability of any point in the set is 0, unlike discrete distributions. The distribution function is continuous, even uniformly continuous, but not absolutely continuous, meaning it’s not a continuous distribution. The pdf doesn’t exist, but one can still find moments of the distribution (e.g. mean, variance). They are almost never encountered in practice, and the only real example I’ve been able to find thus far is based on the Cantor set. This class is the set of red-headed step-distributions– the singular continuous distributions.

Back up, what is Lebesgue measure?

Measure theory itself can get extremely complicated and abstract. The idea of measures is to give the “size” of subsets of a space. Lebesgue measure is one type of measure, and is actually something most people are familiar with: the “size” of subsets of Euclidean space in n dimensions. For example, when n=1, we live in 1D space. Intervals. The Lebesgue measure of an interval [a,b] on the real line is just the length of that interval: b-a. When we move to two dimensions, \mathbb{R}\times \mathbb{R}, the Cartesian product of 1D space with itself, our intervals combine to make rectangles. The Lebesgue measure in 2D space is area; so a rectangle built from [a,b]\times [c,d] has Lebesgue measure (b-a)(d-c). Lebesgue measure in 3D space is volume. And so forth. 

Now, points are 0-dimensional in Euclidean space. They have no size, no mass. They have Lebesgue measure 01. Intuitively, we can simply see that Lebesgue measure helps us see how much “space” something takes up in the Euclidean world, and points take up no space, and hence should have measure 0. 

In fact, any countable set of points has Lebesgue measure 0. Even an infinite but countable set. The union of disjoint Lebesgue measurable sets has a measure equal to the sum of the individual sets. Points are certainly disjoint, and they each have measure 0, and summing 0 forever still yields 0.2 So, the set \{0,1,2\} has Lebesgue measure 0. But so do the natural numbers \mathbb{N}and the rational numbers \mathbb{Q}, even though the rational numbers contain the set of natural numbers.

It is actually possible to construct an uncountable infinite set that has Lebesgue measure 0, and we will need that in constructing our example of a singular continuous distribution. For now, we’ll examine discrete and continuous distributions briefly.

Discrete (Singular) Distributions

These are the ones most probability textbooks begin with, and most of the examples that are familiar.

Roll a fair die. 

The sample space for a roll of a fair die X is S =\{1,2,3,4,5,6\}. The PMF is P(X = x) = 1/6, where x \in S. The CDF is given by the function P(X\leq x) = \sum_{j\leq x}P(X=j) 


P(X \leq 4) = \sum_{j\leq 4}\frac{1}{6} = \frac{2}{3}

Binomial Distribution

A binomial random variable X counts the number of “successes” or 1s in a binary sequence of n Bernoulli random variables. Think a sequence of coin tosses, and counting the number of heads. In this case, the sample space is infinite, but countable: S = \{0,1,2,\ldots\}. If the probability of a 1, or “success” is p, then the PMF of X is given by 

P(X=x) = {n \choose x}p^{x}(1-p)^{n-x}

Note here again that the sample space is of Lebesgue measure 0, but the probability of any point in that space is a positive number. 

Continuous Distributions

Continuous distributions operate on a continuous sample space, usually an interval or Cartesian product of intervals or even a union of intervals. Continuous distribution functions F are absolutely continuous, meaning that (in one equivalent definition), the distribution function has a derivative f=F' almost everywhere that is Lebesgue integrable, and obeys the Fundamental Theorem of Calculus:

F(b)-F(a) = \int_{a}^{b}f(x)dx

for a< b. This f is the probability distribution function (PDF), derived by differentiating the distribution function. Let’s mention some examples of these:

The Continuous Uniform Distribution

Suppose we have a continuous interval [a,b], and the probability mass is spread equally along this interval, meaning that the probability that our random variable X lies in any subinterval of size s has the same probability, regardless of location. Suppose we do not allow the random variable to take any values outside the interval. The sample space is continuous but over a finite interval. The distribution function for this X is given by 

F(x) = \left\{\begin{array}{lr}0&x< a\\\frac{x-a}{b-a}&a\leq x \leq b\\1&x > b\end{array}\right.

This is an absolutely continuous function. Then we may easily derive the PDF by differentiating F:

f(x) = \mathbb{1}_{x \in [a,b]}\frac{1}{b-a}

where \mathbb{1}_{x \in [a,b]} is the indicator function that takes value 1 if x is in the interval, and 0 otherwise. 

This distribution is the continuous version of a die roll. The die roll is the discrete uniform distribution, and here we just allow for a die with uncountably many sides with values in [a,b]. The probability of any particular point is 0, however, even though it is possible to draw a random number from this interval. To see this, note that the probability that the random variable X lies between two points in the interval, say x_{1} and x_{2} is given by multiplying the height of the PDF by the length (Lebesgue measure) of the subinterval. The Lebesgue measure of a point is 0, so even though a value for the PDF exists at that point, the probability is 0. 

We don’t run into issues here mathematically because we are on a continuous interval. 

The Normal Distribution

Likely the most famous continuous distribution, the normal distribution is given by the famous “bell curve.” In this case, the sample space is the entire real line. The probability that a normally distributed random variable X lies between any two points a and b is given by 

P(a\leq X \leq b) = \int_{a}^{b}\frac{1}{\sqrt{2\pi\sigma^{2}}}\exp\left(-\frac{(x-\mu)^{2}}{2\sigma^{2}}\right)dx

where \mu is the mean and \sigma^{2} is the variance. 

Singular Continuous Distributions

We’re going to begin this section by discussing everyone’s favorite counterexample in mathematics: the Cantor set. 

The Cantor set

The Cantor set is given by the limit of the following construction:

  1. Take the interval [0,1]
  2. Remove the middle third: (1/3, 2/3), so you’re left with [0,1/3]\cup[2/3,1]
  3. Remove the middle third of each of the remaining intervals. So you remove (1/9,2/9) from [0,1/3] and (7/9,8/9) from [2/3,1], leaving you with the set [0,1/9]\cup[2/9,1/3]\cup[2/3,7/9]\cup[8/9,1]

Continue this process infinitely.

This is an example of a set that is uncountable, yet has Lebesgue measure 0. Earlier, when we discussed Lebesgue measure, we noted that all countable sets had measure 0. Thus we may conclude that only uncountable sets (like intervals) have nonzero Lebesgue measure. However, the Cantor set illustrates that not all uncountable sets have positive Lebesgue measure. To see why the Cantor set has Lebesgue measure 0, we will look at the measure of the sets that are removed (the complement of the Cantor set):

At the first step, we have removed one interval of size 1/3. At the second step, we remove two intervals of size 1/9. At the third step, we remove four intervals of size 1/27. Let’s call S_{n} the subset removed from the interval [0,1] by the nth step. By the end of the third step, we have removed a set of size

m(S_{3}) = \frac{1}{3} + \frac{2}{3^{2}} + \frac{4}{3^{3}}

By the nth step, 

m(S_{n}) = \sum_{j=0}^{n}\frac{2^{j}}{3^{j+1}}

This is the partial sum of a geometric series, so

m(S_{n}) = 1-\left(\frac{2}{3}\right)^{n}

Now, the Cantor set is formed when n \to \infty. The measure of the complement of the Cantor set, which we called S_{\infty} then has measure

m(S_{\infty}) = \lim_{n \to \infty}m(S_{n}) = \lim_{n \to \infty}1-\left(\frac{2}{3}\right)^{n} = 1

But the original interval we started with had Lebesgue measure 1, and the union of the Cantor set with its complement S_{\infty} is the interval [0,1]. That means that the measure of the Cantor set plus the measure of its complement must add to 1, which implies that the Cantor set is of measure 0. However, since we removed open intervals during the construction, there must be something left; in fact, there are uncountably many points left. 

Now we have an uncountable set of Lebesgue measure 0. We’re going to use this set to construct the only example I could find of a singular continuous distribution. It is very important that the Cantor set is an uncountable set of Lebesgue measure 0. 

Building the Cantor distribution

Update: Following a correction from an earlier version, I’m going to show how to construct this distribution directly and via the complement of the Cantor set. The latter was used in a textbook I found, and is a bit convoluted in its construction, but I’m going to leave it.

The direct construction is to look at the intervals left behind at each stage n of constructing the Cantor set. Assign a probability mass of \frac{1}{2^{n}} to each of the 2^{n} intervals left behind, and this is your distribution function. It’s basically a continuous uniform distribution, but on stages of the Cantor set construction. Sending n \to \infty yields the Cantor set, but the probability distribution moves to 0 on a set of measure 0. Thus, unlike the continuous uniform distribution, where the probability of any single point was 0, but the support has positive measure, we essentially have the continuous uniform distribution occurring on a set of measure 0, which means we have a continuous distribution function on a singular support of measure 0 that is uncountable and thus not discrete. This distribution is therefore neither continuous nor discrete. 

Another way to construct this is by complement, via Kai Lai Chung’s A Course in Probability Theory. 

(Note: after a second glance at this, I found this to be a relatively convoluted way of constructing this distribution, since it can be fairly easily constructed directly. However, I imagine the author’s purpose was to be very rigid and formal to cover all his bases, so I present a review of it here:)

Let’s go back to the construction of the Cantor set. At each step n we have removed in total 2^{n}-1 disjoint intervals. Let’s number those intervals, going from left to right as J_{n,k}, where k = 1,2,\ldots, 2^{n}-1

For example, at n=2 we have that J_{2,1} = (1/9,2/9),J_{2,2} = (1/3,2/3), and J_{2,3} = (7/9,8/9)

Now let the quantity c_{n,k} = \frac{k}{2^{n}}. This will be the probability mass assigned to interval J_{n,k}. So we define the distribution function as 

F(x) = c_{n,k}, x \in J_{n,k}

Let U_{n} = \cup_{k=1}^{2^{n}-1}J_{n,k}, and U = \lim_{n\to\infty}U_{n} The function F is indeed a distribution function and can be shown to be uniformly continuous on the set D = (-\infty,0)\cup U \cup (1,\infty). However, none of the points in D is in the support of F, so the support of F is contained in the Cantor set (and in fact is the Cantor set).  The support (the Cantor set) has measure 0, so it is singular, but the distribution function is continuous, so it cannot be a discrete distribution. This distribution fits nowhere in our previous two classes, so we must now create a third class — the singular continuous distribution.

(By the way, even though the PDF doesn’t exist, the Cantor distribution still has mean of 1/2 and a variance of 1/8, but no mode. It does have a moment generating function.)

Any other examples?

With some help, I spent some time poring through quite a few probability books to seek further study and other treatment of singular continuous distributions. Most said absolutely nothing at all, as if the class didn’t exist. 

One book, Modern Probability Theory and Its Applications has a rather grumpy approach:

There also exists another kind of continuous distribution function, called singular continuous, whose derivative vanishes at almost all points. This is a somewhat difficult notion to picture, and examples have been constructed only by means of fairly involved analytic operations. From a practical point of view, one may act as if singular continuous distribution functions do not exist, since examples of these functions are rarely, if ever, encountered in practice.

This notion also has led me to a couple papers, which I intend to review and continue presenting my findings. I happen to have a great fondness for these “edge cases” and forgotten areas of mathematics. I believe they are the most ripe for groundbreaking new material. 


Poisson Processes and Data Loss

Poisson Processes and Data Loss

There are many applications for counting arrivals over time. Perhaps I want to count the arrivals into a store, or shipments into a postal distribution center, or node failures in a cloud cluster, or hard drive failures in a traditional storage array. It’s rare that these events come neatly, one after the other, with a constant amount of time between each event or arrival. Typically those interarrival times, the time between two events in a sequence arriving, are random. How then do we study these processes?

Read More Read More

The Central Limit Theorem isn’t a Statistical Silver Bullet

The Central Limit Theorem isn’t a Statistical Silver Bullet

Chances are, if you took anything away from that high school or college statistics class you were dragged into, you remember some vague notion about the Central Limit Theorem. It’s likely the most famous theorem in statistics, and the most widely used. Most introductory statistics textbooks state the theorem in broad terms, that as the sample size increases, the sample distribution of the sum of the sample elements will be approximately normally distributed, regardless of the underlying distribution. Many things used in statistical inference as justification in a broad variety of fields, such as the classical z-test, rely on this theorem. Many conclusions in science, economics, public policy, and social studies have been drawn with tests that rely on the Central Limit Theorem as justification. We’re going to dive into this theorem a bit more formally, and discuss some counterexamples to this theorem. Not every sequence of random variables will obey the conditions of theorem, and the assumptions are a bit more strict than are used in practice.

Read More Read More

Probabilistic Ways to Represent the Lifetime of an Object

Probabilistic Ways to Represent the Lifetime of an Object

Every item, system, person, or animal has a lifetime. For people and animals, we typically just measure the lifetime in years, but we have other options for items and systems. We can measure airplane reliability in flight hours (hours actually flown), or stress test a manufacturing tool in cycles. Regardless of the units we use, there are many things in common. We don’t know how long any item will “live” before it’s manufactured or deployed, so an item’s lifetime is a random variable. We wish to make decisions about manufacturing, warranties, or even purchasing by taking the reliability of an object into account.

We can represent each class of items (a brand of 100W lightbulbs, USR’s NS4 robots, etc) by a random variable for the lifetime. We’ll call it Y. Like any random variable, it has a probability distribution. There isn’t only one way to represent the distribution of Y. We can look at equivalent representations, each one useful for answering different types of questions. This article will run through a few of them and the uses by studying the theoretical lifetime distribution of USR’s famous NS4 robots.

Disclaimer: This example is derived from the fictonal company USR from Isaac Asimov’s I, Robot. This should not be construed to represent any real product. I’m sure I’m missing some other disclaimer notices, but assume the standard ones are here.


Survivor Function

The survivor function is the most common way to study the lifetime of an item. Colloquially, this is the probability that the item survives past time t. We denote it by S(t), and we can write mathematically that

S(t) = P(Y \geq t)

This equation can be given by a standard probability distribution (the exponential distribution is the most common) or other formula.

Example (Exponential NS4s)
Without having access to USR’s manufacturing data, let’s assume that the survivor function of the NS4 robot is given by S(t) = e^{-t/3}. Let’s also assume that t is measured in years. What is the probability that a brand new NS4 lasts more than 5 years?

S(t) = exp(-t/3). Look at t=5 to get the correct survival probability.

From the graph above, we can simply plug t=5 into the survivor function to get the answer to our question. The probability that the new NS4 survives longer than 5 years is

S(5) = e^{-5/3} \approx 0.189

or about an 18.9% chance.

We could use the survivor function to help NSR decide where to place the cutoff for warranty claims. Depending on the cost of either repairing the NS4 or replacing the NS4 with the NS5, we can backsolve to find out what t would satisfy management. Suppose the cost function requires that the probability of surviving past the cutoff t is 85%. Then we can use the survivor function to backsolve for t:

\begin{aligned}0.85 &= e^{-t/3} \\\ln(0.85) &= \frac{-t}{3} \\0.49 &\approx t\end{aligned}

Thus, we would set the warranty claims to be valid only for about the first half year after the NS4 is purchased.

Remark. Another way to judge an item is by looking at the shape of the survivor function. A steep decline like the one shown in the above graph tells us that the NS4 isn’t exactly the most reliable robot. Only about half of them survive two years.


Conditional Survivor Function

For those who wish to dive into a little bit more math, we can dive into the conditional survivor function. This “spinoff” of the survivor function will tell us the probability of surviving past time t when it is currently functioning at time a. The survivor function above assumes t starts at 0; that is, the object is brand new. If we have bought a used NS4, or perhaps have been sending it to the grocery store for a while, then we need to take into account the fact that the NS4 has been operational for some time a.

We write the conditional survivor function S_{Y|Y\geq a}(t) for some fixed a. We can use the famous Bayes formula to express this mathematically:

S_{Y|Y\geq a}(t) = P(Y \geq t | Y \geq a) = \frac{P(Y \geq t \text{ and } Y \geq a)}{P(Y \geq a)} = \frac{P(Y \geq t)}{P(Y\geq a)} = \frac{S(t)}{S(a)}

What this formula is basically saying is that the probability that the NS4 survives past t given that it has already lived for a years is given by S(t)/S(a), and derived via Bayes’ formula.

Suppose we bought a used NS4 that was 2 years old (and it is working now). What is the probability that this NS4 is still working more than 3 years from now?

We are looking for the probability that the NS4 is still operational after more than 5 years given that it has already been working for 2. So

S_{Y|Y\geq 2}(5) = \frac{S(5)}{S(2)} = \frac{e^{-5/3}}{e^{-2/3}} = \frac{1}{e} \approx 0.367

Thus, we only have a 36.7% chance of getting more than 3 years out of our used NS4. Perhaps we may want to consider haggling for a lower price…


Cumulative Distribution Function (Cumulative Failure Probability)

This is the cumulative distribution function straight from basic probability, but we can add an additional interpretation in the context of reliability. Mathematically, the cumulative distribution function for a random variable Y is the probability that the random variable is less than or equal to a fixed value y. Mathematically, we denote this by F_{Y}(t) = P(Y \leq t).

You may recognize this as the “opposite” of the survival function (S_{Y}(t) = P(Y \geq t)). In probability, we call this the complement, and we can get from one event to its complement by noting that for an event A and its complement A^{c}, P(A) + P(A^{c}) = 1. Thus, moving back to the survivor function and CDF, S_{Y}(t) + F_{Y}(t) = 1. Therefore, F_{Y}(t) = 1-S_{Y}(t). With the NS4 example, the CDF is given by

F_{Y}(t) = 1-e^{-t/3}

The interpretation is exactly the opposite of the survivor function. The CDF gives us the probability that the NS4 will fail before time t.


Survival Function S(t) and the CDF F(t)


The probability that a new NS4 will fail before the 5th year is given by either F_{Y}(t) = 1-e^{-5/3} = 1-0.189 = 0.821.


Hazard Function

The most common way to look at a lifetime distribution in reliability is called the hazard function. This is also called the failure rate in some circles, and gives a measure of risk. We’ll take a little bit of a dive into math to derive this, since the derivation is illuminating as to its interpretation.

The hazard function is denoted by h(t). The question we want to answer here is the conditional probability that the item will fail in a time interval [t, t+\Delta t], given that it has not occurred before. We want to know the probability of failure per unit time. So,

h(t)\Delta t = P(t \leq Y \leq t + \Delta t| Y \geq t).

We’re going to get into some calculus here, so this can be skipped if you’d rather not deal with this.

h(t) = \frac{P(t \leq Y \leq t + \Delta t| Y \geq t)}{\Delta t}= \frac{F'(t)}{S(t)\Delta t}

Now, if we let the interval length \Delta t get smaller (to 0), we get the instantaneous failure rate.

h(t) = \frac{-S'(t)}{S(t)}

Remark Those sharp in calculus will notice via the chain rule that h(t) = -\frac{d}{dt}\ln(S(t)), so we can express the hazard function in terms of the survivor function.

While the lifetime distribution doesn’t look so good for the NS4, this author would politely request USR work on improving the reliability of this model rather than moving forward with the NS5 project…

We can directly derive the hazard rate for the NS4 population.

h(t) = -\frac{d}{dt}\ln(e^{-t/3}) = \frac{d}{dt}\frac{t}{3} = \frac{1}{3}

which is one failure every three years of operation.


The hazard function is commonly used in engineering maintenance to determine schedules for checks or component replacement. For example, the hazard function can be used to determine how many flight hours an Lockheed F-22 fighter jet can be operated before a certain component is at risk for failure and should be inspected for replacement.

There are other forms we can use to express the distribution of an object’s lifetime, but these are the most common. Another thing to note is that we can easily move from one form to another. They all represent the same thing–lifetime of a system, but in slightly different ways. We were able to make several different decisions about USR’s NS4 robots thanks to these different representations.
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.