Browsed by
Author: Rachel Traylor

Little’s Law: For Estimation Only

Little’s Law: For Estimation Only

I had been intending on writing some posts on queuing theory for a while now, as this branch is the closest to my research interests and was the spark that sent me down the road that eventually led to my PhD dissertation. Most are quite familiar with the concepts of queuing theory, at least intuitively, so this is one of the more tangible topics in mathematics. We’ve all stood in queues at grocery stores, grouched in rush hour traffic, or had to refresh a webpage when the connection drops. I was reminded of my intentions when Datrium (to my surprise) mentioned a common queuing theory result called Little’s Law in their Storage Field Day presentation, and they even have an article they’ve written that makes use of it1. What exactly is Little’s Law, and what does and doesn’t it say?

Some queuing theory terms

To be rigorous in our discussion, we’ll need to formalize some concepts. A queuing system is composed of two components: a queue and a server. The queue is the line you stand in, and the server is the object performing some kind of task (like scanning your groceries). A queuing system can have more than one server2, and different kinds of service policies3

Counting the number of customers in each stage

A customer is in a queuing system when he is standing in the queue itself, and when he is being served by a server. So if we let N_{q} be the number of customers waiting in a queue, and N_{s} be the number of customers currently being served, then the total number of customers in a queuing system (N) is given by 

N = N_{q} + N_{s}

It’s important to note here that customers arrive randomly and are served with random service times. So the number of customers in a queue or in service is a random variable, and changing with time. The arrival times, the service times, and the number of customers in the queuing system all can be assigned probability distributions, but these are random variables. 

Looking at the time

When standing in a queue, you’re now in the queue; that is, you’ve arrived. Now you have to wait. How long will you wait before service? How long will service take? As anyone moving through everyday life can attest, both of those times are random. We denote W as the waiting time a customer spends in the queue before service, and X the amount of time a customer spends being served. Thus, the total is 

T = W+X

and is commonly referred to as the sojourn time, or the total time spent in the queuing system.

Once again, W and X are random variables, and therefore so is T

Arrival Rates

Arrivals to a queuing system are random, and governed by a counting process with a probability distribution. The most common counting process used in queuing theory is the Poisson process due to its very comfortable mathematical properties. 

If we watched the arrivals to a queue for a very long time, a simple method of analyzing a queue is to find the mean arrival rate, or the average number of arrivals per unit time, typically denoted by \lambda. We could make a rough guess at it by dividing the number of arrivals in a long period of time by the length of our “watch period”. A Poisson process has a mean arrival rate, and the number of arrivals in a period of time follows a Poisson distribution. 

At the risk of sounding like a broken record, I’d like to point out that a mean arrival rate of 1/hr does not mean that each hour, only one customer will arrive. It’s an average. Some hours will have no arrivals, some will have 5 arrivals, or 3 arrivals. An average is simply a point estimate of a typical hour.4

The Statement of Little’s Law

Little’s Law is an elegant little formula that relates the mean number of customers in a queuing system, the mean sojourn time, and the mean arrival rate. It’s quite simply written

\text{E}[N] = \lambda\text{E}[T]

There’s not much to analyze here, which in many ways is nice. If you want to get the mean or expected number of customers in your queuing system at a given time, simply multiply the average arrival rate with the average time in the system. 

What’s great about this law is that it’s just that–a law. It’s true regardless of the service distribution, the arrival process, or the service policies. Sounds like a silver bullet, right? Queuing theory solved!

Some subtleties

I keep insisting the reader pay close attention to the fact that we’re dealing with random variables the whole way through. Everything is random, which means that nothing really hits the mean that often. Little’s Law is great for quick calculations and estimations, but relying too heavily on it oversimplifies a queuing system, which can be damaging. (This is similar to wantonly applying the Central Limit Theorem, even when your assumptions are not met. )

Little’s Law essentially smooths out the randomness into a deterministic, steady state equation. The means are numbers, not random variables. What we need to understand about this formula is what it doesn’t help us with. Random variables have variances, and Little’s Law just gives us an average number calculated from other averages.


A queuing system is dynamic, not a steadily flowing river. Little’s Law is great in practice for estimation, and having such an elegant relationship between the means of several random variables is useful to get a general idea of what’s going on in your queuing system. Traffic planning and analysis is a difficult task because of all the randomness involved.

The danger of using Little’s Law as your silver bullet (much the way the Central Limit Theorem is commonly used as a silver bullet) is that you risk losing valuable information about the variation in the random variables that make up your queuing system, which can wreak havoc on the best-laid plans.

Example: Suppose a store owner applies Little’s Law in order to determine how many cashiers he should call in each day. He calculates via Little’s Law that he expects 10 customers/hr, and thus feels perfectly justified in only having two cashiers per day in the store. What he didn’t realize is that the weekends tended to be very busy, with swarms of people coming in the store, and Wednesdays were practically a ghost town in the store. 

Everything averaged out to 10 customers/hr, but that wasn’t much help for the two cashiers who had to handle the weekend rush, and it wasn’t much help to the owner who ended up paying two cashiers that weren’t needed on Wednesdays. He probably could have just handled the store alone on Wednesdays. 

The example above is a simple one, but there is an important lesson in here. Too much smoothing causes valuable information loss. Little’s Law is great for estimation, and should only be used as such.


There is a distributional version of Little’s Law, published in 1993, which is much better suited than the original Little’s Law because it discusses the relationship of the probability distributions of the random variables in the queuing system rather than simply their averages.

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Dialogue: What do We Mean by Predictive Analytics?

Dialogue: What do We Mean by Predictive Analytics?

Predictive analytics is a phrase that gets used often as a feature in many tech offerings. Predicting when a problem is likely to occur allows either a human or an automated system to take some action to mitigate a potential issue before it occurs and has potentially catastrophic effects. If we take data storage, for example, the worst thing that can happen on your storage array is the loss of data. Data loss is permanent (unless you have a backup), so when it’s gone, it’s gone1

Mitigating problems proactively involves modeling system behavior (whatever that system happens to be) according to some metric or set of metrics, then evaluating new data as it arrives and deciding if this new data is “abnormal” or indicative of a potential problem. This means there are three steps –modeling the behavior of the system or aspect of the system you are trying to monitor, determining what is normal (and by complement, abnormal), and then deciding how you will handle an abnormal data point/event/other metric. 

These three steps, while simple to articulate, are three very difficult and distinct problems that each involve a whole host of considerations that are themselves interrelated. For instance2:

  • How do we decide which metrics to use in our modeling? 
  • How do we verify that this model is an accurate representation of system behavior?
  • Once we have a model, how do we define unusual or anomalous behavior
  • Once we define anomalous behavior, how do we decide our courses of action? Do we act on any “weird” point that crosses some threshold, or should we see the threshold crossed repeatedly, or something else?

Proactively mitigating system issues is a well-justified desire of many companies, because it increases system reliability. I watched Starwind present their Virtual Tape Library at Storage Field Day 15 , and they, like many other companies, strive to create a way to detect impending failure patterns and take preventative measures before a catastrophic failure. The presentation is only two hours long, and covered their entire architecture, not just the specific feature regarding failure pattern detection, so we were unable to take the time to discuss the specifics of Starwind’s Proactive Support, as they call it. 

Detecting any kind of pattern in data is difficult, especially a failure pattern. There are always tradeoffs. If we set our tolerance for what we consider “normal behavior” to be too low, we risk alerting on potential issues too often. When this happens, alerts get ignored, and real problems are assumed to be just another “false alarm.” On the other hand, if we set the tolerance for what we consider normal too high, we run the risk of not detecting an issue at all. 

At this point, I’d like to open dialogue in the comments, particularly because these subjects are deep; in many cases so deep an entire two-hour presentation can be devoted to just a few aspects of this very large challenge. How do we balance these tradeoffs? How do we decide whether an unusual data point or set of points is really something bad? Is it possible if we are generating too many “false alarms”, that our original behavior model is off?  

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Commentary: White Papers Don’t Impress Me Much

Commentary: White Papers Don’t Impress Me Much

I spent the last week at an event called Tech Field Day (my second time). In a nutshell, it’s a traveling panel of 12-15 delegates who are generally IT professionals (and me) that visits 8-10 companies over three days to hear various presentations on their technology. Sometimes it’s storage tech, sometimes networking, or cloud, or a mixture of all sorts of things. The common thread, in theory, is that these presentations are supposed to be “deep dives”, to use an industry buzzword. The delegates around the table are all highly proficient in their fields, and are expected to ask questions to drill into claims made and get more details about various IT architectures presented. In my case, I am obviously interested in uncovering the interesting mathematics behind various enterprise technologies. From erasure coding to graph theory to the statistics underneath the vague “analytics” every company claims to do, my interest lies in discussing how they’re employing mathematics to make their tech better or drive business decisions.

Typically, most companies release white papers that claim to detail their architecture (or math, as one claimed). In reality, and with rare exception (Datrium actually comes to mind here), they’re little more than five to seven pages of marketing-style technical claims with no citations or justification. As an overview, I understand keeping the lengths shorter, but references to more detailed publications and reports should be available when making certain claims. Therefore, as part of the Tech Field Day panel, I felt a responsibility to press the presenters on some of these claims, earnestly hoping for more details. My thought was that they were putting out a “teaser”, so to speak, and just waiting excitedly for someone interested to ask about technology they built and are proud of.1 For the most part, my initial thought was wrong. From dismissing my questions to hiding behind the curtain of “secret sauces” and “proprietary” code, I was left disappointed for the most part. 

My frustration can be traced to the very Silicon Valley style idea that flashy marketing must pervade everything, which blurs opinion and fact. White papers which should contain technical details and references become little more than press releases disguised as objective reports. I debated how to really articulate my opinion, and decided to do something a bit out of character for my typical article. With apologies to Shania Twain, I present my version of the song “That Don’t Impress Me Much”:

That Don’t Impress Me Much (Tech Edition)

I’ve noticed in tech they think they’re pretty smart
They’ve clearly got their marketing down to an art.
The white papers are “genius”; it drives me up a wall
There’s nothing original, not at all

Oh-oo-oh, you think you’re special
Oh-oo-oh you think you’re something else

Okay, so the erasure coding’s novel
That don’t impress me much
So you made the claim, but have you got the proof?
Don’t get me wrong, yea, I think you’re all right
But that won’t give me inspiration in the night
That don’t impress me much.

Every white paper says they’re the best on the market
“Independently verified”—just in case
Writing uncited claims, publishing as fact (I want to vomit)
Cause we all know tech’s really a private arms race

Oh-oo-oh, you think you’re special
Oh-oo-oh you think you’re something else

Okay, so it’s “secret sauce”
That don’t impress me much
So you got some code, but have you got some proof?
Don’t get me wrong, yea, I think you’re all right
But that won’t give me inspiration in the night
That don’t impress me much.

So you’re one of those firms using learning machines
But you’ve no earthly clue what’s going on underneath
I can’t believe you think that it’s all right
Come on baby tell me, you must be joking right?

Oh-oo-oh, you think you’re special
Oh-oo-oh you think you’re something else

Okay, so you’ve got analytics
That don’t impress me much
So you can “predict” but have you got some proof?
Don’t get me wrong, yea, I think you’re all right
But that won’t give me inspiration in the night
That don’t impress me much.

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Isomorphisms: Making Mathematics More Convenient

Isomorphisms: Making Mathematics More Convenient

Much of pure mathematics exists to simplify our world, even if it means entering an abstract realm (or creating one) to do it. The isomorphism is one of the most powerful tools for discovering structural similarities (or that two groups are identical structurally) between two groups that on the surface look completely unrelated. In this post, we’ll look at what an isomorphism is.

Read More Read More

The Red-Headed Step-Distributions

The Red-Headed Step-Distributions

Almost every textbook in probability or statistics will speak of classifying distributions into two different camps: discrete (singular in some older textbooks) and continuous. Discrete distributions have either a finite or a countable sample space (also known as a set of Lebesgue measure 0), such as the Poisson or binomial distribution, or simply rolling a die. The probability of each point in the sample space is nonzero. Continuous distributions have a continuous sample space, such as the normal distribution. A distribution in either of these classes is either characterized by a probability mass function (pmf) or probability distribution function (pdf) derived from the distribution function via taking a derivative. There is, however, a third kind.

One rarely talked about, or mentioned quickly and then discarded. This class of distributions is defined on a set of Lebesgue measure 0, yet the probability of any point in the set is 0, unlike discrete distributions. The distribution function is continuous, even uniformly continuous, but not absolutely continuous, meaning it’s not a continuous distribution. The pdf doesn’t exist, but one can still find moments of the distribution (e.g. mean, variance). They are almost never encountered in practice, and the only real example I’ve been able to find thus far is based on the Cantor set. This class is the set of red-headed step-distributions– the singular continuous distributions.

Back up, what is Lebesgue measure?

Measure theory itself can get extremely complicated and abstract. The idea of measures is to give the “size” of subsets of a space. Lebesgue measure is one type of measure, and is actually something most people are familiar with: the “size” of subsets of Euclidean space in n dimensions. For example, when n=1, we live in 1D space. Intervals. The Lebesgue measure of an interval [a,b] on the real line is just the length of that interval: b-a. When we move to two dimensions, \mathbb{R}\times \mathbb{R}, the Cartesian product of 1D space with itself, our intervals combine to make rectangles. The Lebesgue measure in 2D space is area; so a rectangle built from [a,b]\times [c,d] has Lebesgue measure (b-a)(d-c). Lebesgue measure in 3D space is volume. And so forth. 

Now, points are 0-dimensional in Euclidean space. They have no size, no mass. They have Lebesgue measure 01. Intuitively, we can simply see that Lebesgue measure helps us see how much “space” something takes up in the Euclidean world, and points take up no space, and hence should have measure 0. 

In fact, any countable set of points has Lebesgue measure 0. Even an infinite but countable set. The union of disjoint Lebesgue measurable sets has a measure equal to the sum of the individual sets. Points are certainly disjoint, and they each have measure 0, and summing 0 forever still yields 0.2 So, the set \{0,1,2\} has Lebesgue measure 0. But so do the natural numbers \mathbb{N}and the rational numbers \mathbb{Q}, even though the rational numbers contain the set of natural numbers.

It is actually possible to construct an uncountable infinite set that has Lebesgue measure 0, and we will need that in constructing our example of a singular continuous distribution. For now, we’ll examine discrete and continuous distributions briefly.

Discrete (Singular) Distributions

These are the ones most probability textbooks begin with, and most of the examples that are familiar.

Roll a fair die. 

The sample space for a roll of a fair die X is S =\{1,2,3,4,5,6\}. The PMF is P(X = x) = 1/6, where x \in S. The CDF is given by the function P(X\leq x) = \sum_{j\leq x}P(X=j) 


P(X \leq 4) = \sum_{j\leq 4}\frac{1}{6} = \frac{2}{3}

Binomial Distribution

A binomial random variable X counts the number of “successes” or 1s in a binary sequence of n Bernoulli random variables. Think a sequence of coin tosses, and counting the number of heads. In this case, the sample space is infinite, but countable: S = \{0,1,2,\ldots\}. If the probability of a 1, or “success” is p, then the PMF of X is given by 

P(X=x) = {n \choose x}p^{x}(1-p)^{n-x}

Note here again that the sample space is of Lebesgue measure 0, but the probability of any point in that space is a positive number. 

Continuous Distributions

Continuous distributions operate on a continuous sample space, usually an interval or Cartesian product of intervals or even a union of intervals. Continuous distribution functions F are absolutely continuous, meaning that (in one equivalent definition), the distribution function has a derivative f=F' almost everywhere that is Lebesgue integrable, and obeys the Fundamental Theorem of Calculus:

F(b)-F(a) = \int_{a}^{b}f(x)dx

for a< b. This f is the probability distribution function (PDF), derived by differentiating the distribution function. Let’s mention some examples of these:

The Continuous Uniform Distribution

Suppose we have a continuous interval [a,b], and the probability mass is spread equally along this interval, meaning that the probability that our random variable X lies in any subinterval of size s has the same probability, regardless of location. Suppose we do not allow the random variable to take any values outside the interval. The sample space is continuous but over a finite interval. The distribution function for this X is given by 

F(x) = \left\{\begin{array}{lr}0&x< a\\\frac{x-a}{b-a}&a\leq x \leq b\\1&x > b\end{array}\right.

This is an absolutely continuous function. Then we may easily derive the PDF by differentiating F:

f(x) = \mathbb{1}_{x \in [a,b]}\frac{1}{b-a}

where \mathbb{1}_{x \in [a,b]} is the indicator function that takes value 1 if x is in the interval, and 0 otherwise. 

This distribution is the continuous version of a die roll. The die roll is the discrete uniform distribution, and here we just allow for a die with uncountably many sides with values in [a,b]. The probability of any particular point is 0, however, even though it is possible to draw a random number from this interval. To see this, note that the probability that the random variable X lies between two points in the interval, say x_{1} and x_{2} is given by multiplying the height of the PDF by the length (Lebesgue measure) of the subinterval. The Lebesgue measure of a point is 0, so even though a value for the PDF exists at that point, the probability is 0. 

We don’t run into issues here mathematically because we are on a continuous interval. 

The Normal Distribution

Likely the most famous continuous distribution, the normal distribution is given by the famous “bell curve.” In this case, the sample space is the entire real line. The probability that a normally distributed random variable X lies between any two points a and b is given by 

P(a\leq X \leq b) = \int_{a}^{b}\frac{1}{\sqrt{2\pi\sigma^{2}}}\exp\left(-\frac{(x-\mu)^{2}}{2\sigma^{2}}\right)dx

where \mu is the mean and \sigma^{2} is the variance. 

Singular Continuous Distributions

We’re going to begin this section by discussing everyone’s favorite counterexample in mathematics: the Cantor set. 

The Cantor set

The Cantor set is given by the limit of the following construction:

  1. Take the interval [0,1]
  2. Remove the middle third: (1/3, 2/3), so you’re left with [0,1/3]\cup[2/3,1]
  3. Remove the middle third of each of the remaining intervals. So you remove (1/9,2/9) from [0,1/3] and (7/9,8/9) from [2/3,1], leaving you with the set [0,1/9]\cup[2/9,1/3]\cup[2/3,7/9]\cup[8/9,1]

Continue this process infinitely.

This is an example of a set that is uncountable, yet has Lebesgue measure 0. Earlier, when we discussed Lebesgue measure, we noted that all countable sets had measure 0. Thus we may conclude that only uncountable sets (like intervals) have nonzero Lebesgue measure. However, the Cantor set illustrates that not all uncountable sets have positive Lebesgue measure. To see why the Cantor set has Lebesgue measure 0, we will look at the measure of the sets that are removed (the complement of the Cantor set):

At the first step, we have removed one interval of size 1/3. At the second step, we remove two intervals of size 1/9. At the third step, we remove four intervals of size 1/27. Let’s call S_{n} the subset removed from the interval [0,1] by the nth step. By the end of the third step, we have removed a set of size

m(S_{3}) = \frac{1}{3} + \frac{2}{3^{2}} + \frac{4}{3^{3}}

By the nth step, 

m(S_{n}) = \sum_{j=0}^{n}\frac{2^{j}}{3^{j+1}}

This is the partial sum of a geometric series, so

m(S_{n}) = 1-\left(\frac{2}{3}\right)^{n}

Now, the Cantor set is formed when n \to \infty. The measure of the complement of the Cantor set, which we called S_{\infty} then has measure

m(S_{\infty}) = \lim_{n \to \infty}m(S_{n}) = \lim_{n \to \infty}1-\left(\frac{2}{3}\right)^{n} = 1

But the original interval we started with had Lebesgue measure 1, and the union of the Cantor set with its complement S_{\infty} is the interval [0,1]. That means that the measure of the Cantor set plus the measure of its complement must add to 1, which implies that the Cantor set is of measure 0. However, since we removed open intervals during the construction, there must be something left; in fact, there are uncountably many points left. 

Now we have an uncountable set of Lebesgue measure 0. We’re going to use this set to construct the only example I could find of a singular continuous distribution. It is very important that the Cantor set is an uncountable set of Lebesgue measure 0. 

Building the Cantor distribution

Update: Following a correction from an earlier version, I’m going to show how to construct this distribution directly and via the complement of the Cantor set. The latter was used in a textbook I found, and is a bit convoluted in its construction, but I’m going to leave it.

The direct construction is to look at the intervals left behind at each stage n of constructing the Cantor set. Assign a probability mass of \frac{1}{2^{n}} to each of the 2^{n} intervals left behind, and this is your distribution function. It’s basically a continuous uniform distribution, but on stages of the Cantor set construction. Sending n \to \infty yields the Cantor set, but the probability distribution moves to 0 on a set of measure 0. Thus, unlike the continuous uniform distribution, where the probability of any single point was 0, but the support has positive measure, we essentially have the continuous uniform distribution occurring on a set of measure 0, which means we have a continuous distribution function on a singular support of measure 0 that is uncountable and thus not discrete. This distribution is therefore neither continuous nor discrete. 

Another way to construct this is by complement, via Kai Lai Chung’s A Course in Probability Theory. 

(Note: after a second glance at this, I found this to be a relatively convoluted way of constructing this distribution, since it can be fairly easily constructed directly. However, I imagine the author’s purpose was to be very rigid and formal to cover all his bases, so I present a review of it here:)

Let’s go back to the construction of the Cantor set. At each step n we have removed in total 2^{n}-1 disjoint intervals. Let’s number those intervals, going from left to right as J_{n,k}, where k = 1,2,\ldots, 2^{n}-1

For example, at n=2 we have that J_{2,1} = (1/9,2/9),J_{2,2} = (1/3,2/3), and J_{2,3} = (7/9,8/9)

Now let the quantity c_{n,k} = \frac{k}{2^{n}}. This will be the probability mass assigned to interval J_{n,k}. So we define the distribution function as 

F(x) = c_{n,k}, x \in J_{n,k}

Let U_{n} = \cup_{k=1}^{2^{n}-1}J_{n,k}, and U = \lim_{n\to\infty}U_{n} The function F is indeed a distribution function and can be shown to be uniformly continuous on the set D = (-\infty,0)\cup U \cup (1,\infty). However, none of the points in D is in the support of F, so the support of F is contained in the Cantor set (and in fact is the Cantor set).  The support (the Cantor set) has measure 0, so it is singular, but the distribution function is continuous, so it cannot be a discrete distribution. This distribution fits nowhere in our previous two classes, so we must now create a third class — the singular continuous distribution.

(By the way, even though the PDF doesn’t exist, the Cantor distribution still has mean of 1/2 and a variance of 1/8, but no mode. It does have a moment generating function.)

Any other examples?

With some help, I spent some time poring through quite a few probability books to seek further study and other treatment of singular continuous distributions. Most said absolutely nothing at all, as if the class didn’t exist. 

One book, Modern Probability Theory and Its Applications has a rather grumpy approach:

There also exists another kind of continuous distribution function, called singular continuous, whose derivative vanishes at almost all points. This is a somewhat difficult notion to picture, and examples have been constructed only by means of fairly involved analytic operations. From a practical point of view, one may act as if singular continuous distribution functions do not exist, since examples of these functions are rarely, if ever, encountered in practice.

This notion also has led me to a couple papers, which I intend to review and continue presenting my findings. I happen to have a great fondness for these “edge cases” and forgotten areas of mathematics. I believe they are the most ripe for groundbreaking new material. 


(Commentary) Spectre and Meltdown: Spokes on a Wheel

(Commentary) Spectre and Meltdown: Spokes on a Wheel

There has been a flurry of articles and discussions related to Intel’s Spectre and Meltdown vulnerabilities. Many good writings discuss the technical nature and implications to hardware, and you can find a selection here, here, and here. As of this writing, many software developers and security experts are frantically trying to create patches to protect their infrastructure and customers from those who would exploit a 20 year-old design flaw, and we obviously wish them the best of luck.

The severity of the issue is in a large part due to a design flaw that dates back to 1995, when Bill Clinton was president, the DVD was first announced, eBay was founded, Braveheart won the Best Picture Academy Award, and Alanis Morissette’s Jagged Little Pill was released. That means that hundreds of thousands of programs, apps, and products were built on top of a fundamental design flaw that went unnoticed longer than some of our siblings have been alive. What happened?

Complexity happened. Not complexity in the sense of the human body, a finely tuned machine. Complexity born of rushed thinking, continuous development cycles, and a mentality we discouraged in our students when we were still academics — the notion that  just turning in an assignment, even if it was not well-done, was acceptable. We have been scolded in other jobs that “perfect is the enemy of good”, to “fail fast and fail often” and to “just get something delivered.” We at The Math Citadel fundamentally disagree with and reject all of these strategies and attitudes. Rushed thinking and a desperation to be seen as “first done” with the most hype has led to complexity born of brute force solutions, with patches to fix holes discovered after release. When those patches inevitably break something else, more patches are applied to fix the first patches.

“And on and on it spins, crushing those on the ground”, in the words of Daenerys Targaryen.

Lest it be thought that we are picking only on Intel, or that this is an isolated issue, let us explore other manifestations.

  • In November 2017, a developer found a security vulnerability in Apple’s High Sierra operating system that enables access to the root superuser account with a blank password on any Mac (local or remote) running OS 10.13.1. Apple quickly released a patch meant to fix it, but another update ended up reintroducing the “root bug.”
  • When iOS 11.1 was released, autocorrect would change the letter “I” to “A” with a question mark in a box.

The gaming industry has had its share of problems from rushing releases that weren’t complete. (One might almost be forgiven for assuming it’s a business strategy.)

  • No Man’s Sky was released to much hype, but the first release had very few of the promised features, generating a huge player backlash. The company released further features as DLC and patches, but the damage was done.
  •  Call of Duty: WWII had server issues at launch that took the game offline, random disconnects from matches, and some reports of gamer rankings reset. After two patches, users reported improvements but no real fixes.
  • Batman: Arkham Night released a version for the PC, and it became a disaster. Players had to turn off textures and move graphics qualities to “low” to even make the game playable, regardless of how nice their graphics card was. 

The machine learning/“artificial intelligence” space has quite a few examples, and these range from amusing to sinister.

  • Algorithmic pricing without a sanity check leads to a $23 million book price on Amazon
  • Automatic t-shirt slogan generator causes a business to fold after the owner’s algorithm generates a t-shirt saying “Keep calm and rape on.”
  •  Automated re-pricing software RePricer Express erroneously changes the prices of thousands of items on Amazon to a penny. Compounding the problem is the automatic order fulfillment from Amazon, making it impossible to retract the order. One small business owner cites a $150,000 loss.
  • Accusations of price-gouging on flights out of Florida prior to Hurricane Irma are more likely due to the automatic pricing algorithms than active price gouging. Nonetheless, it was a PR nightmare.

We can list many more examples, enough to provide clear evidence of a pattern. There have already been those calling for a re-examination of machine learning and data science in particular in response to these issues. The real problem, however, goes much deeper.

Entire companies are based around the notion of scrum development, a continued cycle of “sprints” that last a couple weeks and end in some deliverable. The original methodology may be good for a prototype, but when scaled to company operations, it inspires a culture of “just get it done.” It leads to a toxic environment, where both leaders and individual contributors are driven by a fury to “turn it in” and release before a competitor or by the time VMWorld comes around. It means products are being built on top of other products that were just barely good enough to ship with a shiny marketing veneer.

In the physical world, this would be akin to building a bridge by throwing stones wantonly into the water in order to hop across. Yes, you can get across the river quickly, but misplacement of any one of those stones can mean you may slip and fall into the water, or someone coming behind you who distributes his weight differently may fall. Worse, if the rocks seem stable enough for a long time, people begin constructing a hasty bridge using those stones as a foundation. The bridge holds for a while, but one day the cumulative effect of poor materials and high traffic volume cause the bridge to collapse, and people get hurt.

If civil engineers designed and built bridges the way tech develops and releases products, people would die. If aerospace engineers rushed the design of a commercial airliner and patched issues the way tech does, people would die.

If mathematicians developed their theories and equations the way tech develops and releases products, your world would crumble.

Let’s run a thought experiment. Suppose George Boole, the inventor of Boolean algebra, rushed his theories out so he could beat an academic rival. He didn’t really prove everything or make sure it was airtight. There was maybe that funny edge case, but he couldn’t see it ever arising in practice, so he just ignored it. Unbeknownst to him, that edge case was a counterexample that showed all of his notions to be false overall.

Boolean algebra is the fundamental theory by which your computers work today, and will be such until and unless quantum computing takes off. If what seemed like an edge case 150 years ago became the foundation for the development of computers, the ramifications would be so vast as to be unrecoverable. It would require a whole new redesign of how computers worked. Let that effect snowball in your mind.

That’s one topic in mathematics. Imagine hundreds of mathematicians developing the foundations of your world the same way. But we don’t. We study the river, the banks, and the earth carefully. Only when we are sure do we begin constructing a bridge, starting with the foundation. Stone by stone, making sure the bottom ones are perfect and unbreakable. The work takes years, decades, lifetimes sometimes, but the results are built to last. Mathematics is the only discipline that has never had to “reinvent” itself upon the discovery of new knowledge. All of mathematics builds, expands, and generalizes.

What does this have to do with business? To fix the attitudes that ultimately led to Spectre, Meltdown, and the patches to fix them, and the patches to fix those, companies need to think like mathematicians. To fix the ideologies that rushed out a new macOS with a serious security vulnerability, companies need to think like mathematicians. To avoid the PR nightmares from “AI gone wrong”, companies need to think like mathematicians.

Leaders and individual contributors need to think like mathematicians, searching deliberately for elegant, simple solutions that are provable, explainable, and fundamentally strong. The elegance and simplicity will allow for other things to be built on top that won’t break the foundation. Even when something is built on top of a foundation, it is carefully examined as to its stability. Provable solutions mean no surprises later when something fails.

This requires lateral thinking, creativity, and most importantly, a willingness to take a bit longer in product development and business decisions. It’s a difficult thing to do, when all your competitors move so fast you think you would only hear the Doppler effect as they scream by you. Adopting a mathematician’s outlook takes longer. However, the results are simpler, with less maintenance, less need for software janitors to clean up the mess from the frantic development party, and stronger, more resilient products. Every one of these things yields cost savings, long term revenue, and perhaps most importantly, customer trust.

We at The Math Citadel are mathematicians, refusing the siren song of scrum-like mentalities. We’re here to help the companies who want to look past the hypes, who want to carve their own paths rather than be the leader on a paved course. We’re here for the companies who say “enough” to shortsightedness and continuous patching. Spectre and Meltdown are just spokes on a wheel. We don’t intend to stop the wheel, we intend to break it.

Generalizing the Negative Binomial Distribution via First-Kind Dependence

Generalizing the Negative Binomial Distribution via First-Kind Dependence

This paper generalizes the negative binomial random variable by generating it from a sequence of first-kind dependent Bernoulli trials under the identity permutation. The PMF, MGF, and various moments are provided, and it is proven that the distribution is indeed an extension of the standard negative binomial random variable. We examine the effect of complete dependence of the Bernoulli trials on the generalized negative binomial random variable. We also show that the generalized geometric random variable is a special case of the generalized negative binomial random variable, but the generalized negative binomial random variable cannot be generated from a sum of i.i.d. generalized geometric random variables.

To download the paper with all proofs, click here

Read More Read More

Mailbox Answers: Calculating New Parity After an Overwrite

Mailbox Answers: Calculating New Parity After an Overwrite

I recently did some work for Mr. Howard Marks, an independent analyst and founder of Deep Storage on the subject of data protection and data loss. He e-mailed me with a question regarding calculating the new parity for a stripe of data on a storage system. 

Let us consider the case of a 10+1 RAID 5 set with a strip size of 64KB. When an application performs a 4KB write the system must:

  1. Read the 64KB strip that contains the 4KB to be overwritten into a memory buffer
  2. Modify the memory buffer with the data to be written
  3. Read however much other data as would be needed to recalculate the parity strip for this stripe
  4. Write the new data and new parity strip (Min 1 4KB write, 1 64KB write)

When we casually talk about this condition we say the RAID controller would need to read all 10 data strips in the stripe so it can XOR all ten together as part of step 4. I, however have been thinking about XOR and think that rather than requiring N+1 total I/Os I can get it down to three. 

If P, the parity strip, already contains

D_1 \oplus D_2 \oplus D_3 \oplus D_4 \oplus D_5 \oplus D_6 \oplus D_7 \oplus D_8 \oplus D_9 \oplus D_{10}

and we’re overwriting part of D_4 couldn’t I [do the following]:

  1. Read the existing D_4 into a variable D'_4.
  2. Modify the data into D_4.
  3. Calculate the changes as D_4\oplus D_{4}' into variable C
  4. Read the parity strip P
  5. Calculate new parity strip as P=P \oplus C

In short, the answer is yes. We’re going to prove that here, because I think this is a great exercise to really show off the power of XOR. We’ve explored the operation here and began our discussion of coding theory by looking at maximum likelihood decoding. Let’s take a brief review of the XOR (or mod 2) operation:

XOR is just modulo 2 addition

Let’s call generic binary words D_{j}. That is, each D_{j} is simply a string of 1s and 0s of whatever length l we feel like setting. So a binary word D_{j} = d_{1}^{(j)}d_{2}^{(j)}\ldots d_{l}^{(j)} consists of binary bits given by the lowercase1 d_{i}.  XOR operation works bit-by-bit, and will be denoted by \oplus:

\begin{aligned}D_{j} \oplus D_{k} &= (d_{1}^{(j)}d_{2}^{(j)}\ldots d_{l}^{(j)})\oplus d_{1}^{(k)}d_{2}^{(k)}\ldots d_{l}^{(k)}\\&= (d_{1}^{(j)} \oplus d_{1}^{(k)})(d_{2}^{(j)}\oplus d_{2}^{(k)})\ldots (d_{l}^{(j)}\oplus d_{l}^{(k)})\end{aligned}

For a quick numerical example, suppose D_{j} = 1011 and D_{k} = 0010. Then

D_{j} \oplus D_{k} = 1011 \oplus 0010 = (1\oplus 0)(0\oplus 0)(1\oplus 1)(1\oplus 0)

Remember, too, that XOR is addition modulo 2, so we add the bits together, then divide by 2 and take the remainder. So, in particular, 1\oplus 1 = 0 because 1+1 leaves a remainder of 0 when divided by 2. So,

D_{j} \oplus D_{k} = 1001

Back to the question

Mr. Marks’ question can be stated mathematically in the following way (and I’m going to generalize it to any finite amount of words of any length XORed together, because that’s what mathematicians do):

Suppose P = D_{1} \oplus D_{2} \oplus \ldots \oplus D_{j} \oplus \ldots \oplus D_{K} for some K, and one word (say D_{j}) is modified to give D_{j}'. Let C be the binary word that represents the changes between D_{j} and D_{j}'. That is,

C = D_{j} \oplus D_{j}'

(Note: XOR as an operation to identify differences in binary words is one of the more elegant features. If all the bits in two words are the same, then bitwise XORing would always give the binary word of all 0s. Only when two bits are different is their XOR result a 1.) Call P' the new XOR sum with D_{j}' substituted for D_{j}. So

P' := D_{1}\oplus D_{2}\oplus \ldots \oplus D_{j}'\oplus \ldots \oplus D_{K}.

Then does P'= P \oplus C?

Numerical example

Whenever I’m seeking to prove a statement, I always “play” with an example. Now, simply finding an example that fits the statement doesn’t constitute proof. But playing with explicit numbers can often yield a way to prove the statement in general. Plus, we can deepen our understanding by really “seeing the theorem in action,” as opposed to just manipulating symbols via logic. 

Let’s just test this with a sum of 3 words to make our lives easier. Let D_{1} = 110, D_{2} = 010, and D_{3} = 101. Then

P = D_{1} \oplus D_{2} \oplus D_{3} = 110 \oplus 010 \oplus 101 = 001

Now suppose we change D_{2} to D_{2}' = 101. First, the new sum P' is given by 

P' = 110 \oplus 101 \oplus 101 = 110

Now, the change in D_{2} and D_{2}' is given by

C = 010 \oplus 101 = 111

Notice that all three positions changed. Each position that is different has a 1.Let’s see if P \oplus C = P' P \oplus C = 001 \oplus 111 = 110

Success! Now, this doesn’t mean we’re done. One example doesn’t constitute proof. We have to show this is true for any finite number of binary words of any  length. 

Time to prove this statement is true

So, let D_{1},...,D_{K} be binary words of generic length l. Choose one word D_{j} and modify it to form the new word D_{j}'. Let C = D_{j} \oplus D_{j}' denote the change vector. Then

\begin{aligned}P^{\prime}&=D_{1}\oplus D_{2}\oplus\ldots D^{\prime}_{j}\oplus\ldots D_{K}\end{aligned}


Now, let’s note that C = D_{j}\oplus D^{\prime}_{j} tells us which positions changed between the two. Another way to look at it is that C is the word you need to XOR with D_{j} to get to the new D^{\prime}_{j}. That is, D^{\prime}_{j} = D_{j} \oplus C.2. Now, let’s plug in the new expression for D_{j}' into P':

\begin{aligned}P^{\prime}&=D_{1}\oplus D_{2}\oplus\ldots D^{\prime}_{j}\oplus\ldots D_{K}\\&=D_{1}\oplus D_{2}\oplus\ldots (D_{j} \oplus C)\oplus\ldots D_{K}\end{aligned}

Now, we know from this post that XOR is a commutative operation. Coupled with the associative property3, we can actually rearrange the order of the XORing to put C last. 

(You’ve done this with regular arithmetic all the time. 5 + 1 + 6 + 3 can be rearranged and grouped into (6+3+1) + 5 = 10 +5 = 15. Commutativity and associativity combined allow this to happen with any operation.)


\begin{aligned}P^{\prime}&=D_{1}\oplus D_{2}\oplus\ldots D^{\prime}_{j}\oplus\ldots D_{K}\\&=D_{1}\oplus D_{2}\oplus\ldots (D_{j} \oplus C)\oplus\ldots D_{K}\\&=(D_{1}\oplus D_{2}\oplus\ldots D_{j}\oplus\ldots D_{K})\oplus C\end{aligned}

But wait, that last thing in parenthesis is exactly P. Therefore,

\begin{aligned}P^{\prime}&=D_{1}\oplus D_{2}\oplus\ldots D^{\prime}_{j}\oplus\ldots D_{K}\\&=D_{1}\oplus D_{2}\oplus\ldots (D_{j} \oplus C)\oplus\ldots D_{K}\\&=(D_{1}\oplus D_{2}\oplus\ldots D_{j}\oplus\ldots D_{K})\oplus C\\&= P \oplus C\end{aligned}

Since we showed this for any generic number of binary words added together, and allowed to binary words to be any length, we’ve proven the statement. 

Bonus: Multiple modifications

What if we modified more than one word in our original sum P? It’s a pretty simple extension to run through the proof again with multiple modified words and show that if we have multiple C^{\prime}s, one for each modified word, we can perform the same substitution and show that the new P^{\prime} is simply the old P XORed with all of the change vectors. Alternatively, you could XOR all the change vectors first into one big change vector, then XOR that with your original P to compute the new P^{\prime}. If you want to verify it for yourself formally, simply follow the same steps we did above for one modified word. You’ll just be performing the same type of substitution multiple times to account for each modification. 


Mr. Marks brought this up because he was seeking a way to compute the new parity strip in a more efficient way (with fewer arithmetic steps) than simply following the definition. You can absolutely “brute force” your way to calculating the new parity strip. Companies and startups are always concerned about “scalability”. Sure, you won’t notice the time different between 10 extra things added together. But what about 10 million? 1 billion? More than that? None of those numbers are infeasible for the amount of calculations we perform on data now. In those cases, the brute force method of simply using the definition starts to cause performance problems. It was worth taking the time to “be clever” and search for a nice, elegant way that cuts down the number of operations necessary to calculate a new parity strip. It took a little upfront work, but the result speaks loudly for itself. 


Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Welcome to GF(4)

Welcome to GF(4)

Everyone has solved some version of a linear system in either high school or college mathematics. If you’ve been keeping up with some of my other posts on algebra, you know that I’m about to either take something familiar away, or twist it into a different form. This time is no different; we’re going to change the field we operate over, and solve a basic linear system in a Galois field called GF(4).

Read More Read More