Browsed by
Month: July 2017

Like Clockwork: Modulo Addition and Finite Groups of Integers

Like Clockwork: Modulo Addition and Finite Groups of Integers

Modulo arithmetic always scared me in college. (Honestly, it’s still not something I can do as easily as integration or spotting pdfs hidden inside icky integrals.) Then when you add abstract algebra (or modern algebra as some call it) and group theory on top of it, the whole topic can become intimidating. Many books on algebra don’t slow down enough to really examine the examples they give, so I’m going dive in and explore some of these. Algebra is a powerful branch of mathematics, with ripple effects all throughout mathematics, engineering, experimental design, and computer science. It’s worth study and understanding, and the topics that scare us tend to lose their fangs after we get to know them a bit.

Addition on the clock

Most of us do modulo arithmetic more often than we think, especially if we look at an analog clock to tell time. For example, if I asked you to convert 14:00 from military time into “regular” time, you’d respond “2 pm” pretty quickly. But how did you do it?

I picked 2 o’clock because I think it should become a standard in America to break at 2 for tea.


You got that answer because you went all the way around the clock and then had 2 remaining. This is modulo addition — in this case \mod 12.

Modulo addition is regular addition with an extra couple steps: dividing by your “base” n (on a clock n = 12) and then grabbing the remainder1. We write mathematically that for integers a,b, and n, that

a \equiv b \bmod n \Leftrightarrow b-a= k\cdot n \text{ for an integer } k

You read this as  a is equivalent to b modulo n if b-a is a multiple (integer multiple) of n. Another way to write it is to say that

a \equiv b \bmod n \Leftrightarrow b = k\cdot n + a \text{ for an integer } k

In other words, if b can be expressed as a multiple of n plus some extra amount a, then a is equivalent to b modulo n.

Back to our clock example, obviously 2 \neq 14, but 2\equiv 14 \bmod 12, because

14 = 1\cdot 12 + 2

Here, our b, 14 can be expressed as an integer multiple of 12 (k=1), plus some extra a=2, so we say that 2\equiv 14 \bmod 12.

That n matters. If I changed the n, that statement doesn’t hold true anymore. That is, 2  is not equivalent to 14 \mod 5 because the remainder when you divide 14 by 5 is 4. Next, let’s study the set of possible remainders when we divide an integer by a given n

The Finite Groups \mathbb{Z}_{n}

Since you’re super clever, I bet you’re already one step ahead. If we look at dividing any number by 12, the remainder can only be the numbers 0 – 11. Why? Well if the remainder is 12, then it’s a multiple of 12. The remainder of any integer, no matter how big it is, when dividing by n has to be somewhere in the set \{0,1,2,\ldots, n-1\}

Let’s look at a few with n=12. As an exercise, verify each of these on your own:

1\equiv 13 \bmod 12 \text{ because } 13 = 1\cdot 12 + 1 0\equiv 24 \bmod 12 \text{ because } 24 = 2\cdot 12 + 0 11\equiv 35 \bmod 12 \text{ because } 35 = 2\cdot 12 + 11 0\equiv 48 \bmod 12 \text{ because } 48 = 4\cdot 12 + 0 3\equiv 123 \bmod 12 \text{ because } 123 = 10\cdot 12 + 3

That k is telling us how many times we’ve gone around the clock. So we’re always stuck on the clock, no matter how many times we go around, which means that the remainder can’t be any larger than 11, or we’ve started another cycle around the clock. The set \{0,1,2,\ldots, n-1\} forms an algebraic structure called a group when we pair the set with the operation modulo addition.

So what’s a group?

group is a set paired with an operation that has to have 3 certain properties. The operation doesn’t always have to be addition, or modulo addition, and the set doesn’t have to be integers. Operations can be addition, multiplication, function composition, or some weird one you define. But to pair an operation with a set, we have to check to make sure some rules are followed:

  • First, your operation has to fit the formal definition of an operation. An operation is defied mathematically as a rule that takes any two elements in a set A and returns a unique element that is still in A. This means that you can’t create a rule that lets you escape the set. If my set is integers, and my rule is division, then it’s not an operation on that set because the result of 3/4 isn’t an integer. I escaped the set. The uniqueness requirement is what it means to be well-defined. If my rule is that I take two integers a and b and return the number whose square is the product ab, then 2\cdot 8 = 16 could return either 4 or -4. That’s not well-defined. The rule doesn’t guarantee that a unique element is returned.

Assuming our operation really is an operation (and modulo addition on the integers is. If you don’t believe me, try to prove it for yourself.), then we can move on. We have the set (integers modulo n) and the operation (modulo addition). Now, to meet the definition of a group, the set and operation must follow these three axioms

  1. The operation is associative. That is, the grouping or order shouldn’t matter when I go to execute the operation on 3 elements. That is, [(a+b) + c] \bmod n \equiv [a + (b+c)]\bmod n.
  2. There is an identity element in my set. In other words, there has to be some element in my set such that performing the operation (in this case, modulo n) on any other element in the set return that other element, and the order doesn’t matter. If you take some element and operate on it using the identity, or if you take the identity element and operate on it using that other element, you would always get that other element. 2
  3. Every element in the set has to have an inverse. This means that for every element in my group, there has to be another element such that performing the operation on the element and its proposed inverse in any order returns the identity element we talked about in (2).3


We have a special way of writing the set of the integers modulo n: \mathbb{Z}_{n} = \{0,1,\ldots,n-1\}. We’ll do a specific example and verify that \mathbb{Z}_{3} under modulo addition is a group.

\mathbb{Z}_{3} and operation tables

We can examine groups via operation tables, which are a nice visual. Think of your multiplication tables when you were young. The elements are placed on the rows and columns, and the entries are the result of applying the operation to the two elements. The row entry goes first, then the column entry. For \mathbb{Z}_{3}, we know the elements are \{0,1,2\}, so the operation table looks like this:

\begin{array}{l|rrr}\color{purple}{+}&\color{purple}{0} &\color{purple}{1} & \color{purple}{2}\\\color{purple}{0} & 0 & 1 & 2\\\color{purple}{1} & 1 & 2 & 0\\\color{purple}{2} & 2 & 0 & 1\end{array}

So, reading the table, the first entry is row 1 + column 1 (\bmod 3\!), which is

(0 + 0) \bmod 3 \equiv 0 \bmod 3

Another one is the row 2 + column 3:

(1 + 2) \bmod 3 = 3 \bmod 3 \equiv 0\bmod 3

Why? Well, after we add, we have to divide by 3 and take the remainder. As an exercise, make sure you can generate this table yourself. It gives good practice, and if you can comfortably generate this, then you’re doing great!

Now, let’s verify that \mathbb{Z}_{3} under modulo addition is a group

We know that modulo addition is an operation, so let’s see if combining it with \mathbb{Z}_{3} fits the definition of a group. This means we have to verify the three axioms, starting with the first:


Axiom 1: Associativity:

We need to show the following to satisfy the first axiom:

[(a+b) + c] \bmod 3 \equiv [a + (b+c)]\bmod 3.

First, let’s take any three unspecified elements of \mathbb{Z}_{3}. Call the a, b, and c. We don’t specify these because we have to keep things general in order to show that it doesn’t matter which three elements we pick.4 We are going to start with [(a+b) + c]\bmod 3 and manipulate it to show that it’s equivalent to [a+(b+c)]\bmod 3. To differentiate between regular old addition and modulo addition (modulo 3 in this case), I am going to use the notation +_{3} when I mean addition modulo 3.

First, by the definition of modulo arithmetic,

[(a+b)+c]\bmod 3 = (a+b)\bmod 3 +_{3} c\bmod 3

The right hand side of the above is the same thing as using regular addition to add a, b, and c together first, then take that result modulo 3. 5 So we write that

\begin{aligned}[(a+b)+c]\bmod 3 &= (a+b)\bmod 3 +_{3} c\bmod 3\\&= (a+b+c)\bmod 3\end{aligned}

Look inside the parentheses. Now we’re back in the nice, safe world of regular addition, and we know we can group and add any ol’ way we want to inside those parentheses. So let’s do that. We know that

(a+b+c) = (a+b) + c = a + (b+c).

Grab that last equality.6 Then we can write

\begin{aligned}[(a+b)+c]\bmod 3 &= (a+b)\bmod 3 +_{3} c\bmod 3\\&= (a+b+c)\bmod 3\\&=[a+(b+c)]\bmod 3\end{aligned}

Hmm, now it looks like we can “descend” the same ladder we walked up to start with. If [(a+b)+c]\bmod 3 = (a+b)\bmod 3 +_{3} c\bmod 3, then it also holds that [a+(b+c)]\bmod 3 = a\bmod 3 +_{3} (b+c)\bmod 3.7 \begin{aligned}[(a+b)+c]\bmod 3 &= (a+b)\bmod 3 +_{3} c\bmod 3\\&= (a+b+c)\bmod 3\\&=[a+(b+c)]\bmod 3\\&= a \bmod 3 +_{3} (b+c) \bmod 3\end{aligned}

Last step. We are now basically done. Use the definition of modulo arithmetic one more time to finish.

\begin{aligned}[(a+b)+c]\bmod 3 &= (a+b)\bmod 3 +_{3} c\bmod 3\\&= (a+b+c)\bmod 3\\&=[a+(b+c)]\bmod 3\\&= a \bmod 3 +_{3} (b+c) \bmod 3\\&\equiv [a+(b+c)]\bmod 3\end{aligned}


Axiom 2: Existence of an identity 

This one is nice and easy here. We already did the work by calculating out that operation table. Which element when added to any other element returns the other element? Does the order matter? Use the multiplication table to verify that 0 is the identity element of \mathbb{Z}_{3}


Axiom 3: Every element has to have an inverse

We also did the setup for this one too. Look at each element (there are only 3). Each element needs to have a corresponding pair that you can modulo add to it to retrieve the identity element 0 \mod 3. Clearly, 08 is its own inverse.9 (0+0) \bmod 3 = 0\mod 3. Also,

(1+2)\bmod 3 \equiv (2+1) \bmod 3 \equiv 0 \bmod 3.

So 1 and 2 are inverses of each other. Therefore, every element has an inverse, and we’ve proven that \mathbb{Z}_{3} is a group under the operation modulo addition.


I read all the way down to here, so what should I take from this?

All the things we just did were to learn a couple things:

  1. There is more than one kind of addition. Modulo addition is very commonly used, with the typical example being clock reading. We also use it when we program computers (because machines operate in binary, which is actually \mathbb{Z}_{2}. It also shows up in many programming applications, like calendars (also use modulo addition) and the storage mapping function. (For a fantastic explanation of the “why I should care”, check out I Programmer’s article on SMF and the article on the mod function. )
  2. Group theory can simplify our lives by giving us fewer things to study. If I want to study the effect dividing by 3 has on numbers, I can just study \mathbb{Z}_{3}, because dividing by 3 can only have, well, 3 possible options: a remainder of 0, 1, or 2. Now I don’t have to study all the integers in this situation to get the information I need.

This article will be just the start of various explorations of group theory. Here I gave an introduction to a specific type of group, the integers modulo n, because it will turn out that the structure and behavior of these groups are actually the same as other groups that look much harder to study, like symmetries of regular shapes. For example, \mathbb{Z}_{3} has the same mathematical structure as all the lines of symmetry through an equilateral triangle.10. This has vast uses in chemistry and physics, studying the shapes of molecules.

Abstract algebra is a vastly useful subject, despite its reputation for living “in the clouds.” Since I myself need to deepen my understanding of this remarkable branch of mathematics, join me on the journey and stay tuned for more posts exploring notions in group theory. Perhaps as I become better friends with algebra, it will appear more useful and beautiful to you as well.
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Uncorrelated and Independent: Related but not Equivalent

Uncorrelated and Independent: Related but not Equivalent

Mathematics is known for its resolute commitment to precision in definitions and statements. However, when words are pulled from the English language and given rigid mathematical definitions, the connotations and colloquial use outside of mathematics still remain. This can lead to immutable mathematical terms being used interchangeably, even though the mathematical definitions are not equivalent. This occurs frequently in probability and statistics, particularly with the notion of uncorrelated and independent. We will focus this post on the exact meaning of both of these words, and how they are related but not equivalent.


First, we will give the formal definition of independence:


Definition (Independence of Random Variables).

Two random variables X and Y are independent if the joint probability distribution P(X, Y) can be written as the product of the two individual distributions. That is,

P(X, Y) = P(X)P(Y)


Essentially this means that the joint probability of the random variables X and Y together are actually separable into the product of their individual probabilities. Here are some other equivalent definitions:

  • P(X \cap Y) = P(X)P(Y)
  • P(X|Y) = P(X) and P(Y|X) = P(Y)

This first alternative definition states that the probability of any outcome of X and any outcome of Y occurring simultaneously is the product of those individual probabilities.

For example, suppose the probability that you will put ham and cheese on your sandwich is P(H \cap C) = 1/3. The probability that ham is on your sandwich (with or without any other toppings) is P(H) = 1/2 and the probability that cheese is on your sandwich (again, with or without ham or other goodies) is P(C) = 1/2. If ham and cheese were independent sandwich fixings, then P(H\cap C) = P(H)P(C), but

P(H\cap C)= 1/3 \neq 1/4 = P(H)P(C)

Thus, ham and cheese are not independent sandwich fixings. This leads us into the next equivalent definition:

Two random variables are independent if  P(X|Y) = P(X) and P(Y|X) = P(Y).

The vertical bar is a conditional probability.  P(X|Y) reads “probability of X given Y“, and is the probability X will have any outcome x given that the random variable Y has occurred.

The second definition means that two random variables are independent if the outcome of one has no effect on the other. That is, putting cheese on my sandwich doesn’t affect the likelihood that I will then add ham, and  if I started with ham, then it doesn’t affect the likelihood I will add cheese second. The example above already showed that ham and cheese were not independent, but we’ll repeat it again with this other equivalent definition.

By Bayes’ formula,

P(H | C) = \frac{P(H\cap C)}{P(C)} = \frac{1/3}{1/2} = \frac{2}{3} \neq \frac{1}{2} = P(H)

The probability that ham will be on my sandwich given that cheese is on my sandwich is 2/3. This means the presence of cheese increases the likelihood that ham will be there too, which also tells me ham and cheese are not independent.

P(C | H) = \frac{P(H\cap C)}{P(H)} = \frac{2}{3} \neq \frac{1}{2} = P(C)

In addition, I’m more likely to add cheese to the sandwich if ham is already there. In both of these, the presence of one affects the probability of the other, so they are not independent.

Coin flips are independent. The probability of me flipping a quarter and getting a head doesn’t affect the probability of you then getting a tail (or head) when you pick up that quarter and flip it after me. Independence is a common assumption in statistics because most distributions rely on it, though real data is rarely truly independent. For the development of the theory of dependent random variables, see herehere, and here.

Next we’ll define what it means to be uncorrelated, and discuss some of the subtleties and equivalent interpretations.


When people use the word “uncorrelated”, they are typically referring to the Pearson correlation coefficient (or product-moment coefficient) having a value of 0. The Pearson correlation coefficient of random variables X and Y is given by

\rho = \frac{\text{Cov}(X,Y)}{\sqrt{\text{Var}(X)}\sqrt{\text{Var}(Y)}}

Where \text{Var}(X) is the variance of the random variable, and \text{Cov}(X,Y) is the covariance between X and Y. A correlation of 0, or X and Y being uncorrelated implies \text{Cov}(X,Y) = 0, and thus it suffices to just look at the numerator.

The covariance between two random variables X and Y measures the joint variability, and has the formula

\text{Cov}(X,Y) = E[XY]-E[X]E[Y] E[\cdot] is the expectation operator and gives the expected value (or mean) of the object inside. E[X] is the mean value of the random variable X. E[XY] is the mean value of the product of the two random variables X and Y.

For an example, suppose X and Y can take on the joint values (expressed as ordered pairs (0,0), (1,0), and (1,1) with equal probability. Then for any of the three possible points (x,y), P( (X,Y) = (x,y)) = 1/3. We will find the covariance between these two random variables.

The first step is to calculate the mean of each individual random variable. X only takes on two values, 0 and 1, with probability 1/3 and 2/3 respectively. (Remember that two of the points have X = 1, with each of those probabilities as 1/3.) Then

E[X] = 0\cdot 1/3 + 1\cdot 2/3 = 2/3

Similarly, E[Y] = 0\cdot 2/3 + 1\cdot 1/3 = 1/3. Now, we must calculate the expected value of the product of X and Y. That product can take on values 0 or 1 (multiply the elements of each ordered pair together) with respective probabilities 2/3 and 1/3. These probabilities are obtained the same way as for the individual expectations. Thus,

E[XY] = 0\cdot 2/3 + 1\cdot 1/3 = 1/3

Finally, we put it all together:

\text{Cov}(X,Y) = E[XY]-E[X]E[Y] = \frac{1}{3}-\frac{2}{3}\cdot\frac{1}{3} = \frac{1}{3}-\frac{2}{9} = \frac{1}{9}

Covariance (and correlation, the normalized form of covariance) measure the linear relationship between two random variables. This is important. Next, we’ll look at how independence and correlation are related.

Independence \Rightarrow Uncorrelated

Here’s the start of where the confusion lies. First, it is absolutely true that if two random variables are independent, then they are uncorrelated. It’s important to prove this, so I will do it. I will prove this for discrete random variables to avoid calculus, but this holds for all random variables, both continuous and discrete.

Theorem. If two random variables X and Y are independent, then they are uncorrelated.


Proof. Uncorrelated means that their correlation is 0, or, equivalently, that the covariance between them is 0. Therefore, we want to show that for two given (but unknown) random variables that are independent, then the covariance between them is 0.

Now, recall the formula for covariance:

\text{Cov}(X,Y) = E[XY]-E[X]E[Y]

If the covariance is 0, then E[XY] = E[X]E[Y]. Therefore, we say mathematically that it is sufficient to show that E[XY] = E[X]E[Y]. This is because uncorrelated is equivalent to a 0 covariance, which is equivalent to E[XY] = E[X]E[Y], and thus showing this last equality is equivalent to showing that X and Y are uncorrelated.

OK, we are given that X and Y are independent, which by definition means that P(X,Y) = P(X)P(Y). This is our starting point, and we know what we want to show, so let’s calculate E[XY]:

E[XY] = \sum_{x}\sum_{y}x\cdot y\cdot P(X=x, Y=y)

This is what E[XY] is. We have to sum the product of the two random variables times the probability that X=x and Y=y over all possible values of X and all possible values of Y. Now, we use the definition of independence and substitute P(X=x)P(Y=y) for P(X=x, Y=y):

\begin{aligned}E[XY] &= \sum_{x}\sum_{y}x\cdot y\cdot P(X=x, Y=y)\\&=\sum_{x}\sum_{y}x\cdot y\cdot P(X=x)P(Y=y)\end{aligned}

If I sum over Y first, then everything related to X is constant with respect to Y. 1. Then I can factor out everything related to X from the sum over y:

\begin{aligned}E[XY] &= \sum_{x}\sum_{y}x\cdot y\cdot P(X=x, Y=y)\\&=\sum_{x}\sum_{y}x\cdot y\cdot P(X=x)P(Y=y)\\&= \sum_{x}x\cdot P(X=x)\sum_{y}y\cdot P(Y=y)\end{aligned}

Then if I actually carry out that inner sum over y, it becomes a completed object with no relationship to the outer sum going over x. That means I can put parentheses around it and pull it out of the sum over x:

\begin{aligned}E[XY] &= \sum_{x}\sum_{y}x\cdot y\cdot P(X=x, Y=y)\\&=\sum_{x}\sum_{y}x\cdot y\cdot P(X=x)P(Y=y)\\&= \sum_{x}x\cdot P(X=x)\left(\sum_{y}y\cdot P(Y=y)\right)\\&= \left(\sum_{y}y\cdot P(Y=y)\right)\left(\sum_{x}x\cdot P(X=x)\right)\end{aligned}

Looking at the objects in each group of parentheses, we see that it matches the definition of expectation for X and Y. That is E[X] = \sum_{x}x\cdot P(X=x), and similar for Y. Therefore, we have shown that E[XY] = E[X]E[Y], and have proven that independence always implies uncorrelated.

Now, to use these two words (independent and uncorrelated) interchangeably, then we would have to know that the converse of the statement we just proven is true: that

Uncorrelated implies independence

If we find even one counterexample (an example where the two variables have 0 correlation but do not fit the definition of independence), then the converse is false and we cannot use those terms interchangeably.

No luck, I have a counterexample


Let’s take X and Y to exist as an ordered pair at the points (-1,1), (0,0), and (1,1) with probabilities 1/4, 1/2, and 1/4. Then E[X] = -1\cdot 1/4 + 0 \cdot 1/2 + 1\cdot 1/4 = 0 = E[Y] and

E[XY] = -1\cdot 1/4 + 0 \cdot 1/2 + 1\cdot 1/4 = 0 = E[X]E[Y]

and thus X and Y are uncorrelated.

Now let’s look at the marginal distributions of X and Y. X can take on values -1, 0, and 1, and the probability it takes each of those is 1/4, 1/2, and 1/4. Same with Y. Then looping through the possibilities, we have to check if P(X=x, Y=y) = P(X=x)P(Y=y) P(X=-1, Y=1) =1/4 \neq 1/16 = P(X=-1)P(Y=1)

We loop through the other two points, and see that X and Y do not meet the definition of independent.

Therefore, we just found an example where two uncorrelated random variables are not independent, and thus the converse statement does not hold.



Correlation is a linear association measure. We saw in our counterexample that there was no linear relationship between the two random variables2 That doesn’t mean the variables don’t have any effect on each other (again, as we saw in our counterexample.) The common mistake is to forget that correlation is a restrictive measure of relationships, since it only covers linear types. Independence is a measure of “probabilistic effect”, which encompasses far more than simply linear association.

The words uncorrelated and independent may be used interchangeably in English, but they are not synonyms in mathematics. Independent random variables are uncorrelated, but uncorrelated random variables are not always independent. In mathematical terms, we conclude that independence is a more restrictive property than uncorrelated-ness.
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

A Generalized Geometric Distribution from Vertically Dependent Bernoulli Random Variables

A Generalized Geometric Distribution from Vertically Dependent Bernoulli Random Variables

 For full proofs and derivations, read here.


This paper generalizes the notion of the geometric distribution to allow for dependent Bernoulli trials generated from dependency generators as defined in Traylor and Hathcock’s previous work. The generalized geometric distribution describes a random variable X that counts the number of dependent Bernoulli trials until the first success. The main result of the paper is  X can count dependent Bernoulli trials from any dependency structure and retain the same distribution. That is, if  X counts Bernoulli trials with dependency generated by \alpha_{1} \in \mathcal{C}_{\delta}, and Y counts Bernoulli trials with dependency generated by \alpha_{2} \in \mathscr{C}_{\delta}, then the distributions of  X and Y are the same, namely the generalized geometric distribution. Other characterizations and properties of the generalized geometric distribution are given, including the MGF, mean, variance, skew, and entropy.


The standard geometric distribution counts one of two phenomena:

  1.  The count of i.i.d. Bernoulli trials until the first success
  2.  The count of i.i.d. Bernoulli trials that resulted in a failure prior to the first success

The latter case is simply a shifted version of the former. However, this distribution, in both forms, has limitations because it requires a sequence of independent and identically distributed Bernoulli trials. Korzeniowski [2] originally defined what is now known as first-kind (FK) dependent Bernoulli random variables, and gave a generalized binomial distribution that allowed for dependence among the Bernoulli trials. Traylor [4] extended the work of Korzeniowski into FK-dependent categorical random variables and derived a generalized multinomial distribution in a similar fashion. Traylor and Hathcock [5] extended the notion of dependence among categorical random variables to include other kinds of dependency besides FK dependence, such as sequential dependence. Their work created a class of vertical dependency structures generated by a set of functions

\mathscr{C}_{\delta} = \{\alpha: \mathbb{N}_{\geq 2} \to \mathbb{N} : \alpha(n) < n \text{ and } \forall n \exists j \in \{1,\ldots,n-1\} : \alpha(n) = j\},

where the latter property is known as dependency continuity. In this paper, we derive a generalized geometric distribution from identically distributed but dependent Bernoulli random variables. The main result is that the pdf for the generalized geometric distribution is the same regardless of the dependency structure. That is, for any \alpha \in \mathscr{C}_{\delta} that generates a sequence of identically distributed but dependent Bernoulli trials, the generalized geometric distribution remains unchanged.


The standard geometric distribution is built from a sequence of independent and identically distributed (i.i.d.) Bernoulli random variables with probability of success p and probability of failure q=1-p. There are two “versions” of the geometric distribution:

  • A random variable X has a geometric distribution if it counts the number of Bernoulli trials needed to observe the first success.
  •  A random variable Y = X-1 has a geometric distribution if it counts the number of failures in a sequence of Bernoulli trials before the first observed success.


In the first case, X has support \{1,2,3,...\}, because we are looking for the first success, which can occur on trial 1, 2, 3, and so forth. In the latter case, Y has support \{0,1,2,...\} because we are counting the number of failures before the first success occurs. That is, if the first success occurs on Trial 1, then there were 0 failures preceding the first success. If the first success occurs on trial 2, then one failure occurred prior to the first success, and thus Y=1. Essentially, Version 2 is a shifted Version 1, because our perspective changes– we do not include the success in the count in Version 2.

For Version 1, known as the standard geometric distribution the pdf is given by

f_{X}(k) = q^{k-1}p, \quad k = 1, 2, 3, \ldots

For Version 2, (the shifted generalized geometric distribution)the pdf is given by

f_{Y}(k) = q^{k}p, \quad k = 0, 1, 2, \ldots

The next section derives the generalized geometric distribution for FK-dependent random variables, and then shows that the pdf is the same regardless of dependency structure.

Generalized Geometric Distribution

Derivation from FK-Dependent Bernoulli Random Variables

Suppose we have a sequence of FK-dependent Bernoulli Random variables. Recall from [2] and [4] that FK-dependent random variables are weighted toward the outcome of the first variable \epsilon_{1}. That is, in the Bernoulli case, P(\epsilon_{1} = 1) = p and P(\epsilon_{1} = 0) = q = 1-p. For subsequent variables in the sequence,

\begin{aligned}P(\epsilon_{n} = 1 | \epsilon_{1}= 1) = p^{+} &\qquad P(\epsilon_{n} = 1 | \epsilon_{1} = 0) = p^{-} \\P(\epsilon_{n} = 0 | \epsilon_{1} = 1) = q^{-} &\qquad P(\epsilon_{n} = 0 | \epsilon_{1} = 0) = q^{+}\end{aligned}

for n\geq 2, where q = 1-p, p^{+} = p + \delta q, p^{-} = p-\delta p, q^{-}= q-\delta q, q^{+} = q + \delta p, and 0 \leq \delta \leq 1 is the dependency coefficient.

We will first give the generalized “Version 1” of the geometric distribution for FK-dependent random variables.


Suppose \epsilon = (\epsilon_{1},\epsilon_{2},\ldots, \epsilon_{n},\ldots) is a FK-dependent sequence of Bernoulli random variables. Let X be the count of such Bernoulli variables needed until the first success. Then X has a generalized geometric distribution with pdf
f_{X}(k) = \left\{\begin{array}{lr}p, & k = 1 \\q\left(q^{+}\right)^{k-2}p^{-}, & k \geq 2\end{array}\right.


In a similar fashion, suppose we prefer to count the number of failures before the first success occurs. For this generalized “Version 2”, we have the following proposition.


Suppose \epsilon = (\epsilon_{1},\epsilon_{2},\ldots,\epsilon_{n},\ldots) is a FK-dependent sequence of Bernoulli random variables. Let Y = X-1 be the count of failures prior to the first success. Then Y has a shifted generalized geometric distribution with pdf

f_{Y}(k) = \left\{\begin{array}{lr}p, & k = 0 \\q\left(q^{+}\right)^{k-1}p^{-}, & k \geq 1\end{array}\right.


Generalized Geometric Distribution for any Vertical Dependency Structure

Propositions 1 and 2 were derived for FK-dependent random variables, but in fact these random variables X and Y remain distributed according to the generalized geometric distribution and the shifted generalized geometric distribution regardless of the vertical dependency structure specified, as long as the dependency structure was generated from a function in \mathscr{C}_{\delta}.


Let \epsilon = (\epsilon_{1}, \epsilon_{2},\ldots,\epsilon_{n},\ldots) be a vertically dependent sequence of Bernoulli random variables, where the dependency is generated by \alpha \in \mathscr{C}_{\delta}. Let X be the count of such Bernoulli trials needed until the first success, and let Y = X-1 be the count of failures of such Bernoulli trials prior to the first success. Then the pdf of X and Y is identical to those given in Propositions 1 and 2.


This result is quite powerful, and not one that holds for all generalized distributions constructed from dependent random variables. Given any vertical dependency structure generated from the broad class \mathscr{C}_{\delta}, the count of trials before a success and the count of failures before a success have the same probability distribution. Thus, if this information is desired, no information about the dependency structure other than the membership of its generating function in \mathscr{C}_{\delta} is necessary. The only information needed to calculate the generalized geometric probabilities for dependent Bernoulli trials is p and \delta.

The next section gives some basic properties of the Generalized Geometric Distribution, such as the moment generating function and selected moments.


Properties  of the Generalized Geometric Distribution

Moment Generating Function


The moment generating function of the generalized geometric distribution is
M_{X}(t) = pe^{t} + \frac{qp^{-}e^{2t}}{1-q^{+}e^{t}}




The mean of the generalized geometric distribution is
E[X] = \mu = \frac{1-\delta p}{p(1-\delta)}


The effect of dependence can be seen in the plot of E[X] below in Figure 1.. For fixed p, when \delta \to 1, E[X] \to \infty, though the rate changes with p.

Figure 1. Expectation of the Generalized Geometric Distribution.


Figure 2. Expected value of generalized geometric distribution with fixed p=1/2

To explore further, suppose p = 1/2, and the Bernoulli trials are thus balanced between success and failure. Figure 2 shows the effect of delta for a fixed p. Notice that the effect of \delta on the expected value become more pronounced as \delta \to 1. In particular, for \delta = 1/2 and p=1/2, E[X] = 3, but after this point, an increase of only 1/6 in \delta to \delta = 2/3 increased the expected value to 4 trials before a success. To double the expected number of trials before a success again to E[X] = 8 requires an increase of \delta by only 4/21 to 6/7.

A smaller probability of success p will yield an expected value \mu that is much more susceptible to effects of dependency \delta, and a larger p will yield an expected value more resistant to high dependency \delta. Since the geometric distribution is a count of the number of trials needed to obtain the first success, a higher p increases the probability that the first success occurs on the first trial, while a lower p decreases that probability. Therefore, the dependency \delta would have a higher effect for lower p, because a longer (dependent) sequence is expected to be generated prior to the first success, which increases the expected number of trials faster than if the Bernoulli trials were independent.

Remark. Notice that when \delta = 0, the Bernoulli trials are independent. The mean of the generalized geometric distribution when \delta = 0 is E[X] = \frac{1}{p}, the mean of the standard geometric distribution.




The variance of the generalized geometric distribution is

\text{Var}(X) = \sigma^{2} = \frac{1-p + \delta p(1-p)}{p^2(1-\delta)^2}


Figure 3. Variance as a function of dependency for fixed p.

Figure 3 shows the effect of \delta on the variance for different values of p. As with the mean, a smaller p induces a higher effect of \delta on the variance of the number of dependent Bernoulli trials before the first success is observed. The shape of all 4 cases is similar, but the scales are vastly different. As p increases, the scale of the variance decreases dramatically.

Remark.  Again, note that when \delta = 0, the variance of the generalized geometric distribution reduces to that of the standard geometric distribution.




The skew of the generalized geometric distribution is given by
\text{Skew}[X] = \frac{2-3p + p^2 + \delta p[q+\delta p q + p(2\delta-1-p)]}{\left(q + \delta pq\right)^{3/2}}

Figure 4. Skew of Generalized Geometric Distribution as a function of p

The skew of the generalized geometric distribution gets more complicated in its behavior as a function of both p and \delta. Figure 4 shows the skew as a function of p for the two extreme cases: complete independence (\delta = 0) and complete dependence \delta = 1. From p=0 to p \approx 0.658, the skew for the independent geometric distribution is greater than the completely dependent case. For p \gtrapprox 0.658, the skew is greater under complete dependence.



The entropy of a random variable measures the average information contained in the random variable. It can also be viewed as a measure of how unpredictable or “truly random” the variable is [1] . The definition of entropy, denoted H(X), was coined by Claude Shannon [3] in 1948.


Definition. (Entropy)
H(X) := -\sum_{i}P(x_{i})\log_{2}(P(x_{i}))


For the standard geometric distribution, the entropy is given by

H_{sg}(X) = \frac{-(1-p)\log_{2}(1-p)-p\log_{2}(p)}{p}


The entropy for the generalized geometric distribution is
H_{gg}(X) = \frac{-\left[pp^{-}\log_{2}(p) + qp^{-}\log_{2}(qp^{-}) + qq^{+}\log_{2}(q^{+})\right]}{p^{-}}


Figure 5a. Entropy of the generalized geometric distribution as a function of p
Figure 5b. Entropy as a function of dependency for fixed p


Figure 5a shows H_{gg}(X) as a function of p for fixed values of \delta. Notice that while the entropy decreases to 0 for all curves as p \to 1, the entropy curve is shifted upward for larger \delta. Figure 5b fixes p and looks at entropy as a function of \delta. Notice that for smaller p, the entropy is much higher, which aligns with intuition.


The standard geometric distribution counts the number of independent Bernoulli trials until the first success. This paper uses the works of Korzeniowski, and Traylor and Hathcock [2, 4, 5] on dependent Bernoulli and categorical random variables to develop a generalized geometric distribution built from dependent Bernoulli random variables. The main result of the paper is that the pdf for the generalized geometric distribution is independent of the dependency structure of the Bernoulli random variables that comprise it. That is, regardless of dependency structure, the pdf for the generalized geometric distribution is given by Proposition 1. Various properties and characterizations were given, including the moment generating function, mean, variance, skew, and entropy. The effect of dependency on each property was studied.
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.


  1. Skewness.
  2. Andrzej Korzeniowski. On correlated random graphs. Journal of Probability and Statistical Science,pages 43–58, 2013.
  3. Claude Shannon. A mathematical theory of communication. The Bell System Technical Journal, 1948.
  4.  Rachel Traylor. A generalized multinomial distribution from dependent categorical random variables. Academic Advances of the CTO, 2017.
  5. Rachel Traylor and Jason Hathcock. Vertical dependency in sequences of categorical random variables. Academic Advances of the CTO, 2017.