### Browsed byMonth: June 2017

The Rigor of Fuzzy Sets

## The Rigor of Fuzzy Sets

Perhaps “fuzzy set theory”, “fuzzy arithmetic”, and “fuzzy rules” could have been named something a bit less mock-worthy. The word “fuzzy” has English synonyms “blurred”, “unfocused”, and the worst “ill-defined”. However, fuzzy set theory is anything but fuzzy. Maybe cuddly and fun, but certainly not ill-defined. We’ll take a look in this post at what exactly fuzzy sets are. Fuzzy set theory is an extension of classical set theory, so we’ll first take a look at the basics of a classical set.

## Classical sets

A classical set is one we all deal with explicitly on a daily basis, even if we don’t use that terminology. An item or element is either a member of a classical set, or it is not. Examples are

• Even numbers, denoted $2\mathbb{Z}$
• A list of members of a particular country club
• Wavelengths that are classified as UV light
• The set of perfect squares $S = \{ k^2 : k \in \mathbb{N}\}$

If we take an element in the relevant universe for each example, we are faced with the binary decision of whether or not it is a member of the set. 3 is not a member of $2\mathbb{Z}$; 4 is. I am not a member of any country club, so given a particular one, I do not fall in that set. The electromagnetic spectrum is a continuum, but the cutoffs are strict. A wavelength between 10nm and 400 nm is ultraviolent. Any wavelength less than 10nm or greater than 400 nm is not. Finally, a the square of a rational number cannot be a perfect square, because a rational number is not a natural number. 3 is a natural number, but there is no natural number whose square gives 3.

For all of these sets, we can define a characteristic function of a set $A$ (denoted $\chi_{A}(x)$) that takes an input and returns a 1 if the input is an element of the set $A$ and 0 if it is not. Mathematically, we write

$$\chi_{A}(x) = \left\{ \begin{array}{lr} 1, & x \in A \\ 0, &x \notin A \end{array}\right.$$

That is, if $A = 2\mathbb{Z}$, then $\chi_{A}(3) = 0$, and $\chi_{A}(4) = 1$.

Now, what if we relax the requirement that you either be in or out? Here or there? Yes or no? What if I allow “shades of grey” to use a colloquialism? Then we extend classical sets to fuzzy sets.

## Partial Membership is well-defined

Many things we encounter daily aren’t strict or crisp as is used in mathematics. Someone might tell you that you can either be in or out of a room, but what if you stand with one foot on either side of the threshold? It doesn’t make sense to then say you are neither in nor out, because part of you is certainly inside the room. You can’t “pick one”, because the whole you is not wholly in either place all at once.

Fuzzy sets allow for partial membership in a set. That is, you can be “a little bit” in the room (perhaps a toe is sticking in), “halfway” in the room (straddling the threshold), “most of the way” in the room (only a foot sticking out of the room), and everything in between. We formally define this notion by extending the characteristic function $\chi_{A}(x)$ to include the possibility of fractional membership.

The membership function of a fuzzy set $A$ (denoted $\mu_{A}(x)$) takes an element $x$ in a universe $X$ and maps it somewhere in the interval [0,1] according to the level of membership it has in the set $A$. That is,

$$\mu_{A}(x): X \to [0,1]$$

Example (Defining a bacon lover)

Let $x \in \mathbb{Q}$ be the number of slices of bacon a person eats per day. (We will allow for partial consumption of bacon slices.) Suppose for simplicity that no one would ever eat more than 5 slices per day. Then a possible membership function for classifying someone as a “bacon-lover” could be

$$\mu_{A}(x) = \left\{\begin{array}{lr}\frac{x-1}{2}, &1\leq x \leq 3 \\ 1, &3 \leq x\leq 5\\0, & \text{ otherwise}\end{array}\right.$$

So given the number of bacon slices you eat per day, we can use the membership function $\mu_{A}(x)$ to determine your level of commitment to bacon. If you eat 2 slices of bacon per day, $\mu_{A}(2) = 0.5$, so you are a “halfway” bacon-lover.

### Level sets

We can also look at fuzzy sets in terms of their level sets, as a way to “backsolve” to determine what values of $x$ are at least an $\alpha$ level of membership in the set, where $0\leq \alpha\leq 1$.  Denoted $A_{\alpha}$, we define the $\alpha$-level set as

$$A_{\alpha} = \{x \in X : \mu_{A}(x) \geq \alpha\}$$

Example

For the bacon set, we can use the graph above.

• $A_{1} = [3,5]$
• $A_{1/2} = [2, 5]$
• $A_{3/4} = [2.5,5]$

In general, we can find the $\alpha$-level set by inverting the membership function. In this case, the right endpoint is always 5, and the left endpoint is given by $2\alpha + 1$.

So, $A_{1/8} = \left[ 2\cdot \frac{1}{8} + 1, 5\right] = [1.25 , 5]$

Some special level sets are the $A_{1}$, called the core, and $A_{0}$, called the support. The core tells us the interval that represents full membership, and the support tells us how “wide” the entire fuzzy set actually is.

Now that we understand what fuzzy sets are, let’s look at doing some things with them.

## Fuzzy set operations are identical to classical set operations

We’re going to dive a little further into fuzzy set theory and discuss operations on fuzzy sets such as union, intersection, and complementation. Since the fuzzy set is a nice extension of the classical set, the definitions of union, intersection, and complementation actually remain the same for fuzzy sets as for classical sets. The only difference is that the result is a fuzzy set rather than a classical set.1 We’ll discuss each one in turn.

### Union

The union is represented by the phrase “or”– that is, given two sets $A$ and $B$, an element is a member of the union of $A$ and $B$ if it is in either $A$ or $B$.2

In classical set theory, the characteristic function of the union of two sets $A$ and $B$ is given by

$$\chi_{A \cup B}(x) = \text{max}\left(\chi_{A}(x),\chi_{B}(x)\right)$$

Let’s look at classical sets again, and let $A$ be the set of prime numbers, and $B$ be the set of odd numbers.

• $\chi_{A \cup B}(3) = \text{max}(A(3), B(3)) = \text{max}(1,1) = 1$ so 3 is in $A \cup B$
• $\chi_{A \cup B}(2) = \text{max}(A(2), B(2)) = \text{max}(1,0) = 1$ so 2 is in  $A \cup B$
• $\chi_{A \cup B}(4) = \text{max}(A(4), B(4)) = \text{max}(0,0) = 0$ so 4 is not in  $A \cup B$

For fuzzy sets, the formula for the membership function of the union of two fuzzy sets  $A \vee B$3 remains exactly the same as for the characteristic function of the union of two classical sets.

$$\mu_{A \cup B}(x) = \text{max}\left(\mu_{A}(x),\mu_{B}(x)\right)$$

Let’s keep our bacon set $A$ and define $B$ as the fuzzy set that classifies a “cheese-aficionado” with membership function

$$\mu_{B}(x) = \left\{\begin{array}{lr}\frac{x}{5}, &0\leq x\leq5\\0, &\text{otherwise}\end{array}\right.$$

We can now find $\mu_{A \vee B}(x) = \max(\mu_{A}(x),\mu_{B}(x))$. For unions, the left endpoint of the support will be the smaller of the two left endpoints of the supports of $A$ and $B$, and the right endpoint of the union will be the larger of the two right endpoints of the supports of $A$ and $B$.  The support of “bacon-lover” union “cheese aficionado” will then be [0,5], which you can see from the graph.

The membership function is given by

$$\mu_{A \vee B}(x) = \left\{\begin{array}{lr}\frac{x}{5}, & 0\leq x \leq \frac{5}{3}\\\frac{x-1}{2}, & \frac{5}{3}\leq x \leq 3\\ 1, & 3\leq x \leq 5\end{array}\right.$$

### Intersection

The intersection of two sets denoted $A \cap B$ is represented in language by the word “and”– an element must be a member of both sets simultaneously to be in the intersection of two sets. For the classical set example, we’ll look again at $A$ as the set of prime numbers, and $B$ as the set of odd numbers. The membership function of the intersection of two classical sets, denoted $\chi_{A \cap B}$ is given by

$$\chi_{A \cap B} = \min\left(\chi_{A}(x), \chi_{B}(x)\right)$$
• $\chi_{A \cap B}(3) = \min(A(3), B(3)) = \min(1,1) = 1$ so 3 is in $A \cap B$
• $\chi_{A \cap B}(2) = \min(A(2), B(2)) = \min(1,0) = 0$ so 2 is not in  $A \cap B$
• $\chi_{A \cap B}(4) = \min(A(4), B(4)) = \min(0,0) = 0$ so 4 is not in  $A \cap B$

Just like for the union, the membership function of the intersection of two fuzzy sets (denoted $A \wedge B$) has the same formula as that for the classical counterpart.

$$\mu_{A \wedge B} = \min\left(\mu_{A}(x), \mu_{B}(x)\right)$$

The intersection of our fuzzy bacon set and our fuzzy cheese set then is given by

$$\mu_{A \wedge B} = \left\{\begin{array}{lr}0, & 0\leq x \leq 1\\\frac{x-1}{2}, & 1\leq x \leq \frac{5}{3}\\\frac{x}{5}, & \frac{5}{3}\leq x \leq 5\end{array}\right.$$

### Complement

The complement of a set (denoted $A^{c} \text{ or } \overline{A}$) is represented by the English word “not”, and indicates negation. Everything outside a given set is the complement of a set. The membership function for the complement of a fuzzy set is identical to the characteristic function of the complement of a classical set:

$$\mu_{A^{c}}(x) = 1-A(x)$$

The complement of a “bacon-lover”, (a bacon-hater?) is then

$$\mu_{A^{c}}(x) = \left\{\begin{array}{lr}\frac{3-x}{2}, & 1\leq x \leq 3 \\ 0, & 3\leq x \leq 5\end{array}\right.$$

## Conclusion

Fuzzy sets are simply an extension, or generalization of the classical sets we all implicitly knew already. A good definition, especially one that intends to extend a concept should contain the original concept as a special case. Fuzzy sets do indeed contain classical sets inside their definition. Moreover, operations on fuzzy sets should not need wholly new definitions. We saw that union, intersection, and complementation still hold for fuzzy sets just as they did before, so the theory of fuzzy sets is well defined and represents a consistent extension of classical set theory. The next post will extend the concept of fuzzy sets to fuzzy numbers, and then investigate how arithmetic works for fuzzy numbers.

Interventions, not Anomalies

## Interventions, not Anomalies

Anomaly Detection is becoming almost universally considered a “hot topic” across industries and business fields.

A venture capitalist wants to know if a potential investment is an anomaly that will make billions at the time of IPO. A credit card company wants to know if a particular transaction is anomalous, as it could be potentially fraudulent. A systems administrator wants to know if the latency on his storage system is unique to him, or widespread through out the network. A logistics manager wants to know if the recent influx of orders to his warehouse is unusual and/or permanent. In each of these cases, they are all implicitly asking the same question:

Is my current observation abnormal?

Before we discuss how to detect anomalies/abnormalities/outliers, the term must be defined. What is an anomaly?           The brutal secret is that there is no rigid (i.e. mathematical) definition of an anomaly. (See the previous discussion here).  Some attempts at a definition are

• “An outlying observation, or “outlier,” is one that appears to deviate markedly from other members of the sample in which it occurs.” [1]

•  An outlier is an observation that is far removed from the rest of the observations. [2]

• Anything that lies outside the median $\pm$ 1.5IQR1

In all of these cases, these statements are all relative to some definition of “typical” or “normal”, and defining an outlier to lie “too far” according to some distance metric away from it. Many debates focus on exactly which metric to use, depending on the type of data ($L^{2}$ norm, absolute distance away from the median, more than three standard deviations away from the mean, etc), but the crux of the issue is sidestepped.

In order to define an anomaly, one must define “normal”, and then decide how much variability in “normal” is tolerable. There are two stages of subjectivity at play here:

1. The definition of “normal”
2. How much variability is acceptable

These two items must be decided before a definition of an anomaly is attempted, which means the resulting definition of an anomaly has a shaky foundation at best. That is, the working definition of an anomaly is relative, and relative to moving targets.  Subjectivity has no place in mathematics, for the applications of the vague mathematics are weak and prone to misuse and abuse. (See previous post)

There are occasions when the definition of an anomaly using the above two steps can be made clear. These occasions are rare, but should be noted. These circumstances typically show up in rigid, well-defined environments, like manufacturing, or engineering/physics laws.

Manufacturing specifications define “normal” clearly and unambiguously, like the location of a hole in sheet metal.  Laws of physics are immutable, and not dependent on judgment or subjective interpretation. The laws of motion define how a “macro” object behaves (I hear you, quantum physicists), regardless of the object. We know where both a bowling ball and a golf ball are going to land given their initial position and velocity on Earth.

Acceptable variability is also clearly defined in both of these cases. In manufacturing, a design engineer specifies an acceptable tolerance for deviation of the hole’s location. In the case of observing physical laws, the variability comes from measurement error, and is given by the accuracy of the measurement tool.

Example. (The Ideal Gas Law)

The ideal gas law describes the relationship between pressure (in atmospheres), volume (in liters), temperature (in degrees Kelvin), and amount of gas (in moles) of an “ideal gas”. Nitrogen is considered to behave very closely to an ideal gas, so for this example, we will assume that nitrogen is an ideal gas and follows the ideal gas law exactly. The ideal gas law is given by

$$PV = nRT$$

where $R = 0.08 \tfrac{L\cdot \text{atm}}{\text{mol}\cdot K}$ is the ideal gas constant. Suppose now that we have exactly 1 mol of nitrogen in a 1 L container. We wish to increase the temperature of the nitrogen to see the effect on the pressure. In this case, we know exactly what the result should be, because we have a physical law that defines the model for this behavior. Here we will also note that the hypothetical pressure sensor has a tolerance of 0.001 atm. That is, the sensor can be up to 0.001 atm off when a measurement is taken.

The graph above shows the exact pressure readings we should expect as we increase the temperature of the container. Therefore, if the temperature of the container is known to be 2 degrees Celsius, and the pressure sensor reads more than 22.001 atm, then we can say the pressure reading is anomalous, because the tolerance is defined by the sensor, and the model is defined exactly by a physical law.

That is, both (1) and (2) from above are specified rigidly and objectively, and not according to a person’s judgment. In the vast majority of use cases for “anomaly detection”, measurements are not being taken according to a known model. Most of the time, the mathematical model that governs the data is unknown prior to even the desire to detect anomalies (whatever we mean by that) in the dataset. Thus, we cannot extend the principles used in manufacturing and physics to the uncertainty of data science.

### Intervention vs. Anomalies

The data science world requires a different approach that encompasses the inherent uncertainty involved in modeling behavior in the first place, let alone determining “excessive” deviations from it. Enter intervention analysis.

Intervention analysis is a broader, yet more rigid term that describes the study of changes to a data set.2 Conceptually, this term embodies the question

Did the behavior change, and is it permanent?

When we express a desire for an “anomaly detection system”, this is the fundamental question we are asking. Typical algorithms used for outlier analysis, like the Extreme Studentized Deviate3 are only looking for “strange points” within the sequence or dataset relative to what you currently see. If a decision is to be made based on a model or a deviation from it, we would like to know what kind of deviation it is. Intervention analysis formally classifies five4:

1. Level shift: At a particular point, the model made a stepwise change that is permanent.

1. Level shift with Lag: Same as (1), but takes a few time steps to occur. This shows up in the dataset as a gradual increase to the new permanent level.

1. Temporary Change: The model experiences a level shift that has a decaying effect over time.

1. Innovation Outlier: This is the most complex type, typically is defined to represent the onset of a causal effect or a change in the model itself.

1. Additive Outlier: the “point outlier”. It has no effect on future behavior.

For examples of each, visit this page for a good introductory description. I also cite the relevant papers in the references.

The important aspect with this approach is that a model is formally defined (for time series, this is called ARIMA), as well as changes to that model. Thus, when we talk about a level shift, there is no ambiguity. Moreover, with formal definitions and types, we can classify a change according to the effect the point has on the future points.5

## Proposed Strategy

Detecting innovations in practice is difficult; this post is not meant to diminish this. Creating a model from data without much prior knowledge of the behavior of the data is also difficult, as these interventions can have adverse effects on building the model in the first place. That is, if a level shift is present in the data at a certain point, we may end up with a different initial model that doesn’t notice the level shift, because it got built into the model as if it were normal, simply because we didn’t know any better. There are estimation techniques, and some papers are referenced at the end for those interested. Here we are interested in an overall strategy to consider these problems in the first place.

Since we must estimate the model, then attempt to identify interventions and classify them, a sensible solution is a probabilistic approach. That is, the modelling is done via some (hopefully robust) method X, and then the potential intervention points are identified via another method Y. Behind the scenes, we ultimately want to classify each point as

1. nothing
2. one of the types of interventions describes above

and rank these possibilities according to the likelihood a particular point belongs in one of the 6 possibilities. The uncertainty is therefore demonstrated in a true probabilistic fashion, and a decision can be made taking into account all the possibilities for a particular intervention.

## Conclusion

In order to identify and solve business problems effectively, the terms must be rigidly defined. By looking at intervention analysis, we can approach very difficult problems with a more solid foundation, since the model and the interventions/outliers have unambiguous, formal definitions. Moreover, a “big picture” strategy for employment should be probabilistic, rather than threshold or classification-based. In this way, the uncertainty can be expressed fully, which allows for more informed decision-making.

## References

1. Grubbs, F. E. (February 1969). “Procedures for detecting outlying observations in samples”. Technometrics11 (1): 1–21. doi:10.1080/00401706.1969.10490657.
2. Maddala, G. S. (1992). “Outliers”Introduction to Econometrics (2nd ed.). New York: MacMillan. pp. 88–96 [p. 89]. ISBN 0-02-374545-2.
3. Joint Estimation of Model Parameters and Outlier Effects in Time Series. Chung Chen and Lon-Mu Liu. Journal of the American Statistical Association Vol. 88 , Iss. 421,1993
Defining An Anomaly

## Defining An Anomaly

Everyone speaks about anomaly detection and its importance, and for good reason. Excessive credit card charges can signal a stolen card. A large change in errors can signal an impending hard drive failure. Unusually low crop yields can signal a pest infestation.

In order to detect these strange occurrences, there are dozens of algorithms with plenty of data science buzzwords that attempt to solve this difficult problem; every company has one. Twitter has a “windowed seasonal variation on the standard ESD algorithm” that treats a data set a simply a collection of points. Others use “ARIMA forecasting” and then see if the most recent point lies “outside the confidence band for the forecast”. I can list a lot of other examples with plenty of acronyms and buzzwords that will make your head spin.

Probabilistic Ways to Represent the Lifetime of an Object

## Probabilistic Ways to Represent the Lifetime of an Object

Every item, system, person, or animal has a lifetime. For people and animals, we typically just measure the lifetime in years, but we have other options for items and systems. We can measure airplane reliability in flight hours (hours actually flown), or stress test a manufacturing tool in cycles. Regardless of the units we use, there are many things in common. We don’t know how long any item will “live” before it’s manufactured or deployed, so an item’s lifetime is a random variable. We wish to make decisions about manufacturing, warranties, or even purchasing by taking the reliability of an object into account.

We can represent each class of items (a brand of 100W lightbulbs, USR’s NS4 robots, etc) by a random variable for the lifetime. We’ll call it $Y$. Like any random variable, it has a probability distribution. There isn’t only one way to represent the distribution of $Y$. We can look at equivalent representations, each one useful for answering different types of questions. This article will run through a few of them and the uses by studying the theoretical lifetime distribution of USR’s famous NS4 robots.

Disclaimer: This example is derived from the fictonal company USR from Isaac Asimov’s I, Robot. This should not be construed to represent any real product. I’m sure I’m missing some other disclaimer notices, but assume the standard ones are here.

## Survivor Function

The survivor function is the most common way to study the lifetime of an item. Colloquially, this is the probability that the item survives past time $t$. We denote it by $S(t)$, and we can write mathematically that

$$S(t) = P(Y \geq t)$$

This equation can be given by a standard probability distribution (the exponential distribution is the most common) or other formula.

Example (Exponential NS4s)
Without having access to USR’s manufacturing data, let’s assume that the survivor function of the NS4 robot is given by $S(t) = e^{-t/3}$. Let’s also assume that $t$ is measured in years. What is the probability that a brand new NS4 lasts more than 5 years?

From the graph above, we can simply plug $t=5$ into the survivor function to get the answer to our question. The probability that the new NS4 survives longer than 5 years is

$$S(5) = e^{-5/3} \approx 0.189$$

We could use the survivor function to help NSR decide where to place the cutoff for warranty claims. Depending on the cost of either repairing the NS4 or replacing the NS4 with the NS5, we can backsolve to find out what $t$ would satisfy management. Suppose the cost function requires that the probability of surviving past the cutoff $t$ is 85%. Then we can use the survivor function to backsolve for $t$:

\begin{aligned}0.85 &= e^{-t/3} \\\ln(0.85) &= \frac{-t}{3} \\0.49 &\approx t\end{aligned}

Thus, we would set the warranty claims to be valid only for about the first half year after the NS4 is purchased.

Remark. Another way to judge an item is by looking at the shape of the survivor function. A steep decline like the one shown in the above graph tells us that the NS4 isn’t exactly the most reliable robot. Only about half of them survive two years.

### Conditional Survivor Function

For those who wish to dive into a little bit more math, we can dive into the conditional survivor function. This “spinoff” of the survivor function will tell us the probability of surviving past time $t$ when it is currently functioning at time $a$. The survivor function above assumes $t$ starts at 0; that is, the object is brand new. If we have bought a used NS4, or perhaps have been sending it to the grocery store for a while, then we need to take into account the fact that the NS4 has been operational for some time $a$.

We write the conditional survivor function $S_{Y|Y\geq a}(t)$ for some fixed $a$. We can use the famous Bayes formula to express this mathematically:

$$S_{Y|Y\geq a}(t) = P(Y \geq t | Y \geq a) = \frac{P(Y \geq t \text{ and } Y \geq a)}{P(Y \geq a)} = \frac{P(Y \geq t)}{P(Y\geq a)} = \frac{S(t)}{S(a)}$$

What this formula is basically saying is that the probability that the NS4 survives past $t$ given that it has already lived for $a$ years is given by $S(t)/S(a)$, and derived via Bayes’ formula.

Example
Suppose we bought a used NS4 that was 2 years old (and it is working now). What is the probability that this NS4 is still working more than 3 years from now?

We are looking for the probability that the NS4 is still operational after more than 5 years given that it has already been working for 2. So

$$S_{Y|Y\geq 2}(5) = \frac{S(5)}{S(2)} = \frac{e^{-5/3}}{e^{-2/3}} = \frac{1}{e} \approx 0.367$$

Thus, we only have a 36.7% chance of getting more than 3 years out of our used NS4. Perhaps we may want to consider haggling for a lower price…

## Cumulative Distribution Function (Cumulative Failure Probability)

This is the cumulative distribution function straight from basic probability, but we can add an additional interpretation in the context of reliability. Mathematically, the cumulative distribution function for a random variable $Y$ is the probability that the random variable is less than or equal to a fixed value $y$. Mathematically, we denote this by $F_{Y}(t) = P(Y \leq t)$.

You may recognize this as the “opposite” of the survival function ($S_{Y}(t) = P(Y \geq t)$). In probability, we call this the complement, and we can get from one event to its complement by noting that for an event $A$ and its complement $A^{c}$, $P(A) + P(A^{c}) = 1$. Thus, moving back to the survivor function and CDF, $S_{Y}(t) + F_{Y}(t) = 1$. Therefore, $F_{Y}(t) = 1-S_{Y}(t)$. With the NS4 example, the CDF is given by

$$F_{Y}(t) = 1-e^{-t/3}$$

The interpretation is exactly the opposite of the survivor function. The CDF gives us the probability that the NS4 will fail before time t.

Example
The probability that a new NS4 will fail before the 5th year is given by either $F_{Y}(t) = 1-e^{-5/3} = 1-0.189 = 0.821$.

## Hazard Function

The most common way to look at a lifetime distribution in reliability is called the hazard function. This is also called the failure rate in some circles, and gives a measure of risk. We’ll take a little bit of a dive into math to derive this, since the derivation is illuminating as to its interpretation.

The hazard function is denoted by $h(t)$. The question we want to answer here is the conditional probability that the item will fail in a time interval $[t, t+\Delta t]$, given that it has not occurred before. We want to know the probability of failure per unit time. So,

$$h(t)\Delta t = P(t \leq Y \leq t + \Delta t| Y \geq t).$$

We’re going to get into some calculus here, so this can be skipped if you’d rather not deal with this.

$$h(t) = \frac{P(t \leq Y \leq t + \Delta t| Y \geq t)}{\Delta t}= \frac{F'(t)}{S(t)\Delta t}$$

Now, if we let the interval length $\Delta t$ get smaller (to 0), we get the instantaneous failure rate.

$$h(t) = \frac{-S'(t)}{S(t)}$$

Remark Those sharp in calculus will notice via the chain rule that $h(t) = -\frac{d}{dt}\ln(S(t))$, so we can express the hazard function in terms of the survivor function.

Example.
We can directly derive the hazard rate for the NS4 population.

$$h(t) = -\frac{d}{dt}\ln(e^{-t/3}) = \frac{d}{dt}\frac{t}{3} = \frac{1}{3}$$

which is one failure every three years of operation.

The hazard function is commonly used in engineering maintenance to determine schedules for checks or component replacement. For example, the hazard function can be used to determine how many flight hours an Lockheed F-22 fighter jet can be operated before a certain component is at risk for failure and should be inspected for replacement.

There are other forms we can use to express the distribution of an object’s lifetime, but these are the most common. Another thing to note is that we can easily move from one form to another. They all represent the same thing–lifetime of a system, but in slightly different ways. We were able to make several different decisions about USR’s NS4 robots thanks to these different representations.

Stochastic Reliability of a Server under a Random Workload

## Abstract

Editor’s note: This paper is the first chapter from a PhD thesis published in 2016 by Rachel Traylor. We generalize a 2011 model from Cha and Lee that gave a closed form for the survival function of a server under a random workload, and with random service times. In this work, the constant stress assumption of Cha and Lee is relaxed and allowed to be random; that is, every request to the server brings a random stress or workload to the server for the duration of its stay until completion. The efficiency measure of a random stress server is defined as the long run average number of jobs completed per unit time, and provides a way to measure server performance.

## Introduction

There are many types of systems which can be dubbed servers, such as a retail checkout counter, a shipping company, a web server, or a customer service hotline. All of these systems have common general behavior. Requests or customers arrive via a stochastic process, the service times vary randomly, and each request stresses the server if only temporarily. A general stochastic model that describes the reliability of such a server can provide the necessary information for optimal resource allocation and efficient task scheduling, leading to significant cost savings for businesses and improved performance metrics [6]. Such topics have been studied in literature for several decades [1, 2, 3, 20].

Much attention was devoted to reliability principles that model software failures and bug fixes, starting with Jelinski and Moranda in 1972 [11]. The hazard function under this model shows the time between the i-th failure and the (+ 1)st failure. Littlewood (1980) [16] extended this initial reliability model for software by assuming differences in error size [12].

These models have been extended into software testing applications [4, 5, 19] and optimal software release times [7, 8, 17, 22]. The explosion of e-commerce and the resulting increase in internet traffic have led to the development of reliability models for Web applications. Heavy traffic can overload and crash a server; thus, various control policies for refreshing content and admission of page requests were created [10, 13, 15, 18, 23].

In particular, Cha and Lee (2011) [9] proposed a stochastic breakdown model for an unreliable web server whose requests arrive at random times according to a nonhomogenous Poisson process and bring a constant stress factor to the system that dissipates upon service completion. The authors provide a fairly general survival function under any service distribution $g_{W}(w)$, define server efficiency to measure performance, and illustrate a possible admission control policy due to an observed property of the server efficiency under a specific numerical example.

Thus far, no extensions of [9] have been proposed. This work generalizes the model put forth by Cha and Lee in a variety of ways. First, the assumption of constant job stress is relaxed and replaced by a random variable, and a new survival function and efficiency equation are derived. This work, while suitable for IT applications, is general enough for use in almost any industry, including logistics, retail, manufacturing, and engineering systems.

## Background (Cha and Lee’s Model)

### System Description and Survival Function

Cha and Lee considered a web server wherein each request arrives via a nonhomogenous Poisson process $\{N(t): t \geq 0\}$ with intensity function $\lambda(t)$. Each request adds a constant stress $\eta$, increasing the breakdown rate for the duration of service. Suppose $r_{0}(t)$ is the breakdown rate of the idle server. Then the breakdown rate process $\mathscr{B}(t)$ is defined as

$$\mathscr{B}(t) := r_{0}(t) + \eta\sum_{j=1}^{N(t)}\mathbb{1}(T_{j} \leq t \leq T_{j} + W_{j})$$

where $N(t), \{T_{j}\}_{j=1}^{N(t)}, \{W_{j}\}_{j=1}^{N(t)}$ are the random variables that describe the number of arrivals, arrival times, and service times, respectively. It is assumed that $\{T_{j}\}_{j=1}^{N(t)}$ are independent of each other. Furthermore, the $\{W_{j}\}_{j=1}^{N(t)} \sim g_{W}(w)$ are i.i.d. and are mutually independent of all $T_{j}$‘s.

Under these conditions, Cha and Lee proved the following theorem that gives the explicit form of the survival function under this model:

Theorem.

Suppose that $\{N(t), t \geq 0\}$ is a nonhomogenous Poisson process with intensity function $\lambda(t)$, i.e. $m(t) \equiv \int_{0}^{t}\lambda(x)dx$. Assuming the conditional survival function is given by

$$P\left(Y > t \left| N(t), \{T_{j}\}_{j=1}^{N(t)}, \{W_{j}\}_{j=1}^{N(t)}\right.\right) = \bar{F}_{0}(t)\exp\left(-\eta\sum_{j=1}^{N(t)}\min(W_{j}, t-T_{j})\right)$$

and $m(t)$ has an inverse, the survival function of $Y$ is given by

$$S_{Y}(t) = \bar{F}_{0}(t)\exp\left(-\eta\int_{0}^{t}\exp(-\eta w)\bar{G}_{W}(w)m(t-w)dw\right)$$

and the hazard function of $Y$, denoted $r(t)$, is given by

$$r(t) = r_{0}(t) + \eta\int_{0}^{t}e^{-\eta w}\bar{G}_{W}(w)\lambda(t-w)dw$$

### Efficiency of the Server

It is natural to develop some measure of server performance. Cha and Lee measure such performance by defining the efficiency, $\psi$, of the web server as the long-run expected number of jobs completed per unit time. That is, with the number of jobs completed as $M$, the efficiency is defined

$$\psi := \lim\limits_{t \to \infty}\frac{E[M(t)]}{t}$$

Upon breakdown and rebooting, the server is assumed to be ‘as good as new’, in that performance of the server does not degrade during subsequent reboots. In addition, the model assumes the arrival process after reboot, denoted $\{N^{*}(t), t \geq 0\}$, is a nonhomogenous Poisson process with the same intensity function $\lambda(t)$ as before, and that $\{N^{*}(t), t \geq 0\}$ is independent of the arrival process before reboot. In a practical setting, this model assumes no ‘bottlenecking’ of arrivals occurs in the queue during server downtime that would cause an initial flood to the rebooted server. In addition, the reboot time is assumed to follow a continuous distribution $H(t)$ with expected value $\nu$. This process is a renewal reward process, with the renewal $\{R_{n}\} = \{M_{n}\}$, the number of jobs completed. The length of a renewal cycle is $Y_{n} + H_{n}$, where $Y_{n}$ is the length of time the server was operational, and $H_{n}$ is the time to reboot after a server crash. Then, by [21],

$$\psi = \frac{E[M]}{E[Y]+ \nu}$$

where $M$ is the number of jobs completed in a particular renewal cycle, $\nu$ is the mean time to reboot of the server, and $Y$ is the length of a particular renewal cycle. Then, using the definition of efficiency, the following closed form of the efficiency of a server under all assumptions of Cha and Lee’s model is derived.

Theorem. (Efficiency of the Constant Stress Server)
Suppose $\{N(t), t \geq 0\}$ is a nonhomogenous Poisson process with intensity $\lambda(t)\geq 0$. Then the efficiency is given by

\begin{aligned}\psi &= \frac{1}{\int_{0}^{\infty}S_{Y}(t)dt + \nu}\left[\exp\left(-\int_{0}^{t}r_{0}(x)dx-\int_{0}^{t}\lambda(x)dx + a(t) + b(t)\right)\right.\\&\kern{10em}\biggl.\times\left(r_{0}(t)a(t) + \eta a(t)b(t)\right)\biggr]\end{aligned}

where $a(t) = \int_{0}^{t}e^{-\eta v}g_{W}(v)m(t-v)dv$, $b(t) = \int_{0}^{t}e^{-\eta(t-r)}\bar{G}_{W}(t-r)\lambda(r)dr$, $\bar{G}_{W}(x) = 1-\int_{0}^{x}g_{W}(s)ds$, and $m(x) = \int_{0}^{x}\lambda(s)ds$.

### Numerical Examples and Control Policies

As an illustrative example, Cha and Lee considered the case when $\lambda(t) \equiv \lambda$,
$r_{0}(t) \equiv r_{0} = 0.2$, $\eta = 0.01$, $\nu = 1$, and $g_{W}(w) = we^{-w^{2}/2}$ (the PDF of the Rayleigh distribution). As shown in the figure above, there exists a $\lambda^{*}$ such that $\psi(\lambda)$ is maximized. Thus one may implement the obvious optimal control policy for server control to avoid server overload:

1. If the real time arrival rate $\lambda < \lambda^{*}$, do not interfere with arrivals.
2. If $\lambda \geq \lambda^{*}$, facilitate some appropriate measure of interference.

Examples of interference for a web server in particular include rejection of incoming requests or possible re-routing. Cha and Lee give an interference policy of rejection with probability $1-\frac{\lambda^{*}}{\lambda}$. The next section presents a generalization of the above model by relaxing the assumption that the job stress $\eta$ is constant. This yields a far more generalized model that can encompass random stress with any distribution.

## Random Stress Reliability Model

The basic model here is very similar to Cha and Lee’s. The main difference is that we relax the constant stress assumption and allow it to be random.

Assume that each job $j$ coming into the server adds a random stress $\mathcal{H}_{j}$ to the server for the duration of its time in the system. Suppose $\{\mathcal{H}_{j}\}_{j=1}^{N(t)} \sim \mathcal{H}$, and are i.i.d. where WLOG $\mathcal{H}$ is a discrete random variable with a finite sample space $S = \{\eta_{i}: \eta_{i} \in \mathbb{R}^{+}, i = 1,...,m \text{ for } m \in \mathbb{N}\}$ and probability distribution given by

$$P(\mathcal{H} = \eta_{i}) = p_{i}, \quad i = 1,\ldots,m$$

The following assumptions from Cha and Lee are retained:

1. (CL1) Requests arrive via a nonhomogenous Poisson Process $\{N(t), t \geq 0\}$ with intensity $\lambda(t)$.
2. (CL2) Arrival times $\{T_{j}\}_{j=1}^{N(t)}$ are independent.
3. (CL3) Service times $\{W_{j}\}_{j=1}^{N(t)}$ are i.i.d. with pdf $g_{W}(w)$ and mutually independent of all arrival times.

Then, the random stress breakdown rate (RSBR) process $\mathscr{B}(t)$ is given by

$$\mathscr{B}(t) = r_{0}(t) + \sum_{j=1}^{N(t)}\mathcal{H}_{j}\mathbb{1}(T_{j} < t \leq T_{j} + W_{j}), \quad t \geq 0$$

Compare the sample trajectory shown in the figure above to the sample trajectory of breakdown rate under the original model (see page 2). The random stress brought by each request still disappears upon job completion, but the effect on the server is no longer deterministic. Thus, for the same set of arrival times and respective completion times, the realization of the breakdown rate process under the RSBR model has one more element of variation.

Let $Y$ be the random time to breakdown of the web server given the workload from client requests. Let $\mathfrak{T} = \{T_{j}\}_{j=1}^{N(t)}$, $\mathfrak{W} = \{W_{j}\}_{j=1}^{N(t)}$, and $\mathfrak{H} = \{\mathcal{H}_{j}\}_{j=1}^{N(t)}$, with observed values $\mathfrak{t} = \{t_{j}\}_{j=1}^{N(t)}$, $\mathfrak{w} = \{w_{j}\}_{j=1}^{N(t)}$, and $\mathfrak{h} = \{\eta_{i_{j}}\}_{i_{j}=1}^{N(t)}$. Then the conditional survival function of the server, given the arrival process of client requests ($N(t)$), job stresses ($\mathfrak{H}$), service times ($\mathfrak{W}$), and arrival times ($\mathfrak{T}$) is

\begin{aligned}S_{\scriptstyle{Y|N(t), \mathfrak{T}, \mathfrak{W}, \mathfrak{H} }}(t|n, \mathfrak{t}, \mathfrak{w}, \mathfrak{h}) &=e^{ -\int_{0}^{t}\mathscr{B}(x)dx}\\&= \bar{F}_{0}(t)e^{-\sum_{j=1}^{N(t)} \mathcal{H}_{i}\min(W_{j}, t-T_{j})}\end{aligned}

where $\bar{F}_{0}(t) = \exp\left(-\int_{0}^{t}r_{0}(x)dx\right)$.

### Survival Function for the RSBR Server

Under the RSBR generalization, the survival function of the server is given in the following theorem.

Theorem 1. (Survival Function for a Random Stress Server)
Suppose that jobs arrive to a server according to a nonhomogenous Poisson process
$\{N(t), t \geq 0\}$ with intensity function $\lambda(t) \geq 0$ and $m(t) \equiv E[N(t)] = \int_{0}^{t}\lambda(x)dx$. Let the arrival times $\{T_{j}\}_{j=1}^{N(t)}$ be independent, and let the service times $\{W_{j}\}_{j=1}^{N(t)} g_{W}(w)$ be i.i.d. and mutually independent of all arrival times. Assume the random job stresses $\mathcal{H}_{j} \sim \mathcal{H}$. Then

$$S_{Y}(t) = \bar{F}_{0}(t)\exp\left(- E_{\mathcal{H}}\left[\mathcal{H}\int_{0}^{t}e^{-\mathcal{H} w}m(t-w)\bar{G}_{W}(w)dw\right]\right)$$

where $\bar{F}_{0}(t) = \exp\left(-\int_{0}^{t}r_{0}(s)ds\right)$.

The compound failure rate function (or hazard function) $r(t)$ is given by

$$r(t) = -\frac{d}{dt}\ln(S_{Y}(t)) = r_{0}(t) + E_{\mathcal{H}}\left[\mathcal{H}\int_{0}^{t}e^{-\mathcal{H} w}m(t-w)\bar{G}_{W}(w)dw\right]$$

See [14] or [21].

## Efficiency Measure of a Server under RSBR

Upon server crash, the server must be rebooted. This section gives the server efficiency as defined in [9], but for the RSBR model. The server efficiency given in Theorem 1 holds quite a few similarities to that of [9], except that, like Theorem 1, the expectation must be taken over the random stress variable. The following assumptions from the original model are retained:

1. (E1) The arrival process after rebooting, $\{N^{rb}(t), t \geq 0\}$, remains a nonhomogenous Poisson process with the same intensity function $\lambda(t), t \geq 0$ as before.
2. (E2) $\{N^{rb}(t), t \geq 0\}$ is independent of the arrival process of client requests before rebooting. Hence, $\{N^{rb}(t), t \geq 0\} = \{N(t), t \geq 0\}$, since it retains all the same characteristics as before.
3. (E3) The time to reboot the server follows a continuous distribution $H(t)$ with mean $\nu$.

Recall that $M(t)$ is defined as the total number of jobs completed by the server during the time $(0,t]$. Also, recall the definition of server efficiency from [9]:

$$\psi \equiv \lim\limits_{t \to \infty}\frac{E[M(t)]}{t}$$

The efficiency of the server under a random stress environment is given in the following theorem.

Theorem (Server Efficiency under Random Stress Environment)
Suppose that $\{N(t): t \geq 0\}$ is a nonhomogenous Poisson process with intensity function $\lambda(t), t \geq 0$. Suppose also the conditions of Theorem 1 and the conditions (E1)-(E3) are met. Then the efficiency of the server is given by

\begin{aligned}\psi &= \frac{1}{\int_{0}^{\infty}S_{Y}(t)dt + \nu}\\&\kern{2em}\times\left\{\int_{0}^{\infty}e^{-\int_{0}^{t}r_{0}(x)-\int_{0}^{t}\lambda(x)dx+E_{\mathcal{H}}[a(t)+b(t)]}(r_{0}(t)E_{\mathcal{H}}[a(t)] + E_{\mathcal{H}}[\mathcal{H} a(t)b(t)])dt \right\}\end{aligned}

where

$$a(t) = \int_{0}^{t}e^{-\mathcal{H} v}g_{W}(v)m(t-v)dv$$

and

$$b(t) = \int_{0}^{t}e^{-\mathcal{H}(t-r)}\bar{G}_{W}(t-r)\lambda(r)dr.$$

## Remarks and Implications

Note that $\mathcal{H}$ was assumed discrete, but the proofs of Theorems 1 and 2 are unaffected if $\mathcal{H}$ is continuous. Thus the generality of this model is significantly stronger than in [9]. In practical considerations, these integrals will need to be numerically evaluated.

For certain distributions of $\mathcal{H}$, $S_{Y}(t)$ has a fairly compact form. Subsequent installments will examine the case where $\mathcal{H}$ has a binomial distribution, formed from both independent and dependent Bernoulli trials. We will also explore the effects of various service life distributions on $\psi$.

## Conclusion

This paper has given a simple but large generalization of [9]. The random stress model allows for the reality that we likely will not know the stress a request will bring to the server, but may know its distribution. Subsequent installments will further study the properties of the RSBR model and extend it to other single-server models, and networks.

Vertical Dependency in Sequences of Categorical Random Variables

## Abstract

This paper develops a more general theory of sequences of dependent categorical random variables, extending the works of Korzeniowski (2013) and Traylor (2017) that studied first-kind dependency in sequences of Bernoulli and categorical random variables, respectively. A more natural form of dependency, sequential dependency, is defined and shown to retain the property of identically distributed but dependent elements in the sequence. The cross-covariance of sequentially dependent categorical random variables is proven to decrease exponentially in the dependency coefficient $\delta$ as the distance between the variables in the sequence increases. We then generalize the notion of vertical dependency to describe the relationship between a categorical random variable in a sequence and its predecessors, and define a class of generating functions for such dependency structures. The main result of the paper is that any sequence of dependent categorical random variables generated from a function in the class $\mathscr{C}_{\delta}$ that is dependency continuous yields identically distributed but dependent random variables. Finally, a graphical interpretation is given and several examples from the generalized vertical dependency class are illustrated.

## Introduction

Many statistical tools and distributions rely on the independence of a sequence or set of random variables. The sum of independent Bernoulli random variables yields a binomial random variable. More generally, the sum of independent categorical random variables yields a multinomial random variable. Independent Bernoulli trials also form the basis of the geometric and negative binomial distributions, though the focus is on the number of failures before the first (or rth success). [2] In data science, linear regression relies on independent and identically distributed (i.i.d.) error terms, just to name a few examples.

The necessity of independence filters throughout statistics and data science, although real data rarely is actually independent. Transformations of data to reduce multicollinearity (such as principal component analysis) are commonly used before applying predictive models that require assumptions of independence. This paper aims to continue building a formal foundation of dependency among sequences of random variables in order to extend these into generalized distributions that do not rely on mutual independence in order to better model the complex nature of real data. We build on the works of Korzeniowski [1] and Traylor [3] who both studied first-kind (FK) dependence for Bernoulli and categorical random variables, respectively, in order to define a general class of functions that generate dependent sequences of categorical random variables.

The next section gives a brief review of the original work by Korzeniowski [1] and Traylor [3]. In Section 2, a new dependency structure, sequential dependency is introduced, and the cross-covariance matrix for two sequentially dependent categorical random variables is derived. Sequentially dependent categorical random variables are identically distributed but dependent. Section 3.1 generalized the notion of vertical dependency structures into a class that encapsulates both the first-kind (FK) dependence of Korzeniowski [1] and Traylor [3] and shows that all such sequences of dependent random variables are identically distributed. We also provide a graphical interpretation and illustrations of several examples of vertical dependency structures.

We repeat a section from A Generalized Multinomial Distribution from Dependent Categorical Random Variables  in order to give a review of the original first-kind (FK) dependency created by Korzeniowski. For the full details, see the previous work on the topic.

Korzeniowski defined the notion of dependence in a way we will refer to here as dependence of the first kind (FK dependence). Suppose $(\epsilon_{1},...,\epsilon_{N})$ is a sequence of Bernoulli random variables, and $P(\epsilon_{1} = 1) = p$. Then, for $\epsilon_{i}, i \geq 2$, we weight the probability of each binary outcome toward the outcome of $\epsilon_{1}$, adjusting the probabilities of the remaining outcomes accordingly.

Formally, let $0 \leq \delta \leq 1$, and $q = 1-p$. Then define the following quantities
\begin{aligned}p^{+} := P(\epsilon_{i} = 1 | \epsilon_{1} = 1) = p + \delta q &\qquad p^{-} :=P(\epsilon_{i} = 0 | \epsilon_{1} = 1) = q -\delta q\\q^{+} := P(\epsilon_{i} = 1 | \epsilon_{1} = 0) = p-\delta p&\qquad q^{-} := P(\epsilon_{i} = 0 | \epsilon_{1} = 0) = q + \delta p\end{aligned}

Given the outcome $i$ of $\epsilon_{1}$, the probability of outcome $i$ occurring in the subsequent Bernoulli variables $\epsilon_{2}, \epsilon_{3},..., \epsilon_{n}$ is $p^{+}, i = 1$ or $q^{+}, i=0$. The probability of the opposite outcome is then decreased to $q^{-}$ and $p^{-}$, respectively.

The figure above illustrates the possible outcomes of a sequence of such dependent Bernoulli variables. Korzeniowski showed that, despite this conditional dependency, $P(\epsilon_{i} = 1) = p \quad\forall i$. That is, the sequence of Bernoulli variables is identically distributed, with correlation shown to be
$$\text{Cor}(\epsilon_{i}, \epsilon_{j}) = \left\{\begin{array}{lr}\delta, & i=1 \\\delta^{2}, &i \neq j, \quad i,j \geq 2\end{array}\right.$$

These identically distributed but correlated Bernoulli random variables yield a Generalized Binomial distribution with a similar form to the standard binomial distribution.

In the previous work, the concept of Bernoulli FK dependence was extended to categorical random variables. That is, given a sequence of categorical random variables with $K$ categories, $P(\epsilon_{1} = i) = p_{i}$, $i = 1,\ldots,K,$ \begin{aligned}P(\epsilon_{j} = i | \epsilon_{1} = i) &= p_{i}^{+} = p_{i} + \delta(1-p_{i});\\P(\epsilon_{j} = k | \epsilon_{1} = i) &= p_{k}^{-} = p_{k}-\delta p_{k}, i \neq k, \:\:k = 1,\ldots,K.\end{aligned}

Traylor proved that FK dependent categorical random variables remained identically distributed, and showed that the cross-covariance matrix of categorical random variables has the same structure as the correlation between FK dependent Bernoulli random variables. In addition, the concept of a generalized binomial distribution was extended to a generalized multinomial distribution.

In the next section, we will explore a different type of dependency structure, sequential dependency.

## Sequentially Dependent Random Variables

While FK dependence yielded some interesting results, a more realistic type of dependence is sequential dependence, where the outcome of a categorical random variable depends with coefficient δ on the outcome of the variable immediately preceeding it in the sequence. Put formally, if we let $\mathcal{F}_{n} = \{\epsilon_{1},\ldots,\epsilon_{n-1}\}$, then $P(\epsilon_{n}|\epsilon_{1},\ldots,\epsilon_{n-1}) = P(\epsilon_{n}|\epsilon_{n-1}) \neq P(\epsilon_{n})$. That is, $\epsilon_{n}$ only has direct dependence on the previous variable $\epsilon_{n-1}$. We keep the same weighting as for FK-dependence. That is,

$$\begin{array}{lr}P(\epsilon_{n} = j | \epsilon_{n-1} = j) = p_{j}^{+} = p_{j} + \delta(1-p_{j}),&\\ P(\epsilon_{n} = j | \epsilon_{n-1} = i) = p_{j}^{-} = p_{j}-\delta p_{j}, &\:\: j = 1,\ldots,K;\:\:i \neq j\end{array}$$ As a comparison, for FK dependence, $P(\epsilon_{n}|\epsilon_{1},\ldots,\epsilon_{n-1}) = P(\epsilon_{n}|\epsilon_{1}) \neq P(\epsilon_{n})$. That is, $\epsilon_{n}$ only has direct dependence on $\epsilon_{1}$, and

$$\begin{array}{lr}P(\epsilon_{n} = j | \epsilon_{1} = j) = p_{j}^{+} = p_{j} + \delta(1-p_{j}),&\\ P(\epsilon_{n} = j | \epsilon_{1} = i) = p_{j}^{-} = p_{j}-\delta p_{j},&\:\:j = 1,\ldots,K;\:\: i \neq j\end{array}$$ Let $\epsilon = (\epsilon_{1},\ldots,\epsilon_{n})$ be a sequence of categorical random variables of length n (either independent or dependent) where the number of categories for all $\epsilon_{i}$ is K. Denote $\Omega_{n}^{K}$ as the sample space of this random sequence. For example,

$$\Omega_{3}^{3} = \{(1,1,1), (1,1,2), (1,1,3), (1,2,1), (1,2,2),\ldots,(3,3,1),(3,3,2),(3,3,3)\}$$

Dependency structures like FK-dependence and sequential dependence change the probability of a sequence $\epsilon$ of length n taking a particular $\omega = (\omega_{1},\ldots,\omega_{n}) \in \Omega_{n}^{K}$. The probability of a particular $\omega \in \Omega_{n}^{K}$ is given by the dependency structure. For example, if the variables are independent, $P((1,2,1)) = p_{1}^{2}p_{2}$. Under FK-dependence, $P((1,2,1)) = p_{1}p_{2}^{-}p_{1}^{+}$, and under sequential dependence, $P((1,2,1)) = p_{1}p_{2}^{-}p_{1}^{-}$. See Figures 2 and 3 for a comparison of the probability mass flows of sequential dependence and FK dependence. Sequentially dependent sequences of categorical random variables remain identically distributed but dependent, just like FK-dependent sequences. That is,

Lemma 1

Let $\epsilon = (\epsilon_{1},\ldots,\epsilon_{n})$ be a sequentially dependent categorical sequence of length n with K categories. Then

$$P(\epsilon_{j} = i) = p_{i}; \qquad i = 1,\ldots,K;\:\:j = 1,\ldots,n;\:\: n \in \mathbb{N}$$

## Cross-Covariance Matrix

The $K \times K$ cross-covariance matrix $\Lambda^{m,n}$ of $\epsilon_{m}$ and $\epsilon_{n}$ in a sequentially dependent categorical sequence has entries $\Lambda_{i,j}^{m,n}$ that give the cross-covariance $\text{Cov}([\epsilon_{m} = i]$, $[\epsilon_{n} = j])$, where $[\cdot]$ denotes an Iverson bracket. In the FK-dependent case, the entries of $\Lambda^{m,n}$ are given by

$$\Lambda^{1,n}_{ij}=\left\{\begin{array}{lc}\delta p_{i}(1-p_{i}), & i = j \\-\delta p_{i}p_{j}, & i \neq j,\end{array}\right.\quad\,n \geq 2,$$

and

$$\Lambda^{m,n}_{ij}=\left\{\begin{array}{lc}\delta^{2}p_{i}(1-p_{i}), & i = j \\-\delta^{2}p_{i}p_{j}, & i \neq j,\end{array}\right. \quad\,n > m; m\neq 1$$

Thus, the cross covariance between any two $\epsilon_{m}$ and $\epsilon_{n}$ in the FK-dependent case is never smaller than δ 2 times the independent cross-covariance. In the sequentially dependent case, the cross-covariances of $\epsilon_{m}$ and $\epsilon_{n}$ decrease in powers of δ as the distance between the two variables in the sequence increases.

Theorem 1 (Cross-Covariance of Dependent Categorical Random Variables).

Denote $\Lambda^{m,n}$ as the $K \times K$ cross-covariance matrix of $\epsilon_{m}$ and $\epsilon_{n}$ in a sequentially dependent sequence of categorical random variables of length N, $m \leq n$, and $n \leq N$, defined as $\Lambda^{m,n} = E[(\epsilon_{m} - E[\epsilon_{m}])(\epsilon_{n} - E[\epsilon_{n}])]$. Then the entries of the matrix are given by

$$\Lambda^{m,n}_{ij} = \left\{\begin{array}{lc}\delta^{n-m}p_{i}(1-p_{i}), & i = j \\-\delta^{n-m} p_{i}p_{j}, & i \neq j.\end{array}\right.$$

The pairwise covariance between two Bernoulli variables in a sequentially dependent sequence is given in the following corollary

Corollary 1

Denote $P(\epsilon_{i} = 1) = p$; $i = 1,\ldots,n$, and let $q=1-p$. Under sequential dependence,
$$\text{Cov}(\epsilon_{m}, \epsilon_{n}) = pq\delta^{n-m}.$$

We give some examples to illustrate.

Example 1 (Bernoulli Random Variables).

If we want to find the covariance between $\epsilon_{2}$ and $\epsilon_{3}$, then we note that the set $S = \{\omega \in \Omega_{3}^{2}\,:\,\omega_{2} = 1, \omega_{3} = 1\}$ is given by $S = \{(1,1,1),(0,1,1)\}$. $P(\epsilon_{2} = 1, \epsilon_{3} = 1) = P(S)$. Thus,

$$\begin{array}{rl}\text{Cov}(\epsilon_{2},\epsilon_{3})&=P(\epsilon_{2} = 1, \epsilon_{3} = 1)-P(\epsilon_{2} = 1)P(\epsilon_{3} = 1)\\&=pp^{+}p^{+} + qp^{-}p^{+}-p^{2}\\&= p^{+}(pp^{+} + qp^{-})-p^{2}\\&= p(p^{+}-p)\\&= pq\delta\end{array}$$

Similarly,
$$\begin{array}{rl}\text{Cov}(\epsilon_{1}, \epsilon_{3}) &= P(\epsilon_{1} = 1, \epsilon_{3} = 1)-P(\epsilon_{1} = 1)P(\epsilon_{3} = 1)\\&= (pp^{+}p^{+} + pq^{-}p^{-})-p^{2} \\&= p((p+\delta q)^{2} + pq(1-\delta)^{2})-p^{2}\\&= p(p^{2} + pq+\delta^{2}q^{2} + \delta^{2}pq)-p^{2}\\&= p(p(p+q) + \delta^{2}q(q+p)- p^{2} \\&= p(p + \delta^{2}q)-p^{2}\\&= pq\delta^{2}\end{array}$$

Example 2 (Categorical Random Variables).

Suppose we have a sequence of categorical random variables, where K = 3. Then
$[\epsilon_{m} = i]$ is the Bernoulli random variable with the binary outcome of 1 if $\epsilon_{m} = i$ and 0 if not. Thus,

$$\text{Cov}([\epsilon_{m} = i], [\epsilon_{n} = j]) = P(\epsilon_{m} = i \vee \epsilon_{n}= j)-P(\epsilon_{m} = i)P(\epsilon_{n} = j).$$

We have shown that every $\epsilon_{n}$ in the sequence is identically distributed, so $P(\epsilon_{m} = i) = p_{i}$ and $P(\epsilon_{n} = j) = p_{j}$.

$$\begin{array}{rl}\text{Cov}([\epsilon_{2} = 1], [\epsilon_{3} = 1]) &=(p_{1}p_{1}^{+}p_{1}^{+} + p_{2}p_{1}^{-}p_{1}^{+} + p_{3}p_{1}^{-}p{1}^{+})-p_{1}^{2} \\&= p_{1}^{+}(p_{1}p_{1}^{+} + p_{2}p_{1}^{-} + p_{3}p_{1}^{-})-p_{1}^{2} \\&= p_{1}^{+}p_{1}-p_{1}^{2} \\&=\delta p_{1}(1-p_{1})\end{array}$$ $$\begin{array}{rl}\text{Cov}([\epsilon_{2} = 1], [\epsilon_{3} = 2]) &=(p_{1}p_{1}^{+}p_{2}^{-} + p_{2}p_{1}^{-}p_{2}^{-}+p_{3}p_{1}^{-}p_{2}^{-})-p_{1}p_{2} \\&= p_{2}^{-}(p_{1}p_{1}^{+} + p_{2}p_{1}^{-} + p_{3}p_{1}^{-})-p_{1}p_{2}\\&= p_{1}p_{2}^{-}-p_{1}p_{2} \\&=-\delta p_{1}p_{2}\end{array}$$

We may obtain the other entries of the matrix in a similar fashion. So, the cross-covariance matrix for a $\epsilon_{2}$ and $\epsilon_{3}$ with K = 3 categories is given by

$$\Lambda^{2,3}=\delta\begin{pmatrix}p_{1}(1-p_{1})&-p_{1}p_{2}&-p_{1}p_{3}\\-p_{1}p_{2} & p_{2}(1-p_{2}) &-p_{2}p_{3}\\-p_{1}p_{3} &-p_{2}p_{3} & p_{3}(1-p_{3})\end{pmatrix}$$

Note that if $\epsilon_{2}$ and $\epsilon_{3}$ are independent, then the cross-covariance matrix is all zeros, because δ = 0.

## Dependency Generators

Both FK-dependency and sequential dependency structures are particular examples of a class of vertical dependency structures. We denote the dependency of subsequent categorical random variables on a previous variable in the sequence as a vertical dependency. In this section, we define a class of functions that generate vertical dependency structures with the property that all the variables in the sequence are identically distributed but dependent.

### Vertical Dependency Generators Produce Identically Distributed Sequences

Define the set of functions

$$\mathscr{C}_{\delta} = \{\alpha: \mathbb{N}_{\geq 2} \to \mathbb{N} : \alpha(n) \geq 1 \wedge \alpha(n) < n\:\: \forall\:n\}.$$

The mapping defines a direct dependency of n on $\alpha(n)$, denoted $n \stackrel{\delta}{\rightsquigarrow} \alpha(n)$. We have already defined the notion of direct dependency, but we formalize it here.

Definition. (Direct Dependency)

Let $\epsilon_{m}$, $\epsilon_{n}$ be two categorical random variables in a sequence, where $m < n$. We say $\epsilon_{n}$ has a direct dependency on $\epsilon_{m}$, denoted $\epsilon_{n} \rightsquigarrow_{\delta} \epsilon_{m}$, if $P(\epsilon_{n}|\epsilon_{n-1},\ldots,\epsilon_{1}) = P(\epsilon_{n}|\epsilon_{m})$.

Example 3.
For FK-dependence, $\alpha(n) \equiv 1$. That is, for any $n, \epsilon_{n} \rightsquigarrow_{\delta} \epsilon_{1}$

Example 4.
The function $\alpha(n) = n-1$ generates the sequential dependency structure.

We now define the notion of dependency continuity.

Definition. (Dependency Continuity)
A function $\alpha: \mathbb{N}_{\geq 2} \to \mathbb{N}$ is dependency continuous if $\forall n \exists j \in \{1,...,n-1\}$ such that $\alpha(n) = j$

We require that the functions in $\mathscr{C}_{\delta}$ be dependency continuous. Thus, the formal definition of the class of dependency generators $\mathscr{C}_{\delta}$ is

Definition. (Dependency Generator)
We define the set of functions $\mathscr{C}_{\delta} = \{\alpha : \mathbb{N}_{\geq 2} \to \mathbb{N}\}$ such that

•  $\alpha(n) < n$
• $\forall n \exists j \in \{1,...,n-1\} : \alpha(n) = j$

and refer to this class as dependency generators.

Each function in $\mathscr{C}_{\delta}$ generates a unique dependency structure for sequences of dependent categorical random variables where the individual elements of the sequence remain identically distributed.

Theorem 2. (Identical Distribution of Sequences Generated by Dependency Generators)
Let $\alpha \in \mathscr{C}_{\delta}$. Then for any $n\in \mathbb{N}, n \geq 2$, the dependent categorical random sequence generated by $\alpha$ has identically distributed elements.

### Graphical Interpretation and Illustrations

We may visualize the dependency structures generated by $\alpha \in \mathscr{C}_{\delta}$ via directed dependency graphs. Each $\epsilon_{n}$ represents a vertex in the graph, and a directed edge connects $n$ to $\alpha(n)$ to represent the direct dependency generated by $\alpha$. This section illustrates some examples and gives a graphical interpretation of the result in Theorem 2.

First-Kind Dependency

For FK-dependency, $\alpha(n) \equiv 1$ generates the graph above. Each $\epsilon_{i}$ depends directly on $\epsilon_{1}$, and thus we see no connections between any other vertices $i, j$ where $j \neq 1$. There are $n-1$ separate subsequences of length 2 in this graph.

Sequential Dependency

$\alpha(n) = n-1$ generates the sequential dependency structure this work has studied in detail. We can see that a path exists from any $n$ back to 1. This  is a visual way to see the result of Theorem2,  in that if a path exists from any $n$ back to 1, then the variables in that path must be identically distributed. Here, there is only one sequence and no subsequences.

A Monotonic Example

Here, $\alpha(n) = \lfloor \sqrt{n}\rfloor$ gives an example of a monotonic function in $\mathscr{C}_{\delta}$ and the dependency structure it generates. Again, note that any $n$ has a path to 1, and the number of subsequences is between 1 and $n-1$.

A Nonmonotonic Example

Here, $\alpha(n)=\left\lfloor\frac{\sqrt{n}}{2}\left(\sin(n)\right)+\frac{n}{2}\right\rfloor$ illustrates a more contrived example where the function is nonmonotonic. It is neither increasing nor decreasing. The previous examples have all been nondecreasing.

A Prime Example

Let $\alpha \in \mathscr{C}_{\delta}$ be defined in the following way. Let $p_{m}$ be the $m$th prime number, and let $\{kp_{m}\}$ be the set of all postive integer multiples of $p_{m}$. Then the set $\mathscr{P}_{m} = \{kp_{m}\} \setminus \cup_{i=1}^{m-1}\{kp_{i}\}$ gives a disjoint partitioning of $\mathbb{N}$. That is, $\mathbb{N} = \sqcup_{m}\mathscr{P}_{m}$, and thus for any $n \in \mathbb{N}, n \in \mathscr{P}_{m}$ for exactly one $m$. Now let $\alpha(n) = m$. Thus, the function is well-defined. We may write $\alpha: \mathbb{N}_{\geq 2} \to \mathbb{N}$ as $\alpha[\mathscr{P}_{m}] = \{m\}$. As an illustration,

\begin{aligned}\alpha[\{2k\}] &= \{1\} \\\alpha[\{3k\}\setminus \{2k\}] &= \{2\} \\\alpha[\{5k\}\setminus (\{2k\}\cup \{3k\})] &= \{3\} \\&\qquad\vdots\\\alpha[\{kp_{m}\}\setminus (\cup_{i=1}^{m-1}\{kp_{i}\})] &= \{m\}\end{aligned}

## Conclusion

This paper has extended the concept of dependent random sequences first put forth in the works of Korzeniowski [1] and Traylor [3] and developed a generalized class of vertical dependency structures. The sequential dependency structure was studied extensively, and a formula for the cross-covariance obtained. The class of dependency generators was defined and shown to always produce a unique dependency structure for any $\alpha \in \mathscr{C}_{\delta}$ in which the a sequence of categorical random variables under that $\alpha$ is identically distributed but dependent. We provided a graphical interpretation of this class, and illustrated with several key examples.

## References

1. Andrzej Korzeniowski. On correlated random graphs. Journal of Probability and Statistical Science, pages 43–58, 2013.
2. Joseph McKean, Robert Hogg, and Allen Craig. Introduction to Mathematical Statistics. Prentice Hall, 6 edition.
3. Rachel Traylor. A generalized multinomial distribution from dependent categorical random variables. Academic Advances of the CTO, 2017.
Summation Chains of Sequences Part 3: Sequence Chains from Linear Functions

## Summation Chains of Sequences Part 3: Sequence Chains from Linear Functions

### Abstract

(Editor’s note:) This paper represents the third installment of a masters thesis by Jonathan Johnson. The first two can be found here and here. This paper continues the development of the theory of summation chains of sequences. Since summation chains are doubly infinite, it’s important to know how little information we actually need to define a chain. The linearity of the function rules that generates a summation chain helps to answer this question. The notion of uniquely completable is defined from the set of positions, and several important theorems are developed to determine when a set of positions is uniquely completeable.

### Introduction

(Editor’s note:) In Part 2, Johnson notes that summation chains can be generated by the function rule $T_{\Sigma}: M^{\infty} \to M^{\infty}$,
$$T_{\Sigma}(x_{1}, x_{2}, x_{3},\ldots) = (x_{1}, x_{1} + x_{2}, x_{1} + x_{2} + x_{3},\ldots).$$ where $M^{\infty}$ is a $\mathbb{Z}$-Module. This is a formal way to define the operator that generates the partial sums of a given sequence. Johnson proves in this installment that this function rule is a linear operator. The linearity of these function operators assists in determining how little information we need to know about the chain to be able to uniquely define it.

Johnson introduces the notion of the set of positions, which becomes the smallest amount of information that can define a summation chain under certain conditions.

### Chains Generated by Linear Functions

Since vector-valued chains are vector-valued functions, they inherit the scalar multiplication and addition operations defined for functions on vector spaces. The linearity of the summation chain function rule leads to many interesting and useful results. The first of these is the closure of the set of summation chains under linear operations. This result holds for any function rule that is a linear operator.

#### Lemma.

Let M be a an 1-dimensional vector space with scalar field F. The summation function rule $T_{\Sigma}$ defined in Remark 3.1 of [6] is a linear operator on $M^{\infty}$.

#### Proposition.

Let $T:V\to V$ be a function rule where V is a vector space with scalar field F.
$\mathcal{C}_T$ is closed under scalar multiplication and addition if and only if T is a linear operator.

#### Corollary 1.

Let $T:V\to V$ be a bijective linear operator on vector space V with scalar field F, then $\mathcal{C}_T$ is a subspace of the space of all functions from $\mathbb{Z}$ to V, $V^{\mathbb{Z}}$, $\Phi_{\Sigma[M]}$ is an isomorphism, and $\mathcal{C}_T\cong V$.

#### Corollary 2.

Let M be a an 1-dimensional vector space with scalar field F.  $\mathcal{C}_{\Sigma[M]}$ is a vector space, and $\Phi_{\Sigma[M]}$ is an isomorphism from $M^{\infty}$ to $\mathcal{C}_{\Sigma[M]}$.

### Sets of Positions and Completeability

Viewing the summation operator as a linear operator is useful for answering the question on how little information must be known to define a chain.

Definition. A set of positions, U, is a nonempty subset of $\mathbb{Z}\times\mathbb{N}$. Let $N\in\mathbb{N}, U_N=\{(m,n)\in U:n\leq N\}$

Definition. Let M be an 1-dimensional vector space with scalar field F. Let $T:M^{\infty}\to M^{\infty}$ be a bijective linear operator. A set of positions, U, is (uniquely) completable for if for every function $d:U\to M$ there exists a (unique) T-chain, c, such that $c(m,n)=d(m,n)$ for all $(m,n)\in U$.

To say that a set of positions U is uniquely compleatable for T is the same as saying that every T-chain can be defined by defining each of the positions in U.

#### Lemma

Let M be an 1-dimensional vector space with scalar field F. Let $T:M^{\infty}\to M^{\infty}$ be a bijective linear operator. Let U be a set of positions, U is completable for T if and only if every nonempty subset of U is completable for T.

#### Lemma

Let M be an 1-dimensional vector space with scalar field F. Let $T:M^{\infty}\to M^{\infty}$ be a bijective linear operator. Let U be a set of positions, and let V be a set of positions containing U. If U is uniquely completable for T and V is completable for T, then V is uniquely completable for T.

How much information is needed to define a summation chain? The summation function rule has the additional property that it can be represented by an infinite-dimensional lower-triangular matrix. The question of how much information is need to define a summation chain will be addressed by studying matrices of this form. First, an ordering of the elements of $\mathbb{Z}\times\mathbb{N}$ is defined in order to create a consistent indexing of sets of positions.

Definition. Let $(m_1,n_1)$ and $(m_2,n_2)$ be in $\mathbb{Z}\times\mathbb{N}$, define $(m_1,n_1)\leq(m_2,n_2)$ to be the dictionary ordering $n_1 or both $n_1=n_2$ and $m_1\leq m_2$.

Remark. This definition defines a total order on $\mathbb{Z}\times\mathbb{N}$.[7]

#### Lemma

Let U be a set of positions such that $|U_N|$ is finite for all natural numbers, N, then U is well-ordered by the ordering in the above definition.

For the remainder of this section, given a set of positions, U, the elements U are assumed to be indexed by the well-ordering induced by the above definition, and $(m_i,n_i)=u_i$ for all $u_i\in U$. Given a vector space M, a linear function rule $T:M^{\infty}\to M^{\infty}$, and a set of positions U, a linear function on $M^{\infty}$ can be defined that maps a sequence $x\in M^{\infty}$ to that values of the positions in U for $\Phi_T(x)$.

Definition. Let M be an 1-dimensional vector space with scalar field F. Let $T:M^{\infty}\to M^{\infty}$ be a bijective linear operator with a lower-triangular matrix. Let U be a set of positions such that the cardinality of $U_N$ is finite for all natural numbers, N. The completion function for T with  is defined as follows.$S^T_U:M^{\infty}\to M^{|U|}$ where for all $x\in M^{\infty}$,

$$S^T_U(x)=[(T^{m_i}(x))_{n_i}]^{|U|}_{i=1}$$

The remaining results in this section show how to use completion functions to determine the completeability of sets of positions.

#### Proposition

Let M be an 1-dimensional vector space with scalar field F. Let $T:M^{\infty}\to M^{\infty}$ be a bijective linear operator with a lower-triangular matrix. Let U be a set of positions such that the cardinality of $U_N$ is finite for all natural numbers, N.

1. U is completable for T if and only if $S^T_U$ is surjective.
2. U is uniquely completable for T if and only if $S^T_U$ is bijective.

Solving completion problems utilizes many properties of infinite matrices that can be referenced in Cooke [4]. The next proposition addresses sets of positions for which a completion function is not defined.

#### Proposition

Let M be an 1-dimensional vector space with scalar field F. Let U be a set of positions such that for a natural number, N, $N<|U_N|$, then U is not completable for any bijective linear operator $T:M^{\infty}\to M^{\infty}$ with a lower-triangular matrix.

#### Proposition

Let M be an 1-dimensional vector space with scalar field F. Let U be a set of positions such that for a natural number, N, $N = |U_N|$, then U is uniquely completable for any bijective linear operator $T:M^{\infty}\to M^{\infty}$ with a lower-triangular matrix.

### Examples of Determining Completeability

Remark. Let S, T be linear transformations from vector space V to vector space W, and let $P:W\to W$ be a bijective linear operator such that $S=PV$.
$$\dim\text{Im}(S)=\dim\text{Im}(T)$$ and S is bijective if and only if T is bijective.[4, 8]

Example 1.
$$\begin{array}{c|cccccccc}m\backslash n& 1 & 2 & 3 & 4 & 5 & 6 & 7 & \cdots\\\vdots & & & & & & &\\2& & & & & & &\\1 & & & \bigcirc & & \bigcirc & & \bigcirc\\0& & & \bigcirc & & \bigcirc & & \bigcirc\\-1& & & & & & &\\\vdots & & & & & & &\end{array}$$

Let $T:\mathbb{R}^{\infty}\to\mathbb{R}^{\infty}$ where for $(x_n)\in\mathbb{R}^{\infty}$, $T(x_n)=(y_n)$ defined as follows:
$$y_n= \left\{\begin{array}{ll}x_1& \text{ if } n=1\\x_3-x_1&\text{ if } n=3\\x_n&\text{ if } n\text{ is even}\\x_n-x_{n-3}&\text{ if } n>4 \text{ and } n \text{ is odd}\\\end{array}\right.$$ Let $U=\{(0,n),(1,n):n\in\mathbb{N}, n>2, n \text{ is odd }\}$.
$$\mathcal{M}(S^T_U)=\left( \begin{array}{cccccccc}0 & 0 & 1 & 0 & 0 & 0 & 0 & \cdots\\1 & 0 & 1 & 0 & 0 & 0 & 0 & \cdots\\0 & 0 & 0 & 0 & 1 & 0 & 0 & \cdots\\0 & 1 & 0 & 0 & 1 & 0 & 0 & \cdots\\0 & 0 & 0 & 0 & 0 & 0 & 1 & \cdots\\0 & 0 & 0 & 1 & 0 & 0 & 1 & \cdots\\\vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots\end{array}\right)$$

Let $A$ be the matrix row-equivalent to $\mathcal{M}(S^T_U)$ defined as follows.
\begin{aligned}A&=\left( \begin{array}{ccccccccc}-1 & 1 & 0 & 0 & 0 & 0 & 0 & \cdots\\0 & 0 & 0& -1 & 1 & 0 & 0 & \cdots \\1 & 0 & 0 & 0 & 0 & 0 & 0 & \cdots\\0 & 0 & 0 & 0 & 0 & -1 & 1 & \cdots \\0 & 0 & 1 & 0 & 0 & 0 & 0 & \cdots\\\vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots\end{array}\right)\left( \begin{array}{cccccccc}0 & 0 & 1 & 0 & 0 & 0 & 0 & \cdots\\1 & 0 & 1 & 0 & 0 & 0 & 0 & \cdots\\0 & 0 & 0 & 0 & 1 & 0 & 0 & \cdots\\0 & 1 & 0 & 0 & 1 & 0 & 0 & \cdots\\0 & 0 & 0 & 0 & 0 & 0 & 1 & \cdots\\0 & 0 & 0 & 1 & 0 & 0 & 1 & \cdots\\\vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \end{array}\right)\\&=Id\\&=\left( \begin{array}{cccccccc}0 & 0 & 1 & 0 & 0 & 0 & 0 & \cdots\\1 & 0 & 1 & 0 & 0 & 0 & 0 & \cdots\\0 & 0 & 0 & 0 & 1 & 0 & 0 & \cdots\\0 & 1 & 0 & 0 & 1 & 0 & 0 & \cdots\\0 & 0 & 0 & 0 & 0 & 0 & 1 & \cdots\\0 & 0 & 0 & 1 & 0 & 0 & 1 & \cdots\\\vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots\end{array}\right)\left( \begin{array}{ccccccccc}-1 & 1 & 0 & 0 & 0 & 0 & 0 & \cdots\\0 & 0 & 0 & -1 & 1 & 0 & 0 & \cdots \\1 & 0 & 0 & 0 & 0 & 0 & 0 & \cdots\\0 & 0 & 0 & 0 & 0 & -1 & 1 & \cdots \\0 & 0 & 1 & 0 & 0 & 0 & 0 & \cdots\\\vdots & \vdots & \vdots & \vdots & \vdots & \vdots &\vdots\end{array}\right)\\\end{aligned} $S^T_U$ has an inverse so $S^T_U$ is bijective.Therefore, it is uniquely completable for T.

Example 2.
$$\begin{array}{c|cccccccc}m\backslash n& 1 & 2 & 3 & 4 & 5 & 6 & 7 & \cdots\\\vdots & & & & & & &\\2& & & & & & &\\1 & & \bigcirc &\bigcirc & & & & & \\0 & & & & \bigcirc & & & &\\-1& & & & \bigcirc & & & &\\2& & & & & & &\\\vdots & & & & & & &\end{array}$$

Let T be the summation operator on $\mathbb{R}^{\infty}$ with a set of positions $U=\{(1,2),(1,3),(-1,4),(0,4)\}$.

$$\mathcal{M}(S^T_U)=\left(\begin{array}{cccccccc}1 & 1 & 0 & 0 & 0 & 0 & \cdots\\1 & 1 & 1 & 0 & 0 & 0 & \cdots\\0 & 0 & -1 & 1 & 0 & 0 & \cdots\\0 & 0 & 0 & 1 & 0 & 0 & \cdots\end{array}\right)$$

Let $A$ be the matrix row-equivalent to $\mathcal{M}(S^T_U)$ defined as follows.

\begin{aligned}A&=\left(\begin{array}{cccc}1 & 0 & 0 & 0 \\-1 & 1 & 1 & -1 \\0 & 0 & -1 & 1 \\0 & 0 & 0 & 1 \end{array}\right)\left(\begin{array}{cccccccc}1 & 1 & 0 & 0 & 0 & 0 & \cdots\\1 & 1 & 1 & 0 & 0 & 0 & \cdots\\0 & 0 & -1 & 1 & 0 & 0 & \cdots\\0 & 0 & 0 & 1 & 0 & 0 & \cdots\end{array}\right)\\&=\left(\begin{array}{cccccccc}1 & 1 & 0 & 0 & 0 & 0 & \cdots\\0 & 0 & 0 & 0 & 0 & 0 & \cdots\\0 & 0 & 1 & 0 & 0 & 0 & \cdots\\0 & 0 & 0 & 1 & 0 & 0 & \cdots\end{array}\right)\end{aligned} $\dim\text{Im}(S^T_U)=\text{rank}(A)=3<4$ so $S^T_U$ is not surjective. Therefore, U is not completable for T.

Example 3.
$$\begin{array}{c|cccccccc}m\backslash n& 1 & 2 & 3 & 4 & 5 & 6 & 7 & \cdots\\\vdots & & & & & & &\\4& & & & & & & &\\3& & & & & &\bigcirc & &\\2 & & & & \bigcirc & & & &\\1 & & \bigcirc & & & & & & \\0&&&&&& & &\\-1 & & \bigcirc & & & & & &\\-2 & & & & \bigcirc & & & &\\-3& & & & & &\bigcirc & &\\-4& & & & & & & &\\\vdots & & & & & & &\end{array}$$

Let T be the summation operator on $\mathbb{R}^{\infty}$ with a set of positions $U=\{(n,2n),(-n,2n):n\in\mathbb{N}\}$.

$$\mathcal{M}(S^T_U)=\left(\begin{array}{ccccccccccc}1 & 1 & 0 & 0 & 0 & 0 & 0 & \cdots & 0 & 0 & \cdots\\-1 & 1 & 0 & 0 & 0 & 0 & 0 & \cdots & 0 & 0 & \cdots\\4 & 3 & 2 & 1 & 0 & 0 & 0 &\cdots & 0 & 0 & \cdots\\0 & 2 & -2 & 1 & 0 & 0 & 0 & \cdots & 0 & 0 & \cdots\\21 & 15 & 10 & 6 & 3 & 1 & 0 & \cdots & 0 & 0 & \cdots\\0 & 0 & -1 & 3 & -3 & 1 & 0 & \cdots & 0 & 0 & \cdots\\\vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & & \vdots & \vdots & \\& & & & & & & \cdots & n & 1 & \cdots\\& & & & & & & \cdots & -n & 1 & \cdots\\& & & & & & & & \vdots & \vdots &\end{array}\right)\begin{array}{l}\\\\\\\\\\\\\\\leftarrow \text{ row } 2n-1\\\leftarrow\text{ row }2n\\\\\end{array}$$

Let $A$ be the matrix row-equivalent to $\mathcal{M}(S^T_U)$ defined as follows.

\begin{aligned}A&=\left(\begin{array}{cccccccc}1 & -1 & 0 & 0 & 0 & 0 & \cdots\\0 & 1 & 0 & 0 & 0 & 0 & \cdots\\0 & 0 & 1 & -1 & 0 & 0 & \cdots\\0 & 0 & 0 & 1 & 0 & 0 & \cdots\\0 & 0 & 0 & 0 & 1 & -1 & \cdots\\0 & 0 & 0 & 0 & 0 & 1 & \cdots\\\vdots & \vdots & \vdots & \vdots & \vdots & \vdots\end{array}\right)\mathcal{M}(S^T_U)\\&=\left( \begin{array}{ccccccccccc}2 & 0 & 0 & 0 & 0 & 0 & 0 & \cdots & 0 & 0 & \cdots\\-1 & 1 & 0 & 0 & 0 & 0 & 0 & \cdots & 0 & 0 & \cdots\\4 & 1 & 4 & 0 & 0 & 0 & 0 &\cdots & 0 & 0 & \cdots\\0 & 2 & -2 & 1 & 0 & 0 & 0 & \cdots & 0 & 0 & \cdots\\21 & 15 & 11 & 3 & 6 & 0 & 0 & \cdots & 0 & 0 & \cdots\\0 & 0 & -1 & 3 & -3 & 1 & 0 & \cdots & 0 & 0 & \cdots\\\vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & & \vdots & \vdots & \\& & & & & & & \cdots & 2n & 0 & \cdots\\& & & & & & & \cdots & -n & 1 & \cdots\\& & & & & & & & \vdots & \vdots & \end{array}\right)\\\end{aligned}

Since $A$ is lower-triangular with no zeros on the diagonal, $A$ is the matrix of a bijective linear operator. Since $A$ is equivalent to $\mathcal{M}(S^T_U)$, $S^T_U$ is bijective.
Therefore, $U$ is uniquely completable for $T$.

### Conclusion

(Editor’s Note): This installment examined sequence chains from linear functions with the goal of looking for the minimum amount of information necessary to uniquely define a summation chain. The set of positions was defined, and the notion of uniquely completable is based on this definition. Several elegant theorems gave the conditions under which the set of positions is uniquely completable. The fourth and final installment will discuss the notion of convergence for complex-valued summation chains.

### References

1. Apostol, T., 1974, Mathematical Analysis 2nd Edition, Addison-Wesley Publishing Company
2. Brin, M. and Stuck, G., 2002, Introduction to Dynamical Systems, Cambridge University Press
3. Cameron, P. J., 1992, Combinatorics: Topics, Techniques, Algorithms, Cambridge University Press
4. Cook, R. G., 2014, Infinite Matrices and Sequence Spaces, Dover Publications, Inc.
5. Johnson, J. (2017) Summation Chains of Sequences Part 1: Introduction, Generation, and Key Definitions. AACTO, Vol. 2, Issue 2.
6. Johnson, J. (2017) Summation Chains of Sequences Part 2: Relationships Between Sequences via Summation Chains AACTO, Vol. 2, Issue 2.
7. Lee, J., 2011, Introduction to Topological Manifolds, Springer Science+Business Media LLC
8. Leon, S. J., 2002, Linear Algebra with Applications 6th Edition, Prentice Hall, Inc.
9. Scheinerman, E. R., 1996, Invitation to Dynamical Systems, Dover Publications, Inc.
Summation Chains of Sequences Part 2: Relationships between Sequences via Summation Chains

## Summation Chains of Sequences Part 2: Relationships between Sequences via Summation Chains

### Abstract

Editor’s note: This paper represents the second installment of a masters thesis by Jonathan Johnson. This paper continues the development of the theory of summation chains of sequences. The concept of sum-related is defined: two sequences are sum-related if one sequence appears in the summation chain of the other. The main result is a theorem to determine if two given sequences are sum-related.

### Introduction

Editor’s note: Previously, the concept of a summation chain of sequences was defined. We reproduce the example and introduction here:

Given a complex-valued sequence $(a_n)^{\infty}_{n=1}$, the sequence of partial sums of $(a_n)$ is given by the sequence

$$(a_1,a_1+a_2,a_1+a_2+a_3,\ldots,\sum^n_{i=1}a_i,\ldots).$$

The sequence of differences of $(a_n)$ is given by the sequence

$$(a_1, a_2-a_1,a_3-a_2,\ldots,a_n-a_{n-1},\ldots).$$

The processes of finding the sequence of partial sums and finding the sequence of differences of a sequence are inverses of each other so every sequence is the sequence of differences of its sequence of partial sums and the sequence of partial sums of its sequence of differences. Every sequence has a unique sequence of partial sums and a unique sequence of differences so it is always possible to find the sequence of partial sums of the sequence of partial sums and repeat the process ad infinitum. Similarly, we can find the sequence of differences of the sequence of differences and repeat ad infinitum. The result is a doubly infinite sequence or “chain” of sequences where each sequence is the sequence of partial sums of the previous sequence and the sequence of differences of the following sequence.

Example
Let $a^{(0)}$ be the sequence defined by $a^{(0)}_n=(-1)^{n+1}$ for all $n\in\mathbb{N}$. For all integers $m>0$, let $a^{(m)}$ be the sequence of partial sums of $a^{(m-1)}$, and for all integers $m<0$, let $a^{(m)}$ be the sequence of differences of $a^{(m+1)}$.

$$\begin{array}{r|ccccccccc}\vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots\\a^{(2)} & 1 & 1 & 2 & 2 & 3 & 3 & 4 & 4 & \cdots\\a^{(1)} & 1 & 0 & 1 & 0 & 1 & 0 & 1 & 0 & \cdots\\a^{(0)} & 1 & -1 & 1 & -1 & 1 & -1 & 1 & -1 & \cdots\\a^{(-1)} & 1 & -2 & 2 & -2 & 2 & -2 & 2 & -2 & \cdots\\a^{(-2)} & 1 & -3 & 4 & -4 & 4 & -4 & 4 & -4 & \cdots\\\vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots\\\end{array}$$

Every sequence can be used to create a unique chain of sequences.

Editor’s note: This installment examines the relationship between sequences and their summation chains, and provides a way to determine whether two sequences are sum-related.

### Summation Chains and Sequences

#### Sum-Related Chains

Consider the following sequences with their $\Sigma$-chains.

$$s_1=(1,0,0,0,\ldots)$$ $$s_2=(1,1,1,1,\ldots)$$ $$s_3=(0,1,0,0,\ldots)$$ $\Phi_{\Sigma[\mathbb{R}]}(s_1) : \begin{array}{c|ccccc} \color{SteelBlue}{m} \backslash \color{LightSeaGreen}{n} & \color{LightSeaGreen}{1} &\color{LightSeaGreen}{2} & \color{LightSeaGreen}{3} & \color{LightSeaGreen}{4} &\color{LightSeaGreen}{\cdots}\\\color{SteelBlue}{\vdots} & \vdots & \vdots & \vdots & \vdots &\\\color{SteelBlue}{2} & 1 & 2 & 3 & 4 & \cdots\\\color{SteelBlue}{1} & 1 & 1 & 1 & 1 & \cdots\\\color{SteelBlue}{0} & 1 & 0 & 0 & 0 & \cdots\\\color{SteelBlue}{-1} & 1 &-1 & 0 & 0 & \cdots\\\color{SteelBlue}{-2} & 1 & -2 & 1 & 0 & \cdots\\\color{SteelBlue}{\vdots} & \vdots & \vdots & \vdots & \vdots &\\\end{array}\quad\Phi_{\Sigma[\mathbb{R}]}(s_2) : \begin{array}{c|ccccc} \color{SteelBlue}{m} \backslash \color{LightSeaGreen}{n} & \color{LightSeaGreen}{1} &\color{LightSeaGreen}{2} & \color{LightSeaGreen}{3} & \color{LightSeaGreen}{4} &\color{LightSeaGreen}{\cdots}\\\color{SteelBlue}{\vdots} & \vdots & \vdots & \vdots & \vdots &\\\color{SteelBlue}{2} & 1 & 3 & 6 & 10& \cdots\\\color{SteelBlue}{1} & 1 & 2 & 3 & 4& \cdots\\\color{SteelBlue}{0} & 1 & 1 & 1 & 1 & \cdots\\\color{SteelBlue}{-1} & 1 & 0 & 0 & 0 & \cdots\\\color{SteelBlue}{-2} & 1 & -1 & 0 & 0 & \cdots\\\color{SteelBlue}{\vdots} & \vdots & \vdots & \vdots & \vdots &\\\end{array}$ $$\Phi_{\Sigma[\mathbb{R}]}(s_3) : \begin{array}{c|ccccc} \color{SteelBlue}{m} \backslash \color{LightSeaGreen}{n} & \color{LightSeaGreen}{1} &\color{LightSeaGreen}{2} & \color{LightSeaGreen}{3} & \color{LightSeaGreen}{4} &\color{LightSeaGreen}{\cdots}\\\color{SteelBlue}{\vdots} & \vdots & \vdots & \vdots & \vdots &\\\color{SteelBlue}{2} & 0 & 1 & 2 & 3 & \cdots\\\color{SteelBlue}{1} & 0 & 1 & 1 & 1 & \cdots\\\color{SteelBlue}{0} & 0 & 1 & 0 & 0 & \cdots\\\color{SteelBlue}{-1} & 0 &1 &-1 & 0 & \cdots\\\color{SteelBlue}{-2} & 0 & 1 & -2 & 1 & \cdots\\\color{SteelBlue}{\vdots} & \vdots & \vdots & \vdots & \vdots &\\\end{array}$$

These chains, while distinct, contain the same pattern of numbers. The entries of $\Phi_{\Sigma[\mathbb{R}]}(s_2)$ are the entries of $\Phi_{\Sigma[\mathbb{R}]}(s_1)$ moved up.This occurs because $s_2$ is the sequence of partial sums of $s_1$. The entries of $\Phi_{\Sigma[\mathbb{R}]}(s_3)$ are the entries of $\Phi_{\Sigma[\mathbb{R}]}(s_1)$ with an extra column of zeros.

The remainder of this section focuses on these relationships.

Definition. Let M be a $\mathbb{Z}$-Module. Let $(a_n)$ and $(b_n)$ be sequences in M. $(a_n)$ is sum-related to $(b_n)$, denoted $(a_n)\sim_{\Sigma}(b_n)$, if $(a_n)$ is a sequence in $\Phi_{\Sigma} ((b_n))$.

If two sequences are sum-related, then one sequence can be obtained by finding the sequences of partial sums of the sequences of partial sums of the other sequence.

Remark. The trivial sequence $(0,0,0,\ldots)$ is sum-related only to itself.

#### Proposition.

Sum relation defines an equivalence class.

#### Proposition.

Let c and d be $\Sigma$-chains. If there exists $m\in \mathbb{Z}$ such that $c(m,n)=d(m,n)$, for all $n\in \mathbb{N}$, then c=d.

#### Proposition.

Let M be a $\mathbb{Z}$-Module. Let c be a $\Sigma$-chain in M. Let $d:\mathbb{Z}\times\mathbb{N}\to M$ be defined such that there exists $k\in \mathbb{Z}$ such that $d(m,n)=c(m+k,n)$ for every $m\in \mathbb{Z}$ and $n\in \mathbb{N}$, thend is a $\Sigma$-chain,and every sequence in c is sum-related to every sequence in d.

Definition. Let c be a $\Sigma$-chain, and let $k\in \mathbb{Z}$.The $\Sigma$-chain $c^{\,k}$ is defined for all integers m and natural numbers n, by $c^{\,k}(m,n)=c(m+k,n)$. $c^{\,k}$ is said to be the $\Sigma$-chain c shifted by k.

Remark. If $s_1$ and $s_2$ are sequences generating $\Sigma$-chains $c_1$ and $c_2$ respectively, then saying $s_1 \sim_{\Sigma} s_2$ is equivalent to saying $c_1=c_2^k$ for some integer k.

When do two sequences appear in the same summation chain? By the above definition, this is the same as asking when sequences are sum-related. Consider the sequence $(x_k)=(0,0,1,-1,1,-1,\ldots)$. The sequences sum-related to $(x_k)$ are the sequences that are in $\Phi_{\Sigma[\mathbb{R}]}((x_k))$.

$$\Phi_{\Sigma[\mathbb{R}]}((x_{k}) : \begin{array}{c|ccccccc} \color{SteelBlue}{m} \backslash \color{LightSeaGreen}{n} & \color{LightSeaGreen}{1} &\color{LightSeaGreen}{2} & \color{LightSeaGreen}{3} & \color{LightSeaGreen}{4} &\color{LightSeaGreen}{5} &\color{LightSeaGreen}{6} &\color{LightSeaGreen}{\cdots}\\\color{SteelBlue}\vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots &\\\color{SteelBlue}{2} & 0 & 0 & 1 & 1 & 2 & 2 & \cdots\\\color{SteelBlue}{1} & 0 & 0 & 1 & 0 & 1 & 0 & \cdots\\\color{SteelBlue}{0} & 0 & 0 & 1 & -1 & 1 & -1 & \cdots\\\color{SteelBlue}{-1} & 0 & 0 & 1 & -2 & 2 & -2 & \cdots\\\color{SteelBlue}{-2} & 0 & 0 & 1 & -3 & 4 & -4 & \cdots\\\color{SteelBlue}{\vdots} & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots &\\\end{array}$$

Notice that each sequence in the chain starts with two zeros followed by a one. The sequences $(0,1,10,20,30,\ldots)$ and $(0,0,4,5,6,\ldots)$ could not be sum-related to $(x_k)$ since they do not begin with this sequence. This strategy can eliminate most pairs of sequences.

#### Lemma.

Given two nontrivial sequences $(a_k)$ and $(b_k)$ such that $(a_k)\sim_{\Sigma}(b_k)$, define $N_a$ and $N_b$ to be the indices of the first nonzero terms of $(a_k)$ and $(b_k)$ respectively, then $N_a= N_b$, and $a_{N_a}=b_{N_b}$.

In general, after exploiting the above lemma, it is difficult to determine sum relation of sequences. It is not possible to exhaustively check every sequence in a chain. However, when every nonzero element in the $\mathbb{Z}$-Module has infinite order, the corollary from the previous post tells where to look for sum-related sequences.
Thus in this case, a closed form procedure for determining sum relation can be used.

#### Theorem.

Let M be a $\mathbb{Z}$-Module. such that the order of every nonzero element in M is infinite. Let $(a_n)$ and $(b_n)$ be sequences in M, and let $c_a=\Phi_{\Sigma}((a_n))$. Let $N_1$ be the index of the first nonzero term of $(a_n)$, and let $N_2$ be the index of the first nonzero term of $(b_{n})$. Thus, $(a_{n})\sim_{\Sigma}(b_{n})$ if and only if the following conditions hold.

1. $N_1=N_2$, if this is true let $N=N_1=N_2$
2. $a_N=b_N$,
3. There exists $m \in \mathbb{Z}$ such that $m\cdot a_N=b_{N+1}-a_{N+1}$,
4. $b_n=c_a(m,n), \forall n \in \mathbb{N}$.

Remark.  When the integer m exists in the theorem above, it is unique.

Example. Let $M=\mathbb{Z}$. Let $(a_n)=(1,-1,-1,1,0,0,0,0,0,\ldots)$ and $b_n=n^2$, $\forall n\in \mathbb{N}$. Let $c_a=\Phi_{\Sigma}((a_n))$.

The index of the first nonzero terms of both sequences is 1. Also, $a_1=b_1=1$ so (1) and (2) are satisfied. $b_2-a_2=5=5\cdot a_1$ so (3) holds. If there is a chain relation, $(b_n)$ will be the fifth sequence in $c_a$. (4) has already been shown for $n=1$ and $n=2$. Using formulas from the theorem, the computation for the other indices is as follows:

$$c_a(5,3)={6 \choose 4}(1)+{5 \choose 4}(-1)+{4 \choose 4}(-1)=15-5-1=9=b_3$$

For $n\geq 4$,
\begin{aligned}c_a(5,n)=&\sum^{n}_{k=1}{4+n-k \choose 4}a_k\\=&{n+3 \choose 4}(1)+{n+2 \choose 4}(-1)+{n+1 \choose 4}(-1)+{n \choose 4}(1)\\=&\frac{(n+3)(n+2)(n+1)n}{24}-\frac{(n+2)(n+1)n(n-1)}{24}\\&-\frac{(n+1)n(n-1)(n-2)}{24}+\frac{n(n-1)(n-2)(n-3)}{24}\\=&\, n^2\\=&\, b_n\\\end{aligned}

Therefore, by the theorem above, $(a_n)\sim_{\Sigma}(b_n)$.

Example
Let $M=\mathbb{R}$. Let $a_n=\frac{1}{n}$ and $b_n=n$, $\forall n\in \mathbb{N}$. Let $c_a=\Phi_{\Sigma}((a_n))$. The index of the first nonzero terms of both sequences is 1. Also, $a_1=b_1=1$ so (1) and (2) are satisfied.
However $b_2-a_2=1.5$ which cannot be obtained by multiplying an integer by $a_1$.
Therefore, $(a_n)\nsim_{\Sigma}(b_n)$.

Example

Let $M=\mathbb{Z}$. Let $a_n=n^2-3n+2$ and $b_n=-\frac{n^4+3n^3+14n^2-48n+32}{7}$, $\forall n\in \mathbb{N}$. Let $c_a=\Phi_{\Sigma}((a_n))$. The index of the first nonzero terms of both sequences is 3. Also, $a_3=b_3=2$ so (1) and (2) are satisfied.
$b_4-a_4=-6=-3\cdot a_3$ so (3) holds. If there is a chain relation, $(b_n)$ will be the third sequence down in $c_a$. However, $c_a(-3,5)=0$ while $b_5=-\frac{108}{7}$.
Therefore, $(a_n)\nsim_{\Sigma}(b_n)$.

Example
Let $M=\mathcal{Z}[X]$. Let $a_n=\sum^{n-1}_{i=0}(i+1)x^{n-i}$, $\forall n\in \mathbb{N}$, and let $b_1=x$ and $b_n=x^n-x^{n-1}$, $n\geq 2$. Let $c_a=\Phi_{\Sigma}((a_n))$. The index of the first nonzero terms of both sequences is 1.
Also, $a_1=b_1=x$ so (1) and (2) are satisfied. $b_2-a_2=(x^2-x)-(x^2+2x)=-3x=-3\cdot a_1$ so (3) holds. If there is a chain relation, $(b_n)$ will be the third sequence down in $c_a$.
(4) has already been shown for $n=1$ and $n=2$.
\begin{aligned}c_a(-3,3)&={3 \choose 1}(x)-{3 \choose 2}(x^2+2x)+{3 \choose 3}(x^3+2x^2+3x)\\&=3x+3(x^2+2x)+x^3+2x^2+3x\\&=x^3-x^2\\&=b_3\\\end{aligned}

For $n\geq 4$,

\begin{aligned}c_a(-3,n)=&\sum^{3}_{k=0}(-1)^{3-k}{3 \choose 3-k}a_{n-3+k}\\=&-{3 \choose 0}\sum^{n-4}_{i=0}(i+1)x^{n-i-3}+{3 \choose 1}\sum^{n-3}_{i=0}(i+1)x^{n-i-1}\\&-{3 \choose 2}\sum^{n-2}_{i=0}(i+1)x^{n-i-1}+{3 \choose 3}\sum^{n-1}_{i=0}(i+1)x^{n-i}\\=&-\sum^{n-1}_{i=3}(i-2)x^{n-i}+3\sum^{n-1}_{i=2}(i-1)x^{n-i}-3\sum^{n-1}_{i=1}ix^{n-i}\\ &\quad+\sum^{n-1}_{i=0}(i+1)x^{n-i}\\=&-\sum^{n-1}_{i=3}(i-2)x^{n-i}+3\sum^{n-1}_{i=3}(i-1)x^{n-i}-3\sum^{n-1}_{i=3}ix^{n-i}\\&+\sum^{n-1}_{i=3}(i+1)x^{n-i}+3x^{n-2}-3x^{n-1}-6x^{n-2}+x^n+2x^{n-1}+3x^{n-2}\\=&\sum^{n-1}_{i=3}[-(i-2)+3(i-1)-3i(i+1)]x^{n-i}+x^n-x^{n-1}\\=&\, x^n-x^{n-1}\\=&\, b_n\\\end{aligned} Therefore,$(a_n)\sim_{\Sigma}(b_n)$.

#### Chains of Sequences with Leading Zeros

Adding zeros to the beginning of a sequence adds columns of zeros to be beginning of the sequences summation chain.

#### Lemma.

Let c be a nontrivial $\Sigma$-chain in $M$. Let $N=\min\{n \in \mathbb{N}:c(0,n)\neq 0\}$. For every integer, m, and for every natural number, $n, $c(m,n)=0$.

#### Proposition.

Let M be a $\mathbb{Z}$-Module. Let c be a nontrivial $\Sigma$-chain in $M$. Let the natural number $N=\min\{n \in \mathbb{N}:c(0,n)\neq 0\}$. Let $d:\mathbb{Z}\times\mathbb{N}\to M$ be defined such that $d(m,n) = c(m,n+N+1)$ for every $m\in \mathbb{Z}$ and $n\in \mathbb{N}$. Then d is a $\Sigma$-chain.

#### Corollary

Let M be a $\mathbb{Z}$-Module, and let $(a_n)$ and $(b_n)$ be sequences in M, such that for a positive integer k, $b_{n+k}=a_n$ for all natural numbers n and $b_n=0$ when $n\leq k$. Let $c_a=\Phi_{\Sigma}((a_n))$ and $c_b=\Phi_{\Sigma}((a_b))$. For every integer m and natural number $n$, $c_b(m,n+k)=c_a(m,n)$, and $c_b(m,n)=0$ when $n< k$.

### Conclusion

Editor’s note: This paper represents the second installment of the concept of summation chains of complex valued sequences. Here, a new way to determine the relationship between two sequences was defined. The concept of sum-relation was defined, and a theorem provided a way to determine the relationship between two sequences.  Part 3 will discuss sequence chains generated by linear function rules.

### References

1. Apostol, T., 1974, Mathematical Analysis 2nd Edition, Addison-Wesley Publishing Company
2. Brin, M. and Stuck, G., 2002, Introduction to Dynamical Systems, Cambridge University Press
3. Cameron, P. J., 1992, Combinatorics: Topics, Techniques, Algorithms, Cambridge University Press
4. Cook, R. G., 2014, Infinite Matrices and Sequence Spaces, Dover Publications, Inc.
5. Johnson, J. (2017)Summation Chains of Sequences Part 1: Introduction, Generation, and Key Definitions. AACTO, Vol. 2, Issue 2.
6. Lee, J., 2011, Introduction to Topological Manifolds, Springer Science+Business Media LLC
7. Leon, S. J., 2002, Linear Algebra with Applications 6th Edition, Prentice Hall, Inc.Scheinerman, E. R., 1996, Invitation to Dynamical Systems, Dover Publications, Inc.
Summation Chains of Sequences Part 1: Introduction, Generation, and Key Definitions

## Abstract

(Editor’s note:) This paper represents the first installment of a masters thesis by Jonathan Johnson. This work introduces the notion of summation chains of sequences. It examines the sequence of sequences generated by partial sums and differences of terms in each level of the chain, looks at chains generated by functions, then introduces a formal definition and key formulae in the analysis of such chains.

## Introduction

Given a complex-valued sequence $(a_n)^{\infty}_{n=1}$, the sequence of partial sums of $(a_n)$ is given by the sequence $(a_1,a_1+a_2,a_1+a_2+a_3,\ldots,\sum^n_{i=1}a_i,\ldots)$. The sequence of differences of $(a_n)$ is given by the sequence $(a_1, a_2-a_1,a_3-a_2,\ldots,a_n-a_{n-1},\ldots)$.

The processes of finding the sequence of partial sums and finding the sequence of differences of a sequence are inverses of each other so every sequence is the sequence of differences of its sequence of partial sums and the sequence of partial sums of its sequence of differences. Every sequence has a unique sequence of partial sums and a unique sequence of differences so it is always possible to find the sequence of partial sums of the sequence of partial sums and repeat the process ad infinitum. Similarly, we can find the sequence of differences of the sequence of differences and repeat ad infinitum. The result is a doubly infinite sequence or  “chain” of sequences where each sequence is the sequence of partial sums of the previous sequence and the sequence of differences of the following sequence.

Example
Let $a^{(0)}$ be the sequence defined by $a^{(0)}_n=(-1)^{n+1}$ for all $n\in\mathbb{N}$.
For all integers $m>0$, let $a^{(m)}$ be the sequence of partial sums of $a^{(m-1)}$,
and for all integers $m<0$, let $a^{(m)}$ be the sequence of differences of $a^{(m+1)}$.

$$\begin{array}{r|ccccccccc}\vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots\\a^{(2)} & 1 & 1 & 2 & 2 & 3 & 3 & 4 & 4 & \cdots\\a^{(1)} & 1 & 0 & 1 & 0 & 1 & 0 & 1 & 0 & \cdots\\a^{(0)} & 1 & -1 & 1 & -1 & 1 & -1 & 1 & -1 & \cdots\\a^{(-1)} & 1 & -2 & 2 & -2 & 2 & -2 & 2 & -2 & \cdots\\a^{(-2)} & 1 & -3 & 4 & -4 & 4 & -4 & 4 & -4 & \cdots\\\vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots\\\end{array}$$

Every sequence can be used to create a unique chain of sequences. This paper studies properties of these chains and explores the relationship between sequences and the chains they create. In particular, the following questions are investigated:

• How can the sequences in a summation chain be computed quickly? Clearly, every sequence in a chain of sequences can be computed by repeatedly finding sequences of partial sums of sequences of partial sums or finding sequences of differences of sequences of differences. It is useful, however, to be able compute any sequence in a chain given the starting sequence, $a^{(0)}$, without having to compute all the sequences in between. Methods for computing chains are discussed on Page 3.
• When do two given sequences appear in the same summation chain? When two sequences appear in the same chain, one sequence can be obtained by repeatedly finding the sequences of partial sums of sequences of partial sums of the other sequence. This process could take a long time, and it is not able to determine if two sequences do not appear in the same chain. (Editor’s note: The next installment presents a process to determine with certainty whether or not two sequences are in the same chain.)
• How much information is needed to define a summation chain? Once a chain has been computed, it appears as an array of entries. Starting with a blank array, if some numbers are added to a blank array, can they be used to define the remaining entries uniquely? (Editor’s note: Later installments explore how much information in an array of numbers is needed to determine a chain.)
• How are the convergent behaviors of sequences in a summation chain related? Can every sequence in a chain diverge? Can every sequence in a chain converge? (Editor’s note: The final installment investigates the nature of the limits of sequences in a chain.)

## Chains Generated by Functions

The summation chains we have considered are doubly infinite sequences (of sequences) in which a fixed rule determines the chain of sequences recursively in both directions. The recursive rule in the backward direction is the inverse of the rule in the forward direction. Unlike single infinite sequences, in a chain there is no starting point to begin the recursive pattern. A sequence must be chosen to “start” the chain, and could be thought of as a “seed” of the chain. Starting with a given sequence, the next sequence in the chain can be obtained by applying a rule, and the previous sequence can be obtained by applying the rule’s inverse. However, this same pattern of sequences could be generated starting with any other sequence in the chain, together with the same generating rule. In order for this process to be well-defined, the rule used must have a well-defined inverse. The following definition captures abstractly what chains are. It also generalizes chains of sequences to chains of elements of more general sets.

Definition. Let $f:X\to X$ be a bijective function on a set X. We will call f a function rule. An f-chain is any function $c:\mathbb{Z} \to X$ such that $c(m)=f(c(m-1))$ for all $m \in \mathbb{Z}$. $c(0)\in X$ is called the seed of c, and c is generated by f with seed c(0). c(m) is called the mth term of the chain.

Each term of a chain is given uniquely as $c(m)=f^{m}(c(0))$.

Definition. Let f be a function rule on X. Define $\mathcal{C}_f$ to be the set of all f-chains. The f-chain seed mapping is the function $\Phi_f :X \to \mathcal{C}_f$ where $\Phi_f (x)$ is the chain generated by the seed x.

Remark. $\Phi_f$ is bijective.

Example.
Let $f:\mathbb{Z}_5 \to \mathbb{Z}_5$ where $f(n)=2n$.

This function has five chains generated by 0, 1, 2, 3, and 4. See below. Notice that the chains generated by 1, 2, 3, and 4 contain the same pattern differing only by a shift in starting point.

##### Chain Seeded by 0
$$\begin{array}{cccccccccc}m: & \cdots & -3 & -2 & -1 & 0 & 1 & 2 & 3 & \cdots \\ c(m): & \cdots & 0 & 0 & 0 & 0 & 0 & 0 & 0 & \cdots \\\end{array}$$
##### Chain Seeded by 1
$$\begin{array}{cccccccccc}m: & \cdots & -3 & -2 & -1 & 0 & 1 & 2 & 3 & \cdots \\ c(m): & \cdots & 2 & 4 & 3 & 1 & 2 & 4 & 3 & \cdots \\\end{array}$$
##### Chain Seeded by 2
$$\begin{array}{cccccccccc}m: & \cdots & -3 & -2 & -1 & 0 & 1 & 2 & 3 & \cdots \\ c(m): & \cdots & 1 & 2 & 4 & 3 & 1 & 2 & 4 & \cdots \\\end{array}$$

Definition Let f be a function rule on X. Let $x_1$ and $x_2$ be in $X$. $x_1$ is f-chain related to $x_2$, denoted $x_1 \sim_{f} x_2$, if $x_1$ is in the image of the function chain $\Phi_f (x_2)$.

#### Proposition.

An f-chain relation is an equivalence relation.

This relation is useful to find relationships between elements in the set X induced by f. In the example above, 1, 2, 3, and 4 are all f-chain related while 0 is related only to itself. This section ends with a few notable properties of chains that are useful, in particular when dealing with summation chains.

#### Proposition.

Let $c_1$ and $c_2$ be f-chains. If for any $m \in \mathbb{Z}, c_1(m)=c_2(m)$, then $c_1=c_2$.

#### Proposition.

Let X be a set with rule f. Let c be an f-chain with co-domain X, and let function $d:\mathbb{Z}\to X$ be such that for some $k\in \mathbb{Z}$, $d(m)=c(m+k)$ for all $m \in \mathbb{Z}$, then d is an f-chain. Also, for every pair $x\in \text{Im}(c)$ and $y\in \text{Im}(d)$, $x \sim_{f} y$.

A more detained analysis of sequences generated by recursive funtions in Brin and Stuck (2). The remainder of this paper will focus on summation chains of sequences.

## Summation Chains of Sequences

### Key Definition and Formulas

Let M be a set, and let X be the set of sequences of elements in M. Given a function rule f on X, and an f-chain, c, in order to keep notation clean, new notation is defined as follows. c(m,n) := the nth term of the sequence c(m).

Using this notation we give a clean definition of a summation chain of sequences.

Definition. Let M be a $\mathbb{Z}$-Module. A summation chain, denoted $\Sigma$-chain, of is a function $c:\mathbb{Z} \times \mathbb{N} \to M$ where $c(m,n)=\sum_{k=1}^{n} c(m-1,k)$, for all $m\in \mathbb{Z}$ and for all $n\in \mathbb{N}$$\Phi_{\Sigma[M]}$ is the seed mapping for summation chains in M. The notation $\Phi_{\Sigma}$ is used when the $\mathbb{Z}$-Module, M, is clear. $\mathcal{C}_{\Sigma[M]}$ denotes the set of all M-valued summation chains.

Remark. Summation chains are generated by the function rule $T_{\Sigma}:M^{\infty}\to M^{\infty}$ where $T_{\Sigma}(x_1,x_2,x_3,\ldots)=(x_1,x_1+x_2,x_1+x_2+x_3,\ldots)$.

Example. Let $t:\mathbb{Z}\times \mathbb{N} \to M$ be the constant function $t(m,n)=0$.

$$\sum_{k=1}^{n} t(m-1,k) =\sum_{k=1}^{n} 0 =0=t(m,n),$$

so t is a $\Sigma$-chain. This chain is the trivial chain and is generated with the sequence of all zeros.

How can the sequences in a summation chain be computed quickly? The next two results provide ways to compute entries of a summation chain. The next proposition provides a way to compute the entries in a chain using the other entries close to it.

#### Proposition.

Let M be a $\mathbb{Z}$-Module. Let $c:\mathbb{Z} \times \mathbb{N} \to M$ be a function. c is a $\Sigma$-chain in M if and only if for all $m,m' \in \mathbb{Z}$ and $n \in \mathbb{N}$, the following equalities hold.

1. $c(m,1)=c(m',1)$,
2. If $n \geq 2$, then
$$c(m,n)=c(m,n-1)+c(m-1,n)=c(m+1,n)-c(m+1,n-1).$$

The next theorem provides a way to calculate any sequence in a summation chain directly given any other sequence in the chain without having to compute any of the sequences in between.

#### Lemma.

Let natural numbers, n and k, where k < n.
$${n \choose k}={n-1 \choose k-1}{n-1 \choose k}$$

#### Theorem.

Let c be a $\Sigma$-chain in $\mathbb{Z}$-Module M. Let $m_1,m_2 \in \mathbb{Z}$ such that $m_1>m_2$, and let $n\in \mathbb{N}$. Define $m=m_1-m_2$.
$$c(m_1,n)=\sum_{k=1}^{n} {m+n-k-1 \choose m-1} c(m_2,k)$$

If $n\geq m+1$,

$$c(m_2,n)= \sum_{k=0}^{m} (-1)^{m-k}{m \choose m-k} c(m_1,n-m+k).$$

If $n\leq m+1$,

$$c(m_2,n)=\sum_{k=1}^{n} (-1)^{n-k}{m \choose n-k} c(m_1,k).$$

Remark.When $n=m+1$, the two equations above are the same.

Example. Let $c=\Phi_{\Sigma[\mathbb{R}]}((2^{k-1}))$. The following is a computation of the sequences c(-5,n) and c(-6,n).

c(-5,1)=c(0,1)=1.
c(-5,2)=c(0,2)-5c(0,1)=2-5(1)=-3.
c(-5,3)=c(0,3)-5c(0,2)+10c(0,1)=4-5(2)+10(1)=4.
c(-5,4)=c(0,4)-5c(0,3)+10c(0,2)-10c(0,1)=8-5(4)+10((2)-10(1)=-2.
c(-5,5)=c(0,5)-5c(0,4)+10c(0,3)-10c(0,2)+5c(0,1)=16-5(8)+10(4)-10((2)+5(1)=1.

When $n\geq 6$,

\begin{aligned}c(-5,n)&=\sum_{i=0}^4{4 \choose i}c(0,n-i)\\&=c(0,n)-5c(0,n-1)+10c(0,n-2)\\&\kern{3em}-10c(0,n-3)+5c(0,n-4)-c(0,n-5)\\&=2^{n-1}-5\cdot 2^{n-2}+10\cdot 2^{n-3}-10\cdot 2^{n-4}+5\cdot 2^{n-5}-2^{n-6}\\&=2^{n-6}(32-5\cdot 16+10\cdot 6-10\cdot 4+5\cdot 2-1)\\&=2^{n-6}(1)\\&=2^{n-6}\end{aligned}

c(-6,1)=c(-5,1)=1
c(-6,2)=c(-5,2)-c(-5,1)=-3-1=-4
c(-6,3)=c(-5,3)-c(-5,2)=4-(-3)=7
c(-6,4)=c(-5,4)-c(-5,3)=-2-4=-6
c(-6,5)=c(-5,5)-c(-5,4)=1-(-2)=3
c(-6,6)=c(-5,6)-c(-5,5)=1-1=0

When $n\geq 7, c(-6,n)=c(-5,n)-c(-5,n-1)=2^{n-6}-2^{n-7}=2^{n-7}$

Example. Let $a_1=1, a_2=-1$, and $a_k=0$ when $k\geq 3$. Let $c=\Phi_{\Sigma[\mathbb{R}]}((a_k))$. The following is a computation of the sequence c(4,n).

c(4,1)=c(0,1)=a_1=1

When $n\geq 2$,
\begin{aligned}c(4,n)&=\sum^{n}_{k=1}{3+n-k \choose 3}c(0,k)\\&={n+2 \choose 3}-{n+1 \choose 3}\\&=\frac{n(n+1)(n+2)}{6}-\frac{(n-1)n(n+1)}{6}\\&=\frac{n(n+1)[(n+2)-(n-1)]}{6}\\&=\frac{n(n+1)}{2}\\\end{aligned}

### Notable Corollaries

In general, there is no way to determine the exact distance between two sequences that occur in a summation chain.
Under certain conditions, however, the corollary below determines the location of sequences in relation to each other within a summation chain. This corollary is essential in determining whether two arbitrary sequences appear in the same summation chain.

#### Corollary

Let M be a $\mathbb{Z}$-Module such that the order of every nonzero element in M is infinite. Let c be a non-trivial $\Sigma$-chain in M. Let $N=\min\{n \in \mathbb{N}\::\:c(0,n)\neq 0\}$. Let $m^*,m_1,m_2 \in \mathbb{Z}$.
$$m^*\cdot c(m_1,N)=c(m_2,N+1)-c(m_1,N+1)$$

if and only if $m^*=m_2-m_1$.

Remark. When M is a field with unity $1_M$, the equation above can be written,
$$m^*\cdot(1_M)=\frac{c(m_2,N+1)-c(m_1,N+1)}{c(m_1,N)}$$

In the case of complex valued summation chains, each chain never repeats a sequence unless the chain is trivial. This property comes from the fact that zero is the only complex number that has finite order as stated by the next corollary.

#### Corollary

Let M be a $\mathbb{Z}$-Module such that the order of every nonzero element in M is infinite. Let c be a $\Sigma$-chain in M. Let $m_1,m_2 \in \mathbb{Z}$ such that $m_1\neq m_2$. If $c(m_1,n)=c(m_2,n), \forall n\in \mathbb{N}$, then c is the trivial chain.

It has not been determined whether or not the corollary above is true for all summation chains of sequences.

## Conclusion

Editor’s note: This paper represents the introduction of the concept of summation chains of complex valued sequences. Key concepts and definitions that set the stage for further results in the development of this theory. Part 2 will discuss sum-related chains, and handling leading zeros.

A Generalized Multinomial Distribution from Dependent Categorical Random Variables

## Abstract

Categorical random variables are a common staple in machine learning methods and other applications across disciplines. Many times, correlation within categorical predictors exists, and has been noted to have an effect on various algorithm effectiveness, such as feature ranking and random forests. We present a mathematical construction of a sequence of identically distributed but dependent categorical random variables, and give a generalized multinomial distribution to model the probability of counts of such variables.

## Introduction

Bernoulli random variables are invaluable in statistical analysis of phenomena having binary outcomes, however, many other variables cannot be modeled by only two categories. Many topics in statistics and machine learning rely on categorical random variables, such as random forests and various clustering algorithms [6,7]. Many datasets exhibit correlation or dependency among predictors as well as within predictors, which can impact the model used. [6,9]. This can result in unreliable feature ranking [9], and inaccurate random forests [6].

Some attempts to remedy these effects involve Bayesian modeling [2] and various computational and simulation methods [8]. In particular, simulation of correlated categorical variables has been discussed in the literature for some time [1, 3, 5]. Little research has been done to create mathematical framework of correlated or dependent categorical variables and the resulting distributions of functions of such variables.

Korzeniowski [4] studied dependent Bernoulli variables, formalizing the notion of identically distributed but dependent Bernoulli variables and deriving the distribution of the sum of such dependent variables, yielding a Generalized Binomial Distribution.

In this paper, we generalize the work of Korzeniowski [4] and formalize the notion of a sequence of identically distributed but dependent categorical random variables. We then derive a Generalized Multinomial Distribution for such variables and provide some properties of said distribution. We also give an algorithm to generate a sequence of correlated categorical random variables.

## Background

Korzeniowski defined the notion of dependence in a way we will refer to here as dependence of the first kind (FK dependence). Suppose $(\epsilon_{1},...,\epsilon_{N})$ is a sequence of Bernoulli random variables, and $P(\epsilon_{1} = 1) = p$. Then, for $\epsilon_{i}, i \geq 2$, we weight the probability of each binary outcome toward the outcome of $\epsilon_{1}$, adjusting the probabilities of the remaining outcomes accordingly.

Formally, let $0 \leq \delta \leq 1$, and $q = 1-p$. Then define the following quantities
\begin{aligned}p^{+} := P(\epsilon_{i} = 1 | \epsilon_{1} = 1) = p + \delta q, &\qquad p^{-} := P(\epsilon_{i} = 0 | \epsilon_{1} = 0) = p-\delta p \\ q^{+} := P(\epsilon_{i} = 1 | \epsilon_{1} = 0) = q+\delta p , &\qquad q^{-} := P(\epsilon_{i} = 0 | \epsilon_{1} = 0) = q -\delta q\end{aligned}

Given the outcome $i$ of $\epsilon_{1}$, the probability of outcome $i$ occurring in the subsequent Bernoulli variables $\epsilon_{2}, \epsilon_{3},...,\epsilon_{n}$ is $p^{+}, i = 1$ or $q^{+}, i=0$. The probability of the opposite outcome is then decreased to $q^{-}$ and $p^{-}$, respectively.

The above illustrates the possible outcomes of a sequence of such dependent Bernoulli variables. Korzeniowski showed that, despite this conditional dependency, $P(\epsilon_{i} = 1) = p$ for all $i$. That is, the sequence of Bernoulli variables is identically distributed, with correlation shown to be
$$\text{Cor}(\epsilon_{i}, \epsilon_{j}) = \left\{\begin{array}{lr} \delta, & i=1 \\ \delta^{2}, & i \neq j;\,i,j \geq 2\end{array}\right.$$

These identically distributed but correlated Bernoulli random variables yield a Generalized Binomial distribution with a similar form to the standard binomial distribution. In our generalization, we use the same form of FK dependence, but for categorical random variables. We will construct a sequence of identically distributed but dependent categorical variables from which we will build a generalized multinomial distribution. When the number of categories $K = 2$, the distribution reverts back to the generalized binomial distribution of Korzeniowski [4]. When the sequence is fully independent, the distribution reverts back to the independent categorical model and the standard multinomial distribution, and when the sequence is independent and $K=2$, we recover the standard binomial distribution. Thus, this new distribution represents a much larger generalization than prior models.

## Construction of Dependent Categorical Variables

Suppose each categorical variable has K possible categories, so the sample space $S = \{1,\ldots,K\}$ (These integers should not be taken as ordered or sequential, but rather as character titles of categories.) The construction of the correlated categorical variables is based on a probability mass distribution over K-adic partitions of [0,1]. We will follow graph terminology in our construction, as this lends a visual representation of the construction. We begin with a parent node and build a K-nary tree, where the end nodes are labeled by the repeating sequence (1,…, K). Thus, after N steps, the graph has $K^{N}$ nodes, with each node labeled 1,…,K repeating in the natural order and assigned injectively to the intervals
$$\left(0,\frac{1}{K^{N}}\right],\left(\frac{1}{K^{n}},\frac{2}{K^{N}}\right],\ldots,\left(\frac{K^{N}-1}{K^{n}},1\right].$$

Define

\begin{aligned}&\epsilon_{N} = i \qquad \text{ on } \left(\tfrac{Kj}{K^{N}}, \tfrac{Kj+i}{K^{N}}\right] \\&\epsilon_{N} = K \qquad \text{on } \left(\tfrac{K(j+1)-1}{K^{N}}, \tfrac{K(j+1)}{K^{N}}\right] \end{aligned}

where $i = 1,...,K-1$, and $j = 0,1,\ldots,K^{N-1}-1$. An alternate expression for the above equation is

\begin{aligned}\epsilon_{N} =i &\text{ on } \left(\tfrac{l-1}{K^{N}}, \tfrac{l}{K^{N}}\right], & i \equiv l \mod K , & \quad i = 1, \ldots ,K-1\\ \epsilon_{N} = K &\text{ on } \left(\tfrac{l-1}{K^{N}}, \tfrac{l}{K^{N}}\right], & 0 \equiv l \mod K\end{aligned}

To each of the branches of the tree, and by transitivity to each of the $K^{N}$ partitions of [0,1], we assign a probability mass to each node such that the total mass is 1 at each level of the tree in a similar manner to [4].

Let $0 < p_{i} < 1, i=1,\ldots,K$ such that $\sum_{i=1}^{K}p_{i}=1$, and let $0 \leq \delta \leq 1$ be the dependency coefficient. Define

$p_{i}^{+} := p_{i} + \delta\sum_{l \neq i}p_{l}, \quad p_{i}^{-} := p_{i}-\delta p_{i}$

for= 1,…,K. These probabilities satisfy two important criteria detailed in the following lemmata:

Lemma 1.

\begin{aligned}\sum_{i=1}^{K}p_{i} = p_{i}^{+} + \sum_{l \neq i}p_{l}^{-} = 1\\ p_{i}p_{i}^{+} + p_{i}^{-}\sum_{l \neq i}p_{l} = p_{i}\end{aligned}

We now give the construction in steps down each level of the K-nary tree.

##### LEVEL 1:

Parent node has mass 1, with mass split $1\cdot\prod_{i=1}^{K}p_{i}^{[\epsilon_{1} = i]}$, where $[\cdot]$ is an Iverson bracket. This level corresponds to a sequence $\epsilon$ of dependent categorical variables of length 1.

$$\begin{array}{cccc}\epsilon_{1}(\text{\textbf{Branch}}) & \textbf{Path} & \textbf{Mass at Node} & \textbf{Interval}\\1 &\textit{parent}\to 1 & p_{1}& (0,1/K] \\2 &\textit{parent} \to 2 & p_{2}& (1/K,2/K]\\ \vdots & \vdots & \vdots &\vdots \\i &\textit{parent} \to i & p_{i}& ((i-1)/K,i/K]\\K &\textit{parent} \to K & p_{K}& ((K-1)/K,1]\end{array}$$
##### LEVEL 2:

Level 2 has K nodes, with K branches stemming from each node. This corresponds to a sequence of length 2: $\epsilon = (\epsilon_{1}, \epsilon_{2})$. Denote i.1 as node i from level 1. For i = 1,…,K,

Node i.1 has mass $p_{i}$, with mass split $p_{i}\left(p_{i}^{+}\right)^{[\epsilon_{2}=i]}\prod\limits_{j=1,j\neq i}^{K}\left(p_{j}^{-}\right)^{[\epsilon_{2} = j]}$.

$$\begin{array}{cccc}\epsilon_{2}(\textbf{Branch})& \textbf{Path} & \textbf{Mass at Node} & \textbf{Interval}\\1 &i.1 \to 1 & p_{i}p_{1}^{-} & \left(\frac{(i-1)K}{K^{2}},\frac{(i-1)K + 1}{K^{2}}\right]\\2 &i.1 \to 2 & p_{i}p_{2}^{-} & \left(\frac{(i-1)K + 1}{K^{2}},\frac{(i-1)K+ 2}{K^{2}}\right]\\\vdots & \vdots & \vdots & \vdots\\i &i.1 \to i & p_{i}p_{i}^{+}& \left(\frac{(i-1)K + (i-1)}{K^{2}},\frac{(i-1)K+ i}{K^{2}}\right]\\\vdots & \vdots & \vdots & \vdots \\K &i.1 \to K & p_{i}p_{K}^{-}& \left(\frac{iK-1}{K^{2}},\frac{iK}{K^{2}}\right]\end{array}$$

In this fashion, we distribute the total mass 1 across level 2. In general, at any level r, there are K streams of probability flow at Level r. For $\epsilon_{1} = i$, the probability flow is given by
$$p_{i}\prod_{j = 2}^{r}\left[\left(p_{i}^{+}\right)^{[\epsilon_{j} = i]}\prod_{l\neq i }\left(p_{l}^{-}\right)^{[\epsilon_{j} = l]}\right],\qquad i = 1,\ldots,K\qquad\qquad(*)$$

We use this flow to assign mass to the Kr intervals of [0,1] at level r in the same way as above. We may also verify via algebraic manipulation that

$$p_{i} = p_{i}\left(p_{i}^{+} + \sum_{l\neq i}p_{i}^{-}\right)^{r-1} =p_{i}\sum_{\epsilon_{2},\ldots,\epsilon_{r}}\prod_{j=2}^{r}\left[\left(p_{i}^{+}\right)^{[\epsilon_{j} = i]}\prod_{l \neq i}\left(p_{l}^{-}\right)^{[\epsilon_{j} = l]}\right]$$

where the first equality is due to the first statement of Lemma 1, and the second is due to the probability flow at level r, given in $(*)$ above.

### Example Construction

For an illustration, refer to Figure 2 above.  In this example, we construct a sequence of length N = 3 of categorical variables with K = 3 categories. At each level r, there are 3r nodes corresponding to 3r partitions of the interval [0,1]. Note that each time the node splits into 3 children, the sum of the split probabilities is 1. Despite the outcome of the previous random variable, the next one always has three possibilities. The sample space of categorical variable sequences of length 3 has 33 = 27 possibilities.

### Properties

#### Identically Distributed but Dependent

We now show the most important property of this class of sequences– that they remain identically distributed despite losing independence.

#### Lemma 2:

$P(\epsilon_{r} = i) = p_{i}\qquad i = 1,\ldots,K; \:\: r \in \mathbb{N}$

#### Pairwise Cross-Covariance Matrix

We now give the pairwise cross-covariance matrix for dependent categorical random variables.

Theorem 1: (Cross-Covariance of Dependent Categorical Random Variables).  Denote by $\Lambda^{\iota,\tau}$ the K × K cross-covariance matrix of $\epsilon_{\iota}$ and $\epsilon_{\tau}$, $\iota,\tau = 1,\ldots,n$, defined as $\Lambda^{\iota,\tau} = E[(\epsilon_{\iota} - E[\epsilon_{\iota}])(\epsilon_{\tau} - E[\epsilon_{\tau}])]$. Then the entries of the matrix are given by

$\Lambda^{1,\tau}_{ij} = \left\{\begin{array}{cc}\delta p_{i}(1-p_{i}), & i = j \\-\delta p_{i}p_{j}, & i \neq j\end{array}\right.\qquad \tau \geq 2,$
and

$\Lambda^{\iota,\tau}_{ij} = \left\{\begin{array}{cc}\delta^{2}p_{i}(1-p_{i}), & i = j \\-\delta^{2}p_{i}p_{j}, & i \neq j\end{array}\right. \qquad\tau > \iota,\:\iota\neq 1.$

In the next section, we exploit the desirable identical distribution of the categorical sequence in order to provide a generalized multinomial distribution for the counts in each category.

## Generalized Multinomial Distribution

From the construction in Section 2, we derive a generalized multinomial distribution in which all categorical variables are identically distributed but no longer independent.

Theorem 2: Generalized Multinomial Distribution.  Let $\epsilon_{1},\ldots,\epsilon_{N}$ be categorical random variables with categories $1,\ldots,K$, constructed as in Section 2. Let $X_{i} = \sum_{j=1}^{N}[\epsilon_{j} = i], i = 1,\ldots,K$, where $[\cdot]$ is the Iverson bracket. Denote $\mathbf{X} = (X_{1},\ldots,X_{K})$, with observed values $\mathbf{x} = (x_{1},\ldots,x_{K})$. Then

$$P(\mathbf{X} = \mathbf{x}) = \sum_{i=1}^{K}p_{i}\frac{(N-1)!}{(x_{i}-1)!\prod_{j \neq i}x_{j}!}\left(p_{i}^{+}\right)^{x_{i}-1}\prod_{j \neq i}\left(p_{j}^{-}\right)^{x_{j}}.$$

### Properties

This section details some useful properties of the Generalized Multinomial Distribution and the dependent categorical random variables.

#### Marginal Distributions

Theorem 3: Univariate Marginal Distribution.  The univariate marginal distribution of the Generalized Multinomial Distribution is the Generalized Binomial Distribution. That is,

$P(X_{i} = x_{i}) = q{N-1 \choose x_{i}}\left(p_{i}^{-}\right)^{x_{i}}\left(q^{-}\right)^{N-1-x_{i}} + p_{i}{N-1 \choose x_{i}-1}\left(p_{i}^{+}\right)^{x_{i}-1}\left(q^{+}\right)^{N-1-(x_{i}-1)},$
where $q = \sum_{j \neq i}p_{j}$, $q^{+} = q + \delta p_{i}$, and $q^{-} = q - \delta q$.
The above theorem shows another way the generalized multinomial distribution is an extension of the generalized binomial distribution.

Theorem 4: Moment Generating Function.  The moment generating function of the generalized multinomial distribution with K categories is given by
$$M_{\mathbf{X}}(\mathbf{t}) = \sum_{i=1}^{K}p_{i}e^{t_{i}}\left(p_{i}^{+}e^{t_{i}} + \sum_{j \neq i}p_{j}^{-}e^{t_{j}}\right)^{n-1}$$

where $\mathbf{X} = (X_{1},...,X_{K})$ and $\mathbf{t} = (t_{1},...,t_{K})$.

#### Moments of the Generalized Multinomial Distribution

Using the moment generating function in the standard way, the mean vector $\mu$ and the covariance matrix $\Sigma$ may be derived.

Expected Value. The expected value of $\mathbf{X}$ is given by $\mu = n\mathbf{p}$ where $\mathbf{p} = (p_{1},...,p_{K})$

Covariance Matrix The entries of the covariance matrix are given by
$$\Sigma_{ij} = \left\{\begin{array}{lr}p_{i}(1-p_{i})(n + \delta(n-1) + \delta^{2}(n-1)(n-2)), & i = j \\p_{i}p_{j}(\delta(1-\delta)(n-2)(n-1)-n), & i \neq j\end{array}\right.$$ Note that if $\delta = 0$, the generalized multinomial distribution reduces to the standard multinomial distribution and $\Sigma$ becomes the familiar multinomial covariance matrix. The entries of the corresponding correlation matrix are given by
$$\rho(X_{i},X_{j}) = -\sqrt{\frac{p_{i}p_{j}}{(1-p_{i})(1-p_{j})}}\left(\frac{n-\delta(n-1)(n-2)}{n+\delta(n-1)+\delta^{2}(n-1)(n-2)}\right)$$

If $\delta = 1$, the variance of $X_{i}$ tends to $\infty$ with n. This is intuitive, as $\delta = 1$ implies perfect dependence of $\epsilon_{2},...,\epsilon_{n}$ on the outcome of $\epsilon_{1}$. Thus, $X_{i}$ will either be 0 or n, and this spread increases to $\infty$ with n.

## Generating a Sequence of Correlated Categorical Random Variables

For brevity, we will take the acronym DCRV for a Dependent Categorical Random Variable. A DCRV sequence $\epsilon = (\epsilon_{1},...,\epsilon_{n})$ is in itself a random variable, and thus has a probability distribution. In order to provide an algorithm to generate such a sequence, we first derive this probability distribution.

### Probability Distribution of a DCRV Sequence

The probability distribution of the DCRV sequence $\epsilon$ of length n is given formally in the following theorem. The proof follows in a straightforward fashion from the construction of dependent categorical random variables and is therefore omitted.

Theorem 5: Distribution of a DCRV Sequence.  Let $(\Sigma,\mathcal{F}, \mathbb{P}) = ([0,1], \mathcal{B}, \mu)$. Let $\epsilon_{i} : [0,1] \to \{1,...,K\},i = 1,...,n,n \in \mathbb{N}$ be DCRVs as defined in the construction. Let $\epsilon = (\epsilon_{1},...,\epsilon_{n})$ denote the DCRV sequence with observed values $e = (e_{1},...,e_{n})$. Then $\mu$ has the density
$$f(x) = \sum_{i=1}^{K^{n}}K^{n}m^{i}\mathbb{1}_{\tiny((i-1)/K^{n}, i/K^{n}]}(x)$$
and
$$P(\epsilon = e) = \int\limits_{\left(\frac{i-1}{K^{n}}, \frac{i}{K^{n}}\right]}f(x)dx = m^{i}$$

where $m^{i}$ is the mass allocated to the interval $\left(\frac{i-1}{K^{n}}, \frac{i}{K^{n}}\right]$ by the probability mass flow and i as the lexicographic order of e in the sample space $\{1,...,K\}^{n}$ given by the relation $\frac{i}{K^{n}} = \sum_{j=1}^{K^{n}}\frac{\epsilon_{j}-1}{K^{j}}$

### Algorithm

We take a common notion of using a uniform random variable in order to generate the desired random variable $\epsilon$. For $\epsilon$ with distribution $F(x) = \int_{0}^{x}f(y)dy, f(x)$ as in the above equation, it is clear that F is invertible with inverse $F^{-1}$. Thus, $F^{-1}(U)$ for $U \sim \text{Uniform}[0,1]$ has the same distribution as $\epsilon$. Therefore, sampling u from U is equivalent to the sample $e = F^{-1}(u)$ from $\epsilon$.

In the construction of the DCRVs, we associated $\epsilon_{n}$ to the intervals

\begin{aligned}\epsilon_{N} =i &\text{ on } \left(\tfrac{l-1}{K^{N}}, \tfrac{l}{K^{N}}\right], & i \equiv l \mod K , & \quad i = 1,\ldots,K-1 \\\epsilon_{N} = K &\text{ on } \left(\tfrac{l-1}{K^{N}}, \tfrac{l}{K^{N}}\right], & 0 \equiv l \mod K\end{aligned}

From the construction, each sequence has a 1-1 correspondence with the interval $\left[\frac{i-1}{K^{n}}, \frac{i}{K^{n}}\right)$ for a unique $i = 1,\ldots,K^{n}$. The probability of such a sequence can be found using Theorem 5:

$$P(\epsilon = e) = F\left(\left[\frac{i-1}{K^{n}}, \frac{i}{K^{n}}\right)\right) = m^{i} = l\left([s_{i-1}, s_{i})\right)$$

where l is the length of the above interval, and $s_{i} = \sum_{j=1}^{i}m^{j}$. Therefore, we have now partitioned the interval [0,1) according to the distribution of $\epsilon$ bijectively to the K-nary partition of [0,1) corresponding to the particular sequence. Thus, sampling $u \in [0,1)$ from a uniform distribution and finding the interval $[s_{i-1},s_{i})$ and corresponding i will yield the unique DCRV sequence.

Algorithm Strategy: Given $u \in [0,1)$ and $n \in \mathbb{N}$, find the unique interval $[s_{i-1}, s_{i}), i= 1,\ldots,K^{n}$ that contains u by “moving down the tree” and narrowing down the “search interval” until level n is reached.

We provide an explicit example prior to the pseudocode to illustrate the strategy.

#### Example DCRV Sequence Generation

Suppose $K = 3, p_{1} = p_{2} = p_{3} = 1/3,$  and  $\delta = 1/2$, and suppose we wish to generate a DCRV sequence of length 2. The figure above gives the probability flow and the corresponding probability intervals $[s_{i-1}, s_{i})$ that partition [0, 1) according to Theorem 4. Now, suppose we sample a uniform random variable u, with observed value $u = \frac{3}{4}$. We now illustrate the process of moving down the above tree to generate the sequence.

1. The first level of the probability tree partitions the interval [0, 1) into three intervals given in the figure above. $u = \frac{3}{4}$ lies in the third interval, which corresponds to $\epsilon_{1} = 3$. Thus, the first entry of e is given by $e_{1} = 3$.
2. Next, the search interval is reduced to the interval from the first step [2/3, 1). We then generate the partitions of [2/3, 1) by cumulatively adding $p_{3}p_{i}^{-}, i = 1,2,3$ to the left endpoint 2/3. Thus, the next partition points are
• 2/3 + (1/3)(1/6) = 13/18
• 2/3 + (1/3)(1/6) + (1/3)(1/6) = 7/9, and
• 2/3 + (1/3)(1/6) + (1/3)(1/6) + (1/3)(2/3) = 1

Yielding the subintervals of [2/3, 1):

• [2/3, 13/18)
• [13/18, 7/9)
• [7/9, 1)

We now find the interval from above that contains u is the second: [13/18,  7/9). Thus, $\epsilon_{2} = 2$. Since we only sought a sequence of length 2, the algorithm is finished, and we have generated the sequence e = (3, 2). If a longer sequence is desired, we repeat step 2 until we are at level n. For the pseudocode of the algorithm, see the pdf, linked at the top of the page.

## Conclusion

Categorical variables play a large role in many statistical and practical applications across disciplines. Moreover, correlations among categorical variables are common and found in many scenarios, which can cause problems with conventional assumptions. Different approaches have been taken to mitigate these effects, because a mathematical framework to define a measure of dependency in a sequence of categorical variables was not available. This paper formalized the notion of dependent categorical variables under a first-dependence scheme and proved that such a sequence is identically distributed but now dependent. With an identically distributed but dependent sequence, a generalized multinomial distribution was derived in Section~\ref{sec: gen multinomial} and important properties of this distribution were provided. An efficient algorithm to generate a sequence of dependent categorical random variables was given.