### Browsed byCategory: Probability Theory

Extensions of the Single Server Efficiency Model

## Abstract

Editor’s note: this paper comprises the third chapter of the PhD dissertation by Rachel Traylor. Visit here and here to see chapters one and two, respectively. Herein we further generalize the single server model presented in [3]. In particular, we consider a multichannel server under the cases of both singular and clustered tasks. In the instance of singular tasks, we present a load balancing allocation scheme and obtain a stochastic breakdown rate process, as well as derive the conditional survival function as a result. The model of a multichannel server taking in clustered tasks gives rise to two possibilities, namely, independent and correlated channels. We derive survival functions for both of these scenarios.

## Load Balancing Allocation for a Multichannel Server

### Model Description

Previously, we had assumed that a web server functions as a single queue that attempts to process jobs as soon as they arrive. These jobs originally brought a constant stress $\eta$ to the server, with the system stress reducing by $\eta$ at the completion of each job.

Now, suppose we have a server partitioned into K channels. Denote each channel as $Q_k,$ $k = 1,\ldots,K$. Jobs arrive via a nonhomogenous Poisson process with rate $\lambda(t)$. Upon arrival, each job falls (or is routed) to the channel with the shortest queue length. If all queue lengths are equal or multiple channels have the shortest length, the job will enter one of the appropriate queues with equal probability.

We retain the previous notation for the baseline breakdown rate, or hazard function. This is denoted by $r_0(t)$ and is the hazard function under an idle system. We also retain the assumption that the arrival times $\mathbf{T}$ are independent. In addition, the service times $\mathfrak{W}$ are i.i.d. with distribution $G_W(w)$. We assume that all channels are serving jobs at the same time, i.e., a job can be completed from any queue at any time. We do not require load balancing for service. In other words, any queue can empty with others still backlogged. We also retain the FIFO service policy for each queue.

Since we have now “balanced,” or distributed, the load of jobs in the server, not all jobs will cause additional stress to the system. Suppose all jobs bring the same constant stress $\eta$ upon arrival. Under load balancing, we will define the additional stress to the system as $\eta\max_k|Q_k|$. Figure 1 shows an example server with current stress of $4\eta$.

### Examples

Due to the dynamic nature of arrival times, allocation to queues, and service times, we have many possible configurations of jobs at any point in time. Therefore, the allocation scheme adds an additional layer of variation to the service times and order of service. The placement of jobs in the various queues (and thus the order of service and service times) is wholly dependent on all arrival times and service times of the prior arrivals. The following examples illustrate the effect on the workload stress added to the system in various scenarios.

###### Figure 2(b). Breakdown Rate Process Trajectory

Example 1Suppose for simplicity we have 2 channels. Suppose at the time of observation of the system, 3 jobs have arrived and none have finished. WLOG, suppose job 3 fell into $Q_{1}$. See Figure 2(a). The stress to the system at $t=t_{\text{obs}}$ is $r_{0}(t_{\text{obs}}) + 2\eta$, as shown in Figure 2(b).

Note in Example 1 that Job 2 does not add any additional stress to the system. Job 1 sees an empty queue upon arrival, and $\max_{K}|Q_{K}| = 1$ when it falls into any particular queue. Job 2 arrives as Job 1 is still being processed, and thus the placement of Job 1 forces Job 2 into the empty channel. Since $\max_{K}|Q_{K}|$ is still 1, the stress to the system doesn’t change. Job 3 arrives as Jobs 1 and 2 are in service, and thus its choice of queue is irrelevant due to the configuration of the two queues at $T_{3}$. Regardless of which queue Job 3 falls into, $\max_{K}|Q_{K}| = 2$. Thus the arrival of Job 3 increases the breakdown rate by $\eta$ again.

The next example shows the change in system stress Job 1 from Example 1 when one job has finished processing before $T_{3}$.

###### Figure 3(b). Example 2 Breakdown Rate Process Trajectory

Example 2. Consider the same two-channel system from Example 1. However, now suppose WLOG that $T_{3} < T_{1}+W_{1}$. In other words, service for Job 1 was completed before Job 3 arrived. Hence Job 3 will fall into the opposite queue as Job 2. The stress to the system at the time of observation would be $r_{0}(t) + \eta$. See Figures 3(a) and 3(b).

In this scenario, the workload due to Job 3 does not contribute any additional stress to the server. Also observe that upon completion of Job 1, the workload stress to the server does not decrease, as Job 2 still resides in the system and is being served.

Contrast this behavior with the breakdown rate process given in [3]. In the single-channel, single-server model described in both [1] and [3], each job adds stress to the server upon arrival. Under the load balancing allocation scheme, the additional stress to the server depends on the arrival and service times of all prior jobs. From a stochastic perspective, this breakdown rate process has full memory.

The examples above illustrate that $\max_{K}|Q_{K}|$ depends on the intersection of the intervals $I_{j} = [T_{j}, T_{j} + W_{j}]$, $j = 1,\ldots,N(t)$. The next section details the methodology to obtain the configuration of jobs in the server at time $t$ by deccomposition of $\bigcup_{j=1}^{N(t)}I_{j}$ into disjoint atoms and derives the stochastic breakdown rate process under the load balancing allocation scheme.

### Breakdown Rate Process and Conditional Survival Function

Let $\epsilon = (\epsilon_{1},\ldots,\epsilon_{N(t)})$ be a $N(t)$-tuple whose components $\epsilon_{j}\in\{\emptyset, c\}$, where $\emptyset$ denotes the empty set, and $c$ denotes the complement of the set. Let $E = \{\epsilon \mathrel{:} \epsilon_{j}\in\{\emptyset, c\}\}$ denote the set of all possible $\epsilon$, excepting $\epsilon = (c,\ldots,c)$. Then by Lemma 1 (see Appendix),

$$\bigcup_{j=1}^{N(t)}I_{j} = \bigcup_{\epsilon \in E}\bigcap_{j=1}^{N(t)}I_{j}^{\epsilon_{j}}$$

Remark. $\cap_{j=1}^{N(t)}I_{j}^{\epsilon_{j}}$ indicates which jobs are still in the server at time $t$. The union is disjoint; thus only one $\epsilon$ will describe the server configuration at any given time $t$. For example, if 3 jobs have arrived to the server at time $t_{\text{obs}}$, $|E| = 3\times 2 - 1 = 5$. These may be enumerated:

• $I_{1} \cap I_{2} \cap I_{3}$

• $I_{1}^{c} \cap I_{2}^{c} \cap I_{3}$

• $I_{1} \cap I_{2}^{c} \cap I_{3}^{c}$
• $I_{1}^{c} \cap I_{2} \cap I_{3}$

• $I_{1}^{c} \cap I_{2} \cap I_{3}^{c}$

As an illustration, refer to Example 1. All three jobs are in the system at $t = t_{\text{obs}}$ (that is, none have completed service), and thus $t_{\text{obs}} \in I_{1} \cap I_{2} \cap I_{3}$. Expanding, $t_{\text{obs}} \in [T_{1}, T_{1}+W_{1}]$, $[T_{2}, T_{2}+W_{2}]$, and $[T_{3}, T_{3}+W_{3}]$.

Compare the case with that of Example 2. In this case, three jobs have arrived at $t = t_{\text{obs}}$, but Job 1 has finished by $t_{\text{obs}}$. Thus $t_{\text{obs}} \not\in I_{1}$, but since Jobs 2 and 3 are still in the system, $t_{\text{obs}} \in I_{2} \cap I_{3}$. Thus $t_{\text{obs}} \in I_{1}^{c} \cap I_{2} \cap I_{3}$.

Now, since the additional workload stress to the server is a multiple of $\eta\max_{K}|Q_{K}|$, it remains to derive the appropriate multiplier that accounts for the number of jobs that contribute additional stress to the system.

Let $n = \sum_{j=1}^{N(t)}\mathbf{1}(\epsilon_{j} = \emptyset | \epsilon_{j} \in \epsilon)$ for a particular $\epsilon$, and let $\alpha_{\epsilon}$ be the multiplier that indicates the number of jobs that contribute stress $\eta$ to the system. Under [1] and the generalization in [3], every uncompleted job in the system contributes stress, thus $\alpha_{\epsilon} = n$.

Under the load balancing scheme, $\alpha_{\epsilon} = \lfloor\frac{n+1}{K}\rfloor$, where $K$ is the number of channels in the server. This is due to the allocation scheme’s attempts to evenly distribute jobs across channels. Thus, for Example 1, $n=3$, and $K=2$, meaning $\alpha_{\epsilon} = 2$, as illustrated in Figure 2(b) and for Example 2, $\alpha_{\epsilon} = \lfloor\frac{3+1}{2}\rfloor = 1$, as in Figure 3(b).

Then, the stochastic breakdown rate process under the load balancing allocation scheme is given by
$$\mathcal{B}(t) = r_{0}(t) + \eta\sum_{\epsilon \in E}\alpha_{\epsilon}\mathbf{1}_{I_{1}^{\epsilon_{1}}\cap I_{2}^{\epsilon_{2}} \cap I_{N(t)}^{\epsilon_{N(t)}}}(t)$$

Under this expression, only one indicator function will be nonzero at any given point in time, since all atoms are disjoint. Now, $I_{1}^{\epsilon_{1}} \cap I_{2}^{\epsilon_{2}} \cap \ldots \cap I_{N(t)}^{\epsilon_{N(t)}}$ may be expressed as one interval $[L_{\epsilon},R_{\epsilon}]$, where

\begin{aligned}L_{\epsilon}&=\max\left(\{T_{j}\mathrel{:}\epsilon_{j} = \emptyset\}_{j=1}^{N(t)}\right);\\R_{\epsilon}&=\min\left(\{T_{j} + W_{j}\mathrel{:}\epsilon_{j} = \emptyset \}_{j=1}^{N(t)},\{T_{j}\mathrel{:}\epsilon_{j} = c\}_{j=1}^{N(t)}\right)\end{aligned}

Thus, for a server with $K$ channels under a load balancing routing scheme with all jobs bringing constant stress $\eta$, the breakdown rate process $\mathcal{B}(t)$ may be expressed as

$$\mathcal{B}(t) = r_{0}(t) + \eta\sum_{\epsilon \in E}\alpha_{\epsilon}\mathbf{1}_{[L_{\epsilon}, R_{\epsilon}]}(t)\qquad(*)$$

Thus, the conditional survival function under the load balancing scheme is given by
\begin{aligned}S_{Y|\mathfrak{T},\mathfrak{W}, N(t)}(t|\mathfrak{t},\mathfrak{w}, n) &= e^{-\int_{0}^{t}\mathcal{B}(s)ds}\\&=\bar{F}_{0}(t)\exp\left(-\eta\int_{0}^{t}\sum_{\epsilon \in E}\alpha_{\epsilon}\mathbf{1}_{[L_{\epsilon}, R_{\epsilon}]}(s)ds\right)\\&=\bar{F}_{0}(t)\exp\left(-\eta\sum_{\epsilon\in E}\alpha_{\epsilon}\min(t-L_{\epsilon},R_{\epsilon})\right) \end{aligned}

### Remarks

Finding the survival function of the single-channel environment relied on the independence of the set of arrival times and service times. From $(*)$, the independence is clearly lost. As noted before, the random breakdown process has full memory, and thus is completely dependent upon the entire trajectory up to $t = t_{\text{obs}}$.

## Clustered Tasks in a Multichannel Server

###### Figure 4. Clustered Tasks in a Multichannel Server

The previous multichannel server model in Section 1 implicitly assumed each job comes with one task, and all channels are identical in their ability to serve any task brought by a job. A classic illustration is a block of registers at a retail establishment. Each customer will survey the length of the various queues at each register before choosing the shortest queue. Viewing each of these separate registers as a channel in a single server under these conditions gave rise to the load balancing allocation model detailed in the previous section. This section presents a different interpretation of a multichannel, single-server model.

Suppose a server has multiple channels $Q_{1},\ldots,Q_{K}$, but each channel serves a different type of task. A customer arrives to the server and may select any number from $0$ to $K$ tasks for the server to perform. Said customer will select each possible task $j$ with probability $p_{j}$. Figure 4 illustrates an example of such a situation in which three customers visit the server and each customer picks a different number and set of tasks at random. A customer is considered fully serviced (i.e. the job is complete) upon completion of the last task belonging to that particular customer.

### Model Assumptions

The following mathematical assumptions are made for the multichannel server with clustered tasks:

1. Customers arrive to the server with $K$ channels via a nonhomogenous Poisson process (NHPP) with intensity $\lambda(t)$.
2. The breakdown rate of the idle server is given by $r_{0}(t)$.
3. Each channel corresponds to a different task the server can perform.
4. The selection of each task is a Bernoulli random variable with probability $p_{k}$. Thus the number of tasks selected by each customer is a binomial random variable.
5. The workload stress to the server is a constant multiple $\eta$ of the number of tasks requested by the customer, i.e. the additional stress is given by $\eta N$, where $N$ is the number of tasks requested.
6. The PDF of each channel’s service time is given by $g_{i}(w)$, $i = 1,\ldots ,K$. Since the customer’s service is not complete until all requested tasks have finished, the service life distribution for the customers is given by $\max_{i}G_{i}(w)$.

Under these assumptions, this model is a special interpretation of the random stress environment developed in [3]. In this case, the random workload stress is $\eta N$, where $N$ is a binomial random variable, and the service life distribution $G_{W}(w) = \max_{i}G_{i}(w)$, which may be easily obtained through the mathematical properties of order statistics. Two variations are considered in the  next section: independent channels and correlated channels.

### Independent Channels in a Clustered Task Server

Suppose the selection probabilities for each task in the server are identical, that is, $p_{1}=p_{2}=\ldots=p_{K}=p$. Then $N\sim\text{Bin}(K,p)$. Using Theorem 3 in [3], the survival function of the multichannel server is given in the following theorem.

Theorem 1 (Survival Function of Multichannel Server with Clustered Tasks and Independent Channels). Suppose model assumptions (1)-(6) above are satisfied. In addition, assume $p_{1}=p_{2}=\ldots=p_{K}=p$. Then the survival function of the server is given by
\begin{aligned}S_{Y}(t)&=\bar{F}_{0}(t)\exp\left(-K\eta\left[e^{-\eta t}\left(1-p+pe^{-\eta t}\right)^{K-1}-p(1-p)^{K-1}\right]\right.\\&\qquad\left.\times\int_{0}^{t}m(t-w)\bar{G}_{W}(w)dw\right)\end{aligned}

where $m(x) = \int_{0}^{x}\lambda(s)ds$, $\bar{F}_{0}(t) = e^{-\int_{0}^{t}r_{0}(s)ds}$, $\bar{G}_{W}(w) = 1-G_{W}(w)$, and $G_{W}(w) = \max_{i}G_{i}(w)$.

### Correlated Channels in a Cluster Server

Now suppose the server tasks are correlated, in that the selection of one particular task may affect the selection of any or all of the other tasks. Thus the channels are a sequence of dependent Bernoulli random variables. The construction of dependent Bernoulli random variables is given in [2], and a summary is given.

#### Dependent Bernoulli Random Variables and the Generalized Binomial Distribution

Korzenwioski [2] constructs a sequence of dependent Bernoulli random variables using a binary tree that distributes probability mass over dyadic partitions of $[0,1]$. Let $0\leq\delta\leq1$, $0, and $q=1-p$. Then define the following quantities:
\begin{aligned}q^{+}\mathrel{:=}q+\delta p&\qquad p^{+}\mathrel{:=}p+\delta q\\q^{-}\mathrel{:=}q(1-\delta)&\qquad p^{-}\mathrel{:=}p(1-\delta)\end{aligned}

The quantities above satisfy the following conditions:
\begin{aligned}q^{+}+p^{-}=q^{-}&+p^{+}=q+p=1\\qq^{+}+pq^{-}=q&,\quad qp^{-}+pp^{+}=1\end{aligned}

###### Figure 5. Construction of Dependent Bernoulli Random Variables

Figure 5 shows the construction shows the dependencies. The following examples using coin flips illustrate the effect of the dependency coefficient $\delta$:

Example 1 $(\delta=1)$. For $\delta = 1$, $q^{+} = q+p=1, q^{-} = 0$, $p^{+} = p+q = 1$, and $p^{-} = 0$. Supposing the first coin flip $\epsilon_{1} = 1$. Then every successive $\epsilon_{i}$ will also be $1$. Similarly if $\epsilon_{1} = 0$. Thus the result of the first coin flip completely determines the outcomes of all the rest.

Example 2 $(\delta = 0)$For $\delta = 0$, $q^{+} = q^{-} = q$, and $p^{+} = p^{-} = p$. Thus, the first coin flip (and all subsequent ones) have no effect on the ones that follow.

Example 3 $(\delta = \frac{1}{4})$Suppose $p=q=\frac{1}{2}$. Then $p^{+} = q^{+} = \frac{5}{8}$, and $p^{-} = q^{-} = \frac{3}{8}$. Then the subsequent outcomes $\epsilon_{i}$, $i \geq 2$ are more likely to match the outcomes of $\epsilon_{1}$ than not.

Now suppose $p = \frac{1}{4}$, $q = \frac{3}{4}$. Then $p^{+} = \frac{7}{16}$$p^{-} = \frac{3}{16}$, $q^{+} = \frac{13}{16}$, and $q^{-} = \frac{9}{16}$. In this example of an unfair coin, the dependency coefficient $\delta$ still attempts to skew the results following the first coin flip in favor of the outcome of $\epsilon_{1}$. However, the dependency here heightens the effect of $\epsilon_{1} = 0$ on subsequent flips, and cannot overcome the discrepancy between the probability of success and failure to skew $\epsilon_{i}$, $i \geq 2$ in favor of a $1$ following the outcome of $\epsilon_{1} = 1$.

Using these dependent Bernoulli random variables, [2] presents a Generalized Binomial Distribution for identically distributed but dependent Bernoulli random variables.

Generalized Binomial Distribution

Let $X = \sum_{i=1}^{n}\epsilon_{i}$, where $\epsilon_{i}$, $i = 1,\ldots ,n$ are identically distributed Bernoulli random variables with probability of success $p$ and dependency coefficient $\delta$.  Then

$$P(X=k) = q\tbinom{n-1}{k}(p^{-})^{k}(q^{+})^{n-1-k} + p\tbinom{n-1}{k-1}(p^{+})^{k-1}(q^{-})^{n-1-(k-1)}$$

#### Survival Function of Correlated Channels in a Cluster Server

Suppose the selection of tasks may be modeled by the dependent Bernoulli random variables given in the previous section. That is, suppose the customer selects Tasks 1-$K$ in sequence, and the selection or rejection of Task 1 affects all subsequent tasks by a dependency coefficient $\delta$. From [2], the correlation between task selections $\epsilon_{i}$ and $\epsilon_{j}$ is given by
$$\rho = \text{Cor}(\epsilon_{i},\epsilon_{j})=\begin{cases}\delta,&i=1;j=2,\ldots,K\\\delta^{2},&i\neq j;i,j\geq 2\end{cases}$$

This illustrates the dependency of Tasks 2-$K$ on the outcome of Task 1, and notes that while Tasks 2-$K$ are still correlated with each other, the dependency is much lower. In a similar fashion to the independent channel server, the survival function is derived.

Theorem 2 (Survival Function of Multichannel Server with Clustered Tasks and Dependent Channels). Suppose model assumptions 1-6 are satisfied. In addition, suppose the selection of channels 1-$K$ are determined by identically distributed Bernoulli random variables with dependency coefficient $\delta$ as defined in [2]. Then the survival function of the server is given by
$$S_{Y}(t)=\bar{F}_{0}(t)\exp\left(-\eta\int_{0}^{t}m(t-w)\bar{G}_{W}(w)\mathscr{S}(w)dw\right)$$

where $m(x) = \int_{0}^{x}\lambda(s)ds$, and

\begin{aligned}\mathscr{S}(w)&=\sum_{n=0}^{K}e^{-\eta n w}\sum_{j=0}^{K-n-1}\tbinom{K-1}{n-1, j,K-1-n-j}p^{K-1-j}(1-p)^{j+1}\delta^{K-1-n-j}(1-\delta)^{n}\\&\quad+\sum_{n=0}^{K}ne^{-\eta nw}\sum_{i=0}^{n-1}\tbinom{K-1}{K-1-n,i,n-1-i}p^{i+1}(1-p)^{K-n}\delta^{n-1-j}(1-\delta)^{K-n-j}\end{aligned}

## Conclusion

The generalized model of a server under random workload proposed in [3] admits further expansion by way of relaxing the assumption that incoming tasks have exactly one queue to enter on arrival. In considering a server partitioned into several channels, a cost is incurred, namely that additional stress to the server is dependent upon arrival and service times of all previous jobs. However, even under these circumstances, we may obtain a breakdown rate process and satisfactory conditional survival function for the server, and the door is opened to further discussion. By examining the multichannel server, we consider the interrelations of the channels themselves, and derive survival functions to meet the case when the channels are independent as well as when they are correlated.

### References

[1] Ki Hwan Cha and Eui Yong Lee. A stochastic breakdown model for an unreliable web server system and an optimal admission control policy. Journal of Applied Probability, 48(2):453–466, 2011.

[2] Andrzej Korzeniowski. On correlated random graphs. Journal of Probability and Statistical Science, 11:43–58, 2013.

[3] R. Traylor. Stochastic reliability of a server under random workload. Academic Advances of the CTO, 1, 2017.

Generalizing the Negative Binomial Distribution via First-Kind Dependence

## Generalizing the Negative Binomial Distribution via First-Kind Dependence

This paper generalizes the negative binomial random variable by generating it from a sequence of first-kind dependent Bernoulli trials under the identity permutation. The PMF, MGF, and various moments are provided, and it is proven that the distribution is indeed an extension of the standard negative binomial random variable. We examine the effect of complete dependence of the Bernoulli trials on the generalized negative binomial random variable. We also show that the generalized geometric random variable is a special case of the generalized negative binomial random variable, but the generalized negative binomial random variable cannot be generated from a sum of i.i.d. generalized geometric random variables.

On Permuted First-Kind Dependence of Categorical Random Variables

## On Permuted First-Kind Dependence of Categorical Random Variables

This paper discusses the notion of horizontal dependency in sequences of first-kind dependent categorical random variables. We examine the necessary and sufficient conditions for a sequence of first-kind dependent categorical random variables to be identically distributed when the conditional probability distribution of subsequent variables after the first are permuted from the identity permutation used in previous works.

On Server Efficiency

## Abstract

Editor’s note: This paper comprises the second chapter of the PhD dissertation by Rachel Traylor. Cha and Lee defined a mathematical notion of server performance by measuring efficiency $\psi$ defined as the long run average number of jobs completed per unit time. The service time distribution heavily influences the shape of the server efficiency as a function of a constant arrival rate $\lambda$. Various classes of distributions are studied in order to find sufficient conditions for the existence of a single maximum. The existence of a maximum allows for simple binary control policies to handle traffic and optimize performance.

## Introduction, Motivation, and Background

Cha and Lee [1] studied the reliability of a single server under a constant stress workload, and also defined a notion of server efficiency $\psi$ for a given intensity $\lambda(t)$ as the long-run average number of jobs completed per unit time as a way to measure server performance. With the number of jobs completed as $M$, the efficiency is defined

$$\psi := \lim\limits_{t \to \infty}\frac{E[M(t)]}{t}$$

Upon breakdown and rebooting, the server is assumed to be ‘as good as new’, in that performance of the server does not degrade during subsequent reboots. In addition, the model assumes the arrival process after reboot, denoted $\{N^{*}(t), t \geq 0\}$, is a nonhomogenous Poisson process with the same intensity function $\lambda(t)$ as before, and that $\{N^{*}(t), t \geq 0\}$ is independent of the arrival process before reboot. In a practical setting, this model assumes no ‘bottlenecking’ of arrivals occurs in the queue during server downtime that would cause an initial flood to the rebooted server. In addition, the reboot time is assumed to follow a continuous distribution $H(t)$ with expected value $\nu$. This process is a renewal reward process, with the renewal $\{R_{n}\} = \{M_{n}\}$, the number of jobs completed. The length of a renewal cycle is $Y_{n} + H_{n}$, where $Y_{n}$ is the length of time the server was operational, and $H_{n}$ is the time to reboot after a server crash. Then, by [2],

$$\psi = \frac{E[M]}{E[Y]+ \nu}$$

where $M$ is the number of jobs completed in a particular renewal cycle, $\nu$ is the mean time to reboot of the server, and $Y$ is the length of a particular renewal cycle. Then, using the definition of $\psi$ the following closed form of the efficiency of a server under all assumptions of Cha and Lee’s model is derived.

Theorem 1 (Server Efficiency under Cha/Lee)
Suppose $\{N(t), t \geq 0\}$ is a nonhomogenous Poisson process with intensity $\lambda(t)\geq 0$. Then the efficiency is given by

\begin{aligned}\psi&=\frac{1}{\int_{0}^{\infty}S_{Y}(t)dt + \nu}\left[\exp\left(-\int_{0}^{t}r_{0}(x)dx-\int_{0}^{t}\lambda(x)dx + a(t) + b(t)\right)\right.\\&\qquad\qquad\left.\times\left(r_{0}(t)a(t)+\eta a(t)b(t) \right)\right]\end{aligned} where $a(t) = \int_{0}^{t}e^{-\eta v}g_{W}(v)m(t-v)dv$, $b(t) = \int_{0}^{t}e^{-\eta(t-r)}\bar{G}_{W}(t-r)\lambda(r)dr$, $\bar{G}_{W}(x) = 1-\int_{0}^{x}g_{W}(s)ds$, and $m(x) = \int_{0}^{x}\lambda(s)ds$.

### Numerical Example and Control Policies

As an illustrative example, Cha and Lee considered the case when $\lambda(t) \equiv \lambda$,
$r_{0}(t) \equiv r_{0} = 0.2$, $\eta = 0.01$, $\nu = 1$, and $g_{W}(w) = we^{-w^{2}/2}$ (the PDF of the Rayleigh distribution). As shown in Figure 1, there exists a $\lambda^{*}$ such that $\psi(\lambda)$ is maximized. Thus one may implement the obvious optimal control policy for server control to avoid server overload:

(1) If the real time arrival rate $\lambda < \lambda^{*}$, do not interfere with arrivals.

(2) If $\lambda \geq \lambda^{*}$, facilitate some appropriate measure of interference.

Examples of interference for a web server in particular include rejection of incoming requests or possible re-routing. Cha and Lee give an interference policy of rejection with probability $1-\frac{\lambda^{*}}{\lambda}$.

The Rayleigh distribution used in Figure 1 has applications in physics, typically when the magnitude of a vector is related to its directional components. It is a special case of the Weibull distribution which is widely used in survival analysis, failure analysis, weather forecasting, and communications. These distributions are not typically used to model service times. The exponential distribution is the most common due to its memoryless properties, followed by the Erlang and uniform distributions.

The efficiency, $\psi$, under the Rayleigh distribution example in Figure 1 shows the existence of a $0 < \lambda^{*} < \infty$ such that $\psi(\lambda)$ is maximized at $\lambda^{*}$. This useful feature of $\psi(\lambda)$ in this case allows for the implementation of a simple control policy for arrivals to the server to prevent overload, given above.

Numerical simulations under a variety of possible distribution classes, including convex, concave, exponential, uniform, and Erlang suggest that the mathematical properties of $\psi$ are heavily influenced by the choice and characteristics of service time distribution $g_{W}(w)$. In particular, it is of interest to seek sufficient conditions of $g_{W}(w)$ that will guarantee the existence of a $\lambda^{*}$ that maximizes $\psi$. This is done for the uniform, compact support, and Erlang classes. Furthermore, it is shown under certain conditions, not only does the server efficiency lack a maximum, but $\psi$ increases without bound. This is not representative of real server behavior, and thus of mathematical and practical interest to note for further modeling.

## Efficiency of a Server under Uniform Service Life Distribution

Suppose $\lambda(x) \equiv \lambda$, and suppose $r_{0}(x) \equiv r_{0} = \max_{x \in (0,\infty)}r_{0}(x)$. The efficiency $\psi$ is given for constant $\eta$ and $\lambda$ here:
$\psi(\lambda) = \frac{1}{\int_{0}^{\infty}S_{Y}(t)dt + \nu}\left[\int_{0}^{\infty}\exp\left(-r_{0}t-\lambda t + a(t)+b(t)\right)(r_{0}+b(t))a(t)dt\right]$

where $S_{Y}(t)$ is the survival function of the node, $a(t) = \int_{0}^{t}e^{-\eta v}g(v)(t-v)dv$, $b(t) = \int_{0}^{t}e^{-\eta(t-r)}\bar{G}(t-r)dr$,
$g(v)$ is the pdf of the service time distribution, and $\bar{G}(x) = 1-\int_{0}^{x}g(s)ds$.

The following theorem gives sufficient conditions for the uniform distribution and $\eta$ that guarantee the existence of a finite maximum efficiency.

Theorem 2(Efficiency under Uniform Service Distribution)
Suppose the service life distribution is given by Uniform($c,d$) for some $0. Then if
$\sigma > \frac{ce^{-c\eta}}{\sqrt{12}\phi(-\eta)(1+\eta(c+d))+c\eta - 1}$ where $\phi(-\eta)$ the standard deviation of the service life $W$, $\psi(\lambda)$ has a maximum on $(0,\infty)$ is the moment generating function of a uniform distribution evaluated at $-\eta$.

Numerical simulations suggest that $\psi$ increases without bound for $c=0, d>1$. The following lemma proves this fact.

Lemma
Suppose the service life distribution is given by Uniform(0,d), with $d>1$. Then $\psi$ increases without bound.

This is worth discussing here. It makes no sense whatsoever for the efficiency of a server to increase forever as the arrival rate increases. So what’s happening here? Notice that if the uniform distribution include 0 as an endpoint, then there is a positive probability that the service time for a job will be exactly 0 (in whatever time units you like). This is impossible in reality. A small service time is possible, but it’s still strictly greater than 0. What we see here is that using distributions with positive support at 0 causes issues in the efficiency function; it’s not the fault of the definition of efficiency, but rather a consequence of using distributions that cannot mirror reality. The next section explores this further with a more broad class of distributions.

## Extension of the Uniform Distribution: Compact Support

The ideas and techniques presented in the previous section yield a powerful extension to any service life distribution with compact support away from 0. Supposing $g_{W}(w)$ has compact support $[a,b]$, it may be bounded above by a positively scaled uniform distribution. In practice, service times are finite and nonzero, thus this extension allows for very simple control policies to be implemented for a much larger class of distributions.

Theorem 3 (Efficiency under Compact Support)
Let $g_{W}(w)$ be the pdf of the service times having compact support $[c,d], d > 0$. Let $m = \max_{w}g_{W}(w) < \infty$, and $R = (d-c)$ be the length of the support. Then $\psi(\lambda)$ has a maximum if $m < \frac{c}{R\eta +e^{-b\eta}-e^{-a\eta}+\eta e^{-a\eta}}$.

Example 1

Let $g_{W}(w) = \frac{2}{5}w\mathbb{1}_{[2,3]}(w)$, and let $r_{0} = \nu = \eta = 1$. Then $m = \frac{6}{5}$. By Theorem 2, $m < \frac{2}{e^{-2}+e^{-3}+1}\approx 1.678$. Thus, the existence of a maximum $\psi$ is guaranteed. Figure 2 gives the numerical result with step size of 0.1. The maximum occurs around $\lambda^{*} = 0.5$

Example 2: Sufficient conditions aren’t always necessary conditions
The condition given in Theorem 2 is rather weak and is only a sufficient condition. As an illustration, consider the same increasing density shifted to a support of $[1,2]$. Thus
$g_{W}(w) = \frac{2}{3}w\mathbb{1}_{[1,2]}(w)$. Retain all other assumptions from Example 1. Then
$$m = \frac{4}{3} > \frac{1}{e^{-2}+1} \approx 0.88$$ which violates the condition. However, consider Figure 3 above. Clearly $\psi$ has a maximum at approximately $\lambda^{*} = 0.8$
The condition given in Theorem 2 relies on overestimating $g_{W}(w)$ by a constant. If the variance $\sigma_{W}^{2}$ of $g_{W}(w)$ is large, the maximum of the pdf will decrease and be more comparable to the rest of the distribution. In these cases, bounding $g_{W}(w)$ by its max $m$ over the support $[a,b]$ will give a reasonable approximation. However, in the case of a small support and high enough skew compared to the size and location of the support, much of the mass is concentrated at the right end of the support, and $m$ will be higher. In these cases, bounding $g_{W}(w)$ by $m$ will result in a large amount of overestimation, and thus the condition may fail, but $\psi$ still has a maximum. The numerical example wherein $g_{W}(w) = \frac{2}{3}w\mathbb{1}_{[1,2]}(w)$ illustrates the conservative nature of this type of estimation.

## Efficiency under Erlang Service Distribution

Now suppose $g(v)$ is of the Erlang class but shifted $\delta > 0$ to the right. For motivation, consider that service times can never be 0 in a practical setting. Then the PDF and the complement of the CDF are given by
\begin{aligned}g(v;k,\gamma,\delta) &= \left\{\begin{array}{lr}0, & 0 \leq v \leq \delta \\\frac{\gamma^{k}(v-\delta)^{k-1}e^{-\gamma(v-\delta)}}{(k-1)!}, & v\geq \delta\end{array}\right.\\\bar{G}(v;k,\gamma,\delta) &= \left\{\begin{array}{lr}1, & 0 \leq v \leq \delta \\e^{\gamma(\delta-v)}\sum_{j=0}^{k-1}\frac{\gamma^{j}(v-\delta)^{j}}{j!}, & v\geq \delta\end{array}\right.\end{aligned}

Theorem 4 (Server Efficiency under Erlang Distribution)
Let $\delta > 0, \eta > 0$. Let $\alpha(\delta) = \left(\frac{\gamma}{\gamma + \eta}\right)^{k}e^{-\eta\delta} + \frac{\gamma^{k}e^{\gamma\delta}(k-1)}{(\eta + \gamma)^{k-1}}$, and $0 < \beta(\delta,\eta) < 1$. If the service life distribution is of the $\delta-$shifted Erlang class, then $\psi(\lambda)$ has a maximum in $\lambda$ on $(0,\infty)$ for $\delta,\eta$ such that $\alpha(\delta) + \beta(\delta,\eta) < 1$.

Thus, we see that the Erlang service distribution produces a maximum as long as the support is shifted away from 0.

## Efficiency under Exponential Distribution

The exponential distribution is the most common service distribution assumed for a queue. The M/M/1 queue assumes a Poisson distribution for arrival times, and an exponential distribution or service times. However, the exponential distribution has most of its mass concentrated at or near 0. As we have seen in other distributions, this becomes a problem, both in considering what makes sense in practical applications, and in its effect on the efficiency.
Suppose $g(v)$ is of the exponential class. That is, $g(v) = \gamma e^{-\gamma v}, \gamma > 0$. It will be proven that under certain conditions on $\eta$ and $\gamma$, an exponential $g(v)$ causes $\psi$ to increase without bound.

Theorem 5 (Efficiency under Exponential Distribution)
Suppose $g(v) = \gamma e^{-\gamma v}$. Then if $\frac{2\gamma}{\gamma + \eta} > 1 + \frac{2}{\frac{2}{\gamma} + W\left(-\frac{\gamma}{\gamma +\eta}e^{-2-\frac{2\eta}{\gamma}}\right)}$, $\psi(\lambda) \to \infty$,  as $\lambda \to \infty$ where $W(\cdot)$ is the Lambert W-function.

## Extension to Random Stress and Nonconstant Intensity

All the above theorems regarding the server efficiency  assumed constant stress $\eta$ and constant intensity $\lambda$. This section generalizes the analyses in the previous sections for nonconstant intensity and random stress.

The stochastic reliability models in both [1]  and the previous chapter both assumed a time-dependent intensity $\lambda(t)$. By setting $\lambda \equiv \max_{t}\lambda(t)$, the established efficiency theorems provide a conservative set of conditions under which a maximum efficiency may be obtained. In these cases, $\psi$ is actually a function of all possible $\lambda_{\max}$.

If the job stresses are random, as in Theorem 2 of Traylor (2016), the above sections may still be utilized. Assume the sample space for $\mathcal{H}$ is compact and nonnegative. The efficiency is given by Theorem 2 of Traylor (2016).
WLOG, again suppose the sample space of $\mathcal{H}$ is discrete, given by $\{\eta_{1},...,\eta_{m}\}$ with respective probabilities $p_{i}, i =1,...,m$. Now suppose all mass is concentrated at $\eta_{[m]}$. Let $a_{m}(t) = \lambda\int_{0}^{t}e^{-\eta_{m}v}g(v)(t-v)dv$, and $b_{m}(t) = \lambda\int_{0}^{t}e^{-\eta_{m}(t-r)}\bar{G}_{W}(t-r)dr$. Then, the following are true:
(1) $E_{\mathcal{H}}[a(t)+b(t)] \leq a_{m}(t) + b_{m}(t)$

(2)$E_{\mathcal{H}}[a(t)] \leq a_{m}(t)$

(3) $E_{\mathcal{H}}[\mathcal{H} a(t)b(t)] \leq \eta_{m}a_{m}(t)b_{m}(t)$

Thus, by replacing the expectations in (1)-(3) with their respective upper bounds in the efficiency theorems, analyses of the efficiency for the uniform, compact support, and Erlang classes may proceed as previously detailed. These estimates are conservative but sufficient.

For an exponential service life distribution and random stress, create a lower bound for the expectations in
(1)-(3) by concentrating all mass at $\eta_{[1]}$. Then the conditions from Theorem 5 guarantee an explosion for the exponential distribution.

## Implications and Numerical Illustrations

### Uniform Service Distribution

Click on an image in the gallery to expand and explore the efficiency of a server under different instances of the uniform service distribution.
For the uniform service distribution, both the variance of $g_{W}(w)$ and the location of the support affect the efficiency itself. The gallery above shows $\psi(\lambda)$ for various uniform distributions as an illustration. In all four cases, $r_{0} = \nu =\eta =1$. The variance of a uniform distribution is given by $\sigma^{2} = \frac{d-c}{12}$. With the exception of the fourth gallery image, which illustrates the explosion of $\psi$ when the distribution has positive mass at 0, Table 1 below compares possible values of $\psi$ for different Uniform distributions. Notice that while the variance $\sigma^{2}$ does affect the range of $\psi$ by several orders of magnitude, the location of the support has a much more powerful effect. Thus, if all service times are equally likely, a server is less efficient if it is consistently but mildly slow (Uniform[10,11]) compared to an inconsistent server (Uniform[1,500]).

$$\begin{array}{l|rrrr}g_{W}(w) & \sigma^{2}&\mu& \text{Approximate Range of } \psi & \text{Approximate } \lambda^{*}\\\text{Uniform}[1,2] & 1/12 &1.5 & (0,0.012) & 1.3\\ \text{Uniform}[10,11] & 1/12 &10.5 & (0, 2\times10^{-11}) & 0.1 \\ \text{Uniform}[1,500] & 499/12 & 250.5 & (0,3\times 10^{-5}) & 1.6\end{array}$$

### Increasing but Compact Service Distribution

Click to expand and explore.

$$\begin{array}{l|rrrr}g_{W}(w) & \sigma^{2}&\mu& \text{Approx. Range of } \psi & \text{Approx. } \lambda^{*}\\\small{\frac{2w}{3}}\mathbb{1}_{[1,2]}(w)& \approx 0.0802&\approx 1.56 & (0,7\times10^{-3}) & 0.75\\\small{\frac{2w}{5}}\mathbb{1}_{[2,3]}(w)&\approx 0.0822 &\approx 2.53 & (0, 7\times10^{-4}) & 0.5 \\\small{\frac{2w}{500^{2}-1}}\mathbb{1}_{[1,500]}(w)& \approx 13888& 333.35 & (0,1.3\times 10^{-7})& 1.4\end{array}$$

As an illustration of a distribution on compact support, consider the class of increasing densities $g_{W}(w) = cx\mathbb{1}_{[a,b]}(x)$. Several examples are given in the above gallery table. For both compact supports of length 1, the variance is approximately the same, but the mean changes, producing an order of magnitude decrease in efficiency. Compared to the compact support of length 499, with a much larger mean, the efficiency decreases by 3 orders of magnitude. Notice, however, that the decline after the maximum is much less sharp.

### Erlang Service Distribution

Click on an image to explore.

$$\begin{array}{l|rrrr}g_{W}(w) & \sigma^{2}&\mu& \text{Approx. Range of } \psi & \text{Approx. } \lambda^{*}\\\text{Erlang}(2,1)& 2&2 & (0,0.7)& 9\\\text{Erlang}(9,1)&9 &9 & (0, 4\times10^{-6}) & 0.5 \\\text{Rayleigh}(1)& (4-\pi)/2& \sqrt{\pi/2} & (0,0.9)& 8\end{array}$$ The gallery above gives two examples of $\psi$ under an Erlang distribution. Notice the change in the efficiency as the mean increases. Here, since $\lambda =1$, $\sigma^{2} = \mu$, so the mean likely has the largest effect on $\psi$.

### Comparing all the examples

Comparing all examples, the Rayleigh(1) service distribution imposes the highest maximum efficiency, followed closely by the Erlang(2,1) service distribution, with the Uniform[1,2] service distribution following. $\lambda^{*}$ under the Erlang(2,1) service distribution is larger than for the Rayleigh(1) service distribution, indicating that a server whose service times follow the former distribution can handle a larger arrival intensity before its efficiency begins to decline than the latter.

The means for the Rayleigh(1), Erlang(2,1), and Uniform[1,2] distributions are similar, as shown in the tables, but the Uniform[1,2] distribution has equal probability of any service time in its support with large negative excess kurtosis. It is posulated that kurtosis, skew, and variance play large roles in the behavior and range of $\psi$. Compare the efficiency under the Erlang(2,1) service distribution with the efficiency under the Erlang(9,1) service distribution. Not only is the mean much lower for the Erlang(2,1) distribution, but the distribution is more strongly positive-skewed than the Erlang(9,1). Thus, more mass is concentrated at the left side of the distribution, indicating that the service times are more often shorter.

Finally, to note the effect of the typical stress level $\eta$ on the range of $\psi$, we compare the original Cha/Lee figure with our Rayleigh figure above. The service distribution and all other quantities remain the same, but Cha and Lee’s numerical example set $\eta = 0.01$, whereas the third image in the Erlang gallery shows $\psi$ under $\eta = 1$. The range of $\psi$ decreases by two orders of magnitude with a 100 fold increase in $\eta$, with the shape remaining similar. In addition, the location of the maximum $\lambda^{*}$ also inversely varies by the same magnitude.

Studying the efficiency under various service distributions aids not only in deciding when to implement a server intervention, but also aids in evaluating the performance of various servers given their service times.

## Conclusion

Many questions have been opened up about the various effects of the service distribution and stress on the server efficiency. The conditions given above are merely sufficient conditions, and have not been proven necessary. Studying the efficiency as a function of the arrival rate under various service distributions and stress provides a way to understand the performance of a server and the various factors that can affect it mathematically.

## References

1.  CHA, K. H., AND LEE, E. Y. A stochastic breakdown model for an unreliable web server system and an optimal admission control policy. Journal of Applied Probability 48, 2 (2011), 453–466.
2.  ROSS, S. Stochastic Processes, 2nd edition.
3.  TRAYLOR, R. Stochastic reliability of a server under random workload. Ph.D. thesis (2016).
A Generalized Geometric Distribution from Vertically Dependent Bernoulli Random Variables

## Abstract

This paper generalizes the notion of the geometric distribution to allow for dependent Bernoulli trials generated from dependency generators as defined in Traylor and Hathcock’s previous work. The generalized geometric distribution describes a random variable $X$ that counts the number of dependent Bernoulli trials until the first success. The main result of the paper is  $X$ can count dependent Bernoulli trials from any dependency structure and retain the same distribution. That is, if  $X$ counts Bernoulli trials with dependency generated by $\alpha_{1} \in \mathcal{C}_{\delta}$, and $Y$ counts Bernoulli trials with dependency generated by $\alpha_{2} \in \mathscr{C}_{\delta}$, then the distributions of  $X$ and $Y$ are the same, namely the generalized geometric distribution. Other characterizations and properties of the generalized geometric distribution are given, including the MGF, mean, variance, skew, and entropy.

## Introduction

The standard geometric distribution counts one of two phenomena:

1.  The count of i.i.d. Bernoulli trials until the first success
2.  The count of i.i.d. Bernoulli trials that resulted in a failure prior to the first success

The latter case is simply a shifted version of the former. However, this distribution, in both forms, has limitations because it requires a sequence of independent and identically distributed Bernoulli trials. Korzeniowski [2] originally defined what is now known as first-kind (FK) dependent Bernoulli random variables, and gave a generalized binomial distribution that allowed for dependence among the Bernoulli trials. Traylor [4] extended the work of Korzeniowski into FK-dependent categorical random variables and derived a generalized multinomial distribution in a similar fashion. Traylor and Hathcock [5] extended the notion of dependence among categorical random variables to include other kinds of dependency besides FK dependence, such as sequential dependence. Their work created a class of vertical dependency structures generated by a set of functions

$$\mathscr{C}_{\delta} = \{\alpha: \mathbb{N}_{\geq 2} \to \mathbb{N} : \alpha(n) < n \text{ and } \forall n \exists j \in \{1,\ldots,n-1\} : \alpha(n) = j\},$$

where the latter property is known as dependency continuity. In this paper, we derive a generalized geometric distribution from identically distributed but dependent Bernoulli random variables. The main result is that the pdf for the generalized geometric distribution is the same regardless of the dependency structure. That is, for any $\alpha \in \mathscr{C}_{\delta}$ that generates a sequence of identically distributed but dependent Bernoulli trials, the generalized geometric distribution remains unchanged.

## Background

The standard geometric distribution is built from a sequence of independent and identically distributed (i.i.d.) Bernoulli random variables with probability of success $p$ and probability of failure $q=1-p$. There are two “versions” of the geometric distribution:

• A random variable $X$ has a geometric distribution if it counts the number of Bernoulli trials needed to observe the first success.
•  A random variable $Y = X-1$ has a geometric distribution if it counts the number of failures in a sequence of Bernoulli trials before the first observed success.

In the first case, $X$ has support $\{1,2,3,...\}$, because we are looking for the first success, which can occur on trial 1, 2, 3, and so forth. In the latter case, $Y$ has support $\{0,1,2,...\}$ because we are counting the number of failures before the first success occurs. That is, if the first success occurs on Trial 1, then there were 0 failures preceding the first success. If the first success occurs on trial 2, then one failure occurred prior to the first success, and thus $Y=1$. Essentially, Version 2 is a shifted Version 1, because our perspective changes– we do not include the success in the count in Version 2.

For Version 1, known as the standard geometric distribution the pdf is given by

$$f_{X}(k) = q^{k-1}p, \quad k = 1, 2, 3, \ldots$$

For Version 2, (the shifted generalized geometric distribution)the pdf is given by

$$f_{Y}(k) = q^{k}p, \quad k = 0, 1, 2, \ldots$$

The next section derives the generalized geometric distribution for FK-dependent random variables, and then shows that the pdf is the same regardless of dependency structure.

## Generalized Geometric Distribution

### Derivation from FK-Dependent Bernoulli Random Variables

Suppose we have a sequence of FK-dependent Bernoulli Random variables. Recall from [2] and [4] that FK-dependent random variables are weighted toward the outcome of the first variable $\epsilon_{1}$. That is, in the Bernoulli case, $P(\epsilon_{1} = 1) = p$ and $P(\epsilon_{1} = 0) = q = 1-p$. For subsequent variables in the sequence,

\begin{aligned}P(\epsilon_{n} = 1 | \epsilon_{1}= 1) = p^{+} &\qquad P(\epsilon_{n} = 1 | \epsilon_{1} = 0) = p^{-} \\P(\epsilon_{n} = 0 | \epsilon_{1} = 1) = q^{-} &\qquad P(\epsilon_{n} = 0 | \epsilon_{1} = 0) = q^{+}\end{aligned}

for $n\geq 2$, where $q = 1-p$, $p^{+} = p + \delta q$, $p^{-} = p-\delta p$, $q^{-}= q-\delta q$, $q^{+} = q + \delta p$, and $0 \leq \delta \leq 1$ is the dependency coefficient.

We will first give the generalized “Version 1” of the geometric distribution for FK-dependent random variables.

Proposition.
Suppose $\epsilon = (\epsilon_{1},\epsilon_{2},\ldots, \epsilon_{n},\ldots)$ is a FK-dependent sequence of Bernoulli random variables. Let $X$ be the count of such Bernoulli variables needed until the first success. Then $X$ has a generalized geometric distribution with pdf
$$f_{X}(k) = \left\{\begin{array}{lr}p, & k = 1 \\q\left(q^{+}\right)^{k-2}p^{-}, & k \geq 2\end{array}\right.$$

In a similar fashion, suppose we prefer to count the number of failures before the first success occurs. For this generalized “Version 2”, we have the following proposition.

Proposition.
Suppose $\epsilon = (\epsilon_{1},\epsilon_{2},\ldots,\epsilon_{n},\ldots)$ is a FK-dependent sequence of Bernoulli random variables. Let $Y = X-1$ be the count of failures prior to the first success. Then $Y$ has a shifted generalized geometric distribution with pdf

$$f_{Y}(k) = \left\{\begin{array}{lr}p, & k = 0 \\q\left(q^{+}\right)^{k-1}p^{-}, & k \geq 1\end{array}\right.$$

### Generalized Geometric Distribution for any Vertical Dependency Structure

Propositions 1 and 2 were derived for FK-dependent random variables, but in fact these random variables $X$ and $Y$ remain distributed according to the generalized geometric distribution and the shifted generalized geometric distribution regardless of the vertical dependency structure specified, as long as the dependency structure was generated from a function in $\mathscr{C}_{\delta}$.

Theorem.
Let $\epsilon = (\epsilon_{1}, \epsilon_{2},\ldots,\epsilon_{n},\ldots)$ be a vertically dependent sequence of Bernoulli random variables, where the dependency is generated by $\alpha \in \mathscr{C}_{\delta}$. Let $X$ be the count of such Bernoulli trials needed until the first success, and let $Y = X-1$ be the count of failures of such Bernoulli trials prior to the first success. Then the pdf of $X$ and $Y$ is identical to those given in Propositions 1 and 2.

This result is quite powerful, and not one that holds for all generalized distributions constructed from dependent random variables. Given any vertical dependency structure generated from the broad class $\mathscr{C}_{\delta}$, the count of trials before a success and the count of failures before a success have the same probability distribution. Thus, if this information is desired, no information about the dependency structure other than the membership of its generating function in $\mathscr{C}_{\delta}$ is necessary. The only information needed to calculate the generalized geometric probabilities for dependent Bernoulli trials is $p$ and $\delta$.

The next section gives some basic properties of the Generalized Geometric Distribution, such as the moment generating function and selected moments.

## Properties  of the Generalized Geometric Distribution

### Moment Generating Function

Fact.

The moment generating function of the generalized geometric distribution is
$$M_{X}(t) = pe^{t} + \frac{qp^{-}e^{2t}}{1-q^{+}e^{t}}$$

### Mean

Fact.

The mean of the generalized geometric distribution is
$$E[X] = \mu = \frac{1-\delta p}{p(1-\delta)}$$

The effect of dependence can be seen in the plot of $E[X]$ below in Figure 1.. For fixed $p$, when $\delta \to 1$, $E[X] \to \infty$, though the rate changes with $p$.

To explore further, suppose $p = 1/2$, and the Bernoulli trials are thus balanced between success and failure. Figure 2 shows the effect of delta for a fixed $p$. Notice that the effect of $\delta$ on the expected value become more pronounced as $\delta \to 1$. In particular, for $\delta = 1/2$ and $p=1/2$, $E[X] = 3$, but after this point, an increase of only $1/6$ in $\delta$ to $\delta = 2/3$ increased the expected value to 4 trials before a success. To double the expected number of trials before a success again to $E[X] = 8$ requires an increase of $\delta$ by only $4/21$ to $6/7$.

A smaller probability of success $p$ will yield an expected value $\mu$ that is much more susceptible to effects of dependency $\delta$, and a larger $p$ will yield an expected value more resistant to high dependency $\delta$. Since the geometric distribution is a count of the number of trials needed to obtain the first success, a higher $p$ increases the probability that the first success occurs on the first trial, while a lower $p$ decreases that probability. Therefore, the dependency $\delta$ would have a higher effect for lower $p$, because a longer (dependent) sequence is expected to be generated prior to the first success, which increases the expected number of trials faster than if the Bernoulli trials were independent.

Remark. Notice that when $\delta = 0$, the Bernoulli trials are independent. The mean of the generalized geometric distribution when $\delta = 0$ is $E[X] = \frac{1}{p}$, the mean of the standard geometric distribution.

### Variance

Fact.

The variance of the generalized geometric distribution is

$$\text{Var}(X) = \sigma^{2} = \frac{1-p + \delta p(1-p)}{p^2(1-\delta)^2}$$

Figure 3 shows the effect of $\delta$ on the variance for different values of $p$. As with the mean, a smaller $p$ induces a higher effect of $\delta$ on the variance of the number of dependent Bernoulli trials before the first success is observed. The shape of all 4 cases is similar, but the scales are vastly different. As $p$ increases, the scale of the variance decreases dramatically.

Remark.  Again, note that when $\delta = 0$, the variance of the generalized geometric distribution reduces to that of the standard geometric distribution.

### Skew

Fact.

The skew of the generalized geometric distribution is given by
$$\text{Skew}[X] = \frac{2-3p + p^2 + \delta p[q+\delta p q + p(2\delta-1-p)]}{\left(q + \delta pq\right)^{3/2}}$$

The skew of the generalized geometric distribution gets more complicated in its behavior as a function of both $p$ and $\delta$. Figure 4 shows the skew as a function of $p$ for the two extreme cases: complete independence ($\delta = 0$) and complete dependence $\delta = 1$. From $p=0$ to $p \approx 0.658$, the skew for the independent geometric distribution is greater than the completely dependent case. For $p \gtrapprox 0.658$, the skew is greater under complete dependence.

### Entropy

The entropy of a random variable measures the average information contained in the random variable. It can also be viewed as a measure of how unpredictable or “truly random” the variable is [1] . The definition of entropy, denoted $H(X)$, was coined by Claude Shannon [3] in 1948.

Definition. (Entropy)
$$H(X) := -\sum_{i}P(x_{i})\log_{2}(P(x_{i}))$$

For the standard geometric distribution, the entropy is given by

$$H_{sg}(X) = \frac{-(1-p)\log_{2}(1-p)-p\log_{2}(p)}{p}$$

Fact.
The entropy for the generalized geometric distribution is
$$H_{gg}(X) = \frac{-\left[pp^{-}\log_{2}(p) + qp^{-}\log_{2}(qp^{-}) + qq^{+}\log_{2}(q^{+})\right]}{p^{-}}$$

Figure 5a shows $H_{gg}(X)$ as a function of $p$ for fixed values of $\delta$. Notice that while the entropy decreases to 0 for all curves as $p \to 1$, the entropy curve is shifted upward for larger $\delta$. Figure 5b fixes $p$ and looks at entropy as a function of $\delta$. Notice that for smaller $p$, the entropy is much higher, which aligns with intuition.

## Conclusion

The standard geometric distribution counts the number of independent Bernoulli trials until the first success. This paper uses the works of Korzeniowski, and Traylor and Hathcock [2, 4, 5] on dependent Bernoulli and categorical random variables to develop a generalized geometric distribution built from dependent Bernoulli random variables. The main result of the paper is that the pdf for the generalized geometric distribution is independent of the dependency structure of the Bernoulli random variables that comprise it. That is, regardless of dependency structure, the pdf for the generalized geometric distribution is given by Proposition 1. Various properties and characterizations were given, including the moment generating function, mean, variance, skew, and entropy. The effect of dependency on each property was studied.

## References

1. Skewness. https://en.wikipedia.org/wiki/skewness.
2. Andrzej Korzeniowski. On correlated random graphs. Journal of Probability and Statistical Science,pages 43–58, 2013.
3. Claude Shannon. A mathematical theory of communication. The Bell System Technical Journal, 1948.
4.  Rachel Traylor. A generalized multinomial distribution from dependent categorical random variables. Academic Advances of the CTO, 2017.
5. Rachel Traylor and Jason Hathcock. Vertical dependency in sequences of categorical random variables. Academic Advances of the CTO, 2017.
Stochastic Reliability of a Server under a Random Workload

## Abstract

Editor’s note: This paper is the first chapter from a PhD thesis published in 2016 by Rachel Traylor. We generalize a 2011 model from Cha and Lee that gave a closed form for the survival function of a server under a random workload, and with random service times. In this work, the constant stress assumption of Cha and Lee is relaxed and allowed to be random; that is, every request to the server brings a random stress or workload to the server for the duration of its stay until completion. The efficiency measure of a random stress server is defined as the long run average number of jobs completed per unit time, and provides a way to measure server performance.

## Introduction

There are many types of systems which can be dubbed servers, such as a retail checkout counter, a shipping company, a web server, or a customer service hotline. All of these systems have common general behavior. Requests or customers arrive via a stochastic process, the service times vary randomly, and each request stresses the server if only temporarily. A general stochastic model that describes the reliability of such a server can provide the necessary information for optimal resource allocation and efficient task scheduling, leading to significant cost savings for businesses and improved performance metrics [6]. Such topics have been studied in literature for several decades [1, 2, 3, 20].

Much attention was devoted to reliability principles that model software failures and bug fixes, starting with Jelinski and Moranda in 1972 [11]. The hazard function under this model shows the time between the i-th failure and the (+ 1)st failure. Littlewood (1980) [16] extended this initial reliability model for software by assuming differences in error size [12].

These models have been extended into software testing applications [4, 5, 19] and optimal software release times [7, 8, 17, 22]. The explosion of e-commerce and the resulting increase in internet traffic have led to the development of reliability models for Web applications. Heavy traffic can overload and crash a server; thus, various control policies for refreshing content and admission of page requests were created [10, 13, 15, 18, 23].

In particular, Cha and Lee (2011) [9] proposed a stochastic breakdown model for an unreliable web server whose requests arrive at random times according to a nonhomogenous Poisson process and bring a constant stress factor to the system that dissipates upon service completion. The authors provide a fairly general survival function under any service distribution $g_{W}(w)$, define server efficiency to measure performance, and illustrate a possible admission control policy due to an observed property of the server efficiency under a specific numerical example.

Thus far, no extensions of [9] have been proposed. This work generalizes the model put forth by Cha and Lee in a variety of ways. First, the assumption of constant job stress is relaxed and replaced by a random variable, and a new survival function and efficiency equation are derived. This work, while suitable for IT applications, is general enough for use in almost any industry, including logistics, retail, manufacturing, and engineering systems.

## Background (Cha and Lee’s Model)

### System Description and Survival Function

Cha and Lee considered a web server wherein each request arrives via a nonhomogenous Poisson process $\{N(t): t \geq 0\}$ with intensity function $\lambda(t)$. Each request adds a constant stress $\eta$, increasing the breakdown rate for the duration of service. Suppose $r_{0}(t)$ is the breakdown rate of the idle server. Then the breakdown rate process $\mathscr{B}(t)$ is defined as

$$\mathscr{B}(t) := r_{0}(t) + \eta\sum_{j=1}^{N(t)}\mathbb{1}(T_{j} \leq t \leq T_{j} + W_{j})$$

where $N(t), \{T_{j}\}_{j=1}^{N(t)}, \{W_{j}\}_{j=1}^{N(t)}$ are the random variables that describe the number of arrivals, arrival times, and service times, respectively. It is assumed that $\{T_{j}\}_{j=1}^{N(t)}$ are independent of each other. Furthermore, the $\{W_{j}\}_{j=1}^{N(t)} \sim g_{W}(w)$ are i.i.d. and are mutually independent of all $T_{j}$‘s.

Under these conditions, Cha and Lee proved the following theorem that gives the explicit form of the survival function under this model:

Theorem.

Suppose that $\{N(t), t \geq 0\}$ is a nonhomogenous Poisson process with intensity function $\lambda(t)$, i.e. $m(t) \equiv \int_{0}^{t}\lambda(x)dx$. Assuming the conditional survival function is given by

$$P\left(Y > t \left| N(t), \{T_{j}\}_{j=1}^{N(t)}, \{W_{j}\}_{j=1}^{N(t)}\right.\right) = \bar{F}_{0}(t)\exp\left(-\eta\sum_{j=1}^{N(t)}\min(W_{j}, t-T_{j})\right)$$

and $m(t)$ has an inverse, the survival function of $Y$ is given by

$$S_{Y}(t) = \bar{F}_{0}(t)\exp\left(-\eta\int_{0}^{t}\exp(-\eta w)\bar{G}_{W}(w)m(t-w)dw\right)$$

and the hazard function of $Y$, denoted $r(t)$, is given by

$$r(t) = r_{0}(t) + \eta\int_{0}^{t}e^{-\eta w}\bar{G}_{W}(w)\lambda(t-w)dw$$

### Efficiency of the Server

It is natural to develop some measure of server performance. Cha and Lee measure such performance by defining the efficiency, $\psi$, of the web server as the long-run expected number of jobs completed per unit time. That is, with the number of jobs completed as $M$, the efficiency is defined

$$\psi := \lim\limits_{t \to \infty}\frac{E[M(t)]}{t}$$

Upon breakdown and rebooting, the server is assumed to be ‘as good as new’, in that performance of the server does not degrade during subsequent reboots. In addition, the model assumes the arrival process after reboot, denoted $\{N^{*}(t), t \geq 0\}$, is a nonhomogenous Poisson process with the same intensity function $\lambda(t)$ as before, and that $\{N^{*}(t), t \geq 0\}$ is independent of the arrival process before reboot. In a practical setting, this model assumes no ‘bottlenecking’ of arrivals occurs in the queue during server downtime that would cause an initial flood to the rebooted server. In addition, the reboot time is assumed to follow a continuous distribution $H(t)$ with expected value $\nu$. This process is a renewal reward process, with the renewal $\{R_{n}\} = \{M_{n}\}$, the number of jobs completed. The length of a renewal cycle is $Y_{n} + H_{n}$, where $Y_{n}$ is the length of time the server was operational, and $H_{n}$ is the time to reboot after a server crash. Then, by [21],

$$\psi = \frac{E[M]}{E[Y]+ \nu}$$

where $M$ is the number of jobs completed in a particular renewal cycle, $\nu$ is the mean time to reboot of the server, and $Y$ is the length of a particular renewal cycle. Then, using the definition of efficiency, the following closed form of the efficiency of a server under all assumptions of Cha and Lee’s model is derived.

Theorem. (Efficiency of the Constant Stress Server)
Suppose $\{N(t), t \geq 0\}$ is a nonhomogenous Poisson process with intensity $\lambda(t)\geq 0$. Then the efficiency is given by

\begin{aligned}\psi &= \frac{1}{\int_{0}^{\infty}S_{Y}(t)dt + \nu}\left[\exp\left(-\int_{0}^{t}r_{0}(x)dx-\int_{0}^{t}\lambda(x)dx + a(t) + b(t)\right)\right.\\&\kern{10em}\biggl.\times\left(r_{0}(t)a(t) + \eta a(t)b(t)\right)\biggr]\end{aligned}

where $a(t) = \int_{0}^{t}e^{-\eta v}g_{W}(v)m(t-v)dv$, $b(t) = \int_{0}^{t}e^{-\eta(t-r)}\bar{G}_{W}(t-r)\lambda(r)dr$, $\bar{G}_{W}(x) = 1-\int_{0}^{x}g_{W}(s)ds$, and $m(x) = \int_{0}^{x}\lambda(s)ds$.

### Numerical Examples and Control Policies

As an illustrative example, Cha and Lee considered the case when $\lambda(t) \equiv \lambda$,
$r_{0}(t) \equiv r_{0} = 0.2$, $\eta = 0.01$, $\nu = 1$, and $g_{W}(w) = we^{-w^{2}/2}$ (the PDF of the Rayleigh distribution). As shown in the figure above, there exists a $\lambda^{*}$ such that $\psi(\lambda)$ is maximized. Thus one may implement the obvious optimal control policy for server control to avoid server overload:

1. If the real time arrival rate $\lambda < \lambda^{*}$, do not interfere with arrivals.
2. If $\lambda \geq \lambda^{*}$, facilitate some appropriate measure of interference.

Examples of interference for a web server in particular include rejection of incoming requests or possible re-routing. Cha and Lee give an interference policy of rejection with probability $1-\frac{\lambda^{*}}{\lambda}$. The next section presents a generalization of the above model by relaxing the assumption that the job stress $\eta$ is constant. This yields a far more generalized model that can encompass random stress with any distribution.

## Random Stress Reliability Model

The basic model here is very similar to Cha and Lee’s. The main difference is that we relax the constant stress assumption and allow it to be random.

Assume that each job $j$ coming into the server adds a random stress $\mathcal{H}_{j}$ to the server for the duration of its time in the system. Suppose $\{\mathcal{H}_{j}\}_{j=1}^{N(t)} \sim \mathcal{H}$, and are i.i.d. where WLOG $\mathcal{H}$ is a discrete random variable with a finite sample space $S = \{\eta_{i}: \eta_{i} \in \mathbb{R}^{+}, i = 1,...,m \text{ for } m \in \mathbb{N}\}$ and probability distribution given by

$$P(\mathcal{H} = \eta_{i}) = p_{i}, \quad i = 1,\ldots,m$$

The following assumptions from Cha and Lee are retained:

1. (CL1) Requests arrive via a nonhomogenous Poisson Process $\{N(t), t \geq 0\}$ with intensity $\lambda(t)$.
2. (CL2) Arrival times $\{T_{j}\}_{j=1}^{N(t)}$ are independent.
3. (CL3) Service times $\{W_{j}\}_{j=1}^{N(t)}$ are i.i.d. with pdf $g_{W}(w)$ and mutually independent of all arrival times.

Then, the random stress breakdown rate (RSBR) process $\mathscr{B}(t)$ is given by

$$\mathscr{B}(t) = r_{0}(t) + \sum_{j=1}^{N(t)}\mathcal{H}_{j}\mathbb{1}(T_{j} < t \leq T_{j} + W_{j}), \quad t \geq 0$$

Compare the sample trajectory shown in the figure above to the sample trajectory of breakdown rate under the original model (see page 2). The random stress brought by each request still disappears upon job completion, but the effect on the server is no longer deterministic. Thus, for the same set of arrival times and respective completion times, the realization of the breakdown rate process under the RSBR model has one more element of variation.

Let $Y$ be the random time to breakdown of the web server given the workload from client requests. Let $\mathfrak{T} = \{T_{j}\}_{j=1}^{N(t)}$, $\mathfrak{W} = \{W_{j}\}_{j=1}^{N(t)}$, and $\mathfrak{H} = \{\mathcal{H}_{j}\}_{j=1}^{N(t)}$, with observed values $\mathfrak{t} = \{t_{j}\}_{j=1}^{N(t)}$, $\mathfrak{w} = \{w_{j}\}_{j=1}^{N(t)}$, and $\mathfrak{h} = \{\eta_{i_{j}}\}_{i_{j}=1}^{N(t)}$. Then the conditional survival function of the server, given the arrival process of client requests ($N(t)$), job stresses ($\mathfrak{H}$), service times ($\mathfrak{W}$), and arrival times ($\mathfrak{T}$) is

\begin{aligned}S_{\scriptstyle{Y|N(t), \mathfrak{T}, \mathfrak{W}, \mathfrak{H} }}(t|n, \mathfrak{t}, \mathfrak{w}, \mathfrak{h}) &=e^{ -\int_{0}^{t}\mathscr{B}(x)dx}\\&= \bar{F}_{0}(t)e^{-\sum_{j=1}^{N(t)} \mathcal{H}_{i}\min(W_{j}, t-T_{j})}\end{aligned}

where $\bar{F}_{0}(t) = \exp\left(-\int_{0}^{t}r_{0}(x)dx\right)$.

### Survival Function for the RSBR Server

Under the RSBR generalization, the survival function of the server is given in the following theorem.

Theorem 1. (Survival Function for a Random Stress Server)
Suppose that jobs arrive to a server according to a nonhomogenous Poisson process
$\{N(t), t \geq 0\}$ with intensity function $\lambda(t) \geq 0$ and $m(t) \equiv E[N(t)] = \int_{0}^{t}\lambda(x)dx$. Let the arrival times $\{T_{j}\}_{j=1}^{N(t)}$ be independent, and let the service times $\{W_{j}\}_{j=1}^{N(t)} g_{W}(w)$ be i.i.d. and mutually independent of all arrival times. Assume the random job stresses $\mathcal{H}_{j} \sim \mathcal{H}$. Then

$$S_{Y}(t) = \bar{F}_{0}(t)\exp\left(- E_{\mathcal{H}}\left[\mathcal{H}\int_{0}^{t}e^{-\mathcal{H} w}m(t-w)\bar{G}_{W}(w)dw\right]\right)$$

where $\bar{F}_{0}(t) = \exp\left(-\int_{0}^{t}r_{0}(s)ds\right)$.

The compound failure rate function (or hazard function) $r(t)$ is given by

$$r(t) = -\frac{d}{dt}\ln(S_{Y}(t)) = r_{0}(t) + E_{\mathcal{H}}\left[\mathcal{H}\int_{0}^{t}e^{-\mathcal{H} w}m(t-w)\bar{G}_{W}(w)dw\right]$$

See [14] or [21].

## Efficiency Measure of a Server under RSBR

Upon server crash, the server must be rebooted. This section gives the server efficiency as defined in [9], but for the RSBR model. The server efficiency given in Theorem 1 holds quite a few similarities to that of [9], except that, like Theorem 1, the expectation must be taken over the random stress variable. The following assumptions from the original model are retained:

1. (E1) The arrival process after rebooting, $\{N^{rb}(t), t \geq 0\}$, remains a nonhomogenous Poisson process with the same intensity function $\lambda(t), t \geq 0$ as before.
2. (E2) $\{N^{rb}(t), t \geq 0\}$ is independent of the arrival process of client requests before rebooting. Hence, $\{N^{rb}(t), t \geq 0\} = \{N(t), t \geq 0\}$, since it retains all the same characteristics as before.
3. (E3) The time to reboot the server follows a continuous distribution $H(t)$ with mean $\nu$.

Recall that $M(t)$ is defined as the total number of jobs completed by the server during the time $(0,t]$. Also, recall the definition of server efficiency from [9]:

$$\psi \equiv \lim\limits_{t \to \infty}\frac{E[M(t)]}{t}$$

The efficiency of the server under a random stress environment is given in the following theorem.

Theorem (Server Efficiency under Random Stress Environment)
Suppose that $\{N(t): t \geq 0\}$ is a nonhomogenous Poisson process with intensity function $\lambda(t), t \geq 0$. Suppose also the conditions of Theorem 1 and the conditions (E1)-(E3) are met. Then the efficiency of the server is given by

\begin{aligned}\psi &= \frac{1}{\int_{0}^{\infty}S_{Y}(t)dt + \nu}\\&\kern{2em}\times\left\{\int_{0}^{\infty}e^{-\int_{0}^{t}r_{0}(x)-\int_{0}^{t}\lambda(x)dx+E_{\mathcal{H}}[a(t)+b(t)]}(r_{0}(t)E_{\mathcal{H}}[a(t)] + E_{\mathcal{H}}[\mathcal{H} a(t)b(t)])dt \right\}\end{aligned}

where

$$a(t) = \int_{0}^{t}e^{-\mathcal{H} v}g_{W}(v)m(t-v)dv$$

and

$$b(t) = \int_{0}^{t}e^{-\mathcal{H}(t-r)}\bar{G}_{W}(t-r)\lambda(r)dr.$$

## Remarks and Implications

Note that $\mathcal{H}$ was assumed discrete, but the proofs of Theorems 1 and 2 are unaffected if $\mathcal{H}$ is continuous. Thus the generality of this model is significantly stronger than in [9]. In practical considerations, these integrals will need to be numerically evaluated.

For certain distributions of $\mathcal{H}$, $S_{Y}(t)$ has a fairly compact form. Subsequent installments will examine the case where $\mathcal{H}$ has a binomial distribution, formed from both independent and dependent Bernoulli trials. We will also explore the effects of various service life distributions on $\psi$.

## Conclusion

This paper has given a simple but large generalization of [9]. The random stress model allows for the reality that we likely will not know the stress a request will bring to the server, but may know its distribution. Subsequent installments will further study the properties of the RSBR model and extend it to other single-server models, and networks.

Vertical Dependency in Sequences of Categorical Random Variables

## Abstract

This paper develops a more general theory of sequences of dependent categorical random variables, extending the works of Korzeniowski (2013) and Traylor (2017) that studied first-kind dependency in sequences of Bernoulli and categorical random variables, respectively. A more natural form of dependency, sequential dependency, is defined and shown to retain the property of identically distributed but dependent elements in the sequence. The cross-covariance of sequentially dependent categorical random variables is proven to decrease exponentially in the dependency coefficient $\delta$ as the distance between the variables in the sequence increases. We then generalize the notion of vertical dependency to describe the relationship between a categorical random variable in a sequence and its predecessors, and define a class of generating functions for such dependency structures. The main result of the paper is that any sequence of dependent categorical random variables generated from a function in the class $\mathscr{C}_{\delta}$ that is dependency continuous yields identically distributed but dependent random variables. Finally, a graphical interpretation is given and several examples from the generalized vertical dependency class are illustrated.

## Introduction

Many statistical tools and distributions rely on the independence of a sequence or set of random variables. The sum of independent Bernoulli random variables yields a binomial random variable. More generally, the sum of independent categorical random variables yields a multinomial random variable. Independent Bernoulli trials also form the basis of the geometric and negative binomial distributions, though the focus is on the number of failures before the first (or rth success). [2] In data science, linear regression relies on independent and identically distributed (i.i.d.) error terms, just to name a few examples.

The necessity of independence filters throughout statistics and data science, although real data rarely is actually independent. Transformations of data to reduce multicollinearity (such as principal component analysis) are commonly used before applying predictive models that require assumptions of independence. This paper aims to continue building a formal foundation of dependency among sequences of random variables in order to extend these into generalized distributions that do not rely on mutual independence in order to better model the complex nature of real data. We build on the works of Korzeniowski [1] and Traylor [3] who both studied first-kind (FK) dependence for Bernoulli and categorical random variables, respectively, in order to define a general class of functions that generate dependent sequences of categorical random variables.

The next section gives a brief review of the original work by Korzeniowski [1] and Traylor [3]. In Section 2, a new dependency structure, sequential dependency is introduced, and the cross-covariance matrix for two sequentially dependent categorical random variables is derived. Sequentially dependent categorical random variables are identically distributed but dependent. Section 3.1 generalized the notion of vertical dependency structures into a class that encapsulates both the first-kind (FK) dependence of Korzeniowski [1] and Traylor [3] and shows that all such sequences of dependent random variables are identically distributed. We also provide a graphical interpretation and illustrations of several examples of vertical dependency structures.

We repeat a section from A Generalized Multinomial Distribution from Dependent Categorical Random Variables  in order to give a review of the original first-kind (FK) dependency created by Korzeniowski. For the full details, see the previous work on the topic.

Korzeniowski defined the notion of dependence in a way we will refer to here as dependence of the first kind (FK dependence). Suppose $(\epsilon_{1},...,\epsilon_{N})$ is a sequence of Bernoulli random variables, and $P(\epsilon_{1} = 1) = p$. Then, for $\epsilon_{i}, i \geq 2$, we weight the probability of each binary outcome toward the outcome of $\epsilon_{1}$, adjusting the probabilities of the remaining outcomes accordingly.

Formally, let $0 \leq \delta \leq 1$, and $q = 1-p$. Then define the following quantities
\begin{aligned}p^{+} := P(\epsilon_{i} = 1 | \epsilon_{1} = 1) = p + \delta q &\qquad p^{-} :=P(\epsilon_{i} = 0 | \epsilon_{1} = 1) = q -\delta q\\q^{+} := P(\epsilon_{i} = 1 | \epsilon_{1} = 0) = p-\delta p&\qquad q^{-} := P(\epsilon_{i} = 0 | \epsilon_{1} = 0) = q + \delta p\end{aligned}

Given the outcome $i$ of $\epsilon_{1}$, the probability of outcome $i$ occurring in the subsequent Bernoulli variables $\epsilon_{2}, \epsilon_{3},..., \epsilon_{n}$ is $p^{+}, i = 1$ or $q^{+}, i=0$. The probability of the opposite outcome is then decreased to $q^{-}$ and $p^{-}$, respectively.

The figure above illustrates the possible outcomes of a sequence of such dependent Bernoulli variables. Korzeniowski showed that, despite this conditional dependency, $P(\epsilon_{i} = 1) = p \quad\forall i$. That is, the sequence of Bernoulli variables is identically distributed, with correlation shown to be
$$\text{Cor}(\epsilon_{i}, \epsilon_{j}) = \left\{\begin{array}{lr}\delta, & i=1 \\\delta^{2}, &i \neq j, \quad i,j \geq 2\end{array}\right.$$

These identically distributed but correlated Bernoulli random variables yield a Generalized Binomial distribution with a similar form to the standard binomial distribution.

In the previous work, the concept of Bernoulli FK dependence was extended to categorical random variables. That is, given a sequence of categorical random variables with $K$ categories, $P(\epsilon_{1} = i) = p_{i}$, $i = 1,\ldots,K,$ \begin{aligned}P(\epsilon_{j} = i | \epsilon_{1} = i) &= p_{i}^{+} = p_{i} + \delta(1-p_{i});\\P(\epsilon_{j} = k | \epsilon_{1} = i) &= p_{k}^{-} = p_{k}-\delta p_{k}, i \neq k, \:\:k = 1,\ldots,K.\end{aligned}

Traylor proved that FK dependent categorical random variables remained identically distributed, and showed that the cross-covariance matrix of categorical random variables has the same structure as the correlation between FK dependent Bernoulli random variables. In addition, the concept of a generalized binomial distribution was extended to a generalized multinomial distribution.

In the next section, we will explore a different type of dependency structure, sequential dependency.

## Sequentially Dependent Random Variables

While FK dependence yielded some interesting results, a more realistic type of dependence is sequential dependence, where the outcome of a categorical random variable depends with coefficient δ on the outcome of the variable immediately preceeding it in the sequence. Put formally, if we let $\mathcal{F}_{n} = \{\epsilon_{1},\ldots,\epsilon_{n-1}\}$, then $P(\epsilon_{n}|\epsilon_{1},\ldots,\epsilon_{n-1}) = P(\epsilon_{n}|\epsilon_{n-1}) \neq P(\epsilon_{n})$. That is, $\epsilon_{n}$ only has direct dependence on the previous variable $\epsilon_{n-1}$. We keep the same weighting as for FK-dependence. That is,

$$\begin{array}{lr}P(\epsilon_{n} = j | \epsilon_{n-1} = j) = p_{j}^{+} = p_{j} + \delta(1-p_{j}),&\\ P(\epsilon_{n} = j | \epsilon_{n-1} = i) = p_{j}^{-} = p_{j}-\delta p_{j}, &\:\: j = 1,\ldots,K;\:\:i \neq j\end{array}$$ As a comparison, for FK dependence, $P(\epsilon_{n}|\epsilon_{1},\ldots,\epsilon_{n-1}) = P(\epsilon_{n}|\epsilon_{1}) \neq P(\epsilon_{n})$. That is, $\epsilon_{n}$ only has direct dependence on $\epsilon_{1}$, and

$$\begin{array}{lr}P(\epsilon_{n} = j | \epsilon_{1} = j) = p_{j}^{+} = p_{j} + \delta(1-p_{j}),&\\ P(\epsilon_{n} = j | \epsilon_{1} = i) = p_{j}^{-} = p_{j}-\delta p_{j},&\:\:j = 1,\ldots,K;\:\: i \neq j\end{array}$$ Let $\epsilon = (\epsilon_{1},\ldots,\epsilon_{n})$ be a sequence of categorical random variables of length n (either independent or dependent) where the number of categories for all $\epsilon_{i}$ is K. Denote $\Omega_{n}^{K}$ as the sample space of this random sequence. For example,

$$\Omega_{3}^{3} = \{(1,1,1), (1,1,2), (1,1,3), (1,2,1), (1,2,2),\ldots,(3,3,1),(3,3,2),(3,3,3)\}$$

Dependency structures like FK-dependence and sequential dependence change the probability of a sequence $\epsilon$ of length n taking a particular $\omega = (\omega_{1},\ldots,\omega_{n}) \in \Omega_{n}^{K}$. The probability of a particular $\omega \in \Omega_{n}^{K}$ is given by the dependency structure. For example, if the variables are independent, $P((1,2,1)) = p_{1}^{2}p_{2}$. Under FK-dependence, $P((1,2,1)) = p_{1}p_{2}^{-}p_{1}^{+}$, and under sequential dependence, $P((1,2,1)) = p_{1}p_{2}^{-}p_{1}^{-}$. See Figures 2 and 3 for a comparison of the probability mass flows of sequential dependence and FK dependence. Sequentially dependent sequences of categorical random variables remain identically distributed but dependent, just like FK-dependent sequences. That is,

Lemma 1

Let $\epsilon = (\epsilon_{1},\ldots,\epsilon_{n})$ be a sequentially dependent categorical sequence of length n with K categories. Then

$$P(\epsilon_{j} = i) = p_{i}; \qquad i = 1,\ldots,K;\:\:j = 1,\ldots,n;\:\: n \in \mathbb{N}$$

## Cross-Covariance Matrix

The $K \times K$ cross-covariance matrix $\Lambda^{m,n}$ of $\epsilon_{m}$ and $\epsilon_{n}$ in a sequentially dependent categorical sequence has entries $\Lambda_{i,j}^{m,n}$ that give the cross-covariance $\text{Cov}([\epsilon_{m} = i]$, $[\epsilon_{n} = j])$, where $[\cdot]$ denotes an Iverson bracket. In the FK-dependent case, the entries of $\Lambda^{m,n}$ are given by

$$\Lambda^{1,n}_{ij}=\left\{\begin{array}{lc}\delta p_{i}(1-p_{i}), & i = j \\-\delta p_{i}p_{j}, & i \neq j,\end{array}\right.\quad\,n \geq 2,$$

and

$$\Lambda^{m,n}_{ij}=\left\{\begin{array}{lc}\delta^{2}p_{i}(1-p_{i}), & i = j \\-\delta^{2}p_{i}p_{j}, & i \neq j,\end{array}\right. \quad\,n > m; m\neq 1$$

Thus, the cross covariance between any two $\epsilon_{m}$ and $\epsilon_{n}$ in the FK-dependent case is never smaller than δ 2 times the independent cross-covariance. In the sequentially dependent case, the cross-covariances of $\epsilon_{m}$ and $\epsilon_{n}$ decrease in powers of δ as the distance between the two variables in the sequence increases.

Theorem 1 (Cross-Covariance of Dependent Categorical Random Variables).

Denote $\Lambda^{m,n}$ as the $K \times K$ cross-covariance matrix of $\epsilon_{m}$ and $\epsilon_{n}$ in a sequentially dependent sequence of categorical random variables of length N, $m \leq n$, and $n \leq N$, defined as $\Lambda^{m,n} = E[(\epsilon_{m} - E[\epsilon_{m}])(\epsilon_{n} - E[\epsilon_{n}])]$. Then the entries of the matrix are given by

$$\Lambda^{m,n}_{ij} = \left\{\begin{array}{lc}\delta^{n-m}p_{i}(1-p_{i}), & i = j \\-\delta^{n-m} p_{i}p_{j}, & i \neq j.\end{array}\right.$$

The pairwise covariance between two Bernoulli variables in a sequentially dependent sequence is given in the following corollary

Corollary 1

Denote $P(\epsilon_{i} = 1) = p$; $i = 1,\ldots,n$, and let $q=1-p$. Under sequential dependence,
$$\text{Cov}(\epsilon_{m}, \epsilon_{n}) = pq\delta^{n-m}.$$

We give some examples to illustrate.

Example 1 (Bernoulli Random Variables).

If we want to find the covariance between $\epsilon_{2}$ and $\epsilon_{3}$, then we note that the set $S = \{\omega \in \Omega_{3}^{2}\,:\,\omega_{2} = 1, \omega_{3} = 1\}$ is given by $S = \{(1,1,1),(0,1,1)\}$. $P(\epsilon_{2} = 1, \epsilon_{3} = 1) = P(S)$. Thus,

$$\begin{array}{rl}\text{Cov}(\epsilon_{2},\epsilon_{3})&=P(\epsilon_{2} = 1, \epsilon_{3} = 1)-P(\epsilon_{2} = 1)P(\epsilon_{3} = 1)\\&=pp^{+}p^{+} + qp^{-}p^{+}-p^{2}\\&= p^{+}(pp^{+} + qp^{-})-p^{2}\\&= p(p^{+}-p)\\&= pq\delta\end{array}$$

Similarly,
$$\begin{array}{rl}\text{Cov}(\epsilon_{1}, \epsilon_{3}) &= P(\epsilon_{1} = 1, \epsilon_{3} = 1)-P(\epsilon_{1} = 1)P(\epsilon_{3} = 1)\\&= (pp^{+}p^{+} + pq^{-}p^{-})-p^{2} \\&= p((p+\delta q)^{2} + pq(1-\delta)^{2})-p^{2}\\&= p(p^{2} + pq+\delta^{2}q^{2} + \delta^{2}pq)-p^{2}\\&= p(p(p+q) + \delta^{2}q(q+p)- p^{2} \\&= p(p + \delta^{2}q)-p^{2}\\&= pq\delta^{2}\end{array}$$

Example 2 (Categorical Random Variables).

Suppose we have a sequence of categorical random variables, where K = 3. Then
$[\epsilon_{m} = i]$ is the Bernoulli random variable with the binary outcome of 1 if $\epsilon_{m} = i$ and 0 if not. Thus,

$$\text{Cov}([\epsilon_{m} = i], [\epsilon_{n} = j]) = P(\epsilon_{m} = i \vee \epsilon_{n}= j)-P(\epsilon_{m} = i)P(\epsilon_{n} = j).$$

We have shown that every $\epsilon_{n}$ in the sequence is identically distributed, so $P(\epsilon_{m} = i) = p_{i}$ and $P(\epsilon_{n} = j) = p_{j}$.

$$\begin{array}{rl}\text{Cov}([\epsilon_{2} = 1], [\epsilon_{3} = 1]) &=(p_{1}p_{1}^{+}p_{1}^{+} + p_{2}p_{1}^{-}p_{1}^{+} + p_{3}p_{1}^{-}p{1}^{+})-p_{1}^{2} \\&= p_{1}^{+}(p_{1}p_{1}^{+} + p_{2}p_{1}^{-} + p_{3}p_{1}^{-})-p_{1}^{2} \\&= p_{1}^{+}p_{1}-p_{1}^{2} \\&=\delta p_{1}(1-p_{1})\end{array}$$ $$\begin{array}{rl}\text{Cov}([\epsilon_{2} = 1], [\epsilon_{3} = 2]) &=(p_{1}p_{1}^{+}p_{2}^{-} + p_{2}p_{1}^{-}p_{2}^{-}+p_{3}p_{1}^{-}p_{2}^{-})-p_{1}p_{2} \\&= p_{2}^{-}(p_{1}p_{1}^{+} + p_{2}p_{1}^{-} + p_{3}p_{1}^{-})-p_{1}p_{2}\\&= p_{1}p_{2}^{-}-p_{1}p_{2} \\&=-\delta p_{1}p_{2}\end{array}$$

We may obtain the other entries of the matrix in a similar fashion. So, the cross-covariance matrix for a $\epsilon_{2}$ and $\epsilon_{3}$ with K = 3 categories is given by

$$\Lambda^{2,3}=\delta\begin{pmatrix}p_{1}(1-p_{1})&-p_{1}p_{2}&-p_{1}p_{3}\\-p_{1}p_{2} & p_{2}(1-p_{2}) &-p_{2}p_{3}\\-p_{1}p_{3} &-p_{2}p_{3} & p_{3}(1-p_{3})\end{pmatrix}$$

Note that if $\epsilon_{2}$ and $\epsilon_{3}$ are independent, then the cross-covariance matrix is all zeros, because δ = 0.

## Dependency Generators

Both FK-dependency and sequential dependency structures are particular examples of a class of vertical dependency structures. We denote the dependency of subsequent categorical random variables on a previous variable in the sequence as a vertical dependency. In this section, we define a class of functions that generate vertical dependency structures with the property that all the variables in the sequence are identically distributed but dependent.

### Vertical Dependency Generators Produce Identically Distributed Sequences

Define the set of functions

$$\mathscr{C}_{\delta} = \{\alpha: \mathbb{N}_{\geq 2} \to \mathbb{N} : \alpha(n) \geq 1 \wedge \alpha(n) < n\:\: \forall\:n\}.$$

The mapping defines a direct dependency of n on $\alpha(n)$, denoted $n \stackrel{\delta}{\rightsquigarrow} \alpha(n)$. We have already defined the notion of direct dependency, but we formalize it here.

Definition. (Direct Dependency)

Let $\epsilon_{m}$, $\epsilon_{n}$ be two categorical random variables in a sequence, where $m < n$. We say $\epsilon_{n}$ has a direct dependency on $\epsilon_{m}$, denoted $\epsilon_{n} \rightsquigarrow_{\delta} \epsilon_{m}$, if $P(\epsilon_{n}|\epsilon_{n-1},\ldots,\epsilon_{1}) = P(\epsilon_{n}|\epsilon_{m})$.

Example 3.
For FK-dependence, $\alpha(n) \equiv 1$. That is, for any $n, \epsilon_{n} \rightsquigarrow_{\delta} \epsilon_{1}$

Example 4.
The function $\alpha(n) = n-1$ generates the sequential dependency structure.

We now define the notion of dependency continuity.

Definition. (Dependency Continuity)
A function $\alpha: \mathbb{N}_{\geq 2} \to \mathbb{N}$ is dependency continuous if $\forall n \exists j \in \{1,...,n-1\}$ such that $\alpha(n) = j$

We require that the functions in $\mathscr{C}_{\delta}$ be dependency continuous. Thus, the formal definition of the class of dependency generators $\mathscr{C}_{\delta}$ is

Definition. (Dependency Generator)
We define the set of functions $\mathscr{C}_{\delta} = \{\alpha : \mathbb{N}_{\geq 2} \to \mathbb{N}\}$ such that

•  $\alpha(n) < n$
• $\forall n \exists j \in \{1,...,n-1\} : \alpha(n) = j$

and refer to this class as dependency generators.

Each function in $\mathscr{C}_{\delta}$ generates a unique dependency structure for sequences of dependent categorical random variables where the individual elements of the sequence remain identically distributed.

Theorem 2. (Identical Distribution of Sequences Generated by Dependency Generators)
Let $\alpha \in \mathscr{C}_{\delta}$. Then for any $n\in \mathbb{N}, n \geq 2$, the dependent categorical random sequence generated by $\alpha$ has identically distributed elements.

### Graphical Interpretation and Illustrations

We may visualize the dependency structures generated by $\alpha \in \mathscr{C}_{\delta}$ via directed dependency graphs. Each $\epsilon_{n}$ represents a vertex in the graph, and a directed edge connects $n$ to $\alpha(n)$ to represent the direct dependency generated by $\alpha$. This section illustrates some examples and gives a graphical interpretation of the result in Theorem 2.

First-Kind Dependency

For FK-dependency, $\alpha(n) \equiv 1$ generates the graph above. Each $\epsilon_{i}$ depends directly on $\epsilon_{1}$, and thus we see no connections between any other vertices $i, j$ where $j \neq 1$. There are $n-1$ separate subsequences of length 2 in this graph.

Sequential Dependency

$\alpha(n) = n-1$ generates the sequential dependency structure this work has studied in detail. We can see that a path exists from any $n$ back to 1. This  is a visual way to see the result of Theorem2,  in that if a path exists from any $n$ back to 1, then the variables in that path must be identically distributed. Here, there is only one sequence and no subsequences.

A Monotonic Example

Here, $\alpha(n) = \lfloor \sqrt{n}\rfloor$ gives an example of a monotonic function in $\mathscr{C}_{\delta}$ and the dependency structure it generates. Again, note that any $n$ has a path to 1, and the number of subsequences is between 1 and $n-1$.

A Nonmonotonic Example

Here, $\alpha(n)=\left\lfloor\frac{\sqrt{n}}{2}\left(\sin(n)\right)+\frac{n}{2}\right\rfloor$ illustrates a more contrived example where the function is nonmonotonic. It is neither increasing nor decreasing. The previous examples have all been nondecreasing.

A Prime Example

Let $\alpha \in \mathscr{C}_{\delta}$ be defined in the following way. Let $p_{m}$ be the $m$th prime number, and let $\{kp_{m}\}$ be the set of all postive integer multiples of $p_{m}$. Then the set $\mathscr{P}_{m} = \{kp_{m}\} \setminus \cup_{i=1}^{m-1}\{kp_{i}\}$ gives a disjoint partitioning of $\mathbb{N}$. That is, $\mathbb{N} = \sqcup_{m}\mathscr{P}_{m}$, and thus for any $n \in \mathbb{N}, n \in \mathscr{P}_{m}$ for exactly one $m$. Now let $\alpha(n) = m$. Thus, the function is well-defined. We may write $\alpha: \mathbb{N}_{\geq 2} \to \mathbb{N}$ as $\alpha[\mathscr{P}_{m}] = \{m\}$. As an illustration,

\begin{aligned}\alpha[\{2k\}] &= \{1\} \\\alpha[\{3k\}\setminus \{2k\}] &= \{2\} \\\alpha[\{5k\}\setminus (\{2k\}\cup \{3k\})] &= \{3\} \\&\qquad\vdots\\\alpha[\{kp_{m}\}\setminus (\cup_{i=1}^{m-1}\{kp_{i}\})] &= \{m\}\end{aligned}

## Conclusion

This paper has extended the concept of dependent random sequences first put forth in the works of Korzeniowski [1] and Traylor [3] and developed a generalized class of vertical dependency structures. The sequential dependency structure was studied extensively, and a formula for the cross-covariance obtained. The class of dependency generators was defined and shown to always produce a unique dependency structure for any $\alpha \in \mathscr{C}_{\delta}$ in which the a sequence of categorical random variables under that $\alpha$ is identically distributed but dependent. We provided a graphical interpretation of this class, and illustrated with several key examples.

## References

1. Andrzej Korzeniowski. On correlated random graphs. Journal of Probability and Statistical Science, pages 43–58, 2013.
2. Joseph McKean, Robert Hogg, and Allen Craig. Introduction to Mathematical Statistics. Prentice Hall, 6 edition.
3. Rachel Traylor. A generalized multinomial distribution from dependent categorical random variables. Academic Advances of the CTO, 2017.
A Generalized Multinomial Distribution from Dependent Categorical Random Variables

## Abstract

Categorical random variables are a common staple in machine learning methods and other applications across disciplines. Many times, correlation within categorical predictors exists, and has been noted to have an effect on various algorithm effectiveness, such as feature ranking and random forests. We present a mathematical construction of a sequence of identically distributed but dependent categorical random variables, and give a generalized multinomial distribution to model the probability of counts of such variables.

## Introduction

Bernoulli random variables are invaluable in statistical analysis of phenomena having binary outcomes, however, many other variables cannot be modeled by only two categories. Many topics in statistics and machine learning rely on categorical random variables, such as random forests and various clustering algorithms [6,7]. Many datasets exhibit correlation or dependency among predictors as well as within predictors, which can impact the model used. [6,9]. This can result in unreliable feature ranking [9], and inaccurate random forests [6].

Some attempts to remedy these effects involve Bayesian modeling [2] and various computational and simulation methods [8]. In particular, simulation of correlated categorical variables has been discussed in the literature for some time [1, 3, 5]. Little research has been done to create mathematical framework of correlated or dependent categorical variables and the resulting distributions of functions of such variables.

Korzeniowski [4] studied dependent Bernoulli variables, formalizing the notion of identically distributed but dependent Bernoulli variables and deriving the distribution of the sum of such dependent variables, yielding a Generalized Binomial Distribution.

In this paper, we generalize the work of Korzeniowski [4] and formalize the notion of a sequence of identically distributed but dependent categorical random variables. We then derive a Generalized Multinomial Distribution for such variables and provide some properties of said distribution. We also give an algorithm to generate a sequence of correlated categorical random variables.

## Background

Korzeniowski defined the notion of dependence in a way we will refer to here as dependence of the first kind (FK dependence). Suppose $(\epsilon_{1},...,\epsilon_{N})$ is a sequence of Bernoulli random variables, and $P(\epsilon_{1} = 1) = p$. Then, for $\epsilon_{i}, i \geq 2$, we weight the probability of each binary outcome toward the outcome of $\epsilon_{1}$, adjusting the probabilities of the remaining outcomes accordingly.

Formally, let $0 \leq \delta \leq 1$, and $q = 1-p$. Then define the following quantities
\begin{aligned}p^{+} := P(\epsilon_{i} = 1 | \epsilon_{1} = 1) = p + \delta q, &\qquad p^{-} := P(\epsilon_{i} = 0 | \epsilon_{1} = 0) = p-\delta p \\ q^{+} := P(\epsilon_{i} = 1 | \epsilon_{1} = 0) = q+\delta p , &\qquad q^{-} := P(\epsilon_{i} = 0 | \epsilon_{1} = 0) = q -\delta q\end{aligned}

Given the outcome $i$ of $\epsilon_{1}$, the probability of outcome $i$ occurring in the subsequent Bernoulli variables $\epsilon_{2}, \epsilon_{3},...,\epsilon_{n}$ is $p^{+}, i = 1$ or $q^{+}, i=0$. The probability of the opposite outcome is then decreased to $q^{-}$ and $p^{-}$, respectively.

The above illustrates the possible outcomes of a sequence of such dependent Bernoulli variables. Korzeniowski showed that, despite this conditional dependency, $P(\epsilon_{i} = 1) = p$ for all $i$. That is, the sequence of Bernoulli variables is identically distributed, with correlation shown to be
$$\text{Cor}(\epsilon_{i}, \epsilon_{j}) = \left\{\begin{array}{lr} \delta, & i=1 \\ \delta^{2}, & i \neq j;\,i,j \geq 2\end{array}\right.$$

These identically distributed but correlated Bernoulli random variables yield a Generalized Binomial distribution with a similar form to the standard binomial distribution. In our generalization, we use the same form of FK dependence, but for categorical random variables. We will construct a sequence of identically distributed but dependent categorical variables from which we will build a generalized multinomial distribution. When the number of categories $K = 2$, the distribution reverts back to the generalized binomial distribution of Korzeniowski [4]. When the sequence is fully independent, the distribution reverts back to the independent categorical model and the standard multinomial distribution, and when the sequence is independent and $K=2$, we recover the standard binomial distribution. Thus, this new distribution represents a much larger generalization than prior models.

## Construction of Dependent Categorical Variables

Suppose each categorical variable has K possible categories, so the sample space $S = \{1,\ldots,K\}$ (These integers should not be taken as ordered or sequential, but rather as character titles of categories.) The construction of the correlated categorical variables is based on a probability mass distribution over K-adic partitions of [0,1]. We will follow graph terminology in our construction, as this lends a visual representation of the construction. We begin with a parent node and build a K-nary tree, where the end nodes are labeled by the repeating sequence (1,…, K). Thus, after N steps, the graph has $K^{N}$ nodes, with each node labeled 1,…,K repeating in the natural order and assigned injectively to the intervals
$$\left(0,\frac{1}{K^{N}}\right],\left(\frac{1}{K^{n}},\frac{2}{K^{N}}\right],\ldots,\left(\frac{K^{N}-1}{K^{n}},1\right].$$

Define

\begin{aligned}&\epsilon_{N} = i \qquad \text{ on } \left(\tfrac{Kj}{K^{N}}, \tfrac{Kj+i}{K^{N}}\right] \\&\epsilon_{N} = K \qquad \text{on } \left(\tfrac{K(j+1)-1}{K^{N}}, \tfrac{K(j+1)}{K^{N}}\right] \end{aligned}

where $i = 1,...,K-1$, and $j = 0,1,\ldots,K^{N-1}-1$. An alternate expression for the above equation is

\begin{aligned}\epsilon_{N} =i &\text{ on } \left(\tfrac{l-1}{K^{N}}, \tfrac{l}{K^{N}}\right], & i \equiv l \mod K , & \quad i = 1, \ldots ,K-1\\ \epsilon_{N} = K &\text{ on } \left(\tfrac{l-1}{K^{N}}, \tfrac{l}{K^{N}}\right], & 0 \equiv l \mod K\end{aligned}

To each of the branches of the tree, and by transitivity to each of the $K^{N}$ partitions of [0,1], we assign a probability mass to each node such that the total mass is 1 at each level of the tree in a similar manner to [4].

Let $0 < p_{i} < 1, i=1,\ldots,K$ such that $\sum_{i=1}^{K}p_{i}=1$, and let $0 \leq \delta \leq 1$ be the dependency coefficient. Define

$p_{i}^{+} := p_{i} + \delta\sum_{l \neq i}p_{l}, \quad p_{i}^{-} := p_{i}-\delta p_{i}$

for= 1,…,K. These probabilities satisfy two important criteria detailed in the following lemmata:

Lemma 1.

\begin{aligned}\sum_{i=1}^{K}p_{i} = p_{i}^{+} + \sum_{l \neq i}p_{l}^{-} = 1\\ p_{i}p_{i}^{+} + p_{i}^{-}\sum_{l \neq i}p_{l} = p_{i}\end{aligned}

We now give the construction in steps down each level of the K-nary tree.

##### LEVEL 1:

Parent node has mass 1, with mass split $1\cdot\prod_{i=1}^{K}p_{i}^{[\epsilon_{1} = i]}$, where $[\cdot]$ is an Iverson bracket. This level corresponds to a sequence $\epsilon$ of dependent categorical variables of length 1.

$$\begin{array}{cccc}\epsilon_{1}(\text{\textbf{Branch}}) & \textbf{Path} & \textbf{Mass at Node} & \textbf{Interval}\\1 &\textit{parent}\to 1 & p_{1}& (0,1/K] \\2 &\textit{parent} \to 2 & p_{2}& (1/K,2/K]\\ \vdots & \vdots & \vdots &\vdots \\i &\textit{parent} \to i & p_{i}& ((i-1)/K,i/K]\\K &\textit{parent} \to K & p_{K}& ((K-1)/K,1]\end{array}$$
##### LEVEL 2:

Level 2 has K nodes, with K branches stemming from each node. This corresponds to a sequence of length 2: $\epsilon = (\epsilon_{1}, \epsilon_{2})$. Denote i.1 as node i from level 1. For i = 1,…,K,

Node i.1 has mass $p_{i}$, with mass split $p_{i}\left(p_{i}^{+}\right)^{[\epsilon_{2}=i]}\prod\limits_{j=1,j\neq i}^{K}\left(p_{j}^{-}\right)^{[\epsilon_{2} = j]}$.

$$\begin{array}{cccc}\epsilon_{2}(\textbf{Branch})& \textbf{Path} & \textbf{Mass at Node} & \textbf{Interval}\\1 &i.1 \to 1 & p_{i}p_{1}^{-} & \left(\frac{(i-1)K}{K^{2}},\frac{(i-1)K + 1}{K^{2}}\right]\\2 &i.1 \to 2 & p_{i}p_{2}^{-} & \left(\frac{(i-1)K + 1}{K^{2}},\frac{(i-1)K+ 2}{K^{2}}\right]\\\vdots & \vdots & \vdots & \vdots\\i &i.1 \to i & p_{i}p_{i}^{+}& \left(\frac{(i-1)K + (i-1)}{K^{2}},\frac{(i-1)K+ i}{K^{2}}\right]\\\vdots & \vdots & \vdots & \vdots \\K &i.1 \to K & p_{i}p_{K}^{-}& \left(\frac{iK-1}{K^{2}},\frac{iK}{K^{2}}\right]\end{array}$$

In this fashion, we distribute the total mass 1 across level 2. In general, at any level r, there are K streams of probability flow at Level r. For $\epsilon_{1} = i$, the probability flow is given by
$$p_{i}\prod_{j = 2}^{r}\left[\left(p_{i}^{+}\right)^{[\epsilon_{j} = i]}\prod_{l\neq i }\left(p_{l}^{-}\right)^{[\epsilon_{j} = l]}\right],\qquad i = 1,\ldots,K\qquad\qquad(*)$$

We use this flow to assign mass to the Kr intervals of [0,1] at level r in the same way as above. We may also verify via algebraic manipulation that

$$p_{i} = p_{i}\left(p_{i}^{+} + \sum_{l\neq i}p_{i}^{-}\right)^{r-1} =p_{i}\sum_{\epsilon_{2},\ldots,\epsilon_{r}}\prod_{j=2}^{r}\left[\left(p_{i}^{+}\right)^{[\epsilon_{j} = i]}\prod_{l \neq i}\left(p_{l}^{-}\right)^{[\epsilon_{j} = l]}\right]$$

where the first equality is due to the first statement of Lemma 1, and the second is due to the probability flow at level r, given in $(*)$ above.

### Example Construction

For an illustration, refer to Figure 2 above.  In this example, we construct a sequence of length N = 3 of categorical variables with K = 3 categories. At each level r, there are 3r nodes corresponding to 3r partitions of the interval [0,1]. Note that each time the node splits into 3 children, the sum of the split probabilities is 1. Despite the outcome of the previous random variable, the next one always has three possibilities. The sample space of categorical variable sequences of length 3 has 33 = 27 possibilities.

### Properties

#### Identically Distributed but Dependent

We now show the most important property of this class of sequences– that they remain identically distributed despite losing independence.

#### Lemma 2:

$P(\epsilon_{r} = i) = p_{i}\qquad i = 1,\ldots,K; \:\: r \in \mathbb{N}$

#### Pairwise Cross-Covariance Matrix

We now give the pairwise cross-covariance matrix for dependent categorical random variables.

Theorem 1: (Cross-Covariance of Dependent Categorical Random Variables).  Denote by $\Lambda^{\iota,\tau}$ the K × K cross-covariance matrix of $\epsilon_{\iota}$ and $\epsilon_{\tau}$, $\iota,\tau = 1,\ldots,n$, defined as $\Lambda^{\iota,\tau} = E[(\epsilon_{\iota} - E[\epsilon_{\iota}])(\epsilon_{\tau} - E[\epsilon_{\tau}])]$. Then the entries of the matrix are given by

$\Lambda^{1,\tau}_{ij} = \left\{\begin{array}{cc}\delta p_{i}(1-p_{i}), & i = j \\-\delta p_{i}p_{j}, & i \neq j\end{array}\right.\qquad \tau \geq 2,$
and

$\Lambda^{\iota,\tau}_{ij} = \left\{\begin{array}{cc}\delta^{2}p_{i}(1-p_{i}), & i = j \\-\delta^{2}p_{i}p_{j}, & i \neq j\end{array}\right. \qquad\tau > \iota,\:\iota\neq 1.$

In the next section, we exploit the desirable identical distribution of the categorical sequence in order to provide a generalized multinomial distribution for the counts in each category.

## Generalized Multinomial Distribution

From the construction in Section 2, we derive a generalized multinomial distribution in which all categorical variables are identically distributed but no longer independent.

Theorem 2: Generalized Multinomial Distribution.  Let $\epsilon_{1},\ldots,\epsilon_{N}$ be categorical random variables with categories $1,\ldots,K$, constructed as in Section 2. Let $X_{i} = \sum_{j=1}^{N}[\epsilon_{j} = i], i = 1,\ldots,K$, where $[\cdot]$ is the Iverson bracket. Denote $\mathbf{X} = (X_{1},\ldots,X_{K})$, with observed values $\mathbf{x} = (x_{1},\ldots,x_{K})$. Then

$$P(\mathbf{X} = \mathbf{x}) = \sum_{i=1}^{K}p_{i}\frac{(N-1)!}{(x_{i}-1)!\prod_{j \neq i}x_{j}!}\left(p_{i}^{+}\right)^{x_{i}-1}\prod_{j \neq i}\left(p_{j}^{-}\right)^{x_{j}}.$$

### Properties

This section details some useful properties of the Generalized Multinomial Distribution and the dependent categorical random variables.

#### Marginal Distributions

Theorem 3: Univariate Marginal Distribution.  The univariate marginal distribution of the Generalized Multinomial Distribution is the Generalized Binomial Distribution. That is,

$P(X_{i} = x_{i}) = q{N-1 \choose x_{i}}\left(p_{i}^{-}\right)^{x_{i}}\left(q^{-}\right)^{N-1-x_{i}} + p_{i}{N-1 \choose x_{i}-1}\left(p_{i}^{+}\right)^{x_{i}-1}\left(q^{+}\right)^{N-1-(x_{i}-1)},$
where $q = \sum_{j \neq i}p_{j}$, $q^{+} = q + \delta p_{i}$, and $q^{-} = q - \delta q$.
The above theorem shows another way the generalized multinomial distribution is an extension of the generalized binomial distribution.

Theorem 4: Moment Generating Function.  The moment generating function of the generalized multinomial distribution with K categories is given by
$$M_{\mathbf{X}}(\mathbf{t}) = \sum_{i=1}^{K}p_{i}e^{t_{i}}\left(p_{i}^{+}e^{t_{i}} + \sum_{j \neq i}p_{j}^{-}e^{t_{j}}\right)^{n-1}$$

where $\mathbf{X} = (X_{1},...,X_{K})$ and $\mathbf{t} = (t_{1},...,t_{K})$.

#### Moments of the Generalized Multinomial Distribution

Using the moment generating function in the standard way, the mean vector $\mu$ and the covariance matrix $\Sigma$ may be derived.

Expected Value. The expected value of $\mathbf{X}$ is given by $\mu = n\mathbf{p}$ where $\mathbf{p} = (p_{1},...,p_{K})$

Covariance Matrix The entries of the covariance matrix are given by
$$\Sigma_{ij} = \left\{\begin{array}{lr}p_{i}(1-p_{i})(n + \delta(n-1) + \delta^{2}(n-1)(n-2)), & i = j \\p_{i}p_{j}(\delta(1-\delta)(n-2)(n-1)-n), & i \neq j\end{array}\right.$$ Note that if $\delta = 0$, the generalized multinomial distribution reduces to the standard multinomial distribution and $\Sigma$ becomes the familiar multinomial covariance matrix. The entries of the corresponding correlation matrix are given by
$$\rho(X_{i},X_{j}) = -\sqrt{\frac{p_{i}p_{j}}{(1-p_{i})(1-p_{j})}}\left(\frac{n-\delta(n-1)(n-2)}{n+\delta(n-1)+\delta^{2}(n-1)(n-2)}\right)$$

If $\delta = 1$, the variance of $X_{i}$ tends to $\infty$ with n. This is intuitive, as $\delta = 1$ implies perfect dependence of $\epsilon_{2},...,\epsilon_{n}$ on the outcome of $\epsilon_{1}$. Thus, $X_{i}$ will either be 0 or n, and this spread increases to $\infty$ with n.

## Generating a Sequence of Correlated Categorical Random Variables

For brevity, we will take the acronym DCRV for a Dependent Categorical Random Variable. A DCRV sequence $\epsilon = (\epsilon_{1},...,\epsilon_{n})$ is in itself a random variable, and thus has a probability distribution. In order to provide an algorithm to generate such a sequence, we first derive this probability distribution.

### Probability Distribution of a DCRV Sequence

The probability distribution of the DCRV sequence $\epsilon$ of length n is given formally in the following theorem. The proof follows in a straightforward fashion from the construction of dependent categorical random variables and is therefore omitted.

Theorem 5: Distribution of a DCRV Sequence.  Let $(\Sigma,\mathcal{F}, \mathbb{P}) = ([0,1], \mathcal{B}, \mu)$. Let $\epsilon_{i} : [0,1] \to \{1,...,K\},i = 1,...,n,n \in \mathbb{N}$ be DCRVs as defined in the construction. Let $\epsilon = (\epsilon_{1},...,\epsilon_{n})$ denote the DCRV sequence with observed values $e = (e_{1},...,e_{n})$. Then $\mu$ has the density
$$f(x) = \sum_{i=1}^{K^{n}}K^{n}m^{i}\mathbb{1}_{\tiny((i-1)/K^{n}, i/K^{n}]}(x)$$
and
$$P(\epsilon = e) = \int\limits_{\left(\frac{i-1}{K^{n}}, \frac{i}{K^{n}}\right]}f(x)dx = m^{i}$$

where $m^{i}$ is the mass allocated to the interval $\left(\frac{i-1}{K^{n}}, \frac{i}{K^{n}}\right]$ by the probability mass flow and i as the lexicographic order of e in the sample space $\{1,...,K\}^{n}$ given by the relation $\frac{i}{K^{n}} = \sum_{j=1}^{K^{n}}\frac{\epsilon_{j}-1}{K^{j}}$

### Algorithm

We take a common notion of using a uniform random variable in order to generate the desired random variable $\epsilon$. For $\epsilon$ with distribution $F(x) = \int_{0}^{x}f(y)dy, f(x)$ as in the above equation, it is clear that F is invertible with inverse $F^{-1}$. Thus, $F^{-1}(U)$ for $U \sim \text{Uniform}[0,1]$ has the same distribution as $\epsilon$. Therefore, sampling u from U is equivalent to the sample $e = F^{-1}(u)$ from $\epsilon$.

In the construction of the DCRVs, we associated $\epsilon_{n}$ to the intervals

\begin{aligned}\epsilon_{N} =i &\text{ on } \left(\tfrac{l-1}{K^{N}}, \tfrac{l}{K^{N}}\right], & i \equiv l \mod K , & \quad i = 1,\ldots,K-1 \\\epsilon_{N} = K &\text{ on } \left(\tfrac{l-1}{K^{N}}, \tfrac{l}{K^{N}}\right], & 0 \equiv l \mod K\end{aligned}

From the construction, each sequence has a 1-1 correspondence with the interval $\left[\frac{i-1}{K^{n}}, \frac{i}{K^{n}}\right)$ for a unique $i = 1,\ldots,K^{n}$. The probability of such a sequence can be found using Theorem 5:

$$P(\epsilon = e) = F\left(\left[\frac{i-1}{K^{n}}, \frac{i}{K^{n}}\right)\right) = m^{i} = l\left([s_{i-1}, s_{i})\right)$$

where l is the length of the above interval, and $s_{i} = \sum_{j=1}^{i}m^{j}$. Therefore, we have now partitioned the interval [0,1) according to the distribution of $\epsilon$ bijectively to the K-nary partition of [0,1) corresponding to the particular sequence. Thus, sampling $u \in [0,1)$ from a uniform distribution and finding the interval $[s_{i-1},s_{i})$ and corresponding i will yield the unique DCRV sequence.

Algorithm Strategy: Given $u \in [0,1)$ and $n \in \mathbb{N}$, find the unique interval $[s_{i-1}, s_{i}), i= 1,\ldots,K^{n}$ that contains u by “moving down the tree” and narrowing down the “search interval” until level n is reached.

We provide an explicit example prior to the pseudocode to illustrate the strategy.

#### Example DCRV Sequence Generation

Suppose $K = 3, p_{1} = p_{2} = p_{3} = 1/3,$  and  $\delta = 1/2$, and suppose we wish to generate a DCRV sequence of length 2. The figure above gives the probability flow and the corresponding probability intervals $[s_{i-1}, s_{i})$ that partition [0, 1) according to Theorem 4. Now, suppose we sample a uniform random variable u, with observed value $u = \frac{3}{4}$. We now illustrate the process of moving down the above tree to generate the sequence.

1. The first level of the probability tree partitions the interval [0, 1) into three intervals given in the figure above. $u = \frac{3}{4}$ lies in the third interval, which corresponds to $\epsilon_{1} = 3$. Thus, the first entry of e is given by $e_{1} = 3$.
2. Next, the search interval is reduced to the interval from the first step [2/3, 1). We then generate the partitions of [2/3, 1) by cumulatively adding $p_{3}p_{i}^{-}, i = 1,2,3$ to the left endpoint 2/3. Thus, the next partition points are
• 2/3 + (1/3)(1/6) = 13/18
• 2/3 + (1/3)(1/6) + (1/3)(1/6) = 7/9, and
• 2/3 + (1/3)(1/6) + (1/3)(1/6) + (1/3)(2/3) = 1

Yielding the subintervals of [2/3, 1):

• [2/3, 13/18)
• [13/18, 7/9)
• [7/9, 1)

We now find the interval from above that contains u is the second: [13/18,  7/9). Thus, $\epsilon_{2} = 2$. Since we only sought a sequence of length 2, the algorithm is finished, and we have generated the sequence e = (3, 2). If a longer sequence is desired, we repeat step 2 until we are at level n. For the pseudocode of the algorithm, see the pdf, linked at the top of the page.

## Conclusion

Categorical variables play a large role in many statistical and practical applications across disciplines. Moreover, correlations among categorical variables are common and found in many scenarios, which can cause problems with conventional assumptions. Different approaches have been taken to mitigate these effects, because a mathematical framework to define a measure of dependency in a sequence of categorical variables was not available. This paper formalized the notion of dependent categorical variables under a first-dependence scheme and proved that such a sequence is identically distributed but now dependent. With an identically distributed but dependent sequence, a generalized multinomial distribution was derived in Section~\ref{sec: gen multinomial} and important properties of this distribution were provided. An efficient algorithm to generate a sequence of dependent categorical random variables was given.