## A Generalized Multinomial Distribution from Dependent Categorical Random Variables

###### For the full paper, which includes all proofs, download the pdf here.

## Abstract

Categorical random variables are a common staple in machine learning methods and other applications across disciplines. Many times, correlation within categorical predictors exists, and has been noted to have an effect on various algorithm effectiveness, such as feature ranking and random forests. We present a mathematical construction of a sequence of identically distributed but dependent categorical random variables, and give a generalized multinomial distribution to model the probability of counts of such variables.

## Introduction

Bernoulli random variables are invaluable in statistical analysis of phenomena having binary outcomes, however, many other variables cannot be modeled by only two categories. Many topics in statistics and machine learning rely on categorical random variables, such as random forests and various clustering algorithms [6,7]. Many datasets exhibit correlation or dependency among predictors as well as within predictors, which can impact the model used. [6,9]. This can result in unreliable feature ranking [9], and inaccurate random forests [6].

Some attempts to remedy these effects involve Bayesian modeling [2] and various computational and simulation methods [8]. In particular, simulation of correlated categorical variables has been discussed in the literature for some time [1, 3, 5]. Little research has been done to create mathematical framework of correlated or dependent categorical variables and the resulting distributions of functions of such variables.

Korzeniowski [4] studied dependent Bernoulli variables, formalizing the notion of identically distributed but dependent Bernoulli variables and deriving the distribution of the sum of such dependent variables, yielding a Generalized Binomial Distribution.

In this paper, we generalize the work of Korzeniowski [4] and formalize the notion of a sequence of identically distributed but dependent categorical random variables. We then derive a Generalized Multinomial Distribution for such variables and provide some properties of said distribution. We also give an algorithm to generate a sequence of correlated categorical random variables.

## Background

Korzeniowski defined the notion of dependence in a way we will refer to here as *dependence of the first kind* (FK dependence). Suppose (\epsilon_{1},...,\epsilon_{N}) is a sequence of Bernoulli random variables, and P(\epsilon_{1} = 1) = p. Then, for \epsilon_{i}, i \geq 2, we weight the probability of each binary outcome toward the outcome of \epsilon_{1}, adjusting the probabilities of the remaining outcomes accordingly.

Formally, let 0 \leq \delta \leq 1, and q = 1-p. Then define the following quantities

\begin{aligned}p^{+} := P(\epsilon_{i} = 1 | \epsilon_{1} = 1) = p + \delta q, &\qquad p^{-} := P(\epsilon_{i} = 0 | \epsilon_{1} = 0) = p-\delta p \\ q^{+} := P(\epsilon_{i} = 1 | \epsilon_{1} = 0) = q+\delta p , &\qquad q^{-} := P(\epsilon_{i} = 0 | \epsilon_{1} = 0) = q -\delta q\end{aligned}

Given the outcome i of \epsilon_{1}, the probability of outcome i occurring in the subsequent Bernoulli variables \epsilon_{2}, \epsilon_{3},...,\epsilon_{n} is p^{+}, i = 1 or q^{+}, i=0. The probability of the opposite outcome is then decreased to q^{-} and p^{-}, respectively.

The above illustrates the possible outcomes of a sequence of such dependent Bernoulli variables. Korzeniowski showed that, despite this conditional dependency, P(\epsilon_{i} = 1) = p for all i. That is, the sequence of Bernoulli variables is identically distributed, with correlation shown to be

\text{Cor}(\epsilon_{i}, \epsilon_{j}) = \left\{\begin{array}{lr} \delta, & i=1 \\ \delta^{2}, & i \neq j;\,i,j \geq 2\end{array}\right.

These identically distributed but correlated Bernoulli random variables yield a Generalized Binomial distribution with a similar form to the standard binomial distribution. In our generalization, we use the same form of FK dependence, but for categorical random variables. We will construct a sequence of identically distributed but dependent categorical variables from which we will build a generalized multinomial distribution. When the number of categories K = 2, the distribution reverts back to the generalized binomial distribution of Korzeniowski [4]. When the sequence is fully independent, the distribution reverts back to the independent categorical model and the standard multinomial distribution, and when the sequence is independent and K=2, we recover the standard binomial distribution. Thus, this new distribution represents a much larger generalization than prior models.

## Construction of Dependent Categorical Variables

Suppose each categorical variable has *K* possible categories, so the sample space S = \{1,\ldots,K\} (These integers should not be taken as ordered or sequential, but rather as character titles of categories.) The construction of the correlated categorical variables is based on a probability mass distribution over *K*-adic partitions of [0,1]. We will follow graph terminology in our construction, as this lends a visual representation of the construction. We begin with a parent node and build a *K*-nary tree, where the end nodes are labeled by the repeating sequence (1,…, *K*). Thus, after *N* steps, the graph has K^{N} nodes, with each node labeled 1,…,*K* repeating in the natural order and assigned injectively to the intervals

\left(0,\frac{1}{K^{N}}\right],\left(\frac{1}{K^{n}},\frac{2}{K^{N}}\right],\ldots,\left(\frac{K^{N}-1}{K^{n}},1\right].

Define

\begin{aligned}&\epsilon_{N} = i \qquad \text{ on } \left(\tfrac{Kj}{K^{N}}, \tfrac{Kj+i}{K^{N}}\right] \\&\epsilon_{N} = K \qquad \text{on } \left(\tfrac{K(j+1)-1}{K^{N}}, \tfrac{K(j+1)}{K^{N}}\right] \end{aligned}where i = 1,...,K-1, and j = 0,1,\ldots,K^{N-1}-1. An alternate expression for the above equation is

\begin{aligned}\epsilon_{N} =i &\text{ on } \left(\tfrac{l-1}{K^{N}}, \tfrac{l}{K^{N}}\right], & i \equiv l \mod K , & \quad i = 1, \ldots ,K-1\\ \epsilon_{N} = K &\text{ on } \left(\tfrac{l-1}{K^{N}}, \tfrac{l}{K^{N}}\right], & 0 \equiv l \mod K\end{aligned}To each of the branches of the tree, and by transitivity to each of the K^{N} partitions of [0,1], we assign a probability mass to each node such that the total mass is 1 at each level of the tree in a similar manner to [4].

Let 0 < p_{i} < 1, i=1,\ldots,K such that \sum_{i=1}^{K}p_{i}=1, and let 0 \leq \delta \leq 1 be the *dependency coefficient*. Define

for* i *= 1,…,*K.* These probabilities satisfy two important criteria detailed in the following lemmata:

**Lemma 1.**

\begin{aligned}\sum_{i=1}^{K}p_{i} = p_{i}^{+} + \sum_{l \neq i}p_{l}^{-} = 1\\ p_{i}p_{i}^{+} + p_{i}^{-}\sum_{l \neq i}p_{l} = p_{i}\end{aligned}

We now give the construction in steps down each level of the *K*-nary tree.

**LEVEL 1:**

*Parent node* has mass 1, with mass split 1\cdot\prod_{i=1}^{K}p_{i}^{[\epsilon_{1} = i]}, where [\cdot] is an Iverson bracket. This level corresponds to a sequence \epsilon of dependent categorical variables of length 1.

##### LEVEL 2:

Level 2 has *K* nodes, with *K* branches stemming from each node. This corresponds to a sequence of length 2: \epsilon = (\epsilon_{1}, \epsilon_{2}). Denote *i*.1 as node *i* from level 1. For *i* = 1,…,*K*,

**Node i.1** has mass p_{i}, with mass split p_{i}\left(p_{i}^{+}\right)^{[\epsilon_{2}=i]}\prod\limits_{j=1,j\neq i}^{K}\left(p_{j}^{-}\right)^{[\epsilon_{2} = j]}.

In this fashion, we distribute the total mass 1 across level 2. In general, at any level *r*, there are *K* streams of probability flow at Level *r*. For \epsilon_{1} = i, the probability flow is given by

p_{i}\prod_{j = 2}^{r}\left[\left(p_{i}^{+}\right)^{[\epsilon_{j} = i]}\prod_{l\neq i }\left(p_{l}^{-}\right)^{[\epsilon_{j} = l]}\right],\qquad i = 1,\ldots,K\qquad\qquad(*)

We use this flow to assign mass to the *K ^{r}* intervals of [0,1] at level

*r*in the same way as above. We may also verify via algebraic manipulation that

where the first equality is due to the first statement of Lemma 1, and the second is due to the probability flow at level *r*, given in (*) above.

### Example Construction

For an illustration, refer to Figure 2 above. In this example, we construct a sequence of length *N* = 3 of categorical variables with *K* = 3 categories. At each level *r*, there are 3^{r} nodes corresponding to 3^{r} partitions of the interval [0,1]. Note that each time the node splits into 3 children, the sum of the split probabilities is 1. Despite the outcome of the previous random variable, the next one always has three possibilities. The sample space of categorical variable sequences of length 3 has 3^{3} = 27 possibilities.

### Properties

#### Identically Distributed but Dependent

We now show the most important property of this class of sequences– that they remain identically distributed despite losing independence.

**Lemma 2:**

P(\epsilon_{r} = i) = p_{i}\qquad i = 1,\ldots,K; \:\: r \in \mathbb{N}

#### Pairwise Cross-Covariance Matrix

We now give the pairwise cross-covariance matrix for dependent categorical random variables.

**Theorem 1: (Cross-Covariance of Dependent Categorical Random Variables). **Denote by \Lambda^{\iota,\tau} the *K* × *K* cross-covariance matrix of \epsilon_{\iota} and \epsilon_{\tau}, \iota,\tau = 1,\ldots,n, defined as \Lambda^{\iota,\tau} = E[(\epsilon_{\iota} - E[\epsilon_{\iota}])(\epsilon_{\tau} - E[\epsilon_{\tau}])]. Then the entries of the matrix are given by

\Lambda^{1,\tau}_{ij} = \left\{\begin{array}{cc}\delta p_{i}(1-p_{i}), & i = j \\-\delta p_{i}p_{j}, & i \neq j\end{array}\right.\qquad \tau \geq 2,

and

\Lambda^{\iota,\tau}_{ij} = \left\{\begin{array}{cc}\delta^{2}p_{i}(1-p_{i}), & i = j \\-\delta^{2}p_{i}p_{j}, & i \neq j\end{array}\right. \qquad\tau > \iota,\:\iota\neq 1.

In the next section, we exploit the desirable identical distribution of the categorical sequence in order to provide a generalized multinomial distribution for the counts in each category.

## Generalized Multinomial Distribution

From the construction in Section 2, we derive a generalized multinomial distribution in which all categorical variables are identically distributed but no longer independent.

**Theorem 2: Generalized Multinomial Distribution. **Let \epsilon_{1},\ldots,\epsilon_{N} be categorical random variables with categories 1,\ldots,K, constructed as in Section 2. Let X_{i} = \sum_{j=1}^{N}[\epsilon_{j} = i], i = 1,\ldots,K, where [\cdot] is the Iverson bracket. Denote \mathbf{X} = (X_{1},\ldots,X_{K}), with observed values \mathbf{x} = (x_{1},\ldots,x_{K}). Then

P(\mathbf{X} = \mathbf{x}) = \sum_{i=1}^{K}p_{i}\frac{(N-1)!}{(x_{i}-1)!\prod_{j \neq i}x_{j}!}\left(p_{i}^{+}\right)^{x_{i}-1}\prod_{j \neq i}\left(p_{j}^{-}\right)^{x_{j}}.

### Properties

This section details some useful properties of the Generalized Multinomial Distribution and the dependent categorical random variables.

**Marginal Distributions**

**Theorem 3: Univariate Marginal Distribution. **The univariate marginal distribution of the Generalized Multinomial Distribution is the Generalized Binomial Distribution. That is,

P(X_{i} = x_{i}) = q{N-1 \choose x_{i}}\left(p_{i}^{-}\right)^{x_{i}}\left(q^{-}\right)^{N-1-x_{i}} + p_{i}{N-1 \choose x_{i}-1}\left(p_{i}^{+}\right)^{x_{i}-1}\left(q^{+}\right)^{N-1-(x_{i}-1)},

where q = \sum_{j \neq i}p_{j}, q^{+} = q + \delta p_{i}, and q^{-} = q - \delta q.

The above theorem shows another way the generalized multinomial distribution is an extension of the generalized binomial distribution.

**Theorem 4: Moment Generating Function.** The moment generating function of the generalized multinomial distribution with *K* categories is given by

M_{\mathbf{X}}(\mathbf{t}) = \sum_{i=1}^{K}p_{i}e^{t_{i}}\left(p_{i}^{+}e^{t_{i}} + \sum_{j \neq i}p_{j}^{-}e^{t_{j}}\right)^{n-1}

where \mathbf{X} = (X_{1},...,X_{K}) and \mathbf{t} = (t_{1},...,t_{K}).

#### Moments of the Generalized Multinomial Distribution

Using the moment generating function in the standard way, the mean vector \mu and the covariance matrix \Sigma may be derived.

**Expected Value.** The expected value of \mathbf{X} is given by \mu = n\mathbf{p} where \mathbf{p} = (p_{1},...,p_{K})

**Covariance Matrix** The entries of the covariance matrix are given by

\Sigma_{ij} = \left\{\begin{array}{lr}p_{i}(1-p_{i})(n + \delta(n-1) + \delta^{2}(n-1)(n-2)), & i = j \\p_{i}p_{j}(\delta(1-\delta)(n-2)(n-1)-n), & i \neq j\end{array}\right.
Note that if \delta = 0, the generalized multinomial distribution reduces to the standard multinomial distribution and \Sigma becomes the familiar multinomial covariance matrix. The entries of the corresponding correlation matrix are given by

\rho(X_{i},X_{j}) = -\sqrt{\frac{p_{i}p_{j}}{(1-p_{i})(1-p_{j})}}\left(\frac{n-\delta(n-1)(n-2)}{n+\delta(n-1)+\delta^{2}(n-1)(n-2)}\right)

If \delta = 1, the variance of X_{i} tends to \infty with *n*. This is intuitive, as \delta = 1 implies perfect dependence of \epsilon_{2},...,\epsilon_{n} on the outcome of \epsilon_{1}. Thus, X_{i} will either be 0 or *n*, and this spread increases to \infty with *n.*

## Generating a Sequence of Correlated Categorical Random Variables

For brevity, we will take the acronym DCRV for a **D**ependent **C**ategorical **R**andom **V**ariable. A DCRV sequence \epsilon = (\epsilon_{1},...,\epsilon_{n}) is in itself a random variable, and thus has a probability distribution. In order to provide an algorithm to generate such a sequence, we first derive this probability distribution.

### Probability Distribution of a DCRV Sequence

The probability distribution of the DCRV sequence \epsilon of length *n* is given formally in the following theorem. The proof follows in a straightforward fashion from the construction of dependent categorical random variables and is therefore omitted.

**Theorem 5: Distribution of a DCRV Sequence. **Let (\Sigma,\mathcal{F}, \mathbb{P}) = ([0,1], \mathcal{B}, \mu). Let \epsilon_{i} : [0,1] \to \{1,...,K\},i = 1,...,n,n \in \mathbb{N} be DCRVs as defined in the construction. Let \epsilon = (\epsilon_{1},...,\epsilon_{n}) denote the DCRV sequence with observed values e = (e_{1},...,e_{n}). Then \mu has the density

f(x) = \sum_{i=1}^{K^{n}}K^{n}m^{i}\mathbb{1}_{\tiny((i-1)/K^{n}, i/K^{n}]}(x)

and

P(\epsilon = e) = \int\limits_{\left(\frac{i-1}{K^{n}}, \frac{i}{K^{n}}\right]}f(x)dx = m^{i}

where m^{i} is the mass allocated to the interval \left(\frac{i-1}{K^{n}}, \frac{i}{K^{n}}\right] by the probability mass flow and *i* as the lexicographic order of *e* in the sample space \{1,...,K\}^{n} given by the relation \frac{i}{K^{n}} = \sum_{j=1}^{K^{n}}\frac{\epsilon_{j}-1}{K^{j}}

### Algorithm

We take a common notion of using a uniform random variable in order to generate the desired random variable \epsilon. For \epsilon with distribution F(x) = \int_{0}^{x}f(y)dy, f(x) as in the above equation, it is clear that *F* is invertible with inverse F^{-1}. Thus, F^{-1}(U) for U \sim \text{Uniform}[0,1] has the same distribution as \epsilon. Therefore, sampling *u* from *U* is equivalent to the sample e = F^{-1}(u) from \epsilon.

In the construction of the DCRVs, we associated \epsilon_{n} to the intervals

\begin{aligned}\epsilon_{N} =i &\text{ on } \left(\tfrac{l-1}{K^{N}}, \tfrac{l}{K^{N}}\right], & i \equiv l \mod K , & \quad i = 1,\ldots,K-1 \\\epsilon_{N} = K &\text{ on } \left(\tfrac{l-1}{K^{N}}, \tfrac{l}{K^{N}}\right], & 0 \equiv l \mod K\end{aligned}From the construction, each sequence has a 1-1 correspondence with the interval \left[\frac{i-1}{K^{n}}, \frac{i}{K^{n}}\right) for a unique i = 1,\ldots,K^{n}. The probability of such a sequence can be found using Theorem 5:

P(\epsilon = e) = F\left(\left[\frac{i-1}{K^{n}}, \frac{i}{K^{n}}\right)\right) = m^{i} = l\left([s_{i-1}, s_{i})\right)where *l* is the length of the above interval, and s_{i} = \sum_{j=1}^{i}m^{j}. Therefore, we have now partitioned the interval [0,1) according to the distribution of \epsilon bijectively to the *K*-nary partition of [0,1) corresponding to the particular sequence. Thus, sampling u \in [0,1) from a uniform distribution and finding the interval [s_{i-1},s_{i}) and corresponding *i* will yield the unique DCRV sequence.

**Algorithm Strategy:** Given u \in [0,1) and n \in \mathbb{N}, find the unique interval [s_{i-1}, s_{i}), i= 1,\ldots,K^{n} that contains *u* by “moving down the tree” and narrowing down the “search interval” until level *n* is reached.

We provide an explicit example prior to the pseudocode to illustrate the strategy.

#### Example DCRV Sequence Generation

Suppose K = 3, p_{1} = p_{2} = p_{3} = 1/3, and \delta = 1/2, and suppose we wish to generate a DCRV sequence of length 2. The figure above gives the probability flow and the corresponding probability intervals [s_{i-1}, s_{i}) that partition [0, 1) according to Theorem 4. Now, suppose we sample a uniform random variable*u,*with observed value u = \frac{3}{4}. We now illustrate the process of moving down the above tree to generate the sequence.

- The first level of the probability tree partitions the interval [0, 1) into three intervals given in the figure above. u = \frac{3}{4} lies in the third interval, which corresponds to \epsilon_{1} = 3. Thus, the first entry of
*e*is given by e_{1} = 3. - Next, the search interval is reduced to the interval from the first step [2/3, 1). We then generate the partitions of [2/3, 1) by cumulatively adding p_{3}p_{i}^{-}, i = 1,2,3 to the left endpoint 2/3. Thus, the next partition points are
- 2/3 + (1/3)(1/6) = 13/18
- 2/3 + (1/3)(1/6) + (1/3)(1/6) = 7/9, and
- 2/3 + (1/3)(1/6) + (1/3)(1/6) + (1/3)(2/3) = 1

Yielding the subintervals of [2/3, 1):

- [2/3, 13/18)
- [13/18, 7/9)
- [7/9, 1)

We now find the interval from above that contains *u* is the second: [13/18, 7/9). Thus, \epsilon_{2} = 2. Since we only sought a sequence of length 2, the algorithm is finished, and we have generated the sequence *e =* (3, 2). If a longer sequence is desired, we repeat step 2 until we are at level *n*. For the pseudocode of the algorithm, see the pdf, linked at the top of the page.

## Conclusion

Categorical variables play a large role in many statistical and practical applications across disciplines. Moreover, correlations among categorical variables are common and found in many scenarios, which can cause problems with conventional assumptions. Different approaches have been taken to mitigate these effects, because a mathematical framework to define a measure of dependency in a sequence of categorical variables was not available. This paper formalized the notion of dependent categorical variables under a first-dependence scheme and proved that such a sequence is identically distributed but now dependent. With an identically distributed but dependent sequence, a generalized multinomial distribution was derived in Section~\ref{sec: gen multinomial} and important properties of this distribution were provided. An efficient algorithm to generate a sequence of dependent categorical random variables was given.

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

## References

- BISWAS, A. Generating correlated ordinal categorical random samples. Statistics and Probability Letters (2004), 25–35.
- HIGGS, M. D., AND HOETING, J. A. A clipped latent variable model for spatially correlated ordered categorical data. Computational Statistics and Data Analysis (2010), 1999–2011.
- IBRAHIM, N., AND SULIADI, S. Generating correlated discrete ordinal data using r and sas iml. Computer Methods and Programs in Biomedicine (2011), 122–132.
- KORZENIOWSKI, A. On correlated random graphs. Journal of Probability and Statistical Science (2013), 43–58.
- LEE, A. Some simple methods for generating correlated categorical variates. Computational Statistics and Data Analysis (1997), 133–148.

- NICODEMUS, K. K., AND MALLEY, J. D. Predictor correlation impacts machine learning algorithms: implications for genomic studies. Bioinformatics (2009), 1884–1890.
- NISTOR GROZAVU, L. L., AND BENNANI, Y. Autonomous clustering characterization for categorical data. Machine Learning and Applications (ICMLA), 2010 Ninth International Conference on (2010).
- S.J. TANNENBAUM, N.H.G. HOLFORD, H. L. E. A. Simulation of correlated continuous and categorical variables using a single multivariate distribution. Journal of Pharmacokinetics and Pharmacodynamics (2006), 773–794.
- TOLOSI, L., AND LENGAURER, T. Classification with correlated features: unreliability of feature ranking and solutions. Bioinformatics (2011), 1986–1994.