# A Generalized Multinomial Distribution from Dependent Categorical Random Variables

###### For the full paper, which includes all proofs, download the pdf here.

## Background

Korzeniowski defined the notion of dependence in a way we will refer to here as *dependence of the first kind* (FK dependence). Suppose (\epsilon_{1},...,\epsilon_{N}) is a sequence of Bernoulli random variables, and P(\epsilon_{1} = 1) = p. Then, for \epsilon_{i}, i \geq 2, we weight the probability of each binary outcome toward the outcome of \epsilon_{1}, adjusting the probabilities of the remaining outcomes accordingly.

Formally, let 0 \leq \delta \leq 1, and q = 1-p. Then define the following quantities

\begin{aligned}p^{+} := P(\epsilon_{i} = 1 | \epsilon_{1} = 1) = p + \delta q, &\qquad p^{-} := P(\epsilon_{i} = 0 | \epsilon_{1} = 0) = p-\delta p \\ q^{+} := P(\epsilon_{i} = 1 | \epsilon_{1} = 0) = q+\delta p , &\qquad q^{-} := P(\epsilon_{i} = 0 | \epsilon_{1} = 0) = q -\delta q\end{aligned}

Given the outcome i of \epsilon_{1}, the probability of outcome i occurring in the subsequent Bernoulli variables \epsilon_{2}, \epsilon_{3},...,\epsilon_{n} is p^{+}, i = 1 or q^{+}, i=0. The probability of the opposite outcome is then decreased to q^{-} and p^{-}, respectively.

The above illustrates the possible outcomes of a sequence of such dependent Bernoulli variables. Korzeniowski showed that, despite this conditional dependency, P(\epsilon_{i} = 1) = p for all i. That is, the sequence of Bernoulli variables is identically distributed, with correlation shown to be

\text{Cor}(\epsilon_{i}, \epsilon_{j}) = \left\{\begin{array}{lr} \delta, & i=1 \\ \delta^{2}, & i \neq j;\,i,j \geq 2\end{array}\right.

These identically distributed but correlated Bernoulli random variables yield a Generalized Binomial distribution with a similar form to the standard binomial distribution. In our generalization, we use the same form of FK dependence, but for categorical random variables. We will construct a sequence of identically distributed but dependent categorical variables from which we will build a generalized multinomial distribution. When the number of categories K = 2, the distribution reverts back to the generalized binomial distribution of Korzeniowski [4]. When the sequence is fully independent, the distribution reverts back to the independent categorical model and the standard multinomial distribution, and when the sequence is independent and K=2, we recover the standard binomial distribution. Thus, this new distribution represents a much larger generalization than prior models.