# A Generalized Multinomial Distribution from Dependent Categorical Random Variables

###### For the full paper, which includes all proofs, download the pdf here.

## Conclusion

Categorical variables play a large role in many statistical and practical applications across disciplines. Moreover, correlations among categorical variables are common and found in many scenarios, which can cause problems with conventional assumptions. Different approaches have been taken to mitigate these effects, because a mathematical framework to define a measure of dependency in a sequence of categorical variables was not available. This paper formalized the notion of dependent categorical variables under a first-dependence scheme and proved that such a sequence is identically distributed but now dependent. With an identically distributed but dependent sequence, a generalized multinomial distribution was derived in Section~\ref{sec: gen multinomial} and important properties of this distribution were provided. An efficient algorithm to generate a sequence of dependent categorical random variables was given.

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

## References

- BISWAS, A. Generating correlated ordinal categorical random samples. Statistics and Probability Letters (2004), 25–35.
- HIGGS, M. D., AND HOETING, J. A. A clipped latent variable model for spatially correlated ordered categorical data. Computational Statistics and Data Analysis (2010), 1999–2011.
- IBRAHIM, N., AND SULIADI, S. Generating correlated discrete ordinal data using r and sas iml. Computer Methods and Programs in Biomedicine (2011), 122–132.
- KORZENIOWSKI, A. On correlated random graphs. Journal of Probability and Statistical Science (2013), 43–58.
- LEE, A. Some simple methods for generating correlated categorical variates. Computational Statistics and Data Analysis (1997), 133–148.

- NICODEMUS, K. K., AND MALLEY, J. D. Predictor correlation impacts machine learning algorithms: implications for genomic studies. Bioinformatics (2009), 1884–1890.
- NISTOR GROZAVU, L. L., AND BENNANI, Y. Autonomous clustering characterization for categorical data. Machine Learning and Applications (ICMLA), 2010 Ninth International Conference on (2010).
- S.J. TANNENBAUM, N.H.G. HOLFORD, H. L. E. A. Simulation of correlated continuous and categorical variables using a single multivariate distribution. Journal of Pharmacokinetics and Pharmacodynamics (2006), 773–794.
- TOLOSI, L., AND LENGAURER, T. Classification with correlated features: unreliability of feature ranking and solutions. Bioinformatics (2011), 1986–1994.