Show simple item record

A Bayesian approach to restricted latent class models for scientifically structured clustering of multivariate binary outcomes

dc.contributor.authorWu, Zhenke
dc.contributor.authorCasciola-Rosen, Livia
dc.contributor.authorRosen, Antony
dc.contributor.authorZeger, Scott L.
dc.date.accessioned2022-01-06T15:50:52Z
dc.date.available2023-01-06 10:50:50en
dc.date.available2022-01-06T15:50:52Z
dc.date.issued2021-12
dc.identifier.citationWu, Zhenke; Casciola-Rosen, Livia ; Rosen, Antony; Zeger, Scott L. (2021). "A Bayesian approach to restricted latent class models for scientifically structured clustering of multivariate binary outcomes." Biometrics 77(4): 1431-1444.
dc.identifier.issn0006-341X
dc.identifier.issn1541-0420
dc.identifier.urihttps://hdl.handle.net/2027.42/171212
dc.description.abstractThis paper presents a model- based method for clustering multivariate binary observations that incorporates constraints consistent with the scientific context. The approach is motivated by the precision medicine problem of identifying autoimmune disease patient subsets or classes who may require different treatments. We start with a family of restricted latent class models or RLCMs. However, in the motivating example and many others like it, the unknown number of classes and the definition of classes using binary states are among the targets of inference. We use a Bayesian approach to RLCMs in order to use informative prior assumptions on the number and definitions of latent classes to be consistent with scientific knowledge so that the posterior distribution tends to concentrate on smaller numbers of clusters and sparser binary patterns. The paper derives a posterior sampling algorithm based on Markov chain Monte Carlo with split- merge updates to efficiently explore the space of clustering allocations. Through simulations under the assumed model and realistic deviations from it, we demonstrate greater interpretability of results and superior finite- sample clustering performance for our method compared to common alternatives. The methods are illustrated with an analysis of protein data to detect clusters representing autoantibody classes among scleroderma patients.
dc.publisherCambridge University Press
dc.publisherWiley Periodicals, Inc.
dc.subject.otherlatent class models
dc.subject.othermixture of finite mixture models
dc.subject.otherautoimmune disease
dc.subject.otherMarkov chain Monte Carlo
dc.subject.otherdependent binary data
dc.subject.otherclustering
dc.titleA Bayesian approach to restricted latent class models for scientifically structured clustering of multivariate binary outcomes
dc.typeArticle
dc.rights.robotsIndexNoFollow
dc.subject.hlbsecondlevelMathematics
dc.subject.hlbtoplevelScience
dc.description.peerreviewedPeer Reviewed
dc.description.bitstreamurlhttp://deepblue.lib.umich.edu/bitstream/2027.42/171212/1/biom13388-sup-0002-SuppMat.pdf
dc.description.bitstreamurlhttp://deepblue.lib.umich.edu/bitstream/2027.42/171212/2/biom13388_am.pdf
dc.description.bitstreamurlhttp://deepblue.lib.umich.edu/bitstream/2027.42/171212/3/biom13388.pdf
dc.identifier.doi10.1111/biom.13388
dc.identifier.sourceBiometrics
dc.identifier.citedreferenceNi, Y., Müller, P., Diesendruck, M., Williamson, S., Zhu, Y. and Ji, Y. ( 2020 ) Scalable Bayesian nonparametric clustering and classification. Journal of Computational and Graphical Statistics, 29, 53 - 65.
dc.identifier.citedreferenceHubert, L. and Arabie, P. ( 1985 ) Comparing partitions. Journal of Classification, 2, 193 - 218.
dc.identifier.citedreferenceJacob, P.E., Murray, L.M., Holmes, C.C. and Robert, C.P. ( 2017 ) Better together? statistical learning in models made of modules. arXiv preprint arXiv:1708.08719.
dc.identifier.citedreferenceJain, S. and Neal, R.M. ( 2004 ) A split- merge Markov chain Monte Carlo procedure for the Dirichlet process mixture model. Journal of Computational and Graphical Statistics, 13, 158 - 182.
dc.identifier.citedreferenceJunker, B.W. and Sijtsma, K. ( 2001 ) Cognitive assessment models with few assumptions, and connections with nonparametric item response theory. Applied Psychological Measurement, 25, 258 - 272.
dc.identifier.citedreferenceKadane, J. ( 1975 ) The role of identification in Bayesian theory. In: Fienberg, S. and Zellner, A. (Eds.) Studies in Bayesian Econometrics and Statistics. chapter 5.2. Amsterdam: North- Holland, pp. 175 - 191.
dc.identifier.citedreferenceLazarsfeld, P.F. ( 1950 ) The logical and mathematical foundations of latent structure analysis. In: Stouffer, S., (Ed.) The American Soldier: Studies in Social Psychology in World War II, volume IV. Princeton, NJ: Princeton University Press, pp. 362 - 412.
dc.identifier.citedreferenceLee, D.D. and Seung, H.S. ( 1999 ) Learning the parts of objects by non- negative matrix factorization. Nature, 401, 788 - 791.
dc.identifier.citedreferenceLiu, J.S. ( 1996 ) Peskun’s theorem and a modified discrete- state Gibbs sampler. Biometrika, 83, 681 - 682.
dc.identifier.citedreferenceMeeds, E., Ghahramani, Z., Neal, R.M. and Roweis, S.T. ( 2007 ) Modeling dyadic data with binary latent factors. Advances in Neural Information Processing Systems, 977 - 984.
dc.identifier.citedreferenceMiettinen, P., Mielikäinen, T., Gionis, A., Das, G. and Mannila, H. ( 2008 ) The discrete basis problem. IEEE Transactions on Knowledge and Data Engineering, 20, 1348 - 1362.
dc.identifier.citedreferenceMiller, J.W. and Harrison, M.T. ( 2018 ) Mixture models with a prior on the number of components. Journal of the American Statistical Association, 113, 340 - 356.
dc.identifier.citedreferenceNi, Y., Müller, P. and Ji, Y. ( 2019 ) Bayesian double feature allocation for phenotyping with electronic health records. Journal of the American Statistical Association. To Appear.
dc.identifier.citedreferenceNobile, A. and Fearnside, A.T. ( 2007 ) Bayesian finite mixtures with an unknown number of components: the allocation sampler. Statistics and Computing, 17, 147 - 162.
dc.identifier.citedreferenceRosen, A. and Casciola- Rosen, L. ( 2016 ) Autoantigens as partners in initiation and propagation of autoimmune rheumatic diseases. Annual Review of Immunology, 34, 395 - 420.
dc.identifier.citedreferenceRukat, T., Holmes, C.C., Titsias, M.K. and Yau, C. ( 2017 ) Bayesian boolean matrix factorisation. In International Conference on Machine Learning: Proceedings of the 34th International Conference on Machine Learning, 70. pp. 2969 - 2978.
dc.identifier.citedreferenceTeh, Y.W., Grür, D. and Ghahramani, Z. ( 2007 ) Stick- breaking construction for the Indian buffet process. Artificial Intelligence and Statistics, 556 - 563.
dc.identifier.citedreferenceTemplin, J.L. and Henson, R.A. ( 2006 ) Measurement of psychological disorders using cognitive diagnosis models. Psychological Methods, 11, 287 - 305.
dc.identifier.citedreferenceVermunt, J.K. and Magidson, J. ( 2002 ) Latent class cluster analysis. Applied Latent Class Analysis, 11, 89 - 106.
dc.identifier.citedreferenceWu, Z., Casciola- Rosen, L., Shah, A., Rosen, A. and Zeger, S.L. ( 2019 ) Estimating autoantibody signatures to detect autoimmune disease patient subsets. Biostatistics, 20, 30 - 47.
dc.identifier.citedreferenceWu, Z., Deloria- Knoll, M., Hammitt, L.L. and Zeger, S.L. ( 2016 ) Partially latent class models for case- control studies of childhood pneumonia aetiology. Journal of the Royal Statistical Society: Series C (Applied Statistics), 65, 97 - 114.
dc.identifier.citedreferenceWu, Z., Deloria- Knoll, M. and Zeger, S.L. ( 2017 ) Nested partially latent class models for dependent binary data; estimating disease etiology. Biostatistics, 18, 200 - 213.
dc.identifier.citedreferenceXu, G. ( 2017 ) Identifiability of restricted latent class models with binary responses. The Annals of Statistics, 45, 675 - 707.
dc.identifier.citedreferenceXu, G. and Shang, Z. ( 2018 ) Identifying latent structures in restricted latent class models. Journal of the American Statistical Association, 113, 1284 - 1295.
dc.identifier.citedreferenceZhang, Z., Li, T., Ding, C. and Zhang, X. ( 2007 ) Binary matrix factorization with applications. In Seventh IEEE International Conference on Data Mining (ICDM 2007), 391 - 400.
dc.identifier.citedreferenceMcCullagh, P. and Yang, J. ( 2008 ) How many clusters? Bayesian Analysis, 3, 101 - 120.
dc.identifier.citedreferenceAlbert, J.H. and Chib, S. ( 1993 ) Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association, 88, 669 - 679.
dc.identifier.citedreferenceChiu, C.- Y., Douglas, J.A. and Li, X. ( 2009 ) Cluster analysis for cognitive diagnosis: theory and applications. Psychometrika, 74, 633 - 665.
dc.identifier.citedreferenceDahl, D.B. ( 2006 ) Model- based clustering for expression data via a Dirichlet process mixture model. In: Do, K.A., Müller, P. and Vannucci, M. (Eds.) Bayesian Inference for Gene Expression and Proteomics. New York: Cambridge University Press, pp. 201 - 218.
dc.identifier.citedreferenceDunson, D. and Xing, C. ( 2009 ) Nonparametric Bayes modeling of multivariate categorical data. Journal of the American Statistical Association, 104, 1042 - 1051.
dc.identifier.citedreferenceGarrett, E. and Zeger, S. ( 2000 ) Latent class model diagnosis. Biometrics, 56, 1055 - 1067.
dc.identifier.citedreferenceGelfand, A.E. and Smith, A.F. ( 1990 ) Sampling- based approaches to calculating marginal densities. Journal of the American Statistical Association, 85, 398 - 409.
dc.identifier.citedreferenceGelman, A., Meng, X.- L. and Stern, H. ( 1996 ) Posterior predictive assessment of model fitness via realized discrepancies. Statistica Sinica, 6, 733 - 760.
dc.identifier.citedreferenceGhahramani, Z. and Griffiths, T.L. ( 2006 ) Infinite latent feature models and the Indian buffet process. Advances in Neural Information Processing Systems, 475 - 482.
dc.identifier.citedreferenceGoodman, L. ( 1974 ) Exploratory latent structure analysis using both identifiable and unidentifiable models. Biometrika, 61, 215 - 231.
dc.identifier.citedreferenceGreen, P.J. ( 1995 ) Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika, 82, 711 - 732.
dc.identifier.citedreferenceGu, Y. and Xu, G. ( 2019a ) Learning attribute patterns in high- dimensional structured latent attribute models. Journal of Machine Learning Research, 20, 1 - 58.
dc.identifier.citedreferenceGu, Y. and Xu, G. ( 2019b ) The sufficient and necessary condition for the identifiability and estimability of the DINA model. Psychometrika, 84, 468 - 483.
dc.identifier.citedreferenceHoff, P.D. ( 2005 ) Subset clustering of binary sequences, with an application to genomic abnormality data. Biometrics, 61, 1027 - 1036.
dc.working.doiNOen
dc.owningcollnameInterdisciplinary and Peer-Reviewed


Files in this item

Show simple item record

Remediation of Harmful Language

The University of Michigan Library aims to describe library materials in a way that respects the people and communities who create, use, and are represented in our collections. Report harmful or offensive language in catalog records, finding aids, or elsewhere in our collections anonymously through our metadata feedback form. More information at Remediation of Harmful Language.

Accessibility

If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.