Hierarchical Non-Parametric Bayesian Mixture Models and Applications on Big Data
Abstract
In the Bayesian nonparametric family, Dirichlet Process (DP) is a prior distribution that is able to learn the number of clusters in mixture models from the data. Thus, the corresponding mixture model is nonparametric in terms of the number of clusters. However, each cluster is represented by a single parametric distribution. Further flexibility is required considering real-world applications with clusters that cannot be modeled with a single parametric distribution. This limitation occurs especially if the cluster shapes are skewed or multimodal. In this dissertation, we have shown that introducing a hierarchy to cluster distributions is an effective way to create more flexible generative models without significantly expanding the parameter space and computational complexity. Referring to the two-layer structure, we have named our method as Infinite Mixtures of Infinite Gaussian Mixtures (I2GMM). We have presented a collapsed Gibbs sampler inference for I2GMM. The parallelization is achieved thanks to the hierarchical structure. However, the collapsed sampler does not consider load balancing. Thus, it does not have a high level of utilization of resources in modern multi-core architectures. Later, we have introduced a new sampling algorithm that combines the uncollapsed sampler and the collapsed sampler to improve the degree of parallelization. In our experiments, we have included flow cytometry and remote sensing data as well as some benchmark datasets. We have observed that I2GMM achieves a better mean F1 score as compared to parametric and non-parametric alternatives in clustering. Also, we have applied the new parallel sampler to IGMM and I2GMM models, and we have observed further speed up on computational time while maintaining the clustering accuracy comparable to that achieved by the collapsed Gibbs sampler.
Degree
Ph.D.
Advisors
Dundar, Purdue University.
Subject Area
Statistics|Artificial intelligence|Computer science
Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server.