Finite mixture models for clustering, dimension reduction and privacy preserving data mining

Xiaodong Lin, Purdue University

Abstract

Gaussian mixture models have been used extensively in model-based clustering. It is well known that the likelihood function for this model is unbounded and the global MLE does not exist. Existing methods confine the parameter estimates in the interior of the parameter space and the MLE is shown to be consistent. We proposed the artificially contaminated Gaussian mixture model (ACM), which achieves the MLE consistency globally. Furthermore, the usual assumption that the true parameter is in the interior of the parameter space can be dropped. The variance of the artificial contamination is allowed to decrease to zero in a polynomial rate while the MLE maintains its consistency. Empirical studies demonstrate that ACM out-performs the competing methods such as the constrained Gaussian mixture model and the penalized Gaussian mixture model. A degenerate EM (DEM) algorithm is proposed to handle the degenerations for the multivariate Gaussian mixture models. By adaptively changing the singular covariance matrices, the DEM algorithm is able to identify the genuine degenerate components while discarding the spurious ones. At the same time, the DEM algorithm solves the EM break-down problems. Traditionally, clustering and dimension reduction are performed separately. The mixture of factor analyzers (MFA) model has been developed for simultaneous dimension reduction and clustering. We show that a constraint based on the total variation can drastically improve the performance of the MFA model. An EM algorithm is developed to solve the optimization problem involving a quadratic constraint. A two-step model selection procedure is proposed to speed up the model selection process. Comparative evaluations show that the constrained MFA (CMFA) model performs demonstrably better than methods including the MFA and the LPCA model in a number of problems. There seems to be a real potential for the CMFA model in high dimensional data analysis. Privacy preserving data mining has received tremendous attention recently. With the imminent growth of distributed database and data mining technology, individual privacy has become a very important issue. We proposed a privacy preserving clustering algorithm based on the finite mixture models which preserves the privacy of the individual data items and the local summary results. Problems regarding privacy preserving statistical analysis will be discussed.

Degree

Ph.D.

Advisors

Zhu, Purdue University.

Subject Area

Statistics|Computer science

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server
.

Share

COinS