High dimensional data clustering and correlation analysis

Xiaoli Zhang, Purdue University

Abstract

This thesis studies two unsupervised pattern discovery problems within the context of scientific applications. The goal is to identify the problems of existing techniques in a practical context and provide solutions. The first problem, clustering land cover data, poses challenges to existing techniques due to the high dimensionality of the data. Part I of this thesis investigates how to employ ensemble methods to solve high dimensional data clustering problems. In particular, I explore the following two questions: (1) how to effectively generate an ensemble of clustering solutions, and (2) how to combine these clustering solutions. To address the first question, I investigate three different approaches to constructing ensembles based on randomized dimension reduction. The results demonstrate that random projection is an effective approach for generating cluster ensembles for high dimensional data and that its efficacy is attributable to its ability to produce diverse base clusterings. For the second question, I designed a graph based approach, which transforms the problem of combining clusterings into a bi-partite graph partitioning problem. Empirical comparisons of the bipartite approach to three existing approaches illustrate that the bipartite approach achieves the best overall performance. Part II of this thesis addresses the problem of correlation pattern analysis, which examines how two related domains, X and Y, correlate with one another. A standard statistical technique for such problems is Canonical Correlation Analysis (CCA). A critical limitation of CCA is that it can only detect linear correlation between the two domains that is globally valid throughout the data. The approach presented in this thesis addresses this limitation by constructing a mixture of local CCA models through a process named correlation clustering. In correlation clustering, both data sets are clustered simultaneously according to the data's correlation structure such that, within a cluster, domain X and domain Y are linearly correlated in the same way. Each cluster is then analyzed using traditional CCA to construct a locally linear correlation model. Empirical evaluations on both artificial data sets and real-world data demonstrate that the proposed approach can detect useful correlation patterns, which traditional CCA fails to discover.

Degree

Ph.D.

Advisors

Brodley, Purdue University.

Subject Area

Electrical engineering

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server
.

Share

COinS