Null model methods for cluster analysis of gene expression data

Brian Munneke, Purdue University


Phenotype expression in organisms is strongly influenced by genes. Recent microarray innovations in molecular biology allow the biologist to quantitate the amount of gene activity resulting from various experimental conditions. Microarrays can be used to monitor gene expression in thousands of genes simultaneously, and in an experiment with several treatment conditions. The magnitude of the resulting data can be overwhelming. The vast amount of data coupled with the inherent variation present in the microarray technology provides many opportunities for statisticians to contribute to both the analysis and interpretation of these data. ^ A common approach for exploratory analysis of these data is cluster analysis, where the intention of the biologist is to uncover regulatory relationships between genes. A wide variety of clustering algorithms exist and have been applied to this unsupervised learning problem. These algorithms cluster genes according to the quantification of their activity levels under experimental conditions for the purpose of identifying genes with similar patterns of expression, and thus identifying co-regulated genes. The experimental variation present in microarray technology causes concern among researchers and calls into question the stability of the clustered expression profiles, as well as the inferred co-regulatory relationships. Motivated by the need to investigate the statistical implications of this variation, a methodology for statistically validating cluster structure has been investigated. This investigation includes a new dissimilarity measure, based on a penalized cosine of the angle, used in clustering algorithms, and two approaches that independently produce the null space of gene expression profiles. Incorporation of these null model methodologies with cluster analysis of gene expression data provides an assessment of how close any suggested clustering group is to what would be expected by random association, and thus provides an assessment of any cluster algorithm, as well as the ability to compare results from different clustering methodologies. ^




Major Professor: Rebecca W. Doerge, Purdue University.

Subject Area

Biology, Genetics|Statistics

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server