Statistical calibration and differential gene expression analysis of RNA-Seq data

Zhaonan Sun, Purdue University

Abstract

The rapid development of the next generation sequencing (NGS) technologies has revolutionized how genomic research is conducted. An important application of the NGS technologies, named RNA-Seq, is to study the RNA transcripts. Despite the increasing popularity of RNA-Seq, statistical challenges still exist for analyzing RNA-Seq data. This thesis focuses on three specific problems in gene expression quantification and differential gene expression analysis using RNA-Seq data. Microarray technology has been used as a main platform for gene expression quantification for the last few decades, and RNA-Seq emerged as a promising alternative to microarrays. Unfortunately, both microarray and RNA-seq data are subject to various measurement errors. Using a third platform, called qRT-PCR, which has low throughput capacity but high measurement accuracy, we propose a system of functional measurement error models to model gene expression measurements and calibrate the microarray and RNA-Seq platforms. As new RNA-Seq normalization methods continue to appear, it is important to systematically compare different methods. Currently, different normalization methods are compared and validated by their correlations with a certain gold standard. Although the current approach is intuitive and easy to implement, it becomes statistically inadequate when the gold standard is also subject to measurement error. We utilize the system of measurement error models based on qRT-PCR, Microarray and RNA-Seq gene expression data to compare and validate RNA-Seq normalization methods. Differential gene expression analysis is complicated in eukaryotes due to the prevalence of alternative splicing events and the incompleteness of transcript annotations. We propose to use the multivariate negative binomial distributions with certain covariance structures to model exon-level RNA-Seq data and perform differential expression analysis. Two specific models, referred to as the first-order adjacent model and the common covariance model, are discussed. The major barrier that hurdles the use of multivariate negative binomial distribution is the computational burden. We generalized a recursive relation in the bivariate negative binomial distribution to the two proposed models. An application of the first order adjacent model is presented and discussed.

Degree

Ph.D.

Advisors

Zhu, Purdue University.

Subject Area

Statistics|Bioinformatics

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server
.

Share

COinS