Date of Award

Fall 2013

Degree Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

Department

Statistics

First Advisor

Michael Yu Zhu

Committee Chair

Michael Yu Zhu

Committee Member 1

Rebecca Doerge

Committee Member 2

Jun Xie

Committee Member 3

Dabao Zhang

Abstract

RNA-Seq has emerged as a powerful technique for transcriptome study. As much as the improved sensitivity and coverage, RNA-Seq also brings challenges for data analysis. The massive amount of sequence reads data, excessive variability, uncertainties, and bias and noises stemming from multiple sources all make the analysis of RAN-Seq data difficult. Despite much progress, RNA-Seq data analysis still has much room for improvement, especially on the quantification of gene and transcript expression levels. The quantification of gene expression level is a direct inference problem, whereas the quantification of the transcript expression level is an indirect problem, because the label of the transcript each short read is generated from is missing. A number of methods have been proposed in the literature to quantify the expression levels of genes and transcripts. Although being effective in many cases, these methods can become ineffective in some other cases, and may even suffer from the non-identifiability problem. A key drawback of these existing methods is that they fail to utilize all the formation in the RNA-Seq short read count data. In this thesis, we propose three model frameworks to address three important questions in RNA-Seq study. First, we propose to use finite Poisson mixture models (PMI) to characterize base pair-level RNA-Seq data and further quantify gene expression levels. Finite Poisson mixture models combine the strength of fully parametric models with the flexibility of fully nonparametric models, and are extremely suitable for modeling heterogeneous count data such as what we observed from RNA-Seq experiments. A unified quantification method based on the Poisson mixture models is developed to measure gene expression levels. Second, based on the Poisson mixture model framework, we further proposed the convolution of Poisson mixture models (CPM-Seq) to quantify the expression levels of transcripts. The maximum likelihood estimation method equipped with the EM algorithm is used to estimate model parameters and quantify transcript expression levels. Third, a penalized convolution Poisson mixture model (penCPM-Seq) is proposed to shrink transcripts with small expression levels to zero and to select transcripts that have high expression levels from the candidate set. Both simulation studies and real data applications have demonstrated the effectiveness of PMI, CPM-Seq, and penCPM-Seq. We will show that they produced more accurate and consistent quantification results than existing methods. Thus, we believe that finite Poisson mixture models provide a flexible framework to model RNA-Seq data, and methods developed based on this thesis have the potential to become powerful tools for RNA-Seq data analysis.

Share

COinS