Inference Using Multi-Level Genomic Features Sets and Models in RNA-Seq Experiments

Jeremiah Rounds, Purdue University

Abstract

The "Next Generation Sequencing" (NGS) scientific revolution continues to enable deeper exploration of genome biology. While NGS technologies are capable of sequencing both DNA and RNA, its utility for exploring changes in gene transcription, via RNA-sequencing (RNA-seq), is two-fold. Namely, differential gene expression between treatments/conditions, and alternative RNA transcripts (or, isoforms) based on exons. Each of these phenomenon can be of biological relevance if related to treatment or condition. Historically, sets of genomic features and exon features have been accessed independently to identify responses of interest. The motivation for this work is to knit together these concepts for the purpose of forming sets of related hypotheses that encompass a broader range of gene response to treatment than either, individually. Included as part of the methodological evolution of how hypotheses are tested and biological phenomena described are two fundamentally associated and seemingly unrelated issues; namely an investigation of the interplay between sample size, read depth and statistical power in RNA-sequencing experimental design, and an improved approach with software for simulating genomic read counts from transcript data. A multi-level inference is presented for the purpose of incorporating exon usage into null hypotheses for testing differential expression. This procedure is novel, inventive, and statistically powerful compared to current standard genome methodology, such as a gene exact test, and unlike transcriptome based methods, does not require discovery of complete transcripts; it only requires that exon counting bins be described in order to incorporate differential usage of exons into experimental response. This work also introduces Sub-Exon Principal Components Analysis MANOVA (SEPCAM), a novel method for describing exon expression via principal components (PC) for purpose of determining statistically significant response to treatment in experiments. SEPCAM summarizes the biology of exon usage in the transcriptome, uses that biology to justify principal components analysis on genome exon data, and then builds projected sub-exon data into statistical evidence of response to treatment. In the broader biological picture, this work demonstrates that most sub-exon expression is along the first PC, and the second PC assists in incorporating complex biological changes of transcript RNA expression into gene response.

Degree

Ph.D.

Advisors

Doerge, Purdue University.

Subject Area

Statistics

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server
.

Share

COinS