Estimation of variation for high-throughput molecular biological experiments with small sample size

Danni Yu, Purdue University

Abstract

Motivation: In the quantification of molecular components, a large variation can affect and even potentially mislead the biological conclusions. Meanwhile, the high-throughput experiments often involve a small number of samples due to the limitation of cost and time. In such cases, the stochastic information may dominate the outcome of an experiment because there may not be enough samples to present the true biological information. It is challenging to distinguish the changes in phenotype from the stochastic variation. Methods: Since the biological molecules have been quantified with different technologies, different statistical methods are required. Focusing on three types of important high-throughput experiments, this thesis proposes novel solutions to reduce noise and increase the accuracy of molecular discovery. i) In the large-scale perturbation screens, thousands of mutant strains on hundreds of plates are separately profiled in hundreds of days (or batches). For each mutant strain, only a small number of samples are profiled. The artificial noise mainly consists of additive and multiplicative effects due to plates and batches. We propose a linear mixed-effect modeling framework based on experimental designs with at least two control samples. These are involved in a normalization and variance estimation procedure for the purpose of reducing the noise from data and scoring the true biological phenotype. ii) In the RNA-seq experiments, fragments of greater than thousands of genes in 4∼8 samples on a flow cell can be sequenced in one day. The additive and non-additive effects due to the large number of plates do not typically present in the data. The gene-wise variance between samples consists of both the expectation and dispersion of gene counts. Due to stochastic noise, some of gene wise dispersion are under or over estimated. This may lead to misinterpretation of the biological phenotype. We propose a shrinkage estimator of dispersion under Negative Binomial models to regularize the estimates towards a value calculated from common information across genes. Lastly iii) in the MS/MS experiments with SWATH acquisition, more than 10 thousand spectra in a run can be sequentially obtained in about 120 minutes. The summed up intensity across all the signals within a tiny m/z bin is used to identify fragments of each peptide. As a result, the interference noise within the m/z bins leaves undetected and misleading ambiguity in protein quantification. The solutions previously proposed for perturbation screens and RNA-seq experiments can not be used for SAWTH acquisition because the property of the data is different. In order to remedy such defects, a new approach is proposed to quantify the homogeneity (opposite to interference) among the co-elusion traces of molecules within the m/z bins. Since correct signals of a fragment share a homogeneous peak shape, we propose to utilize the p-value of one-side test on the second order coefficient in a linear quadratic model. The coefficient accounts for the curved shape in a linear regression procedure. The p-value represents the strength of concave pattern across those peaks of a fragment. Results: The evaluation results of different experiments with each of the three technologies illustrate that the proposed solutions outperform several existing methods.

Degree

Ph.D.

Advisors

Vitek, Purdue University.

Subject Area

Biostatistics|Statistics|Bioinformatics

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server
.

Share

COinS