Parallel Construction of Large-Scale Gene Regulatory Networks
Constructing whole-genome gene regulatory networks using genetical genomics data is challenged by high dimensionality of data, limited computer memory and intensive computation. In this dissertation, we propose a two-stage penalized least squares method to study regulatory interactions among a large number of genes, building up large systems of structural equations based on the instrumental variables view of the classical two-stage least squares method. Fitting a single regression model for each endogenous variable at each stage, the method employs ridge regression at the first stage to obtain consistent estimation of a set of conditional expectations, and the adaptive lasso at the second stage to consistently identify regulatory effects among a huge number of candidates. The resultant estimates of the regulatory effects enjoy the oracle properties. This method is computationally efficient and permits parallel implementation. We demonstrate its effectiveness via both simulation studies and real data analysis. When whole-genome gene regulatory networks are under consideration, the number of endogenous variables can increase to tens of thousands, and the variable selection via the adaptive lasso at the second stage of the two-stage penalized least squares method may not work well in practice. To overcome this limit, we incorporate the iterative Sure Independence Screening into the second stage, and combine it with the adaptive lasso to improve the accuracy of variable selection when the dimension of the model is ultra-high. Simulation studies show that the proposed method can accurately identify regulatory effects in large-scale networks across different simulation settings. We also extend the proposed method for next-generation sequencing data which include not only common variants but also rare variants that may be associated with gene expression. We ride on recent advances in rare-variant association testing to identify rare variants associated with gene expression, and incorporate the results of rare-variant association tests into the construction of gene regulatory networks. By including rare variants from next-generation sequencing data, the accuracy of the first-stage estimation can be improved, which promotes the identification of gene regulatory effects at the second stage. We applied the proposed method to a human whole-genome sequencing data set and obtained some interesting results.
Zhang, Purdue University.
Off-Campus Purdue Users:
To access this dissertation, please log in to our