Transcriptome Analysis Using de Novo Assembly in RNA-Seq

Sufang Wang, Purdue University

Abstract

Liver cancer is the fifth most common cancer, and the third leading cause of cancer-related death worldwide. Hepatocellular Carcinoma (HCC) is the major liver tumor type seen in adults, which is often caused by chronic liver disease such as hepatitis B or C infection. Liver transplantation is one of the most common treatments for patients. However, the recurrence of tumor after surgery is a major concern. Therefore, identifying the early biomarkers that could predict recurrence is important. Transcriptome analysis, using RNA-Seq to identify differentially expressed genes (DEG), is providing insights into the etiology of cancer and corresponding diagnostics. Identifying DEG requires a reference transcriptome. When a reference genome is not available, a de novo transcriptome assembly can be used. Although human reference genome is available, in the case of tumors, where mutation and chromosomal rearrangement may have altered gene/transcript structure, de novo transcriptome assembly is important and beneficial in cancer study. However, since there is limited systematic research evaluating the quality of de novo transcriptome assemblies, we conducted an evaluation of de novo programs first to investigate the importance of de novo assembly in transcriptome analysis. We used two authentic RNA-Seq datasets from Arabidopsis thaliana , and produced transcriptome assemblies using eight programs with a series of k-mer sizes (from 25 to 71), including BinPacker, Bridger, IDBA-tran, Oases-Velvet, SOAPdenovo-Trans, SSP, Trans-ABySS and Trinity. We measured the assembly quality in terms of reference genome base and gene coverage, transcriptome assembly base coverage, number of chimeras and number of recovered full-length transcripts. SOAPdenovo-Trans performed best in base coverage, while Trans-ABySS performed best in gene coverage and number of recovered full-length transcripts. In terms of chimeric sequences, BinPacker and Oases-Velvet were the worst, while IDBA-tran, SOAPdenovo-Trans, Trans-ABySS and Trinity produced fewer chimeras across all single k-mer assemblies. In differential gene expression analysis, about 70% of the significantly differentially expressed genes (DEG) were the same using reference genome and de novo assemblies. We further identify four reasons for the differences in significant DEG between reference genome and de novo transcriptome assemblies: incomplete annotation, exon level differences, transcript fragmentation and incorrect gene annotation, which we suggest that de novo assembly is beneficial even when a reference genome is available. Therefore, from above analysis, we conclude that de novo assembly is important. We then used both human reference genome and de novo assembly approaches to study gene expression in HCC, aiming to identify biomarkers that could predict recurrence of tumor in HCC. We analyzed previously published RNA-Seq data that includes 21 HCC samples, of which 9 tumors were recurrent after orthotopic liver transplantation, and 12 were non-recurrent tumors with their paired normal samples. We assembled a transcriptome using Trinity program and aligned all reads to both the Reference genome and de novo transcriptome assembly, and identified DEG between tumor and normal samples. In total, 494 DEG were identified using the Reference approach and 573 DEG using de novo assembly approach in recurrent tumor analysis, and 548 DEG using the Reference approach and 626 DEG using the de novo assembly approach in non-recurrent tumor analysis. Half of DEG were the same using the Reference approach and de novo assembly approach. We further identified a group of 103 DEG that may be used as new biomarkers to predict recurrence in HCC. However, more analyses are needed to confirm and validate these genes. In addition, cancer cell lines are often used in cancer research, especially the NCI-60 human tumor cell line panel is an invaluable resource for cancer researchers, providing drug sensitivity, molecular, and phenotypic data for a range of cancer types. CellMiner is a web resource that provides tools for the acquisition and analysis of quality-controlled NCI-60 data. CellMiner supports queries of up to 150 drugs or genes, but the output is an Excel file for each drug or gene, making it difficult for researchers that lack computer programming skills to interactively explore the data. Therefore, we developed a Shiny application that facilitates the exploration and visualization of output from CellMiner, further increasing the accessibility of the NCI-60 data. Overall, my research focused on analysis of cancer research using RNA-seq with a wide range of studies. This comprehensive research shed light on not only the mechanisms of cancer, but also how to use various tools to analyze data.

Degree

Ph.D.

Advisors

Gribskov, Purdue University.

Subject Area

Molecular biology|Bioinformatics

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server
.

Share

COinS