Techniques for Scaling Computational Genomics Applications

Kanak V Mahadik, Purdue University

Abstract

A revolution in personalized genomics will occur when scientists can sequence genomes of millions of people cost effectively and conclusively understand how genes influence diseases, and develop better drugs and treatments. The announcement by Illumina on sequencing a human genome for $1000 is a stellar attempt to solve the first part of the puzzle. However, to provide genetic treatments for diseases such as breast cancer, cystic fibrosis, Huntington’s disease, and others requires us to develop tools that can quickly analyze biological sequences and understand their structural and functional properties. Currently, tools are designed in an ad hoc manner, and require extensive programmer effort to develop and optimize them. Existing tools also show poor scalability for the exponentially increasing genomic data generated from continuously enhancing sequencing technologies. In this dissertation, we have taken a holistic approach to enhance the performance and scalability of genomic applications handling large volumes of data. This approach comprises of techniques at three levels - algorithm, compiler, and data structure. At the algorithm level, we identify opportunities for exploiting parallelism and efficient methods of data distribution. Our technique Orion exploits fine-grained parallelism to scale for long genomic sequences and achieves superior performance and better load balance than state-of-the-art distributed genomic sequence matching tools. ScalaDBG transforms the sequential and computationally intensive process of iterative de Bruijn graph construction to a parallel one. At the compiler level, we develop a domain-specific language, called SARVAVID. SARVAVID provides commonly occurring modules in genomics applications as high-level language constructs and performs domain-specific optimizations well beyond the scope of libraries and generic compilers. At the data structure level, we identify opportunities to exploit cache locality and software prefetching for enhancing the performance of indexing structures in genomic applications. We apply our approach to the major classes of genomic applications and demonstrate the benefits with relevant genomic datasets.

Degree

Ph.D.

Advisors

Bagchi, Purdue University.

Subject Area

Computer Engineering|Bioinformatics

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server
.

Share

COinS