Design and implementation of a performance-scalable cloud-enabled sequence alignment search tool
Abstract
Commercial cloud providers are emerging as a cheap source of temporary computational resources without requiring significant investment in capital costs. This thesis examines the possibility of exploiting cloud computing for running existing bioinformatics algorithms/applications while improving flexibility and cost efficiency. We developed a parallelized sequence alignment search tool to be deployed on cloud infrastructure. Our design is performance-scalable and reliable by leveraging scalable MapReduce middleware. By leveraging the elastic scaling capabilities of the external cloud provider, our implementation can scale resource allocation according to demand. Our implementation has two innovations. First, our implementation is the first MapReduce based BLAST implementation that also uses database fragmentation – a technique that is fundamental to exploit parallelism under low loads. Second, we use a locality enhancement technique that minimizes file transfers and improves performance. Results show our implementation outperforms comparable parallel sequence alignment algorithms (MPI-based implementations) for up to 32 nodes and remains competitive up to 128 nodes, with the drop in performance attributed to a less efficient program overhead for shorter tasks. Our algorithm exhibits superior non-performance based attributes due to the MapReduce framework it is design on, such as the ability to recover from hardware crashes during operation, and the ability to dynamically grow existing infrastructure with external third-party resources during runtime, allowing for operation-time performance scaling. Hybrid configuration tests successfully show the ability to create heterogeneous clouds out of multiple infrastructure sources, with a little loss in program performance. Our implementation sets the stage for further exploration of hybrid configurations for cost savings.
Degree
M.S.E.C.E.
Advisors
Thottethodi, Purdue University.
Subject Area
Computer Engineering|Bioinformatics
Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server.