Date of Award
5-2016
Degree Type
Dissertation
Degree Name
Doctor of Philosophy (PhD)
Department
Technology
First Advisor
John Springer
Committee Chair
John Springer
Committee Member 1
Kari L. Clase
Committee Member 2
Michael A. Kane
Committee Member 3
Dawn D. Laux
Committee Member 4
Eric Matson
Abstract
The project described in this dissertation proposal attempted to improve the efficiency and scalability performance as well as the usability and user experience of three Bioinformatics applications - DNA/peptide sequence similarity comparison, digital DNA library subtraction, and DNA/peptide sequence de-duplication - by 1) adopting the Hadoop MapReduce algorithms and distributed file system and 2) implementing the fully automated Hadoop programs into a user friendly graphical user interface (GUI). In addition, the researcher was also interested in investigating the advantages and limitations of applying the Hadoop software framework as a general methodology in parallelizing Bioinformatics programs.
After considering the original calculation algorithms in the serial version of the programs, the available computational resources, the nature of the MapReduce framework, and the optimization of performance, a processing pipeline with one pre-processing step, three mappers, two reducers and one post-processing step was developed. Then a GUI interface that enabled users to specify input/output files and program parameters was created. Also implanted into the GUI were user friendly features such as organized instruction, detailed log files, multi-user accessibility, and so on.
The new and fully automated Hadoop Bioinformatics toolkit showed execution efficiency comparable with their MPI counterparts with median to large scale data, and better efficiency than MPI when ultra-large dataset was provided. In addition, good scalability was observed with testing dataset up to 20 Gb.
Recommended Citation
Wang, Dan, "The application of the Hadoop software framework in Bioinformatics programs" (2016). Open Access Dissertations. 725.
https://docs.lib.purdue.edu/open_access_dissertations/725