The application of the Hadoop software framework in Bioinformatics programs

Dan Wang, Purdue University


The project described in this dissertation proposal attempted to improve the efficiency and scalability performance as well as the usability and user experience of three Bioinformatics applications - DNA/peptide sequence similarity comparison, digital DNA library subtraction, and DNA/peptide sequence de-duplication - by 1) adopting the Hadoop MapReduce algorithms and distributed file system and 2) implementing the fully automated Hadoop programs into a user friendly graphical user interface (GUI). In addition, the researcher was also interested in investigating the advantages and limitations of applying the Hadoop software framework as a general methodology in parallelizing Bioinformatics programs. After considering the original calculation algorithms in the serial version of the programs, the available computational resources, the nature of the MapReduce framework, and the optimization of performance, a processing pipeline with one pre-processing step, three mappers, two reducers and one post-processing step was developed. Then a GUI interface that enabled users to specify input/output files and program parameters was created. Also implanted into the GUI were user friendly features such as organized instruction, detailed log files, multi-user accessibility, and so on. The new and fully automated Hadoop Bioinformatics toolkit showed execution efficiency comparable with their MPI counterparts with median to large scale data, and better efficiency than MPI when ultra-large dataset was provided. In addition, good scalability was observed with testing dataset up to 20 Gb.




Springer, Purdue University.

Subject Area

Information Technology|Bioinformatics

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server