Abstract

The project described in this dissertation proposal attempted to improve the efficiency and scalability performance as well as the usability and user experience of three Bioinformatics applications - DNA/peptide sequence similarity comparison, digital DNA library subtraction, and DNA/peptide sequence de-duplication - by 1) adopting the Hadoop MapReduce algorithms and distributed file system and 2) implementing the fully automated Hadoop programs into a user friendly graphical user interface (GUI). In addition, the researcher was also interested in investigating the advantages and limitations of applying the Hadoop software framework as a general methodology in parallelizing Bioinformatics programs.

After considering the original calculation algorithms in the serial version of the programs, the available computational resources, the nature of the MapReduce framework, and the optimization of performance, a processing pipeline with one pre-processing step, three mappers, two reducers and one post-processing step was developed. Then a GUI interface that enabled users to specify input/output files and program parameters was created. Also implanted into the GUI were user friendly features such as organized instruction, detailed log files, multi-user accessibility, and so on.

The new and fully automated Hadoop Bioinformatics toolkit showed execution efficiency comparable with their MPI counterparts with median to large scale data, and better efficiency than MPI when ultra-large dataset was provided. In addition, good scalability was observed with testing dataset up to 20 Gb.

Keywords

Biological sciences, Applied sciences, Bioinformatics, Graphical user interface, Hadoop, Parallelization

Disciplines

Bioinformatics

Degree Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

Department

Technology

First Advisor

John Springer

Committee Chair

John Springer

Committee Member 1

Kari L. Clase

Committee Member 2

Michael A. Kane

Committee Member 3

Dawn D. Laux

Committee Member 4

Eric Matson

Date of Award

5-2016

Recommended Citation

Wang, Dan, "The application of the Hadoop software framework in Bioinformatics programs" (2016). Open Access Dissertations. 725.
https://docs.lib.purdue.edu/open_access_dissertations/725

Download

Included in

Bioinformatics Commons

COinS

Open Access Dissertations

The application of the Hadoop software framework in Bioinformatics programs

Abstract

Keywords

Disciplines

Degree Type

Degree Name

Department

First Advisor

Committee Chair

Committee Member 1

Committee Member 2

Committee Member 3

Committee Member 4

Date of Award

Recommended Citation

Included in

Search

Links

Links for Authors

Browse

Open Access Dissertations

The application of the Hadoop software framework in Bioinformatics programs

Author

Abstract

Keywords

Disciplines

Degree Type

Degree Name

Department

First Advisor

Committee Chair

Committee Member 1

Committee Member 2

Committee Member 3

Committee Member 4

Date of Award

Recommended Citation

Included in

Share

Search

Links

Links for Authors

Browse