Knowledge discovery and hypothesis generation from biomedical literature using text mining

Harsha Gopal Goud Vaka, Purdue University


Automated extraction of knowledge from voluminous documents is a vast research area. Text mining is a promising approach for extracting knowledge from unstructured textual documents and is the automated approach for knowledge extraction from unstructured data like text. The objective of this thesis is to mine documents pertaining to Ayurveda, which are retrieved from PubMed, and find novel transitive associations among biological objects. This thesis discusses the extraction of biological objects from the collected documents (databank) using an Automated Vocabulary Discovery (AVD) algorithm. An effective co-occurrence based text mining algorithm was designed for hypothesis generation combining AVD (Automated Vocabulary Discovery) algorithm and tf-idf (term frequency and inverse document frequency) algorithm. This algorithm was designed to extract novel binary associations and hypergraph based ternary associations (object1 – object2 – object3) among various objects (genes, chemicals, drugs etc.,) using transitive text mining. This research established relationship between objects from modern medicine and traditional Indian medicine Ayurveda. Thus generated hypotheses (novel associations) were assigned with co-occurrence based significance score and few highly significant novel associations were validated. Finally compared and analyzed thus obtained knowledge (ternary associations) with binary associations (object1 – object2) which form the superset for the ternary associations.^




Snehasis Mukhopadhyay, Purdue University.

Subject Area

Computer Science

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server