A machine learning approach to colorectal cancer screening

Sanmit Vanayak Ambedkar, Purdue University


Colorectal cancer (CRC) is one of the most widely diagnosed cancers and also one of the leading causes of cancer deaths in the United States. CRC is known to develop from adenomatous colon polyps. Colonoscopy is widely used for to screen and effectively prevent the incidences of CRC and has to be repeated periodically to ensure complete prevention. There are multiple characteristics of the patient which are known to contribute towards increased risk of CRC incidence. Examples of such characteristics are age, gender, race, and family history. The aim of this study is to find how these factors influence the potential risk in a patient, which of these factors are more influential than others and how would they affect the risk levels when considered together. The conclusions can be used to group the patients based on potential risks of CRC. This can be useful in deciding the screening strategy for a patient. In this study, decision trees, a machine learning algorithm was used to analyze a database consisting of patient characteristics, screening procedure information and procedure outcomes. The algorithm maps the outcome of the procedure which is the target variable to the patient information. The database consisted of records of over 9400 colonoscopies reflecting more than 5000 patients. Based on the relations obtained, the patients were then grouped. Age, body mass index (BMI), outcomes of the previous procedure, gender were seen to be largely contributing towards the outcomes of the procedures and thus indirectly to the risk of CRC.




Yih, Purdue University.

Subject Area

Information Technology|Industrial engineering|Oncology

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server