Addressing Data Imbalance in Breast Cancer Prediction Using Supervised Machine Learning

Shuning Yin, Purdue University

Abstract

Every 12 minutes, 12 women are diagnosed with breast cancer in the US, and 1 dies out of it. Globally, every 46 seconds, a woman loses her life due to breast cancer, meaning more than 1,800 deaths every day. The condition makes the prediction of breast cancer very important. To achieve the goal, supervised machine learning (ML) methods are used for breast cancer likelihood predictions. However, due to imbalance in the real-world data with very low portion of positive cases, the prediction accuracy of ML models for positive cancer cases was limited. Two procedures were done to address the issues in the study. Firstly, four supervised ML models, including Naïve Bayes (NB), Logistic Regression (LR), Support Vector Machine (SVM), and Multilayer Perceptron (MLP), using WEKA, the industry-standard software, were applied to the Breast Cancer Surveillance Consortium (BCSC) dataset to assess the impact of the data imbalance on breast cancer prediction. Secondly, the data was manually built as balanced (24,558 cases, 12,279 for each class-positive and negative) and unbalanced (99,000 cases for negative) training datasets and a non-overlapping testing dataset (11,000 cases) based on the same dataset and a decision support system was developed for two ML models, NB and LR to tackle the class imbalance issue for breast cancer prediction. Overall, the results indicate that MLP had the best performance on positive breast cancer prediction with 0.959 sensitivity and 0.907 PPV and balanced dataset predicted better results for all ML models than unbalanced dataset. Furthermore, the proposed method improved the sensitivity of positive cancer case prediction from 0.687 to 0.936 using the NB model and from 0.358 to 0.8306 using the LR model. The improvement demonstrated that the approach provided higher confidence ML-based predictions and filtered weaker ones, and the technique could efficiently address the class imbalance issue in breast cancer likelihood prediction and be used in clinical practice.

Degree

M.Sc.

Advisors

Camarillo, Purdue University.

Subject Area

Artificial intelligence|Computer science|Oncology

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server
.

Share

COinS