Examination and utilization of rare features in text classification of injury narratives

Hsin-Ying Huang, Purdue University


Thanks to the advances in computing and information technology, analyzing injury surveillance data with statistical machine learning methods has grown in popularity, complexity, and quality over recent years. During that same time, researchers have recognized the limitations of statistical text analysis with limited training data. In response to the two primary challenges for statistical text analysis, dimensionality reduction and sparse data, many studies have focused on improving machine learning algorithms. Less research has been done, though, to examine and improve statistical machine learning methods in text classification from a linguistic perspective. This study addresses this research gap by examining the importance of extreme-frequency words in classifying injury narratives. The results indicate that adhering to the common practice of removing frequently-occurring prepositions from the text significantly decreased the classification performance for certain categories. Removing low-frequency words significantly improved the classification performance for Multinomial Naive Bayes (MNB), helped alleviate the problem of overfitting small categories for Logistical Regression (LR), but did not have any significant effect for Support Vector Machine (SVM). As a way to utilize low-frequency words, classic word normalization or grouping methods such as stemming and lemmatization are often used in the text preprocessing stage. Despite their popularity, these classic grouping methods are not without limitations. The proposed "Type M+S Word Grouping Method" groups rare and unseen words morphologically and semantically automatically using unlabeled data. Several experiments were conducted for evaluating the grouping effect for three classifiers (MNB, SVM, LR) in three train-test scenarios (1:9, 1:1, 9:1) on injury surveillance data with a half-million narratives classified into 30 external cause categories. The experimental results show that the proposed method optionally paired with three add-on methods (two-word sequence tagging, reviewed tagging, Naive Bayes-weighted classifier) resulted in better classification performance as compared to stemming and lemmatization. The overall classification performance for small categories with limited training data was improved for MNB (5.5%), SVM (4%), and LR (11.2%) to an extent comparable to increasing the size of the labeled training set by a factor of 3.6 for MNB, 2.3 for SVM, and 5.2 for LR. Some improvement was also observed for medium-sized categories (1.7%) while performance on large categories remained nearly unchanged (0.1%). The overall results advance the conclusion that the proposed method of decision support is a promising approach for incorporating expert knowledge that improves machine learning for classifying injury narratives with reduced manual effort. The results also suggest that simply increasing the size of a training dataset would not result in the level of performance that the proposed method can achieve because of the inherent limitations of linear classifiers to acquire fundamental concepts and classification rules from the narrative that human experts know by definitions of injuries.




Lehto, Purdue University.

Subject Area

Industrial engineering

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server