Improving the Autocoding of Injury Narratives Using a Combination of Machine Learning Methods and Natural Language Processing Techniques

Gaurav Nanda, Purdue University


The field “external cause of injury code (E-code)” in injury datasets indicates the specific reason of an injury such as fall, cut, burn and electric shock. E-coded injury data is important for identifying the factors causing most serious injuries and prioritizing prevention efforts. E-codes are typically assigned to injury records by trained human coders based on the injury narrative – a process that is expensive in terms of time and resources. Machine Learning (ML) models offer a promising alternative for quickly assigning E-codes (autocoding) based on the injury narrative but are not able to predict all categories with high accuracy. The primary reasons for low prediction accuracy include: large number of categories, poor quality of training data, heavily skewed distribution of data, and the sparse and noisy nature of injury narratives. Apart from data-related challenges, one of the fundamental reasons behind low autocoding accuracy of classical ML models is that these models use the bag-of-words approach that considers the statistical distribution of words in different categories but does not have knowledge of the syntax, semantics, and pragmatics of the narrative text. Natural Language Processing (NLP) approaches can be used to extract deeper linguistic concepts from the narrative and supplement the ML models to improve autocoding performance. This study examined the use of “non-targeted” NLP approaches and proposed using “targeted” NLP approaches based on the causal model of E-codes for improving autocoding accuracy. Different methods of supplementing the ML model with causal concepts were examined: rule-based, narrative text transformation, and adding nodes in Bayes Network. The non-targeted NLP approaches -- “Syntactic Tagging” and “Syntactic Tagging with Hypernym Mapping” used with Multinomial Naïve Bayes (MNB) model resulted in lower prediction performance as compared to using plain narrative text. The targeted NLP approaches resulted in improved classification performance of the target category. For E-code “Electric Current”, co-occurrence rules based on causal elements were able to identify cases with extremely high (98%) Positive Predictive Value (PPV) and improved the prediction performance of MNB, Support Vector Machine, and Logistic Regression models. The causal concept “Person Fell” was identified using syntactic parsing and word-sequence rules with extremely high PPV (92%), and embedding it to the narrative resulted in improved classification performance of FALL-related categories. Adding causal concepts as nodes in the Bayesian Network resulted in minor improvements in prediction performance.




Lehto, Purdue University.

Subject Area

Industrial engineering

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server