Secondary datasets are used in healthcare research because of its cost advantages, its convenience, and the size of the datasets. However, missing data can cause problems that are difficult to resolve. This manuscript reviews possible causes for missing data, and how to address them. Many researchers use multiple imputation as a solution, which consists of three phases: (a) the imputation phase, (b) the analysis phase, and (c) the pooling phase. When missing data is caused by a refusal to answer or by insufficient knowledge, multiple imputation works well. However, difficulties arise when there are problems with screening questions. If respondents do not answer a screening question, possible answers could be either “yes” or “no.” This paper suggests identifying “yes” responses on the screening question, and setting them aside for use in the analysis. The reasons for this approach are the impossibility of conducting multiple imputation twice, the problem of imputation based on the population after sample weight, and the difficulty of producing logical errors on the estimation in imputation phase. This manuscript uses as an example the techniques used to address missing data from screening questions in a national US dataset. These techniques of multiple imputation using examples from the dataset could be used by researchers in future healthcare research that relies on secondary datasets.
method, missing data, secondary analysis, multiple imputation, quantitative
Date of this Version
Jo, Soojung, "The Use of Multiple Imputation to Handle Missing Data in Secondary Datasets: Suggested Approaches when Missing Data Results from the Survey Structure" (2022). School of Nursing Faculty Publications. Paper 56.