Data Classification for l-diversity
Corporations are retaining ever-larger corpuses of personal data; the frequency of breaches and corresponding privacy impact has been rising accordingly. One way to mitigate this risk is through use of anonymized data, limiting the exposure of individual data to only where it is absolutely needed. This would seem particularly appropriate for data mining, where the goal is generalizable knowledge rather than data on specific individuals. In practice, corporate data miners often insist on original data, for fear that they might miss something with anonymized or differentially private approaches. This dissertation provides both empirical and theoretical justifications for the use of anonymized data, in particular for a specific scheme of anonymization called anatomization (or anatomized data). Anatomized data preserves all attribute values, but introduces uncertainty in the mapping between identifying and sensitive values, thus satisfying l-diversity. We first propose a promising decision tree learning algorithm. Empirical results show that this algorithm produces decision trees approaching the accuracy of non-private decision trees. We then show that a k-nearest neighbor classifier and a support vector classifier trained on anatomized data are theoretically expected to do as well as on the original data under certain conditions. The theoretical effectiveness of the latter approaches are validated using several publicly available datasets, showing that we outperform the state of the art for nearest neighbor and support vector classification using training data protected by k-anonymity, and are comparable to learning on the original data.
Clifton, Purdue University.
Off-Campus Purdue Users:
To access this dissertation, please log in to our