Statistical inference and data cleaning in relational database systems
Abstract
Real-world databases often contain syntactic and semantic errors, in spite of integrity constraints and other safety measures available in modern DBMSs. We present an iterative statistical framework for inferring missing information and correcting such errors automatically. The key insight of our approach is to exploit dependencies not only within tuples, but also between attributes of related tuples. We draw on techniques from statistical relational learning to develop an efficient approximate inference algorithm that can be implemented in standard DBMSs using SQL and user-defined functions. The resulting framework performs the inference and data cleaning tasks in an integrated manner, using novel techniques to infer correct values accurately even in the presence of dirty data. We evaluate our methods empirically using multiple synthetic and real data sets. The results show that our algorithm infers missing values comparable to baseline statistical methods, such as exact inference in Bayesian networks. However our framework simultaneously identifies and corrects corrupted values with high precision, and is significantly more efficient because of its database-level implementation.
Degree
Ph.D.
Advisors
Neville, Purdue University.
Subject Area
Statistics|Computer science
Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server.