Statistical inference and data cleaning in relational database systems

Christopher S Mayfield, Purdue University

Abstract

Real-world databases often contain syntactic and semantic errors, in spite of integrity constraints and other safety measures available in modern DBMSs. We present an iterative statistical framework for inferring missing information and correcting such errors automatically. The key insight of our approach is to exploit dependencies not only within tuples, but also between attributes of related tuples. We draw on techniques from statistical relational learning to develop an efficient approximate inference algorithm that can be implemented in standard DBMSs using SQL and user-defined functions. The resulting framework performs the inference and data cleaning tasks in an integrated manner, using novel techniques to infer correct values accurately even in the presence of dirty data. We evaluate our methods empirically using multiple synthetic and real data sets. The results show that our algorithm infers missing values comparable to baseline statistical methods, such as exact inference in Bayesian networks. However our framework simultaneously identifies and corrects corrupted values with high precision, and is significantly more efficient because of its database-level implementation.

Degree

Ph.D.

Advisors

Neville, Purdue University.

Subject Area

Statistics|Computer science

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server
.

Share

COinS