Guided data cleaning

Mohamed A Yakout, Purdue University

Abstract

Until recently, all data cleaning techniques have focused on providing fully automated solutions, which are risky to rely on, without efficiently and effectively considering collaboration with the data users and other available resources. This dissertation studies techniques to involve data users directly and indirectly, as well as leveraging the WWW, specifically web tables, for data cleaning tasks. In particular, the dissertation addresses four key challenges for guided data cleaning. The first challenge relates to directly involving users in the data cleaning process. The goal is to efficiently combine the best of both the user fidelity to guide the data cleaning process and the existing automatic cleaning techniques to suggest cleaning updates. For this purpose, we develop the necessary principles to reason about which questions to forward to the user using a novel combination of decision theory and active learning. The second challenge is scalability as existing automatic cleaning techniques are not scalable. We introduce a new approach that is based on statistical machine learning techniques. We achieve scalability by introducing a robust mechanism to partition the database, and then aggregate the final cleaning decisions from the several partitions. The third challenge relates to involving users indirectly for a data cleaning task. We notice that the users' actions (or behavior), which can be found in the systems log, can be useful evidence for the task of deduplicating the users themselves. We develop the necessary pattern detection and modeling algorithms for this purpose. Finally, the fourth challenge relates to leveraging the WWW for data cleaning tasks. We address the problem of finding missing values (or entity augmentation) using web tables. Our solution relies on aggregating answers from several web tables that directly and indirectly match the user's entities. We model this problem as a topic sensitive pagerank, which models the holistic semantic match of a web table to the topic of the list of entities. Our experimental evaluations using real-world datasets demonstrate the effectiveness and efficiency of our proposed approaches to improve the quality of dirty databases.

Degree

Ph.D.

Advisors

Elmagarmid, Purdue University.

Subject Area

Computer science

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server
.

Share

COinS