Guided Data Fusion

Romila Pradhan, Purdue University

Abstract

While the volume and variety of data furnished by disparate data sources has rocketed over the years, often there is little to no restraint over the quality of data available on the Internet; data sources often provide conflicting information for the same data item (a real-world entity or event). Recent years have witnessed a number of data fusion systems that propose solutions to consolidate multiple instances of a data item, distinguish correct from incorrect information and present a unified, consistent and meaningful record to users. Most of these fusion systems are focused on automatically identifying correct information for data items. Despite their remarkable effectiveness in resolving conflicts, these fusion systems are not error-free and incorrect interpretation on certain data items quickly propagate as false judgement on other items. This dissertation studies techniques to incorporate user feedback and capitalize on the knowledge of relationships among claims of data items to improve the effectiveness of conflict resolution. In particular, the dissertation addresses two key challenges toward guided data fusion. The first challenge relates to integrating feedback from users to rapidly resolve conflicts. The objective is to effectively and efficiently integrate user feedback for maximum benefit to data fusion. For this purpose, we develop a novel framework built on the principles of decision theory and active learning to reason about the order in which claims should be validated by users. We propose approaches that exploit the structure of interactions between data items and sources and offer interactive validation time for users of a data fusion system. The second challenge relates to leveraging relations between claims of data items to identify multiple related correct claims. The objective is to recognize existing entity-relationships among claims and integrate them with data fusion systems that are agnostic to data relationships. Toward this goal, we leverage knowledge representations that encapsulate a wide range of relationship semantics and introduce mechanisms to integrate the knowledge representation with data fusion models to retrieve multiple correct claims that are consistent with each other. Our experimental evaluations using real-world and synthetic datasets demonstrate the effectiveness and efficiency of our proposed approaches to improve conflict resolution of data integrated from multiple sources.

Degree

Ph.D.

Advisors

Prabhakar, Purdue University.

Subject Area

Computer science

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server
.

Share

COinS