Similarity-aware query processing and optimization

Yasin Nilton Silva, Purdue University

Abstract

Many application scenarios, e.g., marketing analysis, sensor networks, and medical and biological applications, require or can significantly benefit from the identification and processing of similarities in the data. Even though some work has been done to extend the semantics of some operators, e.g., join and selection, to be aware of data similarities; there has not been much study on the role, interaction, and implementation of similarity-aware operations as first-class database operators. The focus of this thesis work is the proposal and study of several similarity-aware database operators and a systematic analysis of their role as query operators, interactions, optimizations, and implementation techniques. This work presents a detailed study of two core similarity-aware operators: Similarity Group-by and Similarity Join. We describe multiple optimization techniques for the introduced operators. Specifically, we present: (1) multiple non-trivial equivalence rules that enable similarity query transformations, (2) Eager and Lazy aggregation transformations for Similarity Group-by and Similarity Join to allow pre-aggregation before potentially expensive joins, and (3) techniques to use materialized views to answer similarity-based queries. We also present the main guidelines to implement the presented operators as integral components of a database system query engine and several key performance evaluation results of this implementation in an open source database system. We introduce a comprehensive conceptual evaluation model for similarity queries with multiple similarity-aware predicates, i.e., Similarity Selection, Similarity Join, Similarity Group-by. This model clearly defines the expected correct result of a query with multiple similarity-aware predicates. Furthermore, we present multiple transformation rules to transform the initial evaluation plan into more efficient equivalent plans.

Degree

Ph.D.

Advisors

Aref, Purdue University.

Subject Area

Computer science

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server
.

Share

COinS