Comoputer Science Technical Report


Many application scenarios, e.g., marketing analysis, sensor networks, and
medical and biological applications, require or can significantly benefit from the
identification and processing of similarities in the data. Even though some work
has been done to extend the semantics of some operators, e.g., join and
selection, to be aware of data similarities; there has not been much study on the
role, interaction, and implementation of similarity-aware operations as first-class
database operators. The focus of this thesis work is the proposal and study of
several similarity-aware database operators and a systematic analysis of their
role as query operators, interactions, optimizations, and implementation
techniques. This work presents a detailed study of two core similarity-aware
operators: Similarity Group-by and Similarity Join. We describe multiple
optimization techniques for the introduced operators. Specifically, we present: (1)
multiple non-trivial equivalence rules that enable similarity query transformations,
(2) Eager and Lazy aggregation transformations for Similarity Group-by and
Similarity Join to allow pre-aggregation before potentially expensive joins, and (3)
techniques to use materialized views to answer similarity-based queries. We also
present the main guidelines to implement the presented operators as integral
components of a database system query engine and several key performance
evaluation results of this implementation in an open source database system. We
introduce a comprehensive conceptual evaluation model for similarity queries
with multiple similarity-aware predicates, i.e., Similarity Selection, Similarity Join,
Similarity Group-by. This model clearly defines the expected correct result of a
query with multiple similarity-aware predicates. Furthermore, we present multiple
transformation rules to transform the initial evaluation plan into more efficient
equivalent plans.


similarities of data, similarity aware operations, systematic analysis, database operators, similarity group by, similarity join

Date of this Version