Abstract

Today, a myriad of data sources, from the Internet to business operations to scientific instruments, produce large and different types of data. Many application scenarios, e.g., marketing analysis, sensor networks, and medical and biological applications, call for identifying and processing similarities in "big" data. As a result, it is imperative to develop new similarity query processing approaches and systems that scale from low dimensional data to high dimensional data, from single machine to clusters of hundreds of machines, and from disk-based to memory-based processing. This dissertation introduces and studies several similarity-aware query operators, analyzes and optimizes their performance.

The first contribution of this dissertation is an SQL-based Similarity Group-by operator (SGB, for short) that extends the semantics of the standard SQL Group-by operator to group data with similar but not necessarily equal values. We realize these SGB operators by extending the Standard SQL Group-by and introduce two new SGB operators for multi-dimensional data. We implement and test the new SGB operators and their algorithms inside an open-source centralized database server (PostgreSQL).

In the second contribution of this dissertation, we study how to efficiently process Hamming-distance-based similarity queries (Hamming-distance select and Hamming-distance join) that are crucial to many applications. We introduce a new index, termed the HA-Index, that speeds up distance comparisons and eliminates redundancies when performing the two flavors of Hamming distance range queries (namely, the selects and joins).

In the third and last contribution of this dissertation, we develop a system for similarity query processing and optimization in an in-memory and distributed setup for big spatial data. We propose a query scheduler and a distributed query optimizer that use a new cost model to optimize the cost of similarity query processing in this in-memory distributed setup. The scheduler and query optimizer generates query execution plans that minimize the effect of query skew. The query scheduler employs new spatial indexing techniques based on bloom filters to forward queries to the appropriate local sites. The proposed query processing and optimization techniques are prototyped inside Spark, a distributed main-memory computation system.

Keywords

Applied sciences, Database, Query processing, RDBMS, Similarity

Disciplines

Computer Sciences

Degree Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

Department

Computer Science

First Advisor

Walid G. Aref

Committee Chair

Walid G. Aref

Committee Member 1

Elisa Bertino

Committee Member 2

Sonia Fahmy

Committee Member 3

Sunil Prabhakar

Date of Award

12-2016

Recommended Citation

Tang, Mingjie, "Efficient processing of similarity queries with applications" (2016). Open Access Dissertations. 1014.
https://docs.lib.purdue.edu/open_access_dissertations/1014

Download

Included in

Computer Sciences Commons

COinS

Open Access Dissertations

Efficient processing of similarity queries with applications

Abstract

Keywords

Disciplines

Degree Type

Degree Name

Department

First Advisor

Committee Chair

Committee Member 1

Committee Member 2

Committee Member 3

Date of Award

Recommended Citation

Included in

Search

Links

Links for Authors

Browse

Open Access Dissertations

Efficient processing of similarity queries with applications

Author

Abstract

Keywords

Disciplines

Degree Type

Degree Name

Department

First Advisor

Committee Chair

Committee Member 1

Committee Member 2

Committee Member 3

Date of Award

Recommended Citation

Included in

Share

Search

Links

Links for Authors

Browse