Performance and Cost Optimization for Distributed Cloud-Native Systems

Ashraf Y Mahgoub, Purdue University

Abstract

We investigate the problem of performance and cost optimization for two types of cloudnative distributed systems: NoSQL data-stores and Serverless DAG applications. First, NoSQL data-stores provide a set of features that is demanded by high performance computing (HPC) applications such as scalability, availability and schema flexibility. High performance computing (HPC) applications, such as metagenomics and other big data systems, need to store and analyze huge volumes of semi-structured data. Such applications often rely on NoSQL-based datastores, and optimizing these databases is a challenging endeavor, with over 50 configuration parameters in Cassandra alone. As the application executes, database workloads can change rapidly over time (e.g. from read-heavy to writeheavy), and a system tuned for one phase of the workload becomes suboptimal when the workload changes. We present a method and a system for optimizing NoSQL configurations for Cassandra and ScyllaDB when running HPC and metagenomics workloads. First, we identify the significance of configuration parameters using ANOVA. Next, we apply neural networks prediction using the most significant parameters and their workload-dependent mapping to predict database throughput, as a surrogate model. Afterwards, we optimize the configuration using genetic algorithms on the surrogate to maximize the workload dependent performance. Using the proposed methodology in our first framework (Rafiki), we can predict the throughput for unseen workloads and configuration values with an error of 7.5% for Cassandra and 6.9- 7.8% for ScyllaDB. Searching the configuration spaces using the trained surrogate models, we achieve performance improvements of 41% for Cassandra and 9% for ScyllaDB over the default configuration with respect to a read-heavy workload, and also significant improvement for mixed workloads. In terms of searching speed, Rafiki, using only 1/10,000-th of the searching time of exhaustive search, and reaches within 15% and 9.5% of the theoretically best achievable performances for Cassandra and ScyllaDB, respectively—supporting optimizations for highly dynamic workloads. Next, we consider the problem of reconfiguring NoSQL databases under changing workload patterns. This is challenging because of the large configuration parameter search space with complex interdependencies among the parameters. While state-of-the-art systems can automatically identify close-to-optimal configurations for static orkloads, they suffer for dynamic workloads as they overlook three fundamental challenges: (1) Estimating performance degradation during the reconfiguration process (such as due to database restart). (2) Predicting how transient the new workload pattern will be. (3) Respecting the application’s availability requirements during reconfiguration. Our second framework, Sophia, addresses all these shortcomings using an optimization technique that combines workload prediction with a cost-benefit analyzer. Sophia computes the relative cost and benefit of each reconfiguration step, and determines an optimal reconfiguration for a future time window. This plan specifies when to change configurations and to what values, to achieve the best performance without degrading data availability. We demonstrate its effectiveness for three different workloads: a multi-tenant, global-scale metagenomics repository (MG-RAST), a bus-tracking application (Tiramisu), and an HPC data-analytics system, all with varying levels of workload complexity and demonstrating dynamic workload changes. We compare Sophia ’s performance in throughput and tail-latency over various baselines for two popular NoSQL databases, Cassandra and Redis.

Degree

Ph.D.

Advisors

Grama, Purdue University.

Subject Area

Artificial intelligence|Computer science

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server
.

Share

COinS