Time-varying network reservations for cloud data centers

Di Xie, Purdue University

Abstract

In multi-tenant data centers, jobs of different tenants compete for the shared data center network and can suffer poor performance and high cost from varying, unpredictable network performance. Recently, several virtual network abstractions have been proposed to provide explicit application programming interfaces (APIs) for tenant jobs to specify and reserve virtual clusters (VCs) with both an explicit number of virtual machines (VMs) and network bandwidth between the VMs. However, all of the existing proposals reserve a fixed bandwidth throughout the entire execution of a job. Our profiling study of a set of MapReduce benchmark applications shows these popular cloud applications generate substantial traffic during only 30%-60% of the entire execution, suggesting that existing simple VC models waste precious networking resources. In the dissertation, we study the design and implementation of fine-grained virtual network abstractions to support efficient network isolation and predictable application performance in shared data centers. This dissertation makes the following four contributions: (1) We introduce the first fine-grained virtual network reservation abstraction, Temporally-Interleaved Virtual Clusters (TIVC), that can capture the time-varying nature of the networking requirements of cloud applications. (2) We present a “black-box” approach to generate TIVC models automatically based on network traffic profiling of the applications. (3) To demonstrate the effectiveness of TIVC, we develop Proteus, a cloud resource management system that implements the new abstraction. Using large-scale simulations of cloud application workloads and the prototype implementation running actual cloud applications, we show that the new abstraction significantly increases the utilization of the entire data center and reduces the cost to the cloud tenants, compared to previous fixed-bandwidth abstractions. (4) We extend the current Infrastructure-as-a-Service (IaaS) cloud model, which requires users to specify resource configurations explicitly, to support a much more user friendly interface that instead takes a service time objective as input and automatically translates it to the needed resource configuration. The new API is realized by extending Proteus with a projection model, based on insights into performance bottlenecks of MapReduce jobs and their scaling properties, and parameterized with component running times based on profiling on small clusters with sampled inputs. Evaluation results show our projection model can predict job running times with 2.7% of error when scaling to 32 nodes.

Degree

Ph.D.

Advisors

Hu, Purdue University.

Subject Area

Computer science

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server
.

Share

COinS