Improving MapReduce performance in large-scale clusters

Faraz Ahmad, Purdue University

Abstract

The evolution of big data has led enterprises to seek time efficient and cost affordable solutions for processing large volumes of raw data on clusters of commodity hardware. MapReduce is a well-known programming model from Google for large-scale data processing which provides automatic data management and fault tolerance to improve programmability of clusters. MapReductions are extensively used in clusters not only to provide up-to-date organized data for interactive workloads such as search engines and social networks, but also to perform time-critical data analytics for retail enterprises as well as financial markets. Improving the performance of MapReductions becomes particularly important because of (i) time-critical nature of MapReductions, (ii) savings in important machine hours, and (iii) cost-effective cloud solutions for users and enterprises. The main thrust of the thesis is to address the MapReduce performance problems caused by an all-Map-to-all-Reduce communication, called the Shuffle, across the network bisection. Many MapReductions move large amounts of data (e.g., as much as the input data) during the Shuffle, stressing the bisection bandwidth and introducing significant runtime overhead. In this work, I make four contributions. First, I propose techniques to overlap Shuffle communication with Reduce computation to improve MapReduce performance (MaRCO) in homogeneous clusters. Second, I propose a suite of optimizations (Tarazu) that perform communication- and computation-aware load balancing to improve performance on heterogeneous clusters. Third, I identify performance bottlenecks in multi-tenant clusters due to Shuffle, and exploit a key trade-off between intra-job concurrency and data locality (ShuffleWatcher) to shape and reduce Shuffle traffic in multi-tenant clusters. Finally, I establish a benchmark suite (PUMA) of real-world applications that represents a broad range of MapReductions exhibiting application characteristics with varying computation and communication demands.

Degree

Ph.D.

Advisors

Vijaykumar, Purdue University.

Subject Area

Computer Engineering|Computer science

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server
.

Share

COinS