Methods to improve applicability and efficiency of distributed data-centric compute frameworks

Karthik Shashank Kambatla, Purdue University

Abstract

The success of modern applications depends on the insights they collect from their data repositories. Data repositories for such applications currently exceed exabytes and are rapidly increasing in size, as they collect data from varied sources—web applications, mobile phones, sensors and other connected devices. Distributed storage and data-centric compute frameworks have been invented to store and analyze these large datasets. This dissertation focuses on extending the applicability and improving the efficiency of distributed data-centric compute frameworks. While data-centric models like MapReduce allow applications to process large datasets on thousands of nodes, the lack of inter-task communication and high synchronization costs limit their applicability to data-parallel computations. In this dissertation, we (1) enable inter-task communication through transactional execution of tasks over shared address space, and (2) lower the synchronization costs through relaxed synchronization techniques. These improvements extend the applicability of MapReduce and allow applications to exploit amorphous data parallelism and algorithmic asynchrony. We demonstrate this by improving the scalability and performance, by an order of magnitude, of many unstructured graph applications. The simplicity of data-centric models, that hides the complexity of distributed execution from end-users, is one of the main reasons for their widespread adoption and success. However, this makes it harder for end-users to efficiently utilize and share cluster resources. In this dissertation, we devise UBIS, a utilization-aware cluster scheduler, to improve cluster utilization and job throughput. UBIS demonstrates improvements of up to 30% for representative workloads. We also outline methods to automatically tune per-job configuration knobs for optimal performance.

Degree

Ph.D.

Advisors

Grama, Purdue University.

Subject Area

Computer science

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server
.

Share

COinS