Date of Award

Spring 2015

Degree Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

Department

Computer Engineering

First Advisor

MILIND KULKARNI

Committee Chair

MILIND KULKARNI

Committee Member 1

ARUN PRAKASH

Committee Member 2

SAMUEL P. MIDKIFF

Committee Member 3

VIJAY S. PAI

Abstract

Graphical Processing Units (GPUs) offer massive, highly-efficient parallelism, making them an attractive target for computation-intensive applications. However, GPUs have a separate memory space which introduces the complexity of manually handling explicit data movements between GPU and CPU memory spaces. Although GPU kernels/libraries have made it easy to improve application performance by offloading computation to GPUs, unfortunately it is very difficult to manually optimize CPU-GPU communication between multiple kernel invocations to avoid redundant communication when using these kernels with complex applications. ^ In this thesis, we introduce SemCache, a semantics-aware GPU cache that automatically manages CPU-GPU communication in addition to optimizing communication by eliminating redundant transfers using caching. It uses library semantics to determine the appropriate caching granularity for a given offloaded library (e.g., matrices). Our caching technique is efficient; it only tracks matrices instead of tracking every memory access at fine granularity. We applied SemCache to Basic Linear Algebra Subprograms (BLAS) library to provide a GPU drop-in replacement library which requires no program rewriting or annotations. ^ SemCache++ extends SemCache to support offloading to multiple GPUs. SemCache++ is used to build the first multi-GPU drop-in replacement library that (a) uses the virtual memory to automatically manage and optimize multi-GPU communication and (b) requires no program rewriting or annotations. SemCache++ also enables new features like asynchronous transfers, parallel execution and overlapping communication with computation. Experimental results show that our system can dramatically reduce redundant communication for real-world computational science application and deliver significant performance improvements, beating GPU-based implementations like MAGMA, CULA, CUBLAS, StarPU and CUBLASXT

Share

COinS