Open Access Dissertations

Large Scale Data Analysis in Parallel R and Its Use in Efficiently Scheduling Batch Jobs in the Cloud

Hao Lin, Purdue University

Date of Award

8-2018

Degree Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

Department

Electrical and Computer Engineering

Committee Chair

Samuel P. Midkiff

Committee Member 1

Rudolf Eigenmann

Committee Member 2

Y. Charlie Hu

Committee Member 3

Milind Kulkarni

Abstract

Large-scale data management and deep data analysis are increasingly important for both enterprise and scientific applications. Statistical languages provide rich functionality and ease of use for data analysis and modeling and have large user bases. R is among the most widely used of these languages, but is limited by a single threaded execution model and problem sizes that fit in a single node. We propose a highly parallel R system called RABID (R Analytics for BIg Data) that maintains R compatibility, leverages the MapReduce-like Spark framework and achieves high performance and scaling across clusters. RABID preserves the R programming model by introducing R-compatible distributed data structures with overloading functions. Optimizations like reducing the memory footprint, data pipelining and serialization, and operation merging are used to improve runtime performance. We compare RABID to several other frameworks.

In the era of cloud computing, batch data process workloads like RABID applications are targeted to run in VMs or containers in a cloud-based data center. Efficient scheduling of data center VMs can reduce the number of physical servers needed and, in turn, reduce the energy and other capital costs for maintaining the virtualized data center. We propose an innovative data-driven approach to achieve efficient pro-active VM scheduling. Our approach uses a multi-capacity bin-packing technique that efficiently places VMs onto physical servers. We use time-series analysis to extract not only low frequency information about future VM workloads but also high frequency information for VM workload correlations. This approach can also be implemented in RABID and leverages its high performance.

Recommended Citation

Lin, Hao, "Large Scale Data Analysis in Parallel R and Its Use in Efficiently Scheduling Batch Jobs in the Cloud" (2018). Open Access Dissertations. 1999.
https://docs.lib.purdue.edu/open_access_dissertations/1999

Download

COinS

Open Access Dissertations

Large Scale Data Analysis in Parallel R and Its Use in Efficiently Scheduling Batch Jobs in the Cloud

Date of Award

Degree Type

Degree Name

Department

Committee Chair

Committee Member 1

Committee Member 2

Committee Member 3

Abstract

Recommended Citation

Search

Links

Links for Authors

Browse

Open Access Dissertations

Large Scale Data Analysis in Parallel R and Its Use in Efficiently Scheduling Batch Jobs in the Cloud

Author

Date of Award

Degree Type

Degree Name

Department

Committee Chair

Committee Member 1

Committee Member 2

Committee Member 3

Abstract

Recommended Citation

Share

Search

Links

Links for Authors

Browse