Geo-distributed big data processing

Chamikara Madhusanka Jayalath, Purdue University

Abstract

Big data processing undoubtedly represents a major challenge of this era. Big data inherently arises due to many reasons including applications retaining more information to improve operation, monitoring, or auditing. Many systems have been proposed for efficiently handling big data. MapReduce, popularized by Google, is a widely used model where data is processed in two essential phases, mapping and reducing. Also many workflow systems have been introduced for efficiently handling multiple big datasets. These include Google's FlumeJava and Apache Pig. One major limitation of current systems for processing big data is that they assume a single homogeneously addressable cluster of nodes. Most of these systems are not designed to operate across multiple data centers and operate poorly in such environments. Many analysis tasks involve several datasets which are not necessarily stored in the same data center, and some datasets themselves may consist of several sub-datasets that may be partitioned into several data centers. In other terms, in contrast to the illusion of omnipresent uniform storage and computation resources promoted by cloud vendors, clouds are implemented by concrete data centers with specific locations; and big data is often geographically distributed. Current tools perform poorly in such environments if they support them at all. In this dissertation, we present solutions for efficiently handling big data that is geographically distributed. First, we investigate ways for efficiently processing a single geographically distributed dataset, and present G-MR, a tool for executing a sequence of tasks on such a dataset in an optimized manner. Second, we identify ways for efficiently handling multiple geographically distributed datasets using big data workflow systems. We present our languages Rout and DuctWork and corresponding systems that extend the big data workflow languages Apache Pig and Google's FlumeJava respectively, for defining and executing geographically distributed big data workflows. Third, we present Atmosphere, a distributed middleware system for efficiently communicating data across multiple cloud environments.

Degree

Ph.D.

Advisors

Eugster, Purdue University.

Subject Area

Computer science

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server
.

Share

COinS