Abstract

The integration of computers into many facets of our lives has made the collection and storage of staggering amounts of data feasible. However, the data on its own is not so useful to us as the analysis and manipulation which allows manageable descriptive information to be extracted. New tools to extract this information from ever growing repositories of data are required.

Some of these analyses can take the form of a two phase problem which is easily distributed to take advantage of available computing power. The first phase involves computing some descriptive partial result from some subset of the original data, and the second phase involves aggregating all the partial results to create a combined output. We formalize this compute-aggregate model for a rigorous performance analysis in an effort to minimize the latency of the aggregation phase with minimal intrusive analysis or modification.

Based on our model we find an aggregation overlay attribute which highly affects aggregation latency and its dependence on an easily findable trait of aggregation. We rigorously prove the dependence and find optimal overlays for aggregation. We use the proven optima to create simple heuristics and build a system, NOAH, to take advantage of the findings. NOAH can be used by big data analysis systems.

We also study an individual problem, top-k matching, to explore the effects of optimizing the computation phase separately from aggregation and create a complete distributed system to fulfill an economically relevant task.

Comments

© 2015 IEEE. Reprinted, with permission, from Culhane, William and Kogan, Kirill and Jayalath, Chamikara and Eugster, Patrick. Optimal Communication Structures for Big Data Aggregation. INFOCOM '15. DOI: 10.1109/INFOCOM.2015.7218544

In reference to IEEE copyrighted material which is used with permission in this thesis, the IEEE does not endorse any of Purdue University's products or services. Internal or personal use of this material is permitted. If interested in reprinting/republishing IEEE copyrighted material for advertising or promotional purposes or for creating new collective works for resale or redistribution, please go tohttp://www.ieee.org/publications_standards/publications/rights/rights_link.html to learn how to obtain a License from RightsLink.

Keywords

Big data, Cloud computing, Aggregation, Top-k, Matching, Algorithms, Distributed Systems, Systems

Disciplines

Computer Sciences | Theory and Algorithms

Degree Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

Department

Philosophy

First Advisor

Patrick Eugster

Committee Member 1

Charles Killian

Committee Member 2

Sonia Fahmy

Committee Member 3

Dongyan Xu

Date of Award

5-2015

Recommended Citation

Culhane, William J. IV, "Optimal "Big Data" Aggregation Systems - From Theory to Practical Application" (2015). Open Access Dissertations. 216.
https://docs.lib.purdue.edu/open_access_dissertations/216

Download

Included in

Theory and Algorithms Commons

COinS

Open Access Dissertations

Optimal "Big Data" Aggregation Systems - From Theory to Practical Application

Abstract

Comments

Keywords

Disciplines

Degree Type

Degree Name

Department

First Advisor

Committee Member 1

Committee Member 2

Committee Member 3

Date of Award

Recommended Citation

Included in

Search

Links

Links for Authors

Browse

Open Access Dissertations

Optimal "Big Data" Aggregation Systems - From Theory to Practical Application

Author

Abstract

Comments

Keywords

Disciplines

Degree Type

Degree Name

Department

First Advisor

Committee Member 1

Committee Member 2

Committee Member 3

Date of Award

Recommended Citation

Included in

Share

Search

Links

Links for Authors

Browse