Optimal "big data" aggregation systems – From theory to practical application
Abstract
The integration of computers into many facets of our lives has made the collection and storage of staggering amounts of data feasible. However, the data on its own is not so useful to us as the analysis and manipulation which allows manageable descriptive information to be extracted. New tools to extract this information from ever growing repositories of data are required. Some of these analyses can take the form of a two phase problem which is easily distributed to take advantage of available computing power. The first phase involves computing some descriptive partial result from some subset of the original data, and the second phase involves aggregating all the partial results to create a combined output. We formalize this compute-aggregate model for a rigorous performance analysis in an effort to minimize the latency of the aggregation phase with minimal intrusive analysis or modification. Based on our model we find an aggregation overlay attribute which highly affects aggregation latency and its dependence on an easily findable trait of aggregation. We rigorously prove the dependence and find optimal overlays for aggregation. We use the proven optima to create simple heuristics and build a system, NOAH, to take advantage of the findings. NOAH can be used by big data analysis systems. We also study an individual problem, top-k matching, to explore the effects of optimizing the computation phase separately from aggregation and create a complete distributed system to fulfill an economically relevant task.
Degree
Ph.D.
Advisors
Eugster, Purdue University.
Subject Area
Computer science
Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server.