Embarrassingly Parallel Statistics and its Applications: Divide & Recombine Methods for Parallel Computation of Quantiles and Construction of K-D Trees for Big-Data

Aritra Chakravorty, Purdue University

Abstract

In Divide & Recombine (D&R), data are divided into subsets, analytic methods are applied to each subset independently, with no communication between processes; then the subset outputs for each method are recombined. For big data, this provides almost all of the analytic tasking needed when data are analyzed. It also provides high computational performance because typically most of the computation is embarrassingly parallel, the simplest parallel computation. Another kind of tasking must address computational performance and numeric accuracy: the computing of functions of all of the data, or “statistics”. For data big and small, it is often important to compute such statistics for all of the data, which can be summaries of the data, such as sample quantiles of continuous variables, or can process the data into a form that helps analysis, such as dividing the data into representative subsets. Development of computational methods to compute these statistics can be challenging. D&R can be a very effective framework for computing statistics. To support this, we introduce the concept of embarrassingly parallel (EP) statistics, both weak and strong. The concept of EP statistics is not entirely new, but has had little development. The existing methodology is mainly sums of sums. For example, this is done when computing the necessary statistics for least squares where sums of products and cross productions are carried out on subsets then summed across subsets. Our treatment of EP statistics has taken the concept much further. The outcome is ability to use EP statistics in conjunction with the use a Fourier series to approximate an optimization criteria. The series terms, which are strongly EP statistics, are summed across subsets, and the result is optimized. These are EP-F computational methods. We have so far developed two EP-F computational methods for two widely used statistic computations. EP-F-Quantile is for quantiles of big data, and EP-F-KDtree is for KD-trees. Speed and accuracy of EPF-Quantile are compared with that of the well-known binning method, which also can be formulated in terms of EP statistics. EPF-KDtree is the first parallel KD-tree computational method of which we are aware. EP and EPF computational methods have potentially many other applications to computing statistics.

Degree

Ph.D.

Advisors

Cleveland, Purdue University.

Subject Area

Statistics|Communication|Computer science|Mathematics

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server
.

Share

COinS