Divide and recombine (D&R) for the analysis of large and complex data sets with application to VoIP

Jin Xia, Purdue University

Abstract

Large complex data sets, or big data, pervade academia, industry, and government. They present substantial challenges to our historical methodological approaches to data analysis, and to our historical computational methods and environments for data analysis. They challenge achieving deep analysis, which means comprehensive detailed analysis that does not lose important information through inappropriate data reductions. They challenge the computational feasibility of using an interactive, highly-extensible, domain-specific language for data analysis, such as R, that provides time-efficient programming for the analyst and access to the 1000s of methods of statistics and machine learning. Deep analysis using an interactive language is readily achievable today with small data sets; the challenge is to make this scale to big data. Divide and recombine (D&R) consists of statistical methods, computational methods, and computational environments for scaling analysis to big data. The data are divided into subsets in one or more ways. Numeric and visualization methods are applied to each subset, or are applied to each subset in a sample based on methods of statistical sampling and experimental design. Then the results of each method are recombined across subsets. Computational methods and computational environments are critical. One development in computational environments is RHIPE, the R and Hadoop Integrated Programming Environment. It enables D&R to be carried out wholly from within R. Statistical division methods and statistical recombination methods are very broad. They depend on the structure of the data being analyzed, and are an exciting area of research in statistical theory and methods. NEW RESULTS. Network engineering for the quality-of-service (QoS) of VoIP (voice over the Internet) can benefit substantially from simulation study of VoIP packet traffic queueing on a network of routers. This requires accurate statistical models for the packet arrivals to the queue. D&R was applied with great success to a number of different big data sets with different data structures that arose in carrying out arrival process model building, in running the simulation, and in analyzing the simulation output. All analysis was carried out in R using RHIPE routines. The analyses also provided much insight into carrying out D&R in practice. First, D&R was used to achieve deep analysis to build arrival models. The live packet-level data were from a link of Global Crossing (GBLX), an international service provider. The models together with D&R were used to generate synthetic packet arrivals to simulate the first-in-first-out (FIFO) queue of a VoIP traffic router. Then D&R was used to study the statistical properties of two stochastic processes: packet-level queueing delay and jitter from the simulation. These are the critical QoS variables. Deep analysis through D&R provided substantial information for VoIP network engineering. It also provided the first comprehensive study of the statistical properties of the delay and jitter processes for VoIP.

Degree

Ph.D.

Advisors

Ward, Purdue University.

Subject Area

Statistics

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server
.

Share

COinS