Distributed Bootstrap for Massive Data

Yang Yu, Purdue University

Abstract

Modern massive data, with enormous sample size and tremendous dimensionality, are usually stored and processed using a cluster of nodes in a master-worker architecture. A shortcoming of this architecture is that inter-node communication can be over a thousand times slower than intra-node computation, which makes communication efficiency a desirable feature when developing distributed learning algorithms. In this dissertation, we tackle this challenge and propose communication-efficient bootstrap methods for simultaneous inference in the distributed computational framework. First, we propose two generic distributed bootstrap methods, k-grad and n+k-1-grad, which apply multiplier bootstrap at the master node on the gradients communicated across nodes. Based on them, we develop a communication-efficient method of producing an ℓ∞- norm confidence region using distributed data with dimensionality not exceeding the local sample size. Our theory establishes the communication efficiency by providing a lower bound on the number of communication rounds τmin that warrants the statistical accuracy and efficiency and showing that τminonly increases logarithmically with the number of workers and the dimensionality. Our simulation studies validate our theory. Then, we extend k-grad and n+k-1-grad to the high-dimensional regime and propose a distributed bootstrap method for simultaneous inference on high-dimensional distributed data. The method produces an ℓ∞-norm confidence region based on a communicationefficient de-biased lasso, and we propose an efficient cross-validation approach to tune the method at every iteration. We theoretically prove a lower bound on the number of communication rounds τmin that warrants the statistical accuracy and efficiency. Furthermore, τminonly increases logarithmically with the number of workers and the intrinsic dimensionality, while nearly invariant to the nominal dimensionality. We test our theory by extensive simulation studies and a variable screening task on a semi-synthetic dataset based on the US Airline On-Time Performance dataset.

Degree

Ph.D.

Advisors

Cheng, Purdue University.

Subject Area

Communication|Electrical engineering

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server
.

Share

COinS