Divide and Recombine for Large and Complex Data: Model Likelihood Functions Using MCMC and TRMM Big Data Analysis

Abstract

Divide & Recombine (D&R) is a powerful and practical statistical framework for the analysis of large and complex data. In D&R, big data are divided into subsets, each analytic method is applied to subsets with no communication among subsets, and the outputs are recombined to form a result of the analytic method for the entire data. This enables deep analysis and practical computational performance. The aim of this thesis is to provide an innovative D&R procedure to model likelihood of the generalized linear model for large data sets using Markov chain Monte Carlo (MCMC) methods and to present an analysis of Tropical Rainfall Measuring Mission (TRMM) data utilizing the DeltaRho D&R computational environment. The first chapter briefly introduces DeltaRho computation environment, followed by the introduction of univariate and multivariate skew-normal distribution and the derivation of parameter estimation using sample moments. Then a very basic introduction to MCMC sampling is provided as the MCMC sampling method could be used to characterize the posterior distribution in Chapter 3. Finally, the chapter is closed by a nonparametric procedure for decomposing a seasonal time series into seasonal, trend and remainder components – STL. In the second chapter, an innovate D&R procedure is proposed to compute likelihood functions of data-model (DM) parameters for big data. The likelihood-model (LM) is a parametric probability density function of the DM parameters. The density parameters are estimated by fitting the density to MCMC draws from each subset DM likelihood function, and then the fitted densities are recombined. The procedure is illustrated using normal and skew-normal LMs for the logistic regression DM on simulated data. Also, a novel diagnostic method is developed to measure the degree of the similarity between fitted density and the true likelihood function, with a real data application illustrated in the later section. In the last chapter, the focus is to present an analysis of TRMM big data utilizing the DeltaRho D&R computational environment. First, the exploratory data analysis is conducted to investigate the spatial patterns of precipitation and the seasonal behaviors of rain rates at different time scales. Then, spatio-temporal logistic models are constructed to explain the variation of 3-hr precipitation occurrence in automation for 460,800 locations, followed by model diagnostics and model inference. Furthermore, more advanced predictive models– two-stage logistic regression model, spatial-temporal autologistic regression model, and neighbor recurrent logistic regression model– are developed to forecast the probability of 3-hr precipitation occurrence at all locations. Finally, the chapter is ended with the application of spatio-temporal logistic models on daily heavy rainfall data.

Degree

Ph.D.

Cleveland, Purdue University.

Subject Area

Statistics

Off-Campus Purdue Users: