# Divide and recombined for large complex data: Nonparametric-regression modelling of spatial and seasonal-temporal time series

12-2016

Dissertation

## Degree Name

Doctor of Philosophy (PhD)

Statistics

## First Advisor

William S. Cleveland

## Committee Chair

William S. Cleveland

Anindya Bhadra

Ryan Hafen

Boyu Zhang

Hao Zhang

## Abstract

In the first chapter of this dissertation, I briefly introduce one type of nonparametric regression method, namely local polynomial regression, followed by emphasis on one specific application of loess on time series decomposition, called Seasonal Trend Loess (STL). The chapter is closed by the introduction of D\&R; (Divide and Recombined) statistical framework. Data can be divided into subsets, each of which is applied with a statistical analysis method. This is an embarrassing parallel procedure since there is no communication between each subset. Then the analysis result for each subset are combined together to be the final analysis outcome for the whole dataset. The main purpose of this chapter is to cover the foundation of the methodology and principle for the data analysis in later chapters, which crucially depends on these topics.

In the second chapter, a new statistical method for the analysis of spatial seasonal-temporal dataset is proposed, named as Spatial Seasonal Trend Loess (SSTL). The chapter starts with the illustration of main steps of analysis routine of SSTL. Next, a spatial-temporal dataset from National Climatic Data Center (NCDC) is introduced and used as an example to explore the routine step by step. The modeling results are evaluated by diagnostic visual display among the third chapter.

In the third chapter, I illustrate procedure to choose the best smoothing parameters for spatial and temporal smoothing respectively. For the spatial smoothing, cross validation method is used to choose the best smoothing span and degree in all 576 months collectively. The training dataset and testing dataset are decided by using near exactly replicates framework which split the original dataset into subsets. For the temporal STL+ fitting, an experiment of tunning smoothing parameters is explored based on data after year 1950 to choose the best smoothing parameters in terms of prediction ability.

In the fourth chapter, I discuss the generalization of SSTL routine under the divide and recombined framework for big and complex spatial-temporal dataset. Because of the flexibility of the SSTL for large and computationally heavy dataset, it comes free to embed the SSTL method into the divide and recombined framework, called drSSTL. Either by-month division or by-station division of the dataset can be generated, then a corresponding analysis method is applied on them. The drSSTL routine now consists of a series of MapReduce jobs, and all fitting results are saved on HDFS. Meanwhile some of the diagnostics procedures discussed in the third chapter are illustrated here but with more details about their parallel implementation. In the first section, I illustrate the MapReduce job to download the target dataset in parallel. Next I explore the details about the each steps of drSSTL as a MapReduce job to proceed the smoothing procedure on different divisions.

In the last chapter, I conduct the analysis of the performance of the drSSTL routine for large spatial-temporal dataset. There are two groups of tunning parameters which have the potential influence on the performance of the routine. One is the tunning parameters of statistical model, which controls the complexity of the smoothing procedure of spatial loess and STL+ procedure. Another group of parameters are user-tunable Hadoop parameters. Within this chapter, I illustrate several pilot experiments and full factorial experiments to study the effect of these tunning parameters to the elapsed time of the drSSTL routine.

COinS