Improving Big Data Box-Cox Transformation on Spark

Huayi Fang, Purdue University


This study investigates improving Spark computation with Box-Cox Information Array when it is used to implement the linear regression models. In order to find the best linear regression model that fit the data, traditional methods have to read whole data many times, which is really time-consuming. Apache Spark can train linear regression model efficiently with distributed clusters because it processes all the data in memory. However, if the data size is huge or there are a lot of temporary data during the computation, it has to spill the data to disk and read it back later. These frequent I/O operations will affect the Spark computation. With the method proposed by Zhang and Yang (2017), information needed for linear regression can be stored in memory with small matrix called Box-Cox Information Array. This information array requires raw data to be scanned one time only. With this information array, the best linear regression model could be obtained at once. This study applies the Box-Cox Information Array method in Spark to understand how it affects the Spark computation performance. The experiment proves that when training forty-one models, the Box-Cox Information Array method is about 8 times faster than the existing API provided in Apache Spark when training 41 models, and it has better performance of prediction.




Yang, Purdue University.

Subject Area

Computer science

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server