Hadoop Based Algorithm for Computing Linear Regression

Tian Wang, Purdue University

Abstract

As machine learning and big data analysis play a more and more important role in both industry and academia, researchers correspondingly spend a large amount of time trying to find those accurate models that could help researchers predict the trend of a certain phenomenon. Current packages and functions in R, Hadoop and RHadoop require accessing the entire data set each time when a new set of parameters need to be evaluated. This is extremely time-consuming when data is big and disk I/O is slow. This study implemented an one-read-multiple-evaluation technique that can greatly reduce time needed to find the best model from multiple sets of parameters. In the testing RHadoop environment, the proposed approach showed that finding the best Box-Cox transformed linear model from 41 potential parameters was about 25 times faster than the linear models on RHadoop when the training datasets is about 12.4 GB. Results also showed the scheme is scalable when the size of data is bigger and more sets of parameters need to be compared.

Degree

M.S.

Advisors

Zhang, Purdue University.

Subject Area

Computer science

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server
.

Share

COinS