The Approach to Ridge Regression for Big Data: An Examination
Ridge regression is a technical method to deal with highly correlated data when using regression model for analysis. Like other traditional techniques, one common limitation happens when the data size is bigger than the storage capacity of the memory or taking most of the memory storage. The analysis can’t complete because of memory error, either happening in loading the data into the memory or during the calculation step. Sampling or extending the memory storage capacity may be two possible solutions to avoid the problem. However, it probably brings unknown bias when the population is enormous or high costs in establishing the hardware. With the new method proposed by Zhang and Yang (2017b), it solves the above problems that the memory cannot support the requirements for computation in big data sets as well as the cost. The new method only needs to read the whole data set one time and make it separately. Unlike the traditional method, reading the entire dataset repeatedly is not required. In this study, it is to prove the new method can provide a fast way to use ridge regression for analysis as well as an exact result without approximation. Three experiments implemented are to examine (i) if the new method can provide the result sooner than others, (ii) if the new method can handle bigger data set of which others can’t don, and (iii) if the result from the new method has better predictive accuracy than others.
Yang, Purdue University.
Information Technology|Computer science
Off-Campus Purdue Users:
To access this dissertation, please log in to our