Date of Award
5-2018
Degree Type
Thesis
Degree Name
Master of Science (MS)
Department
Computer and Information Technology
Committee Chair
Baijian Yang
Committee Member 1
John A. Springer
Committee Member 2
Tonglin Zhang
Abstract
This research proposes a new ftting algorithm of logistic regression on IRWLS that utilizes the procedure of scanning data row-by-row and has the ability to acquire an exact result with only a few iterations. Furthermore, this research also realizes the distributed parallelization of the proposed method on Spark and conducts various experiments to manifest its memory-wise advantage over the traditional methods such as Spark MLlib package. The results show that the proposed method can provide an exact result rather than an approximated one within 5 or 6 iterations; achieve a satisfying accuracy for fight delay prediction within 1 or 2 iterations; has a better potential for parallelization and a better performance than MLlib with a 3-4x faster speed without full optimizations; and its performance is not undermined by an increasing data memory ratio.
Recommended Citation
Wang, Mengyao, "Performance Enhancement of Logistic Regression for Big Data on Spark" (2018). Open Access Theses. 1471.
https://docs.lib.purdue.edu/open_access_theses/1471