Performance Enhancement of Logistic Regression for Big Data on Spark

Mengyao Wang, Purdue University


This research proposes a new fitting algorithm of logistic regression on IRWLS that utilizes the procedure of scanning data row-by-row and has the ability to acquire an exact result with only a few iterations. Furthermore, this research also realizes the distributed parallelization of the proposed method on Spark and conducts various experiments to manifest its memory-wise advantage over the traditional methods such as Spark MLlib package. The results show that the proposed method can provide an exact result rather than an approximated one within 5 or 6 iterations; achieve a satisfying accuracy for flight delay prediction within 1 or 2 iterations; has a better potential for parallelization and a better performance than MLlib with a 3-4x faster speed without full optimizations; and its performance is not undermined by an increasing data memory ratio.




Yang, Purdue University.

Subject Area

Mathematics|Statistics|Computer science

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server