Performance Enhancement of Logistic Regression for Big Data on Spark

Abstract

This research proposes a new fitting algorithm of logistic regression on IRWLS that utilizes the procedure of scanning data row-by-row and has the ability to acquire an exact result with only a few iterations. Furthermore, this research also realizes the distributed parallelization of the proposed method on Spark and conducts various experiments to manifest its memory-wise advantage over the traditional methods such as Spark MLlib package. The results show that the proposed method can provide an exact result rather than an approximated one within 5 or 6 iterations; achieve a satisfying accuracy for flight delay prediction within 1 or 2 iterations; has a better potential for parallelization and a better performance than MLlib with a 3-4x faster speed without full optimizations; and its performance is not undermined by an increasing data memory ratio.

Degree

M.S.

Advisors

Yang, Purdue University.

Subject Area

Mathematics|Statistics|Computer science

Download

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server.

COinS

Performance Enhancement of Logistic Regression for Big Data on Spark

Abstract

Degree

Advisors

Subject Area

Search

Links

Links for Authors

Browse

Performance Enhancement of Logistic Regression for Big Data on Spark

Abstract

Degree

Advisors

Subject Area

Share

Search

Links

Links for Authors

Browse