Studying the effect of multi-query functionality on a correlation-aware SQL-to-mapreduce translator in Hadoop version 2

Thivviyan Amirthalingam, Purdue University


The advent of big data has prompted both the industry and research for numerous solutions in catering to the need for data with high volume, veracity, velocity and variety properties. The notion of ever increasing data was initially publicized in 1944 by Fremont Rider, who argued that the libraries in American Universities are doubling in size every sixteen years (Press, 2013). Then, when the digital storage era came to be, it became easier than ever to store and manage large volumes of data. The need for efficient big data systems is now further fueled by the "Internet of Things" as it opens floodgates for, never before seen, new information flow. These phenomena have called for a simpler and more scalable environment with high fault tolerance and control over availability. With that motivation in mind, and as an alternative to relational databases, numerous Not-Only Structured Query Language (NoSQL) databases were conceived. Nonetheless, relational databases and their de facto language, Structured Query Language (SQL) are still prominent among wider user groups. This thesis project ventures into bridging the gap between Hadoop and relational databases through allowing multi-query functionality to a SQL-to-MapReduce translator. In addition to that, this research also includes the upgrade of the translator to a newer Hadoop version to utilize newer tools and features added since its original deployment. This study also includes the analysis of the modified translator's behavior under different sets of conditions. A regression model was devised for each of the experiments made and presented as significant means of understanding the data collected and any future estimates.




Springer, Purdue University.

Subject Area

Information Technology

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server