Supervised precision ordinal clustering – A human-machine learning algorithm to create accurate clusters in big datasets: Application to indiana water quality data with novel visualization techniques

Sarabjit Singh, Purdue University

Abstract

A new and more effective cluster analysis algorithm and novel visualization techniques have been developed for big datasets. Indiana multiparameter water quality data spanning multiple decades were used to demonstrate the effectiveness of the new methods. In the first phase, the water quality dataset was transformed from native form to one suitable for machine-learning applications. Data from two EPA systems—Legacy Data Center (LDC) and STORET—were merged to obtain all water quality measurements recorded in Indiana over six decades. Ten physical, ten metal, and seven nutrient parameters were selected because of their high coverage. Average daily measurements for all parameters at the same station were rearranged as a single 27-dimension tuple. A new missing value completion technique—"Expanding Temporal Horizon" averaging—was developed to significantly reduce missing values from 68% to 33%. Measurements that were overindexed 3x or higher than a parameter's population mean were considered significant, and a binary data version was created based on this criterion. In the second phase, all multiparameter tuples were clustered using two traditional clustering algorithms—K-Means and Expectation Maximization—by both standard and binary clustering. A new bottom-up clustering technique "Supervised Precision Ordinal Clustering" was also developed to create clusters. SPOC is a hybrid human-machine or "humachine" learning algorithm that incorporates human subject-matter expertise into machine learning. It outperformed both traditional clustering algorithms on number of precision clusters, parameters and significant measurements represented, and consumer's accuracy. In the third phase, two new visualization techniques were developed using Microsoft Excel: (1) A chart form to represent cluster classification temporal trends at individual stations. (2) A geographically accurate, raster-map-like visualization to display and model spatiotemporal trends. The latter technique was used to track changes in spatiotemporal distribution of turbidity between 1990s and 2000s by three indices—all measurements, significant measurements, and SPOC clusters. The results by all three indices were shown to be aligned as expected. Analysis of Indiana water quality using above techniques shows that metals Cadmium, Chromium, Copper, Lead, Manganese, Mercury, Nickel, and Zinc, and other parameters including Alkalinity, BOD, Arsenic, TOC, Cyanide, Nitrogen-Kjeldahl, and Phosphorus had highest levels of significant measurements and associated stations in the 1960s and 1970s. They gradually declined starting in the 1980s, likely due to introduction of stricter water quality regulations. Parameters like COD, Conductance, E. coli, Hardness, TSS, Turbidity, Nitrogen-Nitrite & Nitrate, and Sulfate show an increase in recent decades, likely due to increase in population and more extensive water quality monitoring.

Degree

Ph.D.

Advisors

Engel, Purdue University.

Subject Area

Agricultural engineering|Computer science

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server
.

Share

COinS