Novel techniques to increase classification accuracy in machine learning
Machine learning utilizes statistical and computational algorithms for classification and prediction. In this dissertation, several machine learning algorithms are developed to increase classification accuracy, especially with a number of real world applications. Beginning with protein secondary structure prediction, a special technique to include hydrophobicity information is shown to improve classification accuracy with support vector machines (SVM’s). These findings are next generalized in several frameworks to classify other datasets more accurately. The resulting methods are discussed below. Consensual subset classifiers include segmenting data, ranking classifiers, and aggregating the results. Different input features can be used to partition the data samples into two subsets in a two-level decision treelike structure. By choosing a subset of rank-ordered classifiers for consensus, better performance is often obtained. Building on the subset framework further, two different kinds of splitting rules can be integrated in a quad decision tree. After such a quad tree is built, one data sample might fall in several terminal nodes. The estimated distribution of the data sample is obtained by averaging the distribution of the regions that it falls in the terminal nodes. We can also shuffle datasets to create different decision boundaries in several ways, such as different splitting rules aggregated at one node, random subspace, and bootstrapping. This smoothing-average method usually results in higher classification accuracy. Boosting has been shown to effectively improve the classifier performance. It is shown how to effectively use a number of validation sets obtained from the original training data to improve the performance of a classifier. Every time the training set is changed, new classification borders are generated. After iteratively adding misclassified validation set to the training set, a more robust classifier is obtained. Parallel computing utilizes the power of multi-core processors to reduce the computation time. Parallel version of consensual subset SVM has been implemented and tested. The speedup depends on the dataset, and does not increase linearly with number of CPU cores for smaller datasets.
Gelfand, Purdue University.
Applied Mathematics|Computer Engineering
Off-Campus Purdue Users:
To access this dissertation, please log in to our