An important problem in pattern recognition is the effect of small design sample size on classification performance. When the ratio of the number of training samples to the number of feature measurements is small, the estimates of the discrimiriant functions are not accurate and therefore the classification results might not be satisfactory. This problem is becoming more and more important in remote sensing, as the number of available spectral bands is becoming greater and greater. In order to utilize fully the information contained in the high dimensional data, training samples are needed from all of the classes of interest. A large number of classes of interest, and a large number of features to be used, necessitate a large number of training samples. Such training samples are usually very expensive and time consuming to acquire. In this thesis, we study the use of unlabeled samples, that are usually available in large numbers and with no extra effort, in reducing the small sample size problems. It is shown that by adding the unlabeled samples to the classifier design process, better estimates for the discriminant functions can be obtained. Therefore, the peaking phenomenon that is observed in the performance versus dimensionality curves, can be mitigated. Bounds on the expected amount of improvement in the classification performance are derived for the case of two multivariate Gaussian classes with a known common covariance matrix. These bounds, explicitly show the relationship between dimensionality and samples size for the case when parameters are estimated by simultaneously using training and unlabeled samples. A semi-parametric method for estimating the parameters of the class density functions, that uses both training and unlabeled samples, is proposed, and its parallel implementation is discussed. The problem of density model selection for classification is studied. An algorithm based on backtrack search strategy is presented for generating candidate models for the density functions. The candidate models are evaluated by several criteria that are based on both training and unlabeled samples.

Date of this Version

January 1994