Abstract
We con iid er the simultaneous estimation of the outlier set and the regression parameters using the ~corJtaminatcdd ata set, A, which has many members obeying the linear model, but some which do not. A precise definition of outliers is given using set theory. Fixing the iers to be L, the optimum set of inliers is chosen as the set having the highest log all subsets of A of size N-L, bJ being the size of A. We define the concept of the data set A into inlien and outliers, and show that the local maxima of the likelihood fu Il ction yields a regression estimate which yields a valid partidon of the data A. We show that t d global maximum set s;, the estimate of the inlier set, has an interesting game theoretic inthretation. The outlier set estimate given here is based on evaluating different partitions of da i. and it does not involve arbitrary thresholds characteristic of the papers in the literature. ~ We dev lop a new formula for computing the sum of minimal residual squares of any sub- t set of A of size N-L as a quadratic form in the residuals quoted from the LS coefficients obtained frob all the data A. Only a particular case of this formula when L = 1 has been g this method one can compute the s;, the optimal set of inliers of size N-L. We developed here to seven well known "difficult" multivariate data sets like the stackloss data, the simulated data set of Rousseeuw, the water salinity data, the engine knock data set, the Hawkins-Bradu-Kass data, the star data set, and show our method extracts the correct outliers from the simulated data sets and extracts the outliers from data sets like engine knock where conventional methods like the least median squares fail to do.
Date of this Version
February 1993
Comments
Page 39 is missing from original document as well as copy held in the Siegesmund Engineering Library of Purdue University.