Outlier Detection for Covariance Estimation Via Feature Selection Algorithms
Rajiv S. Menjoge, Roy E. Welsch
Sloan School of Management, Massachusetts Institute of Technology, Cambridge, MA, United States

We present here an alternative way of viewing the robust covariance estimation problem by posing it as a feature selection problem for a multivariate regression. The literature most related to our paper is ([1],[3]), where a connection is established between outlier detection in linear regression and feature selection on an augmented design matrix.

Let the data matrix be Y, construct X to be a column of 1's with an identity matrix appended, and perform feature selection in the multivariate regression model of Y onto X, where Y is considered to be the matrix of the response variables, and X is the design matrix. The estimate of the covariance matrix of the error term for the regression is the algorithm's estimate of a robust covariance matrix of the original data.

The justification for our algorithm is the following: 1. The estimated covariance matrix for multivariate data is the same as the estimated conditional covariance matrix of the multivariate regression model of Y onto X, where X is a column of 1's. 2. The mean-shift outlier model establishes that performing a regression with deleted observations yields the same results as performing a regression onto an augmented X matrix, where the augmented columns are dummy columns corresponding to the observations deleted in the first model.

We focus on using backwards elimination as the feature selection algorithm. In particular, we iteratively eliminate the least relevant feature of the X matrix, producing a ranked list of outliers. The criterion for feature relevance is the determinant of the conditional covariance matrix when the feature is eliminated. When there are ties, we use instead the product of the nonzero eigenvalues. When there are still ties, we use the L1 norm of the coefficient matrix. Since ties occur for the first few features eliminated, this specification is important and allows the initializing heuristic to be self-contained.

We applied our algorithm to four real data sets (Hertzsprung-Russell Star Data, Biochemical Data, Wine Data, and Swiss Heads Data) and achieved satisfactory results, with our algorithm highly ranking those observations that are known to be outliers [2].

References:

[1] McCann, L. and Welsch, R. E., 2007. Robust Variable Selection Using Least Angle Regression and Elemental Set Sampling. Computational Statistics and Data Analysis. 52, 249-257.

[2] Menjoge, R. S., 2010. New Procedures for Visualizing Data and Diagnosing Regression Models, PhD Thesis, M.I.T.

[3] Morgenthaler, S.,Welsch, R. E., and Zenide, A., 2003. Algorithms for Robust Model Selection in Linear Regression. Theory and Applications of Recent Robust Methods, eds. M. Hubert, G. Pison, A. Struyf, and S. Van Aelst, Basel (Switzerland): Birkhauser-Verlag.

Keywords: robust statistics; mean shift outlier model; multivariate analysis; robust covariance estimation

Biography: Professor Roy Welsch is the Eastman Kodak Leaders for Global Operations Professor of Management, Statistics, and Engineering Systems at the MIT Sloan School of Management and Engineering Systems Division and Head of the Management Science Area of the Sloan School. He received his PhD in Mathematics from Stanford University. His major research areas have been regression diagnostics, robust estimation, nonlinear modeling, visualization, and, more recently, financial engineering and bioinformatics.