Subspace Methods for Anomaly Detection in High Dimensional Astronomical Databases
Marc Y.R. Henrion1, David J. Hand1, Axel Gandy1, Daniel J. Mortlock2
1Department of Mathematics, Imperial College London, London, United Kingdom; 2Department of Physics, Imperial College London, London, United Kingdom

Modern astronomical surveys, in particular cross-matched databases from virtual observatories, are very large datasets (hundred of thousands to millions and even billions of objects), which are high-dimensional (from a dozen variables up to a few hundred) and which often contain large numbers of missing values (due to sources emitting light at different wavelengths and faint sources not being detected in all filter passbands). The objects most interesting for astronomers are typically very rare, very faint and have one or several features that set them apart from the other sources in the survey. Indeed common stars and galaxies are fairly well-understood and it are objects right at the detection limits of the different surveys or objects that have peculiar astrophysical properties which drive much of the astrophysical research. Therefore anomaly detection tools are vital for finding such potential interesting sources. However the size of the datasets involved, the high dimensionality and above all the large numbers of missing values present severe challenges to existing anomaly detection methods. We propose a novel approach which works by computing, for each object, anomaly scores in lower dimensional subspaces and then combining these scores to a unique score for each source. Working in subspaces allows us to work around the curse of dimensionality and deal very intuitively with missing values. As a result our method allows direct comparisons of sources, even if they have been observed in quite different sets of variables. We will discuss several ways of combining anomaly scores and look at various properties of our approach. The proposed approach is very flexible and can be used with most anomaly score computation methods.

Keywords: Anomaly detection; Digital sky surveys; Data mining; Astrostatistics

Biography: Marc Henrion is a research student in the Department of Mathematics at Imperial College London since October 2007. His research interests are classification and anomaly detection methods, in particular for digital sky surveys. Before starting his PhD, he has studied for an undergraduate degree in Mathematics at Imperial College London and the Έcole Normale Supérieure de Lyon (ENS Lyon).