Exploratory Non-Parametric and Graphical Procedures for Top-k Ranked Lists
Michael G. Schimek
Medical Informatics, Statistics and Documentation, Medical University of Graz, Graz, Austria

Web search engines or microarray laboratory devices, among other new technologies, produce very long lists of distinct items or objects in rank order. The statistical task is to identify common top-ranking objects from two or more lists and to form sub-lists of consolidated items. In each list, the rank position might be due to a measure of strength of evidence, to a preference, or to an assessment either based on expert knowledge or a technical device. For each object, it is assumed that its rank assignment in one list is independent of its rank assignments in the other lists. The ranking is from 1 to N, without ties. Starting with work of Mallows (1957), there is a substantial model-based literature on combining rankings in problems where the number of items N is relatively small, and significantly less than the number L of assignment mechanisms. These parametric approaches cannot handle data of the type described above. Moreover, we are interested in problems where the reliability of rankings breaks down after the first (top), say, k objects due to error or lack of discriminatory information. Hence, we need distribution-free, and at the same time computationally highly efficient approaches, because list aggregation by means of brute force is limited to the situation where both N and L are impractically small.

Here, we present a non-parametric inference procedure that allows us to test for random degeneration of paired rankings (Hall and Schimek, 2010) even under m-dependence of the assignments. The size of a reliable consensual sub-list obtained in this manner depends on various technical parameters to cope with irregular and incomplete rankings, typical for real data. This exploratory inference tool can provide the necessary input for rank aggregation procedures (Schimek, Mysickova, and Budinska, 2010). Here, our focus is on statistical graphics for data integration based on the estimated k's. We implemented the inference and the graphical procedures and other procedures in the R package TopKLists. Several examples (real and simulated data) are used to illustrate the capabilities of this new methodology.

References:

Hall P. and Schimek M.G. (2010). Moderate deviation-based inference for random degeneration in paired rank lists. Under revision for JASA.

Mallows, C.L. (1957). Non null ranking models I. Biometrika 44, 114-130.

Schimek M.G., Mysickova, A. and Budinska, E. (2010). An inference and integration approach for the consolidation of ranked lists. Accepted for publication in JSPI.

Keywords: Non-parametric inference; Ordered list; Rank aggregation; Statistical graphics

Biography: Michael G. SCHIMEK, DPhil, PhD, is a researcher and educator in statistics and bioinformatics. He holds degrees from the Universities of Vienna (Austria), Innsbruck (Austria) and Bath (UK). His affiliations are the Medical University of Graz (Austria) and the Masaryk University in Brno (CZ). He had visiting positions at the Max Planck Institute for Molecular Genetics in Berlin (Germany), at Ludwig-Maximilians University in Munich (Germany), at Texas A&M University in College Station (USA), at the University of Sydney (Australia), at McGill University in Montreal (Canada), among others. He is past Vice President of the International Association for Statistical Computing. His research interests range from nonparametrics and statistical modelling to biocomputing and genomic data integration.