Visual Clustering of High-Dimensional Data by Navigating Low-Dimensional Spaces
Wayne Oldford, Adrian Waddell
Statistics and Actuarial Science, University of Waterloo, Waterloo, ON, Canada

The structure of a set of high dimensional data objects (e.g. images, documents, molecules, genetic expressions, etc.) is notoriously difficult to visualize. In contrast, lower dimensional structure (esp. 3 or fewer dimensions) is natural to us and easy to visualize. A not unreasonable approach, then, is to explore one low dimensional visualization after another in the hope that, together, these will shed light on the higher dimensional structure. A familiar example is the parallel coordinate plot, where a sequence of one-dimensional projections are connected to provide insight into the structure of a high dimensional data set.

In this talk, we describe the graph theoretic structure, recently proposed in Hurley and Oldford (2011, Comp. Stat.), that represents low-dimensional spaces as graph nodes and transitions between spaces as edges. Of interest, are walks along these graphs that reveal meaningful structure. If the nodes are one-dimensional, a walk corresponds to a parallel coordinate plot; if they are two dimensional and edges exist, say, only between 2d spaces which share a variate, then the walk could be represented dynamically as a series of scatterplots, one transitioning into the next via a 3d rigid transformation. We demonstrate how these graphs are constructed and dynamically explored via our R package, RnavGraph.

These graphs, unfortunately, grow in size with the square of the number of dimensions. Fortunately, there are numerous means for constructing only the more interesting regions of each graph. Some restrictions are imposed by the statistical context, others by empirical measures on the data itself. Of the latter, scatterplot diagnostics (scagnostics) are especially valuable. When the objective is to visually cluster the data, nearest neighbour based dimension reduction methods are particularly effective in conjunction with appropriate scagnostics. We demonstrate these methods using RnavGraph.

Keywords: Graph navigation, transition graphs; RnavGraph; High-dimensional data; Visual clustering

Biography: Wayne Oldford is a Professor of Statistics and of Computer Science at the University of Waterloo, Canada. His research interests include data visualization, cluster analysis, quantitative programming environments, and statistical foundations. Prof. Oldford is a long-standing member of the International Statistical Institute, and even longer of the International Association for Statistical Computing.