It is often the case that clustering takes place in supbspaces of the whole Euclidean space and that there may be shape restrictions on these subspaces.
Examples of these include the simplex, spherical subspaces and hypercubes. In our case we look at clustering on the unit hypercube (of arbitrary dimension). The standard methods for clustering often have implicit spherical/elliptical shape constraints on the clusters they can discover. Examples of such methods include: k means, complete hierarchical clustering, ward's method, model-based clustering (with finite Gaussian mixtures) as well as many others. Clearly while these should perform well for discovering groups in the centre of the hypercube, they may perform poorly when searching for groups at the corners. This paper looks to compare a standard method for clustering, k means, with alternative methods tailored to the shape of the space. The results of a finite mixture of beta distributions (estimated by EM algorithm) and also k means and model-based clustering applied to arcsine transformed data will be presented, giving the various pros and cons of all methods. Results from both simulated data from various models and a dataset in the cognitive diagnosis field, looking at grouping students based on skill estimates (from 0 to 1) of various skills used in an electronic test, will be presented.
Keywords: Cluster analysis; Hypercube; EM algorithm
Biography: Dr Nema Dean is a Lecturer of Statistics at the School of Mathematics and Statistics in the University of Glasgow. She received her PhD in Statistics from the University of Washington in Seattle, studying under Prof. Adrian Raftery. Her main areas of interest are in unsupervised learning, variable selection and visualization.