Clustering High Dimension Low Sample-Size Data with Fuzzy Cluster-Based Principal Component Analysis
Mika Sato-Ilic
Faculty of Systems and Information Engineering, University of Tsukuba, Tsukuba, Ibaraki, Japan

The aim of principal component analysis (PCA) is to summarize the latent similarity structure of data observed in high dimensional space by projecting the data into a smaller dimensional space.

However, classical PCA has the following problem. Since the classical PCA is based on orthogonal projection, the metric projection defined in convex space, which is the data space, is non-expansive. Therefore, a norm between two projected objects in a smaller dimensional space is inevitably smaller than the norm between the corresponding pre-projected two objects in a high dimensional space. The root cause of this problem is that PCA only focuses on minimizing the sum of square of distances from objects in a high dimensional space to a hyper plane in a lower dimensional space, and does not consider similarities among objects in a high dimensional space.

In order to solve this problem, we extract the similarity structure of objects in a high dimensional space by using a fuzzy clustering method and by tacking the result to the PCA, and we propose a new PCA considering the similarity structure of objects in a high dimensional space. When the projected space is obtained using a covariance matrix, we use a weighted covariance matrix which involves the contribution degree for the fuzzy classification structure of objects, based on dissimilarity of objects in the higher dimensional space.

Our target data in this study is high dimension low sample-size data. In order to obtain an adaptable classification structure for this type of data, we exploit a method to detect subgroups of objects that cluster on subsets of variables. The merit of this method is to select the significant variables by simultaneously considering two structures which include a structure of the degree of classification and a structure of objects. Since our goal is to extract the classification structure of the data, the involvement of the factor of classification structure at each variable is crucial.

References:

J.H. Friedman and J.J. Meulman, Clustering Objects on Subsets of Attributes, Journal of the Royal Statistical Society, Series B, vol. 66, pp. 815-849, 2004.

M. Sato-Ilic, A Cluster-Target Similarity Based Principal Component Analysis for Interval-Valued Data, 19th International Conference on Computational Statistics, Physica-Verlag, pp. 1605-1612, 2010.

Keywords: Principal Component Analysis; Fuzzy Clustering; Variable Selection; High Dimension Low Sample-Size Data

Biography: I received my Doctorate of Engineering from Hokkaido University, specializing in data mining including fuzzy data analysis. I currently hold the position of Associate Professor in the Faculty of Systems and Information Engineering, at the University of Tsukuba.

I am a Council of the International Association for Statistical Computing, and a Director of the Japan Statistical Society. I am a Senior Member of the IEEE and a vice chair of the IEEE Computational Intelligence Society Fuzzy Systems Technical Committee. In addition, I have held several other positions in the IEEE including, vice program chair, special sessions co-chair, and publicity chair for several international conferences. Presently, I am the editor-in-chief of the International Journal of Knowledge Engineering and Soft Data Paradigms. My research interests include the development of methods for multi-dimensional data analysis, pattern classification, and data mining based on soft computing for which I have received several academic awards.