Principal Component Analysis on Interval-Valued Data Based on Complete Information in Hypercubes
Huiwen Wang1,2, Rong Guan1,2
1School of Economics and Management, Beihang University, Beijing, China; 2Research Center of Complex Data Analysis, Beihang University, Beijing, China

Principal component analysis (PCA) is mainly adopted for dimension reduction by transforming a number of correlated variables into fewer uncorrelated variables called principal components. It thus helps visualize the original high-dimensional observations on a lower-dimensional picture from the most informative viewpoint. With the explosion of databases, analysts often find it hard to discover knowledge from mass observations, even in a low-dimensional space. Symbolic Data Analysis (SDA) has directed an innovative way for solving this problem. The technique aims to generalize large scaled data to conceptual observations described by symbolic data and to extend classical statistical methods or to develop new approaches for multivariate analysis on symbolic data.

This paper will focus on PCA on interval-valued data, one of the most widely used symbolic data. The well-known methods proposed in the literature, such as VPCA, CPCA and MRPCA, share a common defect of using only part of the information of hypercubes (vertices, centers or midpoints and radii). Further study is therefore needed for PCA that can capture the complete information in interval-valued observations, pictured as hypercubes. To this end, in this paper, we propose the novel Complete Information Principal Component Analysis (CIPCA). By dividing hypercubes into informative grid data, CIPCA defines the inner product of interval-valued variables, which is the integral form of the inner product of grid data. Following an analytical approach, rather than a geometrical approach, CIPCA accomplishes the derivation of interval-valued principal components and transforms PCA modeling into the computation of some inner products. Experiments on the synthetic data sets demonstrate the merits of CIPCA in modeling interval-valued data. CIPCA leads to more accurate analytical results, since the defined inner product captures complete information in hypercubes, instead of vertices, centers, or midpoints and radii.

Keywords: Principal component analysis; Interval-valued data; Complete information; Hypercubes

Biography: Professor Huiwen Wang comes from Beihang University, China. She is now the dean of School of Economics and Management, doctoral advisor, the president of Research Center of Complex Data Analysis of Beihang University. And her study work focuses on complex data analysis, symbolic data analysis and partial least squares. Till now, she has published 3 books and more than 60 papers.