Variable Selections and Their Applications for Data Analysis in Principal Canonical Correlation Analysis
Toru Ogura1, Yasunori Fujikoshi2, Takakazu Sugiyama3
1Industrial and Systems Engineering, Chuo University, Bunkyo-ku, Tokyo, Japan; 2Hiroshima University, HigashiHiroshima-shi, Japan; 3Soka University, Hachiouji-shi, Tokyo, Japan

Canonical correlation analysis (CCA) is often used to analyze the correlation between two random vectors (for examples, see Anderson (2003), Siotani et al. (1985), Sugiura (1976)). However, the canonical variables calculated using CCA are not very useful because they are not easy to interpret, and it is difficult to reduce the number of meaningful canonical variables. To address these difficulties, Sugiyama and Takeda (1999) proposed principal canonical correlation analysis (PCCA). PCCA is CCA between two sets of principal component (PC) scores. That is, each set of PC scores (components) is calculated from each random vector by principal component analysis (PCA). PCCA uses each set of PC scores instead of the original random vectors. PCA transforms a given data set of correlated variables into a new data set of uncorrelated variables, or PC scores.

Each PC score is defined from the original variable set and retains a certain percentage of the inherent variability. Each PC score accounts for a decreasing proportion of the total variance inherent in the data. Therefore, it is assumed that PCCA has some merit.

Because PC scores descend in order of the amount of information that they contain, it is important to select useful PC scores in PCCA. By using only selected PC scores, it will be easier to interpret the CCA. Some procedures for selecting variables in CCA of random vectors have been proposed. Several authors (Fujikoshi (1985), Ichikawa and Konishi (1999), Fujikoshi and Kurata (2008), Konishi and Kitagawa (2008)) have proposed the use of a method based on Akaike's (1973) idea. Ogura (2010) formally investigated the same criterion in an application of PCCA, proposing a variable selection criterion for one set of PC scores in PCCA, and proposed some advantages of using this procedure. For example, the principal canonical correlation coefficients from selected PC scores provide almost the same information about the principal canonical correlation coefficients as do those from all PC scores. Furthermore, it is easier to interpret the canonical variables. The effectiveness of this procedure was demonstrated using an example.

In this paper we propose a variable selection criterion for two sets of PC scores in PCCA that is an extension of Ogura's (2010) approach, based on a reasonable derivation. Furthermore, we demonstrate the effectiveness of this criterion using a simulation and an example. We also compare variable selection for two sets of PC scores in PCCA with variable selection for one set of PC scores.

We also investigate some applications for real data.

Keywords: Canonical correlation analysis; Principal component analysis; Variable selection

Biography: I got a ph.D. in March, 2010 from Chuo University in Japan.

I have been working as an Assistant Professor with the Chuo University in Japan since April, 2010.

My area of interest is the canonical correlation analysis.

My recent studies are “Distribution of some statistics in principal canonical correlation analysis”, “A variable selection in principal canonical correlation analysis” and “Canonical correlation coefficients of dataset with missing observations”, etc.