The present proposal deals with high-dimensional binary data collected in different occasions in time or space. Studying the associations of data collected at different occasions, a primary aim is to detect changes in the association structure from one occasion to another. A suitable exploratory technique for the analysis of multiple associations in high-dimensional data is the multiple correspondence analysis (MCA; Greenacre, 2007). However, the comparison of MCA factorial displays referring to different occasions is meaningless. A possible solution to link the association structures of different data batches is to start from an MCA display of a reference and incrementally update the solution with further batches (Iodice D'Enza and Greenacre, 2010). This approach, does not take into account the presence of a cluster structure in the set of statistical units.
This contribution intend to present an approach that, through the combination of clustering and factorial techniques, aims to visualize the evolution of the association structure of binary attributes over different data batches. The proposal is to introduce a latent categorical variable which is determined and updated at each incoming batch; in other words this variable is determined according to the association structure and represents the 'link' among the solutions. The latent categorical variable is endogenously determined by the procedure; in particular, it refers to the cluster structure characterizing the data set in question. A starting solution is updated incrementally as new data sets are analysed. The factorial display will describe the patterns of change in the multiple associations when shifting the analysis from one occasion to the other.
Procedures suitably combining clustering with factorial analysis techniques have been proposed. Vichi and Kiers (2001) propose a combination of principal component analysis (PCA) with k-means clustering method. In the framework of categorical data, another interesting approach combining clustering and multiple correspondence analysis (MCA) is proposed by Hwang et al. (2006). Similarly, yet dealing with binary data, Palumbo and Iodice D'Enza (2010) propose a suitable dimension reduction and clustering. The present proposal is an enhancement of the latter approach to the comparative analysis of multiple batches.
Greenacre M. J., (2007) 'Correspondence Analysis in Practice', second edition. CRC press.
Hwang H., Dillon W. R. and Takane Y., (2006). 'An extension of multiple correspondence analysis for identifying heterogenous subgroups of respondents'. Psychometrika. 71, 161.
Iodice D'Enza A. and Greenacre MJ. (2010).'Multiple correspondence analysis for the quantification and visualization of large categorical data sets'. In proc. of SIS09 Statistical Methods for the analysis of large data-sets. (in press).
Palumbo F. and Iodice D'Enza A.,(2010).'A two-step iterative procedure for clustering of binary sequences'. Data Analysis And Classifcation. Springer, 50.
Vichi M. and Kiers H., (2001). 'Factorial k-means analysis for two way data'. CSDA 37(1): 49.
Keywords: Correspondence analysis; Evolutionary associations analysis
Biography: Full Professor of Statistics, University of Naples Federico II; Ph.D. in Computational Statistics, University Federico II at Napoli, Italy; Post-Doctoral Fellowship, University Federico II at Napoli, Italy. He teaches Psychometrics and Data Analysis at intermediate and advanced courses. Associated Editor of Computational Statistics and Italian Journal of Applied Statistics. Participant to many European Projects and associate investigator in several Italian research projects. Scientific Secretary of the IASC (2007-2009), IASC Council Member 2009-2013. President Elect ClaDAG (Classification and Data Analysis Group of the Italian Statistical Society) Member of many national and international conference Programme Committees, including the IASC SPC for the ISI conference in Durban (2009). Invited speaker in several International conferences on Complex Data and Data Mining. Main research interests are multivariate analysis of complex data and data mining.