Principal Component Regression with Survey Data Application on the French Media Audiance
Camelia Goga, Muhammad Ahmed Shehzad, Aurelie Vanheuverzwyn
IMB, Universite de Bourgogne, Dijon, France; IMB, Universite de Bourgogne, Dijon, France; Mediametrie, Levallois, France

Large data sets of auxiliary variables are becoming more and more common in practice these days. While fitting a linear regression model between the variable of interest and the auxiliary variables, numerical problems may occur due to the multicollinearity or ill-conditioning of the data. In such conditions, the estimation of finite population totals by means of the generalized regression estimator (GREG) (Sarndal, Swensson and Wretman, 1992) may perform poorly because of the instability of the ordinary least square regression coefficient estimator. If a calibration-type approach is used, extremely large or negative weights may result. Another difficulty is the fact that usual computer softwares for computing calibration estimates such as CALMAR, for example may not be capable of dealing with such large data sets.

In this paper, we suggest reducing the dimension by using a new class of model-assisted estimators based on the well-known principal component regression (Jolliffe, 2002). For this purpose, we consider a superpopulation model that fits the relationship between the interest variable and the p auxiliary variables. Let X be the matrix of the p auxiliary variables and let Z be the matrix of the p principal components of X. Let r with r < <p be the first principal components which correspond to the largest eigenvalues ofX and which incorporate the most information/variation available in the auxiliary data. We then fit a model between the interest variable and the r principal components instead of the X-auxiliary variables. The construction of the generalized regression (GREG) estimator is made simpler by using a small number of principal components as compared to the large number of auxiliary variables and with the advantage that each principal component is a linear combination of all auxiliary variables and hence no loss of information occurs. Results concerning the asymptotic bias and variance under sample design are derived.

A calibration type approach may also yield the same kind of estimator. Now, the objective is to construct a weighted estimator for the population total involving principal components Z instead of X. More exactly, we suggest deriving weights satisfying the calibration equations on the first r principal components of X and being as close as possible to the initial sampling weights in an average sense for the chi-squared distance measure.

These results are then illustrated by using a real data set of the French media audience company, Médiamétrie and comparisons with ridge-type estimators (Rao and Singh, 2009) are given.


1. Jolliffe, I.T (2002) Principal Components Analysis, Second Edition, New York: Springer- Verlag.

2. Rao, K.N.K. and Singh, A.C. (2009). Range restricted weight calibration for survey data using ridge regression. Pakistan Journal of Statistics, 25, 371-384.

3. Sarndal, Swensson and Wretman (1992). Model-assisted survey sampling. Springer-Verlag, New-York.

Keywords: Model-assisted estimators; Calibration estimators; Penalized regresion

Biography: I am of Pakistan origin and I am in the second year of PhD at the University of Burgundy, France. My work covers the understanding of survey sampling techniques and developing new methods for incorporating large data sets in regression with survey sampling designs.