Fitting Regression Models to Complex Survey Data – Gelman's Estimator Revisited
Moshe Feder1, Danny Pfeffermann1,2
1Southampton Statistical Sciences Research Institute, University of Southampton, Southampton, United Kingdom; 2Department of Statistics, Hebrew University, Jerusalem, Israel

Fitting models to survey data is complicated by a number of factors. These typically include: unequal selection probabilities, item and/or unit non-response, and clustering. As a result of these factors, the sample distribution of variables of interest usually differs from their distribution in the target population and this can bias the inference, if ignored in the analysis. Proper fitting of models to complex survey data generally requires therefore to account for additional variables to the dependent variable Y and the explanatory variables X in the regression model of interest. These include the design variables used for the sample selection (denoted by Z), the sample inclusion indicator variable I and the response indicator R. Some of these variables may overlap or even coincide. For example, Y may also be a design variable, such as in case control studies. In almost every application, even primary data analysts (i.e., those affiliated with the data producers) do not have a complete set of data that includes the population values of Z, R, etc. Secondary data analysts typically have even less access to population-level variables.

Some of the methods proposed in the literature for analyzing complex survey data only require knowledge of sample-level data, and only a subset of the variables mentioned above, while the application of other methods requires some population-level information. Standing out in the latter group is an approach recently advocated by Gelman (2007), which requires knowledge of the design variables at the population level. The basic idea behind this approach is to first include the design variables in the model along with the explanatory variables, and interactions between the two set of variables, and then averaging the augmented model over the population distribution of the design variables. In this presentation I shall discuss the advantages and limitations of this approach for estimating population-level models and propose a variation that overcomes the limitations. I shall compare the original approach and the proposed modifications using simulated and real data sets.

Keywords: Comples Survey Data; Modelling; Unequal selection properties; Selection bias

Biography: Moshe Feder started his statistical career at Statistics Canada in Ottawa following which he taught at the University of Southampton in the U.K. Later, he joined the Research Triangle Institute (RTI) in North Carolina. Last year he returned to Southampton as a Research Fellow.