An Imputation Method for Missing Covariate Data
Chiara Bocci, Emilia Rocco
Department of Statistics, University of Florence, Florence, Italy

The paper deals with the matter of applying a geoadditive model to produce estimates for some geographical domains in the absence of point referenced geographical data. Geoadditive models introduced by Kammann and Wand (2003), allow to analyze the spatial distribution of the study variable while accounting for possible linear or non-linear covariate effects by merging an additive model (Hastie and Tibshirani, 1990) and a kriging model (Cressie, 1993) and by expressing both as a linear mixed model. Therefore, when data are spatially located and explicit consideration is given to the possible importance of their spatial distribution in the analysis, geoadditive models represent a powerful geostatistical methodology.

The model implementation needs the statistical units to be referenced at point locations and if we use them to produce model-based estimates of a parameter of interest for some geographical domains, the spatial location is required for all the population units. However often we don't know the exact location of all the population units, especially when socio-economic data are involved. Typically, we know the coordinates for sampled units (which could be specifically collected for the analysis), but we don't know the exact location of all the non-sampled population units. For the non-sampled units we know just the areas to which they belong like census districts, blocks, municipalities, etc. In such situation, the classic approach is to locate all the units belonging to the same area by the coordinates (latitude and longitude) of the geographical center or centroid of the area. This is obviously an approximation, induced by nothing but a geometrical property, and its effect on the estimates can be strong, depending on the level of nonlinearity in the spatial pattern and on the area dimension.

In this paper we propose to fill the holes in the geographical information following a stochastic imputation approach instead of the classic deterministic one with the centroids. In particular, we suggest to treat the lack of geographical information imposing a distribution for the locations inside each area. This is realized through a hierarchical Bayesian formulation of the geoadditive model in which a prior distribution on the spatial coordinates is defined. The performance of our imputation approach is evaluated through various Markov Chain Monte Carlo (MCMC) experiments.


Cressie, N., 1993. Statistics for Spatial Data (revised edition). Waley, New York.

Hastie, T.J., Tibshirani, R., 1990. Generalized Additive Models. Chapman & Hall, London.

Kammann, E.E., Wand, M.P., 2003. Geoadditive Models. Applied Statistics 52, 1–18.

Keywords: Hierarchical Bayesian Models; Imputation; Penalized Splines; Linear Mixed Model

Biography: Emilia Rocco is Assistant Professor in Statistics at the University of Firenze (Italy). She graduated in Statistics (summa cum laude) from the University of Firenze in 1995, was visiting fellow at the Department of Statistics of the Pennsylvania State University in 1998 and received her PhD in Applied statistics from the University of Firenze in 1999. Her main scientific interests concern sample surveys. In particular, she investigates informative and not informative sample designs useful for studying rare and clustered populations; capture–recapture methodologies; methods of treatment of nonresponse; small area estimation methods; ranked set sampling.