Census takers around the world often seek to deliver anonymised microdata files which can be easily used for tabulation and secondary analysis. In order to satisfy this 'requirement', such files must be complete, i.e. have no missing values, and coherent, i.e. have no invalid or implausible values (e.g. person record with Marital Status = 'Married' and Age = 5). Therefore most census operations involve performing some editing and imputation (E&I) during data processing.
The detailed planning required for a census E&I operation requires investigating alternative imputation methods and parameter settings (sets of edit rules and other parameters driving imputation procedures). For comparison and decision making, some objective criteria are needed to assess how well each imputation method and parameter setting performs. In addition, similar criteria will later be required to evaluate how well the chosen imputation approach performed for data processed during the live census E&I operation.
Chambers (2001) proposed criteria for measuring 'distributional homogeneity' exploring the idea that the imputation should preserve the distributions of variables undergoing imputation. These criteria are based on a Wald-type statistic used to compare the distribution of categorical variables obtained after imputation with the 'true distribution', assumed known in simulation scenarios where missing (erroneous) values are introduced to test the imputation approach. This statistic has been used by the UK Office for National Statistics for preliminary analyses of their imputation evaluation studies in preparation for the 2011 Census.
In this paper we consider several alternative measures of imputation quality based on chi-square type distance measures, on the family of Power-Divergence statistics (Cressie and Read, 1984), and also on the Minkowski distance. These measures provide alternative criteria that can be used to assess the impact of imputation on distributions of categorical variables, and proved useful in overcoming some observed limitations of the Wald-type statistic discussed above.
We propose a set of imputation quality indicators based on some of these measures of distributional homogeneity that can also be used for monitoring the live imputation process during the census E&I processing. Simulation was used to illustrate the performance of proposed indicators.
References:
Chambers, R. L. (2001). Evaluation criteria for statistical editing and imputation. In National Statistics Methodology Series, J. Charlton (ed), 41. Newport: Office for National Statistics.
Cressie, N., and Read, T. R. C. (1984). Multinomial Goodness-of-Fit Tests. Journal of the Royal Statistical Society. Series B (Methodological) 46, 440-464.
Keywords: Goodness-of-fit; Categorical data; Marginal distribution; Evaluation
Biography: Principal investigator at ENCE, PhD in Social Statistics from Southampton University, presided IASS 2007-2009, currently associate editor to Survey Methodology and International Statistical Review.