An Assessment of Deviations from Conditional Independence in Binary Data Fusion
Elsabe Smit
Statistics and Actuarial Science, University of the Witwatersrand, Johannesburg, Gauteng, South Africa

Data fusion is a data integration technique that allows us to combine information from different sources through a set of common characteristics, thereby creating a single, all-inclusive data source. The success of a fusion largely depends on the accuracy of the underlying assumptions about the relationship between the variables unique to each individual data source. The most common single imputation model used to fuse data is based on the assumption of conditional independence.

The objective of this analysis is to evaluate data fusion procedures for binary data under the assumption of conditional independence, and assess how deviations from this assumption influence the success of the fusion. A large number of datasets consisting of binary variables are simulated based on pre-defined marginal and correlation structures. For each dataset, the level of conditional independence is quantified through a function of entropy, called conditional mutual information.

It is expected that data for which the conditional independence assumption do not hold true will not produce accurate results. The question to be answered is whether there is a gradual decline in the validity of the results as we deviate from conditional independence, or whether the deterioration is more abrupt.


Alosh, M.A. and Lee, S. J. (2001). A simple approach for generating correlated binary variates. Journal of Statistical Computation and Simulation, 70, 231-255.

ARF Guidelines for Data Integration (2003),

D'Orazio, M., Di Zio, M. and Scanu, M. (2006). Statistical matching: Theory and practice. England: Wiley.

Jakulin, A. and Bratko, I. (2004). Quantifying and visualizing attribute interactions: An approach based on entropy,

Radner, D. B., Allen, R., Gonzalez, M. E., Jabine, T. B. and Muller, H. J. (1980). Report on exact and statistical matching techniques. Statistical Policy Working Paper 5, Federal Committee on Statistical Methodology.

Rässler, S. (2002). Statistical matching: A frequentist theory, practical applications, and alternative Bayesian approaches. Lecture Notes in Statistics, 168. New York: Springer.

Rodgers, W.L. (1984). An evaluation of statistical matching. Journal of Business and Economic Statistics, 2(1), 91-102.

Rubin, D.B. (1987). Multiple imputation for nonresponse in surveys. New Jersey: Wiley.

Sims, C.A. (1972). Comments. Annals of Economic and Social Measurement 1, 343-346.

Keywords: Data fusion; Conditional independence assumption; Binary data

Biography: Elsabe Smit is a lecturer at the University of the Witwatersrand in South Africa in the Department of Statistics and Actuarial Science. Prior to this she worked as a statistician in the market research industry for a number of years. This motivated her to undertake an M.Sc in Mathematical Statistics to investigate data fusion applications in market research.