Multiple Record Linkage: Generalizing the Fellegi-Sunter Approach to K > 2 Datafiles
Mauricio Sadinle1, Rob Hall2, Stephen E. Fienberg1,2
1Statistics, Carnegie Mellon University, Pittsburgh, PA, United States; 2Machine Learning, Carnegie Mellon University, Pittsburgh, PA, United States

Virtually all methods for probabilistic record linkage are variations on an approach developed in the 1960s by Fellegi and Sunter, for linking two files when unique identi_ers are unavailable. When the task involves linking K > 2 files, the most common approach consists of performing all possible linkages of pairs of datafiles using a Fellegi-Sunter-like approach and then somehow reconciling the discrepancies in an ad hoc fashion. Reconciliation is typically necessary because the transitivity of the pairwise linkage decisions fails, especially when the probabilities of linkage are far from 0 and 1. In this paper we generalize the Fellegi-Sunter approach by incorporating transitivity through a direct consideration of the pairwise match probabilities of the data used to model matching probabilities and propose a variation on the EM-algorithm for carrying out the linkage estimation. Our approach has potential utility in a variety of contexts and we consider an application integrating three homicide record systems from Colombia. We also discuss how to achieve a privacy-preserving approach to multiple record linkage and its potential application to the problem of multi-part privacy preserving statistical computation.

Keywords: EM algorithm; Mixture models; Multidimensional assignment problem; Undirected graphs

Biography: Mauricio Sadinle received his bachelors degree in Statistics at National University of Colombia in Bogota. He is now a PhD student in the Department of Statistics at Carnegie Mellon University under the supervision of Stephen E. Fienberg. Some of his research interests are related to the integration of multiple samples in the context of population size estimation.