Machine Learning and Record Linkage
William E. Winkler
Center for Statistical Methods Research, U.S. Census Bureau, Washington, DC, United States

Although terminology differs, there is considerable overlap between record linkage (entity resolution) methods based on the Fellegi-Sunter model (JASA 1969) and Bayesian networks used in machine learning (Mitchell 1997). Both are based on formal probabilistic model that can be shown to be equivalent in many practical situations (Winkler 2000). When no missing data are present in identifying fields and training data are available, then both can efficiently estimate parameters of interest. When missing data are present, the EM algorithm can be used for parameter estimation in Bayesian Networks when there are training data (Friedman 1997) and in record linkage when there are no training data (unsupervised learning, Winkler 1988, 1989; Ravikumar and Cohen 2004, Bhattacharya and Getoor 2006). Extended EM methods can even estimate false match rates (precision) without training data in a narrow range of situations (Winkler 2006). This talk describes some of the current methods of approximate string comparison for accounting for typographical error between strings, hidden Markov models for adaptive name and address parsing, methods of semi-supervised learning, and fast indexing and retrieval methods for comparing records from files with hundreds of millions or billions of records (Winkler, Yancey, and Porter 2010).

Keywords: Likelihood ratio; Bayesian networks; EM algorithm

Biography: Ph.D. Probability Theory, Principal Researcher at U.S. Census Bureau, Fellow in the American Statistical Association, member U.S. National Academies of Science committee on Voter Registration Databases