S0003055418000783 353..371

American Political Science Review (2019) 113, 2, 353–371

doi:10.1017/S0003055418000783 © American Political Science Association 2019

Using a Probabilistic Model to Assist Merging of Large-Scale

Administrative Records

TED ENAMORADO Princeton University

BENJAMIN FIFIELD Princeton University

KOSUKE IMAI Harvard University

ince most social science research relies on multiple data sour ces, merging da ta se ts is an essential

part of researchers’ workﬂow. Unfortunately, a unique identiﬁer that unambiguously links

records is often unavailable, and d ata may contain missing and inaccurate information. These

problems are severe especially when merging large-scale administra tive records. W e develop a fast and

scalable algorithm to implement a canonical model of probabilistic record linkage that has many

advantages over deterministic methods frequently used by social scientists. The proposed methodology

efﬁciently handles millions o f observations while accounting for missing data and measurement error,

incorporating auxiliary information, and adjusting for uncertainty about merging in post-merge

analyses. We conduct comprehensive simulation studies to evaluate the performance of our algo-

rithm in realistic scenarios. W e also apply our methodology to merging campaign contribution records,

survey data, and nationwide voter ﬁles. An open-source software package is available for implementing

the proposed methodology.

INTRODUCTION

s the amount and diversity of available data sets

rapidly increase, social scientists often harness

multiple data sources to answer substantive

questions. Indeed, merging data sets, in particular large-

scale administrative records, is an essential part of

cutting-edge empirical research in many discip lines

(e.g., Ansolabehere and Hersh 2012; Einav and Levin

2014; Jutte, Roos, and Browne 2011). Data merging can

be consequential. For example, the American National

Election Studies (ANES) and Cooperative Congressional

Election Study (CCES) validate self-reported turnout by

merging their survey data with a nationwide voter ﬁle

where only the matched respondents are treated as reg-

istered voters. Although Ansolabehere and Hersh (2012)

advocate the use of such a validation procedure, Berent,

Krosnick, and Lupia (2016) argue that the discrepancy

between self-reported and validated turnout is due to the

failure of the merge procedure rather than social desir-

ability and nonresponse bias.

Merging data sets is straightforward if there exists a

unique identiﬁer that unambiguously links records from

different data sets. Unfortunately, such a unique

identiﬁer is often unavailable. Under these circum-

stances, some researchers have used a deterministic

algorithm to automate the merge process (e.g., Adena

et al. 2015; Ansolabehere and Hersh 2017; Berent,

Krosnick, and Lupia 2016; Bolsen, Ferraro, and

Miranda 2014; Cesarini et al. 2016; Figlio and Guryan

2014; Giraud-Carrier et al. 2015; Hill 2017; Meredith

and Morse 2014) whereas others have relied on a

proprietary algorithm (e.g., Ansolabehere and Hersh

2012; Engbom and Moser 2017; Figlio and Guryan 2014;

Hersh 2015; Hill and Huber 2017; Richman, Chattha,

and Earnest 2014). However, these methods are not

robust to measur ement error (e.g., misspelling) and

missing data, which are common to social science data.

Furthermore, deterministic merge methods cannot

quantify the uncertainty of the merging procedure and

instead typically rely on arbitrary thresholds to deter-

mine the degree of similarity sufﬁcient for matche s.

This means that post-merge data analyses fail to account

for the uncertainty of the merging procedure, yielding a

bias due to measurement error. These methodological

Ted Enamorado, Ph.D. Candidate, Department of Politics, Princeton

University, [email protected], http://www.tedenamorado.com.

Benjamin Fiﬁeld, Ph.D. Candidate, Department of Politics,

Princeton University, bﬁﬁ[email protected], http://www.benﬁﬁeld.

com.

Kosuke Imai

, Professor, Department of Government and

Department of Statistics, Harvard University. [email protected],

https://imai.fas.harvard.edu.

The proposed methodology is implemented through an open-

source R package,

fastLink: Fast Probabilistic Record Linkage,

which is freely available for download at the Comprehensive R Archive

Network (CRAN; https://CRAN.R-project.org/package5fastLink).

WethankBruceWillsieofL2andSteffenWeissofYouGov fordataand

technical assistance, Jake Bowers, Seth Hill,Johan Lim, MarcRatkovic,

Mauricio Sadinle, ﬁve anonymous reviewers, and audiences at the 2017

Annual Meeting of the American Political Science Association,

Columbia University (Political Science), Fifth Asian Political Meth-

odology Meeting, Gakusyuin University (Law), Hong Kong University

of Science and Technology,the Institute for Quantitative Social Science

(IQSS) at Harvard University, the Quantitative Social Science (QSS)

colloquiumat Princeton University,Universidad de Chile(Economics),

Universidad del Desarrollo, Chile (Government), the 2017 Summer

Meeting of the Society for Political Methodology, the Center for Sta-

tistics and the Social Sciences (CSSS) at the University of Washington

for useful comments and suggestions. Replication materials can be

found on Dataverse at: https://doi.org/10.7910/DVN/YGUHTD.

Received: August 8, 2017; revised: May 13, 2018; accepted:

October 17, 2018. First published online: January 2, 2019.

These thresholds are highly dependent on data. For example,

Ansolabehere and Hersh (2017) ﬁnd that using three ﬁelds with exact

matches as the threshold works well for the Texas voter ﬁle, but the

same threshold may not work for other data. In contrast, probabilistic

methods can automatically weight observations.

353

Downloaded from https://www.cambridge.org/core. Harvard University, on 24 Apr 2019 at 23:40:20, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S0003055418000783

challenges are ampliﬁed especi ally when merging large-

scale administrative records.

We demonstrate that social scientists should use

probabilistic models rather than deterministic methods

when merging large data sets. Probabilistic models can

quantify the uncertainty inherent in many merge pro-

cedures, offering a principled way to calibrate and

account for false positives and false negatives.

Unfortunately, although there exists a well-known

statistics literature on probabilistic record linkage

(e.g., Harron, Goldstein, and Dibben 2015; Herzog,

Scheuren, and Winkler 2007; Winkler 2006b), the

current open-source implementation does not scale to

large data sets commonly used in today’s social science

research. We address this challenge by developing a fast

and scalable implementation of the canonical proba-

bilistic record linkage model originally proposed by

Fellegi and Sunter (1969). Together with paralleliza-

tion, this algorithm, which we call

fastLink, can be

used to merge data sets with millions of records in a

reasonable amou nt of time using one’s laptop computer.

Additionally, building on the previous methodological

literature (e.g., Lahiri and Larsen 2005), we show (1)

how to incorporate auxiliary informati on such as pop-

ulation name frequency and migration rates into the

merge procedure and (2) how to conduct post-merge

analyses while accounting for the uncertai nty about the

merge process. We describe these methodological

developments in the following section.

We then describe the comprehensive simulation

studies to evaluate the robustness of

fastLink to

several factors including the size of data sets, the pro-

portion of true matches, measurement error, and

missing data proportion and mechanisms. A total of 270

simulation setting s consistently show that

fastLink

signiﬁcantly outperforms the deterministic methods.

Although the proposed methodology pro duces high-

quality matches in most situations, the lack of overlap

between two data sets often leads to large error rates,

suggesting that effective blocking is essential when the

expected number of matches is relatively small. Fur-

thermore,

fastLink appears to perform at least as

well as recently proposed probabilistic approaches

(Sadinle 2017; Steorts 2015). Importantly, our merge

method is faster and scales to larger data sets than these

state-of-art methods.

Next, we present two empirical applications. First, we

revisit Hill and Huber (2017) who examine the ideo-

logical differences between donors and nondonors by

merging the CCES data of more than 50,000 survey

respondents, with the a campaign contribution database

of over ﬁve million donor records (Bonica 2013). We

ﬁnd that the matches identiﬁed by

fastLink are at

least as high quality as those identiﬁed by the propri-

etary method, which was used by the original authors.

We also improve the original analysis by incorporating

the uncertainty of the merge process in the pos t-merge

analysis. We show that although the overall conclusion

remains unchanged, the magnitude of the estimated

effects is substantially smaller.

As the second application, we merge two nationwide

voter ﬁles of over 160 million voter records each,

representing one of the largest data merges ever con-

ducted in social science research.

By merging voter ﬁles

over time, scholars can study the causes and con-

sequences of partisan residential segregation (e.g., Tam

Cho, Gimpel, and Hui 2013; Mummolo and Nall 2016)

and political analytics professionals can develop

effective microtargeting strategies (e.g., Hersh 2015).

We show how to incorporate available within-state and

across-state migration rates in the merge process. Given

the enormous size of the data sets, we propose a two-

step procedure where we ﬁrst conduct a within-state

merge for each state followed by across-state merges for

every pair of states. The proposed methodology is able

to match about 95% of voters, which is about 30-

percentage points greater than the exact matching

method. Although it is more difﬁcult to ﬁnd across-state

movers, we are able to ﬁnd 20 times as many such voters

than the existing matching method.

Finally, we give concluding remarks. We provide an

open-source R software package

fastLink:Fast

Probabilistic Record Linkage, which is freely available at

the ComprehensiveR Archive Network (CRAN;https://

CRAN.R-project.org/package5fastLink)forimple-

menting our methodology so that other researchers can

effectively merge data sets in their own projects.

THE PROPOSED METHODOLOGY

In this section, we ﬁrst introduce the canonical model of

probabilistic record linkage originally proposed by

Fellegi and Sunter (1969). We describe several

improvements we make to this model, including a fast

and scalable impl ementation, the use of auxiliary

information to inform parameter estimation, and the

incorporation of uncertainty about the merge process in

post-merge analyses.

The Setup

Suppose that we wish to merge two data sets, A and B,

which have sample sizes of N

and N

, respectively. We

use K variables, which are common to both data sets, to

conduct the merge. We consider all possible pair-wise

comparisons between these two data sets. For each of

these N

distinct pairs, we deﬁne an agreement

vector of length K, denoted by g(i, j), whose kth element

(i, j) represents the discr ete level of within-pair

similarity for the kth variable between the ith obser-

vation of data set A and the jth observation of data set B.

Speciﬁcally, if we have a total of L

similarity levels for

the kth variable, then the corresponding element of the

agreement vector can be deﬁned as,

AlthoughHersh (2015) conducted a large-scaledata merge, he relied

on a proprietary algorithm. Others such as Ansolabehere and Hersh

(2017) and Tam Cho, Gimpel, and Hui (2013) match data sets of

several million voters each, but neither of these studies approaches the

scale of our applications. Note that the US Census Bureau routinely

conducts large-scale data merges for decennial census (Winkler,

Yancey, and Porter 2010).

Ted Enamorado, Benjamin Fiﬁeld, and Kosuke Imai

354

i; jðÞ¼

0 different

 2

;

similar

 1 identical

(1)

The proposed methodology allows for the existence of

missing data. We deﬁne a missingness vector of length K,

denoted by d(i, j),foreachpair(i, j)whereitskth element

(i,j)equals 1 if at least one record in the pair has a missing

value in the kth variable and is equal to 0 otherwise.

Table 1 presents an illustrative example of agreement

patterns based on two artiﬁcial data sets, A and B,eachof

which has two records. In this example, we consider three

possible values of g

(i, j)forﬁrst name, last name, and

street name, i.e., L

5 3 (different, similar, nearly identi-

cal),whereasabinaryvariableisusedfortheotherﬁelds, i.

e., L

5 2 (different, nearly identical). The former set of

variables requires a similarity measure and threshold

values. We use the Jaro–Winkl er string distance (Jaro

1989; Winkler 1990), which is a commonly used measure in

the literature (e.g., Cohen, Ravikumar, and Fienberg 2003;

Yancey 2005).

Because the Jaro–Winkler distance is a

continuous measure whose values range from 0 (different)

to 1 (identical), we discretize it so that g

(i, j) takes an

integer value between 0 and L

2 1asdeﬁned in equation

(1). Suppose that we use three levels (i.e., different, similar,

and nearly identical) based on the threshold values of 0.88

and 0.94 as recommended by Winkler (1990). Then, when

comparing the last names in Table 1,weﬁnd that, for

example, Smith and Smithson are similar (a Jaro–Winkler

distance of 0.88) whereas Smith and Martinez are different

(a Jaro–Winkler distance of 0.55).

The above setup implies a total of N

compar-

isons for each of K ﬁelds. Thus, the number of com-

parisons grows quickly as the size of data sets increases.

One solution is to use blocking and avoid comparisons

that should not be made. For example, we may make

comparisons within gender group only. While it is

appealing because of computational efﬁciency gains,

Winkler (2005) notes that blocking often involves ad

hoc decisions by researchers and faces difﬁculties when

variables have missing values and measurement error.

Here, we focus on the data merge within a block and

refer interested readers to Christen (2012) and Steorts

et al. (2014) for comprehensive reviews of blocking

techniques.

We also note a related technique, called

ﬁltering, which has the potential to overcome the

weaknesses of traditional blocking methods by dis-

carding pa irs that are unlikely to be matches when ﬁtting

a probabilistic model (Murray 2016).

The Canonical Model of Probabilistic

Record Linkage

The Model and Assumptions

We ﬁrst descri be the most commonly used probabilistic

model of record linkage (Fellegi and Sunter 1969). Let a

latent mixing variable M

indicate whether a pair of

records (the ith record in the data set A and the jth

record in the data set B) rep resents a match. The model

has the following simple ﬁnite mixture structure (e.g.,

Imai and Tingley 2012; McLaughlan and Peel 2000):

i; jðÞjM

¼ m ;

indep:

Discrete p

ðÞ; (2)

;

i:i:d:

Bernoulli lðÞ; (3)

TABLE 1. An Illustrative Example of Agreement Patterns.

Name Address

First Middle Last Date of birth House Street

Data set A

1 James V Smith 12-12-1927 780 Devereux St.

2 Robert NA Martines 01-15-1942 60 16th St.

Data set B

1 Michael F Martinez 02-03-1956 4 16th St.

2 James D Smithson 12-12-1927 780 Dvereuux St.

Agreement patterns

A:1 B:1 Different Different Different Different Different Different

A:1 B:2 Identical Different Similar Identical Identical Similar

A:2 B:1 Different NA Similar Different Different Identical

A:2 B:2 Different NA Different Different Different Different

The top panel of the table shows two artiﬁcial data sets, A and B, each of which has two records. The bottom panel shows the agreement

patterns for all possible pairs of these records. For example, the second line of the agreement patterns compares the ﬁrst record of the data set

A with the second record of the data set B. These two records have an identical information for ﬁrst name, date of birth, and house number;

similar information for last name and street name; and different information for middle name. A comparison involving at least one missing value

is indicated by NA.

Online Supplementary Information (SI) S1 describes how the

Jaro–Winkler string distance is calculated.

As shown in the subsection Computational Efﬁciency and Appendix

A, the discretization of the distance measure leads to substantial

computational efﬁciency when making pairwise comparison for each

linkage ﬁeld.

The parameters of record linkage models must be interpreted

separately for each block (Murray 2016).

Using a Probabilistic Model to Assist Merging of Large-Scale Administrative Records

355

where p

is a vector of length L

, containing the

probability of each agreement level for the kth variable

given that the pair is a match (m 5 1) or a nonmatch

(m 5 0), and l represents the probability of a match

across all pairwise comparisons. Through p

, the

model allows for the possibility that two records can

have identical values for some variables even when they

do not represent a match.

This model is based on two key independence

assumptions. First, the latent variable M

is assumed to be

independently and identically distributed. Such an

assumption is necessarily violated if, for example, each

record in the data set A should be matched with no more

than one record in the data set B.Intheory,this

assumption can be relaxed (e.g., Sadinle 2017) but doing so

makes the estimation signiﬁc antly more complex and

reduces its scalability (see Online SI S8). Later in the

paper, we discuss how to impose such a constraint without

sacriﬁcing computational efﬁci ency. Second, the condi-

tional independence among linkage variables is assumed

given the match status. Some studies ﬁnd that the violation

of this assumption leads to unsatisfactory performance (e.

g., Belin and Rubin 1995; Herzog, Scheuren, and Winkler

2010;LarsenandRubin2001; Thibaudeau 1993; Winkler

and Yancey 2006). In Online SI S4, we show how to relax

the conditional independence assumption while keeping

our scalable implementation.

In the literature, researchers often treat missing data as

disagreements, i.e., g

(i,j)5 0ifd

(i,j)5 1 (e.g., Goldstein

and Harron 2015;Ongetal.2014; Sariyar, Borg, and

Pommerening 2012). This procedure is problematic

becauseatruematchcancontainmissingvalues.Other

imputation procedures also exist but none of them has a

theoretical justiﬁcation or appe ars to perform well in

practice.

To address this problem, following Sadinle

(20 14, 2017), we assume that data are missing at random

(MAR) conditional on the latent variable M

i; j

ðÞ

?? g

i; j

ðÞ

j M

;

for each i ¼ 1; 2; ...; N

, j ¼ 1; 2; ...; N

, and k 5 1,

2, …, K. Under this MAR assumption, we can simply

ignore missing data. The observed-data likelihood

function of the model deﬁned in equations (2) and (3) is

given by,

obs

k; p j d; cðÞ}

∏

i¼1

∏

j¼1



m¼0

1lðÞ

1m

∏

k¼1

∏

1

‘¼0

1fg

ði;jÞ¼‘g

km‘



1d

i;jðÞ

()

;

where p

km‘

represents the ‘th element of probability

vector p

,i.e.,p

km‘

¼ Pr g

i; jðÞ¼‘ j M

¼ m



Because the direct maximization of the observed-data

log-likelihood function is difﬁcult, we estimate the model

parameters using the Expectation-Maximization (EM)

algorithm (see Online SI S2).

The Uncertainty of the Merge Process

The advantage of probabilistic models is their ability to

quantify the uncertainty inherent in merging. Once the

model parameters are estimated, we can compute the

match probability for each pair using Bayes rule,

¼ Pr M

¼ 1 j d i; jðÞ; g i; jðÞ



l∏

k¼1

∏

1

‘¼0

1fg

ði;jÞ¼‘g

k1‘



1d

i;jðÞ



m¼0

1  lðÞ

1m

∏

k¼1

∏

1

‘¼0

1fg

ði;jÞ¼‘g

km‘



1d

i;jðÞ

(4)

In the subsection Post-merge Analysis, we show how

to incorporate this match probability into post-merge

regression analysis to account for the uncertainty of the

merge process.

Although in theory a post-merge analysis can use all

pairs with nonzero match probabilities, it is often more

convenient to determine a threshold S when creating a

merged data set. Such an approach is useful especially

when the data sets are large. Speciﬁcally, we call a pair

(i, j) a match if the match probability n

exceeds S. There

is a clear trade-off in the choice of this threshold value. A

large value of S will ensure that most of the selected pairs

are correct matches but may fail to identify many true

matches. In contrast, if we lower S too much, we will

select more pairs but many of them may be false

matches. Therefore, it is important to quantify the

degree of these matching errors in the merging process.

One advantage of probabilistic models over deter-

ministic methods is that we can estimate the false dis-

covery rate (FDR) and the false negative rate (FNR).

The FDR represents the proportion of false matches

among the selected pairs whose matching probability is

greater than or equal to the threshold. We estimate the

FDR using our model parameters as follows:

Pr M

¼ 0 j n

$ S





i¼1



j¼1

1 n

$ S

1  n





i¼1



j¼1

1 n

$ S

(5)

whereas the FNR, which represents the proportion of

true matches that are not selected, is estimated as

Pr M

¼ 1 j n

, S





i¼1



j¼1

1 n

: (6)

Researchers typically select, at their own discretion,

the value of S such that the FDR is sufﬁciently small.

But, we also emphasize the FNR because a strict

threshold can lead to many false negatives.

In our

simulations and empirical studies, we ﬁnd that the

For example, although Goldstein and Harron (2015) suggest the

possibility of treating a comparison that involves a missing value as a

separate agreement value, but Sariyar, Borg, and Pommerening

(2012) ﬁnd that this approach does not outperform the standard

method of treating missing values as disagreements.

This is known as the maximum a posteriori estimate.

A more principled solution to the threshold S selection problem

would require data for which the true matching status M(i, j)is

known—so that one can select the value of S to minimize the clas-

siﬁcation error. However, in record linkage problems, only in rare

occasions do labeled data sets exist. See Larsen and Rubin (2001),

Feigenbaum (2016), and Enamorado (2018) for approaches that

directly incorporate labeled data.

Ted Enamorado, Benjamin Fiﬁeld, and Kosuke Imai

356

results are not particularly sensitive to the choice of

threshold value, although in other applications,

scholars found ex-post adjustments are necessary for

obtaining good estimates of error rates (e.g., Belin and

Rubin 1995; Larsen and Rubin 2001;Murray2016;

Thibaudeau 1993; Winkler 1993; Winkler 2006a).

In the merging process, for a given record in the data

set A, it is possible to ﬁnd multiple records in the data set

B that have high match probabilities. In some cases,

multiple observations have an identical value of match

probability, i.e., n

¼ n

with j „ j9. Following the lit-

erature (e.g., McVeigh and Murray 2017; Sadinle 2017;

Tancredi and Liseo 2011), we recommend that

researchers analyze all matched observations by

weighting them accordi ng to the matching probability

(see the subsection Post-Merge Analysis). If researchers

wish to enforce a constraint that each record in one data

set is only matched at most with one record in the other

data set, they may follow a procedure described in

Online SI S5.

Incorporating Auxiliary Information

Another advantage of the probabilistic model intro-

duced above is that we can incorporate auxiliary

information in parameter estimation. This point has not

been emphasized enough in the literature. Here, we

brieﬂy discuss two adjustments using auxiliary

data—ﬁrst, how to adjust for the fact that some names

are more common than others, and second, how to

incorporate aggregate information about migration.

More details can be found in Online SI S6.

Because some ﬁrst names are more common than

others, they may be more likely to be false matches. To

adjust for this possibility without increasing the com-

putational burden, we formalize the conditions under

which the ex-post correction originally proposed by

Winkler (2000) is well-suited for this purpose. Brieﬂy,

the probability of being a match will be up-weighted or

down-weighted given the true frequencies of different

ﬁrst names (obtained, for instance, from Census data) or

observed frequencies of each unique ﬁrst name in the

data (see Online SI S6.3.1).

Furthermore, we may know apriorihow many

matches w e should ﬁnd in two data set s becau se of the

knowledge and data on over-time migration. For

instance, the Internal Revenue Service (IRS) pub-

lishes detailed information on migration in the United

States from tax records (see https://www.irs.gov/uac/

soi-tax-stats-migration-data). An estimate of the share

of individuals who moved out of a state or who moved

in-state can be easily reformulated as a prior on rel-

evant parameters in the Fellegi–Sunter model and

incorporated into parameter estimation (see Online SI

S6.3.2).

Post-Merge Analysis

Finally, we discuss how to conduct a statistical analysis

once merging is complete. One advantage of proba-

bilistic models is that we can directly incorpo rate the

uncertainty inherent to the merging process in the post-

merge analysis. This is important because resear chers

often use the merged variable either as the outcome or

as the explanatory variable in the post-merge analysis.

For example, when the ANES validates self-reported

turnout by merging the survey da ta with a nationwide

voter ﬁle, respondents who are unable to be merge d are

coded as nonregistered voters. Give n the uncertainty

inherent to the merging process, it is possible that a

merging algorithm fails to ﬁnd some respondents in the

voter ﬁle even though they are actually registered

voters. Similarly, we may incorrectly merge survey

respondents with other registered voters. These mis-

matches,if ignored, can adversely affect the properties of

post-match analyses (e.g., Neter, Maynes, and Ram-

anathan 1965; Scheuren and Winkler 1993).

Unfortunately, most of the record linkage literature

has focused on the linkage process itself without con-

sidering how to conduct subsequent statistical analyses

after merging data sets.

Here, we build on a small

literature about post-merge regression analysis, the goal

of which is to eliminate possible biases due to the linkage

process within the Fellegi–Sunter framework (e.g., Hof

and Zwinderman 2012; Kim and Chambers 2012; Lahiri

and Larsen 2005; Scheuren and Winkler 1993, 1997).

We also clarify the assum ptions under which a valid

post-merge analysis can be conducted.

The Merged Variable as an Outcome Variable

We ﬁrst consider the scenario, in which researchers wish

to use the variable Z merged from the data set B as a

proxy for the outcome variable in a regression analysis.

We assume that this regression analysis is applied to all

observations of the data set A and uses a set of

explanatory variables X taken from this data set. These

explanatory variables may or may not include the

variables used fo r merging. In the ANES application

mentioned above, for example, we may be interested in

regressing the validated turnout measure merged from

the nationwide voter ﬁle on a variety of demographic

variables measured in the survey.

For each observation i in the data set A, we obtain the

mean of the merged variable, i.e., z

¼ E Z



j g; d



where Z



represents the true value of the merged

variable. This quantity can be computed as the weighted

average of the variable Z merged from the data set B

where the weights are proportional to the match

probabilities, i.e., z



j¼1



j¼1

. In the ANES

application, for example, z

represents the probability of

turnout for survey respondent i in the data set A and can

be computed as the weighted average of turnout among

the registered voters in the voter ﬁle merged with

respondent i. If we use thresholding and one-to-one

match assignment so that each record in the data set A is

matched with at most one record in the data set B (see

the subsection The Canonical Model of Probabilistic

An important exception includes a fully Bayesian approach outside

of the Fellegi–Sunter framework, which we do not pursue here

because of its limited scalability (see Tancredi and Liseo 2011; Gut-

man, Afendulis, and Zaslavsky 2013; Gutman et al. 2016; Dalzell and

Reiter 2018).

Using a Probabilistic Model to Assist Merging of Large-Scale Administrative Records

357

Record Linkage), then we compute the mean of the

merged variable as z



j¼1



where M



is a

binary variable indicating whether record i in the data

set A is matched with record j in the data set B subject to

the constraint



j¼1



#1.

Under this setting, we assume that the true value of

the outcome variable is independent of the explanatory

variables in the regression conditional on the infor-

mation used for merging, i.e.,



?? X

j d; gðÞ; (7)

for each i ¼ 1; 2; ...; N

. The assumption implies that

the merging process is based on all relevant information.

Speciﬁcally, within an agreement pattern, the true value

of the merged variable Z



is not correlated with the

explanatory variables X

. Under this assumption, the

law of iterated expectation implies that regressing z

gives the results equivalent to the ones based on the

regression of Z



on X

in expectation.

E Z



j X



¼ EEZ



j c; d; X



j X



¼ E z

j X

ðÞ: (8)

The condition al independence assumption may be

violated if, for example, within the same agreement

pattern, a variable correlated with explanatory varia-

bles is associated with merging error. Without this

assumption, however, only the bounds can be identiﬁed

(Cross and Manski 2002). Thus, alternativ e assumptions

such as parametric assumptions an d exclusion restric-

tions are needed to achieve identiﬁcation (see Ridde r

and Mofﬁtt 2007, for a review).

The Merged Variable as an Explanatory Variable

The second scenario we consider is the case where we

use the merged variable as an explanatory variable.

Suppose that we are interested in ﬁtting the following

linear regression model:

¼ a þ bZ



þ h

þ «

; (9)

where Y

is a scalar outcome variable and the strict

exogeneity is assumed, i.e., E «

j Z



; XðÞ¼0 for all i.

We follow the analysis strategy ﬁrst proposed by Lahiri

and Larsen (2005) but clarify the assumptions required

for their approach to be valid (see also Hof and

Zwinderman 2012). Speciﬁcally, we maintain the

assumption of no omitted variable for merging given in

equation (7). Additionally, we assume that the merging

variables are independent of the outcome variable

conditional on the explanatory variables Z*andX, i.e.,

?? c; dðÞjZ



; X: (10)

Under these two assumptions, we can consistently

estimate the coefﬁcients by regressing Y

on z

and X

E Y

j c; d; X

ðÞ¼a þ bE Z



j c; d; X



þ h

þ E «

j c; d; X

ðÞ

¼ a þ bz

þ h

;

(11)

where the second equality follows from the assumptions

and the law of iterated expectation.

We generalize this strategy to the maximum like-

lihood (ML) estimation, which, to the best of our

knowledge, has not been considered in the literature

(but see Kim and Chambers (2012) for an estimating

equations approach),

j Z



; X

;

indep:

j Z



; X



; (12)

where u is a vector of model parameters. To estimate the

parameters of this model, we maximize the following

weighted log-likelihood function:

h ¼ argmax



i¼1



j¼1



log P

j Z



¼ Z

; X



; (13)

where n



¼ n



j9¼1

ij9

. Online SI S7 shows that under

the two assumptions described earlier and mild regu-

larity conditions, the weighted ML estimator given in

equation (13) is consistent and asymptotically normal.

Note that because we are considering large data sets, we

ignore the uncertainty about n



SIMULATION STUDIES

We conduct a comprehensive set of simulation studies

to evaluate the statistical accuracy and computational

efﬁciency of our probabilistic modeling approach and

compare them with those of deterministic methods.

Speciﬁcally, we assess the ability of the proposed

methodology to control estimation error, false positives

and fals e negatives, and its robustness to missing values

and noise in the linkage ﬁelds, as well as the degree of

overlap between two data sets to be merged. We do so

by systematically varying the amount and structure of

missing data and measurement error.

The Setup

To make our sim ulation studies realistic, we use a data

set taken from the 2006 California voter ﬁle. Because

merging voter ﬁles is often done by blocking on gend er,

we subset the data set to extract the information about

female voters only, reducing the number of observation

to approximately 17 million voters to 8.3 million

observations. To create a base data set for simulations,

we further subset the data set by removing all obser-

vations that have at least one missing value in the fol-

lowing variables: ﬁrst name, middle initial, last name,

date of birth, registration date, address, zip code, and

turnout in the 2004 Presidential election. After listwise

deletion, we obtain the ﬁna l da ta set of 341,160 voters,

from which we generate two data sets of various

characteristics to be merged. From this data set, we

independently and randomly select two subsamples to

be merged under a variety of scenarios.

We design our simulation studies by varying the

values of the ﬁve parameters as summarized below.

Online SI S9.1 describ es in detail the precise setups of

these simulations.

1. Degree of overlap: Proportion of records in the smaller

data set that are also in the larger data set. We consider

three scenarios—20% (small), 50% (medium), and

80% (large).

Ted Enamorado, Benjamin Fiﬁeld, and Kosuke Imai

358

2. Size balance: Balance of sample sizes between the two

data sets to be merged. We consider three ratios—1:1

(equally sized), 1:10 (imbalanced), and 1:100 (lopsided).

3. Missing data: We consider ﬁve different mechanisms,

missing completely at random (MCAR), MAR, and not

missing at random (NMAR). For MAR and NMAR, we

consider independent and dependent missingness

patterns across linkage ﬁelds.

4. Amount of missing data: Proportion of missing values in

each linkage variable other than year of birth. We

consider three scenarios—5% (small), 10% (medium),

and 15% (large).

5. Measurement error: Proportion of records (6%) for

which the ﬁrst name, last name, and street name contain

a classical measurement error.

Together, we conduct a total of 135 (53

3 5) sim-

ulation studies where missing data are of main concern.

We also conduct another set of 135 simulations with

various types of nonclassical measurement errors, while

keeping the amount of missing values ﬁxed (see Online

SI S9.2).

Results

Figure 1 compares the performance of fastLink

(blue solid bars) to the two deterministic methods

oftenusedbysocialscientists.Theﬁrst is the merging

method based on exact matches (red shaded bars),

whereas the second is the recently proposed partial

match algorithm (ADGN; light green solid bars) that

considers two records as a match if at least three ﬁelds

of their address, date of birth, gender, and name are

identical (Ansolabehere and Hersh 2017). The top

panel of Figure 1 presents the FNR whereas the bottom

panel presents the absolute error for estimating the

2004 turnout rate. We merge two data sets of equal size

(100,000 records each) after introducing the classical

measurement error and the medium amount of missing

data as explained above. For

fastLink, only pairs

with a match probability $0.85 are considered to be

matches, but the results remain qualitatively similar if

we change the threshold to 0.75 or 0.95.

We ﬁnd that

fastLink signiﬁcantly outperforms

the two deterministic methods.

Although all three

methods are designed to control the FDR, only

fastLink is able to keep the FNR low (less than ﬁve

percentage in all cases considered here). The deter-

ministic algorithms are not robust to missing data and

measurement error, yielding a FNR of much greater

magnitude. Additionally, we observe that the deter-

ministic methods yield a substantially greater estimation

FIGURE 1. Accuracy of Data Merge

The top and bottom panels present the false negative rate (FNR) and the absolute estimation error (for estimating the turnout rate), respectively, when

merging datasets of 100,000 records each across with different levels of overlap (measured as a percentage of a data set). Three missing data

mechanisms are studied with the missing data proportion of 10% for each linkage ﬁeldother than year of birth: missing completely at random (MCAR),

missing at random (MAR), and missing not at random (MNAR). Classical measurement error is introduced to several linkage ﬁelds. The proposed

probabilistic methodology (

fastLink; blue solid bars) signiﬁcantly outperforms the two deterministic algorithms, i.e., exact match (red shaded

bars) and partial match (ADGN; light green solid bars), across simulation settings.

In Online SI S8, we compare fastLink to the state-of-the-art

probabilistic methods and ﬁnd that

fastLink performs as well as

these methods.

Using a Probabilistic Model to Assist Merging of Large-Scale Administrative Records

359

biasthanfastLink unlessthe dataare MCAR.Under

the other two missing data mechanisms, the magni-

tude of the bias is substantially greater than that of

fastLink. Although fast Link has an absolute

estimation error of less than 1.5 percentage points even

under MNAR, the other two methods have an absolute

estimation error of more than 7.5 percentage points

under both MAR and MNAR. Finally, the performance

fastLink worsens as the size of overlap reduces

and the missing data mechanism becomes less random.

We next evaluate the accuracy of FDR and FNR

estimates in the top and bottom panels, respectively.

Because the deterministic methods do not give such

error estimates, we compare the performance of the

proposed methodology (indicated by blue solid circles)

with that of the same probabilistic modeling approach,

which treats missing values as disagreements foll owing a

common practice in the literature (indicated by solid

triangles). Figure 2 presents the results of merging two

data sets of equal size where the medium amount of data

are assumed to be MAR and some noise are added as

described earlier. In the top panel of the ﬁgure, we ﬁnd

that the true FDR is low and its estimate is accurate

unless the degree of overlap is small. With a small

degree of overlap, both methods signiﬁcantly under-

estimate the FDR. A similar ﬁnding is obtained for the

FNR in the bottom panel of the ﬁgure where estimated

FNR is biased upward.

One way to address the problem of having small

overlap would be to use blocking based on a set of fully

observed covariates. For example, in our simulations,

because the year of birth is observed for each record in

both data sets, we bloc k the data by making comparisons

only across individuals within a window of 61 year

around each birth year.

Then, we apply fastLink

to each block separately. As shown in the right most

column of Figure 2, blocking signiﬁcantly improves the

estimation accuracy for the FDR and FNR estimates as

well as their true values although the bias is not elim-

inated. The reason for this improvement is that tradi-

tional blocking increases the degree of overlap. For

example, in this simulation setting for each of the 94

blocks under consideration, the ratio of true matches to

all pos sible pairs is at least 8 3 10

, which is more than

15 times as large as the corresponding ratio for no

blocking and is comparableto thecaseof overlapof 50%.

We present the results of the remaining simulation

studies in the Online Simulation Appendix. Two major

FIGURE 2. Accuracy of FDR and FNR Estimates

The top panel compares the estimated FDR (x-axis) with its true value (y-axis) whereas the bottom panel compares the estimated FNR

against its true value. We consider the medium amount of missing data generated under MAR as a missingness mechanism and add

measurement error to some linkage ﬁelds. The blue solid circles represent the estimates based on

fastLink whereas the black solid

triangles represent the estimates obtained by treating missing data as disagreements. The FDR and FNR estimates are accurate when the

overlap is high. Additionally,

fastLink gives lower FDR and FNR than the same algorithm that treats missing values as a disagreement.

Note that in cases where the overlap is small (20%), blocking improves the precision of our estimates.

In Online SI S12, we also present results using a clustering method,

i.e., k-means, to group similar observations.

Ted Enamorado, Benjamin Fiﬁeld, and Kosuke Imai

360

patterns discussed above are also found under these other

simulation scenarios. First, regardless of the missing data

mechanisms and the amount of missing observations,

fastLink controls FDR, FNR, and estimation error

well. Second, a greater degree of overlap between data sets

leads to better merging results in terms of FDR and FNR as

well as the accuracy of their estimate s. Blocking can

ameliorate these problems caused by small overlap to

some extent. These empirical patterns are consistently

found across simulations even when two data sets have

unequal sizes.

Computational Efﬁciency

We compare the computational performance of

fastLink with that of the RecordLinkage package

in R (Sariyar and Borg 2016) and the Record Linkage

package in Python (de Bruin 2017) in terms of running

time. The latter two are the only other open

source packages in R and Python that implement

a probabilistic model of record linkage under the Fell-

egi–Sunter framework. To mimic a standard computing

environment of applied researchers, all the calculations

are performed in a Macintosh laptop computer with a 2.8

GHz Intel Core i7 processor and 8 GB of RAM.

Although

fastLink takes advantage of a multi-

core machine via the OpenMP-based parallelization

(the other two packages do not have a parallelization

feature), we perform the comparison on a single-core

computing environment so that we can assess the

computational efﬁciency of our algorithm itself. Addi-

tionally, we include runtime results where we parallelize

computation across eight cores. For all implementa-

tions, we set the convergence threshold to 1 3 10

We consider the setup in which we merge two data sets

of equal size with 50% overlap, 10% missing proportion

under MCAR, and no measurement error. Our linkage

variables are ﬁrst name, middle initial, last name, house

number, street name, and year of birth. We vary the size of

each data set from 1,000 records to 300,000 observations.

As in the earlier simulations, each data set is based on the

sample of 341,160 female registered voters in California,

for whom we have complete information in each linkage

ﬁeld. To build the agreement patterns, we use the Jar-

o–Win kler string distance with a cutoff of 0.94 for ﬁrst

name, last name, and street name. For the remaining

ﬁelds, we only consider exact matches as agreements.

Figure 3 presents the results of this running time

comparison. We ﬁnd that although all three packages

take a similar amount of time for data sets of 1,000

records, the running time increases exponentially for

the other packages in contrast to

fastLink (black

solid triangles connected by a dashed line, single core;

blue solid circles connected by a solid line, eight cores),

which exhibi ts a near linear increase. When matching

data sets of 150,000 records each,

fastLink takes less

than six hours to merge using a single core (under three

hours whenparallelizedacrosseight cores). In contrast,it

takes more than 24 hours for Record Linkage (Python;

solid purple squares connected by a dotted line), to

merge two data sets of only 20,000 observations each.

The performance is not as bad for Record Linkage (R;

red crosses connected by a dashed line), but it still takes

over six hours to merge data sets of 40,000 records each.

Moreover, an approximation based on an exponential

regression model suggests that Record Linkage (R)

would take around 22 hours to merge two data sets of

50,000 records each, while Record Linkage (Python)

wouldtakeabout900daysto accomplishthesamemerge.

In Online SI S3.1, we further decompose the runtime

comparison to provide more detail on the sources of our

computational improvements. We detail the choices we

make in the computational implementation that yields

these substantial efﬁciency gains in Appendix A.

EMPIRICAL APPLICATIONS

In this section, we present two empirical applications of the

proposed methodology. First, we merge election survey

data (about 55,000 observations) with political con-

tribution data (about ﬁvemillion observations). The major

challenge of this merge is the fact that the expected

number of matches between the two data sets is small.

Therefore, we utilize blocking and conduct the data merge

within each block. The second application is to merge two

nationwide voter ﬁles, each of which has more than 160

million records. This may, therefore, represent the largest

data merge ever conducted in the social sciences. We show

how to use auxiliary information about within-state and

across-state migration rates to inform the match.

Merging Election Survey Data with Political

Contribution Data

Hill and Huber (2017) study differences between donors

and nondonors by merging the 2012 CCES survey with

the Database on Ideology, Money in Politics, and

Elections [DIME, Bonica (2013)]. The 2012 CCES is

based on a nationally representative sample of 54,535

individuals recruited from the voting-age population in

the United States. The DIME data, on the other hand,

provide the information about individual donations to

political campaigns. For the 2010 and 2012 elections, the

DIME contains over ﬁve million donors.

The original authors asked YouGov, the company

which conducted the survey, to merge the two data sets

using a proprietary algorithm. This yielded a total of

4,432 CCES respondents matched to a donor in the

DIME data. Aft er the merge, Hill and Huber (2017)

treat each matched CCES respondent as a donor and

conduct various analyses by comparing these matched

respondents with those who are not matched with a

donor in the DIME data and hence are treated as

nondonors. Below, we apply the proposed methodology

to merge these two data sets and conduct a post-merge

Starting values differ across methods because other methods do not

allow us to change their default starting values. However, the EM

algorithm converges quickly regardless of the choice of starting values.

In fact, it is well known that the bottleneck is a large number of

required comparisons (e.g., Christen 2012; Jaro 1972), for which we

use a hashing technique as described below in Appendix A.

Using a Probabilistic Model to Assist Merging of Large-Scale Administrative Records

361

analysis by incorporating the uncertainty about the

merge process.

Merge Procedure

We use the name, address, and gender information to

merge the two data sets. To protect th e anonymity of

CCES respondents, YouGov used

fastLink to

merge the data sets on our behalf. Moreover, because of

contractual obligations, the merge was conducted only

for 51,184 YouGov panelists, which is a subset of the

2012 CCES respondents. We block based on gender and

state of residence, resulting in 102 blocks (50 states plus

Washington DC 3 two gender categories). The size of

each block ranges from 175,861 (CCES 5 49, DIME 5

3,589) to 790,372,071 pairs (CCES 5 2,367, DIME 5

333,913) with the median value of 14,048,151 pairs

(CCES 5 377, DIME 5 37,263). Within each block, we

merge the data sets using the ﬁrst name, middle initial,

last name, house number, street name, and postal code.

As done in the simulations, we use three levels of

agreement for the string-valued variables based on the

Jaro–Winkler distance with 0.85 and 0.92 as the

thresholds. For the remaining variables (i.e., middle

initial, house number, and postal code), we utilize a

binary comparison indicating whether they have an

identical value.

To construct our set of matched pairs between CCES

and DIME, ﬁrst, we use the one-to-one matching

assignment algorithm described in Online SI S5 and ﬁnd

the best match in the DIME data for each CCES

respondent. Then, we declare as a match any pair whose

matching probability is above a certain threshold. We

use three thresholds, i.e., 0.75, 0.85, and 0.95, and

examine the sensitivity of the empirical results to the

choice of threshold value.

Finally, in the original study

of Hill and Huber (2017), noise is added to the amount

of contribution to protect the anonymity of matched

CCES respondents. However, we signed a non-

disclosure agreement with YouGov for our analysis so

that we can make a precise comparison between the

proposed methodology and the proprietary merge

method used by YouGov.

Merge Results

Table 2 presents the merge results. We begin by

assessing the match rates, which represent the pro-

portion of CCES respondents who are matched with

donors in the DIME data. Although the match rates are

similar between the two methods,

fastLink appears

to ﬁnd slightly more (less) matches for male (female)

respondents than the proprietary method regardless of

the threshold used. However, this does not mean that

both methods ﬁnd identical matches. In fact, out of 4,797

matches identiﬁed by

fastLink (using the threshold

of 0.85), the proprietary method does not identify 861 or

18% of them as matches.

As discussed in the subsection The Canonical Model

of Probabilistic Record Linkage, one important ad-

vantage of the probabilistic modeling approach is that

we can estimate the FDR and FNR, which are shown in

the table. Such error rates are not available for the

FIGURE 3. Running Time Comparison

The plot presents the results of merging datasets of equal sizeusing different implementations of the Fellegi-Sunter model. The datasets were

constructed from a sample of female registered voters in California. The amount of overlap between datasets is 50%, and, for each dataset,

there are 10% missing observations in each linkage variable: ﬁrst name, middle initial, last name, house number, street name, and year of

birth. The missing data mechanism is Missing Completely at Random (MCAR). The computation is performed on a Macintosh laptop

computer with a 2.8 GHz Intel Core i7 processor and 8 GB of RAM. The proposed implementation

fastLink (single-core runtime as black

solid triangles connected by a dashed line, and parallelized over eight cores as blue solid dots connected by a solid line) is signiﬁcantly faster

than the other open-source packages.

In Online SI S10.3, instead of a one-to-one matching restriction

used here, we present the results of the weighted approach described

in the subsection The Uncertainty of the Merge Process. As shown in

Figure S7 of Online SI S10.3, there is no distinguishable difference in

the results obtained from either approach.

Ted Enamorado, Benjamin Fiﬁeld, and Kosuke Imai

362

proprietary method. As expected, the overal l estimated

FDR is controlled to less than 1.5% for both male and

female respondents. The estimated FNR, on the other

hand, is large, illustrating the difﬁculty of ﬁnding some

donors. In particular, we ﬁnd that female donors are

much more difﬁcult to ﬁnd than male donors.

Speciﬁcally, there are 12,803 CCES respondents

who said they made a campaign contribution during

the last 12 months before the 2012 election. Among

them, 5,206 respondents claimed to have donated at

least 200 dollars. Interestingly, both

fastLink and

the proprietary method matched an essentially iden-

tical number of self-reported donors with a con-

tribution of over 200 dollars (2,431 and 2 ,434 or

approximately 47%, respectively), whereas among the

self-reported small donors both methods can only

match approximately 16% of them.

Next, we examine the quality of matches for the two

methods (see also Online SI S13). We begin by com-

paring the self-reported donation amount of matched

CCES respondents with their actual donation amount

recorded in the DIME data. Although only donations

greater than 200 dollars are recorded at the federal

level, the DIME data include some donations of smaller

amounts, if not all, at the state level. Thus, although we

do not expect a perfect correlation between self-

reported and actual donation amount, under the

assumption that donors do not systematic ally under or

over report the amount of campaign contributions, a

high correlation between the two measures implies a

more accurate merging process.

The uppe r panel of Figure 4 presents the results

where for

fastLink, we use one-to-one match with

the threshold of 0.85.

We ﬁnd that for the respondents

who are matched by both methods, the correlation

between the self-reported and matched donation

amounts is reasonably high (0.73). In the case of

respondents who are matched by

fastLink only, we

observe that the correlation is low (0.57) but is greater

than the correlation for those matches identiﬁed by the

proprietary method alone (0.42). We also examine the

distribution of match probabilities for these three groups

of matches. The bottom panel of the ﬁgure presents the

results, which are consistent with the patterns of corre-

lation identiﬁed in the top panel. That is, those matches

identiﬁed by the two methods have the highest match

probability whereas most of the matches identiﬁed only

by the proprietary method have extremely low match

probabilities. In Online SI S13, we also examine the

quality of the agreement patterns separately for the

matches identiﬁed by both methods,

fastLink only,

and the proprietary method only. Overall, our results

indicate that

fastLink produces matches whose

quality is often better than that based on the proprietary

method.

TABLE 2. The Results of Merging the 2012 Cooperative Congressional Election Study (CCES) with the

2010 and 2012 Database on Ideology, Money in Politics, and Elections (DIME) Data

fastLink Proprietary method

0.75 0.85 0.95

Number of matches All 4,948 4,797 4,576 4,534

Female 2,198 2,156 2,067 2,210

Male 2,750 2,641 2,524 2,324

Overlap between

fastLink

and proprietary method

All 3,959 3,936 3,881

Female 1,877 1,866 1,844

Male 2,082 2,070 2,037

Match rate (%) All 9.67 9.37 8.94 8.85

Female 8.12 7.96 7.63 8.16

Male 11.40 10.95 10.40 9.64

FDR (%) All 1.24 0.65 0.21

Female 0.92 0.53 0.14

Male 1.50 0.75 0.28

FNR (%) All 15.25 17.35 20.81

Female 5.35 6.80 10.30

Male 21.83 24.36 27.79

The table presents the merging results for both fastLink and the proprietary method used by YouGov. The results of fastLink are

presented for one-to-one match with three different thresholds (i.e., 0.75, 0.85, 0.95) for the matching probability to declare a pair of

observations as a successful match. The number of matches, the amount of overlap, and the overall match rates are similar between the two

methods.The table also presents information on the estimated false discoveryand false negative rates (FDR and FNR, respectively) obtained

using

fastLink. These statistics are not available for the proprietary method.

Figures S5 and S6 in Online SI S10 present the results under two

different thresholds: 0.75 and 0.95, respectively. The results under

those thresholds are similar to those with the threshold of 0.85 pre-

sented here.

Using a Probabilistic Model to Assist Merging of Large-Scale Administrative Records

363

Post-Merge Analysis

An important advantage of the probabilistic modeling

approach is its ability to account for the uncertainty of

the merge process in post-merge analyses. We illustrate

this feature by revisiting the post-merge analysis of Hill

and Huber (2017). The original authors are interested in

the comparison of donors (deﬁned as those who are

matched with records in the DIME data) and nondonors

(deﬁned as those who are not matched) among CCES

respondents. Usin g the matches identi ﬁed by a pro-

prietary method, Hill and Huber (2017) regress policy

ideology on the matching indicator variable, which is

interpreted as a donation indicator variable, the turnout

indicator variables for the 2012 general election and

2012 congressional primary elections, as well as several

demographic variables. Policy ideology, which ranges

from 21 (most liberal) to 1 (most conservative), is

constructed by applying a factor analysis to a series of

questions on various issues.

The demographic control

variables include income, education, gender, household

union membership, race, age in decades, and impor-

tance of religion. The same model is ﬁtted separately for

Democrats and Republicans.

To account for the uncertaintyof the mergeprocess, as

explained in the subsection Post-Merge Analysis,weﬁt

the same linear regression except thatwe use the mean of

the match indicator variable as the main explanatory

variableratherthan the match indicatorvariable. Table 3

presents the estimated coefﬁcients of the afore-

mentioned linear regression models with the corre-

sponding heteroskedasticity-robust standard errors in

parentheses. Generally, the results of our improved

analysisagree withthose of the original analysis, showing

that donors tend to be more ideologically extreme than

nondonors.

Although the overall conclusion is similar, the esti-

mated coefﬁcients are smaller in magnitude when

accounting for the uncertainty of merge process. In

particular, according to

fastLink, for Republican

respondents, the estimated coefﬁcient of being a donor

represents only 12% of the standard deviation of their

ideological positions (instead of 21% given by the

proprietary method). Indeed, the difference in the

estimated coefﬁcients between

fastLink and the

proprietary method is statistically signiﬁcant for both

Republicans (0.035, s.e. 5 0.014), and Democrats

(20.015, s.e. 5 0.007). Moreover, although the original

analysis ﬁnd that the parti san mean ideological differ-

ence for donors (1.108, s.e. 5 0.018) is 31 percentage

larger than that for nondonors (0.848, s.e. 5 0.001), the

results based on

fastLink shows that this difference

is only 25 percentage larger for donors (1.058,

FIGURE 4. Comparison of fastLink and the Proprietary Method

The top panel compares the self-reported donations (y-axis) by matched CCES respondents with their donation amount recorded in the DIME

data (x-axis) for the three different groups of observations: those declared as matches by both

fastLink and the proprietary method (left),

those identiﬁed by

fastLink only (middle), and those matched by the proprietary method only (right). The bottom panel presents histograms

for the match probability for each group. For

fastLink, we use one-to-one match with the threshold of 0.85.

They include gun control, climate change, immigration, abortion,

jobs versus the environment, same-sex marriage, afﬁrmative action,

and ﬁscal policy.

Ted Enamorado, Benjamin Fiﬁeld, and Kosuke Imai

364

s.e. 5 0.018). Thus, although the proprietary method

suggests that the partisan gap for donors is similar to the

partisan gap for those with a college degree or higher

(1.100, s.e. 5 0.036),

fastLink shows that it is closer

to the partisan gap for those with just some college

education but without a degree (1.036, s.e. 5 0.035).

Merging Two Nationwide Voter Files Over Time

Our second application is what might be the largest data

merging exercise ever conducted in social sciences.

Speciﬁcally, we merge the 2014 nationwide voter ﬁle to

the 2015 nationwide voter ﬁle, each of which has over

160 million records. The data sets are provided by L2, Inc.,

a leading national non-partisan ﬁrm and the oldest

organization in the United States that supplies voter data

and related technology to candidates, political parties,

pollsters, and consultants for use in campaigns. In addition

tothesheersizeofthedatasets,mergingthesenationwide

voter ﬁles is methodologically challenging because

some voters change their residence over time, making

addresses uninformative for matching these voters.

Merge Procedure

When merging data sets of this scale, we must drastically

reduce the number of comparisons. In fact, if we

examine all possible pairwise comparisons between the

two voter ﬁles, the total number of such pairs exceeds 2.5

3 10

. It is also impor tant to incorporate auxiliary

information about movers because the address variable

is noninformative when matching these voters. We use

the IRS Statistics of Income (SOI) to calibrate match

rates for within-state and across-state movers. Details

on incorporating migration rates into parameter esti-

mation can be found in the subsection Incorporating

Auxiliary Information and Online SI S6.2. The IRS SOI

data are deﬁnitive source of migration data in the

United States that tracks individual residences year-to-

year across all states through their tax returns.

We develop the following two-step procedure that

utilizes random sampling and blocking of voter records to

reduce the computational burden of the merge (see

Online SI S3.2 and S6.2). Our merge is based on ﬁrst name,

middle initial, last name, house number, street name, date/

year/month of birth, date/year/month of registration, and

gender. The ﬁrst step uses each of these ﬁelds to inform the

merge, whereas the second step uses only ﬁrst name,

middle initial, last name, date/year/month of birth, and

gender. For both ﬁrst name and last name, we include a

partial match category based on the Jaro–Winkler string

distance calculation, setting the cutoff for a full match at

0.92 and for a partial match at 0.88.

As described in Online SI S6.2, we incorporate

auxiliary information into the model by moving from

the likelihood framework to a fully Bayesian approach.

Because of conjugacy of our priors, we can obtain the

estimated parameters by maximizing the log posterior

distribution via the EM algorithm. This approach allows

us to maintain the computational efﬁc iency.

Step 1: Matching within-state movers and nonmovers

for each state.

(a) Obtain a random sample of voter records from each

state ﬁle.

(b) Fit the model to this sample using the within-state migration

rates from the IRS data to specify prior parameters.

applying the k-means algorithm to the ﬁrst name.

(d) Using the estimated model parameters, conduct the

data merge within each block.

Step 2: Matching across-state movers for each pair of

states.

(a) Set aside voters who are identiﬁed as successful

matches in Step 1.

(b) Obtain a random sample of voter records from each

state ﬁle as done in Step 1(a).

TABLE 3. Predicting Policy Ideology Using Contributor Status

Republicans Democrats

Original

fastLink Original fastLink

Contributor 0.080*** 0.046*** 20.180*** 20.165***

(0.016) (0.015) (0.008) (0.009)

Turnout for 2012 general election 0.095*** 0.095*** 20.060*** 20.060***

(0.013) (0.013) (0.010) (0.010)

Turnout for 2012 primary election 0.094*** 0.095*** 20.019** 20.022***

(0.009) (0.009) (0.009) 0.009)

Demographic controls Yes Yes Yes Yes

Number of observations 17,386 17,386 20,925 20,925

The estimated coefﬁcients from the linear regression of policy ideology score on the contributor indicator variable and a set of demographic

controls. Along with the original analysis, the table presents the results of the improved analysis based on

fastLink, which accounts for the

uncertainty of the merge process. *** p , 0.001, ** p , 0.01, * p , 0.05. Robust standard errors in parentheses.

Speciﬁcally, we set prior parameters on the expected match rate and

expected within-state movers rate using the IRS data, giving 75%

weight to the prior estimate and 25% weight to the maximum like-

lihood estimate. For the ﬁrst step, we set priors on both p

address,1,0

(the

probability of a voter’s address not matching conditional on being in

the matched set, which is equivalent to the share of in-state movers in

the matched set) and l. For the second step, we set a prior on l.

Using a Probabilistic Model to Assist Merging of Large-Scale Administrative Records

365

from the IRS data to specify prior parameters.

(d) Create blocks by ﬁrst stratifying on gender and then

applying the k-means algorithm to the ﬁrst name as

done in Step 1(c).

(e) Using the estimated model parameters, conduct the

data merge within each block as done in Step 1(e).

In Step 1, we apply random sampling, rather than

blocking, strategy to use the within-state migration rates

from the IRS data and ﬁt the model to a representative

sample for each state. For the same reason, we use a

random sampling strategy in Step 2 to exploit the

availability of IRS across-state migration rates. We

obtain a random sample of 800,000 voter records for

ﬁles with over 800,000 voters and use the entire state ﬁle

for states with fewer than 800,000 voter records on ﬁle.

Online SI S11 shows through simulation studies that for

datasetsas smallas100,000records,a 5% random sample

leads to parameter estimates nearly indistinguishable

from those obtained using the full data set. Based on this

ﬁnding, we choose 800,000 records as the size of the

random samples, corresponding to a 5% of records from

California, the largest state in the United States.

Second, within each step, we conduct the merge by

creating blocks to reduce the number of pairs for consid-

eration. We block based on gender, ﬁrst name, and state,

and we select the number of blocks so that the average size

of each blocked data set is approximately 250,000 records.

To block by ﬁrst name, we rank ordered the ﬁrst names

alphabetically and ran the k-means algorithm on this

ranking in order to create clusters of maximally similar

names.

Finally, the entire merge procedure is computa-

tionally intensive. The reason is that we need to repeat Step

1 for each of 50 states plus Washington DC and apply Step 2

to each of 1,275 pairs. Thus, as explained in Online SI S2, we

use parallelization whenever possible. All merges were run

on a Linux cluster with 16 2.4-GHz Broadwell 28-core

nodes with 128 GB of RAM per node.

Merge Results

Table 4 presents the overall match rate, FDR, and FNR

obtained from

fastLink.Weassesstheperformance

of the match at three separate matching probability

thresholds to declare a pair of observations a successful

match:0.75,0.85,and0.95.Wealsobreakoutthematches

by within-state matches only and across-state matches

only. Across the three thresholds, the overall match rate

remains very high, at 93.04% under a 95% acceptance

threshold, although the estimated FDR and FNR remain

controlledat 0.03%and 3.86%.All threethresholdsyield

match rates that are signiﬁcant higher than the corre-

sponding match rates of the exact matching technique.

In Figure 5, we examine the quality of the merge

separately for the within-state merge (top panel) and

across-state merge (bottom panel). The ﬁrst column

plots the distribution of the matching probability

across all potential match pairs. For both within-state

and across-state merge, we observe a clear separation

between the successful matches and unsuccessful

matches, with very few matches falling in the middle.

This suggests that the true and false matches are

identiﬁed reasonably well. In the second column, we

examine the distribution of the match rate by state.

Here, we see that most states are tightly clustered

between 88% and 96%. Only Ohio, with a match rate of

85%, has a lower match rate. For the across-state merge,

the match rate is clustered tightly between 0% and 5%.

In the third column, we plot the estimated FDR

against the estimated FNR for each state. For the

within-state merge, the FDR is controlled well—every

state other than Minnesota has an FDR below 0.1%.

Additionally, there are only two states, Mississippi and

TABLE 4. The Results of Merging the 2014 Nationwide Voter File with the 2015 Nationwide Voter File

fastLink

Exact0.75 0.85 0.95

Match count (millions) All 135.60 129.69 128.73 91.62

Within-state 127.38 127.12 126.80 91.36

Across-state 8.22 2.57 1.93 0.27

Match rate (%) All 97.25 93.67 93.04 66.24

Within-state 92.06 91.87 91.66 66.05

Across-state 5.19 1.80 1.38 0.19

FDR (%) All 1.02 0.10 0.03

Within-state 0.08 0.04 0.01

Across-state 0.95 0.06 0.02

FNR (%) All 3.35 3.63 3.86

Within-state 2.63 2.83 3.05

Across-state 0.72 0.80 0.81

This table presents the merging results for fastLink for three different thresholds (i.e., 0.75, 0.85, 0.95) for the matching probability to

declare a pair of observations a successful match. Across the different thresholds, the match rates do not change substantially and are

signiﬁcantly greater than the corresponding match rates of the exact matching technique.

See Online SI S14 for evidence that this blocking strategy performs

similarly to a blocking strategy based on age windowing.

Ted Enamorado, Benjamin Fiﬁeld, and Kosuke Imai

366

New Mexico, where fastLink seems to have trouble

identifying true matches, as measured by the FNR. In

the across-state merge, the FDR for every state is below

0.1%, suggesting that the resulting matches are of high

quality. Furthermore,

fastLink appears to be

ﬁnding a high share of true movers across voter ﬁles, as

the FNR for all but three states falls under 2%.

Finally, we examine the across-state migration pat-

terns recovered from our matching procedure. Figure 6

displays a heatmap of the migration patterns obtained

from

fastLink with darker purple colors indicating a

higher match rate when merging the 2014 nationwide

voter ﬁle for a given state (origin state) to the 2015

nationwide voter ﬁle for a given state (destination state).

We uncover several regional migration patterns. First,

we ﬁnd a migration cluster in New England, wherevoters

from New Hampshire and Rhode Island migrated to

Massachusetts between 2014 and 2015. Another strong

migration cluster exists between New Jersey, Delaware,

and Pennsylvania in the mid-Atlantic region. Both pat-

ternssuggestthatmostmigrationoccursbetweenclusters

of adjacent states and urban centers. Finally, we ﬁnd a

large volume of out migration to Florida from across the

United States, and the out migration is particularly

concentrated in states on the Eastern seaboard such as

Virginia, New Hampshire, New Jersey, and Connecticut.

This possibly reﬂects the ﬂow of older voters and retirees

to the more temperate climate.

CONCLUDING REMARKS

With the advance of the Internet, the last two decades

have witnessed a “data revolut ion” in the social sciences

where diverse and large data sets have become elec-

tronically available to researchers. Much of today’s

cutting-edge quantitative social science research results

from research ers’ creativity to link multiple data sets

that are collected separately. In many cases, however, a

unique identiﬁer that can be used to merge multiple data

sources does not exist. Currently, most social scientists

rely on either deterministic or proprietary methods.

Yet, deterministic methods are not robust to meas-

urement errors and missing data, cannot quantify the

uncertainty inherent in merge process, and often require

arbitrary decisions from researchers. Proprietary meth-

ods, many of which are also deterministic, lack trans-

parency and hence are not suitable for academic and

FIGURE 5. Graphical Diagnostics From Merging the 2014 Nationwide Voter File with the 2015

Nationwide Voter File

This ﬁgure presentsgraphicaldiagnosticsfor fastLink for within-state matches(toppanel)andacross-statematches(bottompanel).The ﬁrst

column plots the distribution of the matching probability across all patterns. The second column plots the distribution of the match rate for each

state. Lastly, the third column compares the FNR against the FDR for each state separately.

Using a Probabilistic Model to Assist Merging of Large-Scale Administrative Records

367

policy research where reproducibility and transparency

play an essential role.

Here, we advocate the use of probabilistic modeling

to assist merging large-scale data sets. The main

advantage of probabilistic models is their ability to

estimate false positive and false negative rates that arise

when linking multiple data sets. We contribute to the

statistical literature of record linkage by developing a

fast and scalable implementation of the canonical

model. Through simulation and empirical studies, we

demonstrate that the proposed methodology can

quickly and reliably merge data sets even when they

have millions of records.

Like any methods, however, the proposed record

linkage technology has important limitations of which

researchers must be aware. Most importantly, the

proposed methodology is likely to ha ve a difﬁcult time

producing high-quality matches when the overlap

between two data sets is expected to be small. As shown

in our simulation studies, for these difﬁcult merge

problems, effective blocking is essential. Blocking is

even more important when linking many data sets at

once. Other important research questions are how to

merge more than two ﬁles at the same time and how

to efﬁciently use a small amount of hand-coded data to

improve the quality of record linkage. We leave these

methodological challenges to future research.

Appendix A. COMPUTATIONALLY

EFFICIENT IMPLEMENTATION

In this appendix, we describe the details of our computa-

tionally efﬁcient implementation of the canonical model of

probabilistic record linkage.

Reverse Data Structures for

Field Comparisons

The critical step in record linkage is to compare pairs of

records across the K ﬁelds used to link two data sets, which is

often regarded as the most expensive step in terms of

computational time (Christen 2012). To do so, for each

linkage ﬁeld k,weﬁrst compare observation i of data set A

and j from data set B via a predeﬁned distance metric (e.g.,

Jaro–Winkler for string-valued ﬁelds) and obtain a value

which we call S

(i, j). However, comparisons in the Fellegi–Sunter

model are represented in terms of a discrete agreement

level per linkage ﬁeld, not a continuous measure of

agreement as the one implied by the distance metric. In

other words, we need a discrete representation of S

(i, j).

FIGURE 6. Across-State Match Rates for the 2014 Nationwide Voter File to 2015 Nationwide Voter File

Merge

We plot the match rates from each across-state match pair as a heatmap, where darker colors indicate a higher match rate.

Ted Enamorado, Benjamin Fiﬁeld, and Kosuke Imai

368

Speciﬁcally, if we have a total of L

agreement levels for the

kth variable, then,

i; jðÞ¼

0ifS

i; jðÞ#t

1ift

, S

i; jðÞ#t

 1ift

2

, S

i; jðÞ#t

1

; (14)

where g

(i, j) represents the agreement level between the

values for variable k for the pair (i, j) and s ¼ t

; t

; ...; t

k1

the set of predetermined thresholds use to deﬁne the agree-

ment levels. For example, to compare names and last names,

some authors such as Winkler (1990) argue in favor of using

the Jaro–Winkler string distance to produce S

, where one

could use t 5 {0.88, 0.94} to construct g

for three agreement

levels.

Still the problem with constructing g

is that the number of

comparisons we have to make is often large. In our proposed

implementation we exploit the following characteristics of

typical record linkage problems in social sciences:

•

The number of unique values observed in each linkage

ﬁeld is often less than the number of observations in each

data set. For example, consider a variable such as ﬁrst

name. Naively, one may compare the ﬁrst name of each

observation in data set A with that of every observation

in B. In practice, however, we can reduce the number of

comparisons by considering only unique ﬁrst name that

appears in each data set. The same trick can be used for

all linkage ﬁelds by focusing on the comparison of the

unique values of each variable.

• For each comparison between two unique ﬁrst names

(name

1;A

and name

1;B

), for example, we only keep the

indices of the original data sets and store them using what

is often referred as a reverse data structure in the lit-

erature (Christen 2012). In such an arrangement, a pair

of names (name

1;A

, name

1;B

) becomes a key with two

lists, one containing the indices from data set A that have

a ﬁrst name equal to name

1;A

, and another list that does

the same for name

1;B

in data set B.

• Comparisons involving a missing value need not be made.

Instead, we only need to store the indices of the observa-

tions in A and B that contain missing information for ﬁeld k.

• Because the agreement levels are mutually exclusive, we

use the lowest agreement level as the base category.

Once a set of threshold values has been deﬁned, then a

pair of names can only be categorized in one of the L

agreement levels. The indices for the pairs of values that

can be categorized as disagreements (or nearly dis-

agreements) do not need to be stored. For most varia-

bles, disagreement is the category that encompasses the

largest number of pairs. Thus, our reverse data structure

lists become quite sparse. This sparsity can be exploited

by the use of sparse matrix, yielding a substantially

memory-efﬁcient implementation.

Sparse Matrix Representation of Hash Tables

to Count Agreement Patterns

Next, we describe our computationally efﬁcient imple-

mentation of the Fellegi–Sunter model via the EM algorithm

(see Online SI S2 for the exact algorithm we use). First, for

implementing the E step, notice that the match probability

given in equation (5) takes the same value for two pairs if their

agreement patterns are identical. For the sake of illustration,

consider a simple example where two variables are used for

merging, i.e., K 5 2, and binary comparison is made for each

variable, i.e., L

5 2. Under this setting, there are a total of nine

agreement patterns: (0, 0), (0, 1), (1, 0), (1, 1), (NA, 0), (NA, 1),

(0, NA), (1, NA), and (NA, NA), where 1 and 0 represent

agreement and disagreement, respectively, whereas NA

represents a missing value. Then, for instance, the match

probability for (0, 1) is given by lp

110

211

/{lp

110

211

1 (1 2

l)p

100

201

} whereas that for (1, NA) is equal to lp

111

/{lp

111

(1 2 l)p

101

}. If all comparison values are missing, e.g., (NA,

NA), then we set the match probability to l. Thus, the E step

can be implemented by computing the match probability for

each of the realized agreement patterns. Often, the total

number of realized agreement patterns is much smaller than

the number of all possible agreement patterns.

Second, the M step deﬁned in equations (S1) and (S2)

requires the summation of match probabilities across all pairs

or their subset. Because this probability is identical within each

agreement pattern, all we have to do is to count the total

number of pairs that have each agreement pattern. In other

words, the number of pairs per agreement pattern becomes

our sufﬁcient statistic. We use the following hash function for

efﬁcient counting:

H ¼



k¼1

where

1;1ðÞ

1;2ðÞ

... h

1;N

ðÞ

;1ðÞ

;2ðÞ

... h

ðÞ

;

(15)

where h

i;jðÞ

¼ 1 g

i; jðÞ> 0

i;j

ðÞþ

1fk > 1g3S

k1

e¼1

ðL

1Þ

. The

matrix H

maps each pair of records to a corresponding

agreement pattern in the kth variable that is represented by a

unique hash value based on the powers of 2. These hash values

are chosen such that the matrix H links each pair to the

corresponding agreement pattern across K variables.

Because an overwhelming majority of pairs do not agree

in any of the linkage ﬁ elds, most elements of the H

matrix

are zero. As a result, the H matrix also has many zeros. In

our implementation, we utilize sparse matrices the lookup

time of which i s O(P)whereP is the number of unique

agreement patterns observed. In most applications, P is

much less than the total number of possible agreement

patterns, i.e., ∏

k¼1

. This hashing technique is applicable

if the number of variables used for merge is moderate. If

many va riables are used for the merge, approximate

hashing techniquessuch as min hashing and locallysensitive

hashing are necessary.

Since the work of Jaro (1972), the use of table-like objects to store

agreement patterns has been recognized as an important step to

improve computational efﬁciency. Our contribution goes beyond by

tying together, under a uniﬁed framework, reverse data structures and

novel use of a sparse matrix representation of a hash table to store

agreement patterns.

Using a Probabilistic Model to Assist Merging of Large-Scale Administrative Records

369

SUPPLEMENTARY MATERIAL

To view supplementary material for this article, please visit

https://doi.org/10.1017/S0003055418000783.

Replication materials can be found on Dataverse at: https://

doi.org/10.7910/DVN/YGUHTD.

REFERENCES

Adena, Maja, Ruben Enikolopov, Maria Petrova, Veronica Santarosa,

andEkaterinaZhuravskaya.2015. “RadioandtheRiseof theNazisin

Prewar Germany.” Quarterly Journal of Economics 130: 1885–939.

Ansolabehere, Stephen, and Eitan Hersh. 2012. “Validation: What

Big Data Reveal about Survey Misreporting and the Real Elec-

torate.” Political Analysis 20: 437–59.

Ansolabehere, Stephen, and Eitan Hersh. 2017. “ADGN: An

Algorithm for Record Linkage Using Address, Date of Birth,

Gender and Name.” Statistics and Public Policy 4: 1–10.

Belin, Thomas R., and Donald B. Rubin. 1995. “A Method for Cal-

ibrating False-Match Rates in Record Linkage.” Journal of the

American Statistical Association 90: 694–707.

Berent, Matthew K., Jon Arthur Krosnick, and A. Lupia. 2016.

“Measuring Voter Registration and Turnout in Surveys. Do Ofﬁcial

Government Records Yield More Accurate Assessments?” Public

Opinion Quarterly. 80: 597–621.

Bolsen, Toby, Paul J. Ferraro, and Juan Jose Miranda. 2014. “Are

Voters More Likely to Contribute to Other Public Goods? Evi-

dence from a Large-Scale Randomized Policy Experiment.”

American Journal of Political Science 58: 17–30.

Bonica, Adam. 2013. Database on Ideology, Money in Politics, and

Elections: Public Version 1.0 [Computer File]. Stanford, CA:

Stanford University Libraries.

Cesarini, David, Erik Lindqvist, Robert Ostling, and Bjorn Wallace.

2016. “Wealth, Health, and Child Development: Evidence from

Administrative Data on Swedish Lottery Players.” Quarterly

Journal of Economics 131: 687–738.

Christen, Peter. 2012. Data Matching. Concepts and Techniques for

Record Linkage, Entity Resolution, and Duplicate Detection.

Heidelberg, Germany: Springer.

Cohen, William W., Pradeep Ravikumar, and Stephen Fienberg. 2003.

“A Comparison of String Distance Metrics for Name-Matching

Tasks.” In International Joint Conference on Artiﬁcial Intelligence

(IJCAI) 18.

Cross, Philip J., and Charles F. Manski. 2002. “Regressions, Short and

Long.” Econometrica 70: 357–

68.

Dalzell, Nicole M., and Jerome P. Reiter. 2018. “Regression Modeling

and File Matching Using Possibly Erroneous Matching Variables.”

Journal of Computational and Graphical Statistics 1–11. Published

online July 11, 2018.

de Bruin, Jonathan. 2017. “Record Linkage. Python library. Version

0.8.1.” https://recordlinkage.readthedocs.io/.

Einav, Liran, and Jonathan Levin. 2014. “Economics in the Age of Big

Data.” Science 346 (6210): 1243089-1–6.

Enamorado, Ted. 2018. “Active Learning for Probabilisitic Record

Linkage.” Social Science Research Network (SSRN). URL: https://

ssrn.com/abstract53257638.

Engbom, Niklas, and Christian Moser. 2017. “Returns to Education

Through Access to Higher-Paying Firms: Evidence from US

Matched Employer-Employee Data.” American Economic Review:

Papers and Proceedings 107: 374–78.

Feigenbaum, James. 2016. Automated Census Record Linking: A

Machine Learning Approach. Boston University. Technical Report.

https://jamesfeigenbaum.github.io/research/pdf/census-link-ml.pdf

Fellegi, Ivan P., and Alan B. Sunter.1969. “A Theory of Record Linkage.”

Journal of the American Statistical Association 64: 1183–210.

Figlio, David, and Jonathan Guryan. 2014. “The Effects of Poor

Neonatal Health on Children’s Cognitive Development.” American

Economic Review 104: 3921–55.

Giraud-Carrier, Christophe, Jay Goodlife, Bradley M. Jones, and

Stacy Cueva. 2015. “Effective Record Linkage for Mining Cam-

paign Contribution Data.” Knowledge and Information Systems 45:

389–416.

Goldstein, Harvey, and Katie Harron. 2015. Methodological Devel-

opments in Data Linkage. John Wiley & Sons, Ltd. Chapter 6:

Record Linkage: A Missing Data Problem, pp. 109–24.

Gutman, Roee, Christopher C. Afendulis, and Alan M. Zaslavsky.

2013. “A Bayesian Procedure for File Linking to End-of-Life

Medical Costs.” Journal of the American Medical Informatics

Association. 103: 34–47.

Gutman, Roee, Cara J. Sammartino, Traci C. Green, and Brian T.

Montague. 2016. “Error Adjustments for File Linking Methods

Using Encrypted Unique Client Identiﬁer (eUCI) with Application

to Recently Released Prisoners Who Are HIV1.” Statistics in

Medicine 35: 115–29.

Harron, Katie, Harvey Goldstein, and Chris Dibben, eds. 2015.

Methodological Developments in Data Linkage. West Sussex: John

Wiley & Sons.

Hersh, Eitan D. 2015. Hacking the Electorate: How Campaigns Per-

ceive Voters. Cambridge, U.K.: Cambridge University Press.

Herzog, Thomas H., Fritz Scheuren, and William E. Winkler. 2010.

“Record Linkage.” Wiley Interdisciplinary Reviews: Computational

Statistics 2: 535–43.

Herzog, Thomas N., Fritz J. Scheuren, and William E. Winkler.

2007. Data Quality and Record Linkage Techniques. New York:

Springer.

Hill, Seth. 2017. “Changing Votes or Changing Voters: How Candi-

dates and Election Context Swing Voters and Mobilize the Base.”

Electoral Studies 48: 131–48.

Hill, Seth J., and Gregory A. Huber. 2017. “Representativeness and

Motivations of the Contemporary Donorate: Results from

Merged Survey and Administrative Records.” Political Behavior

39: 3–29.

Hof, Michel H. P., and Aeilko H. Zwinderman. 2012. “Methods for

Analyzing Data from Probabilistic Linkage Strategies Based

on Partially Identifying Variables.” Statistics in Medicine 31:

4231–42.

Imai, Kosuke, and Dustin Tingley. 2012. “A Statistical Method for

Empirical Testing of Competing Theories.” American Journal of

Political Science 56: 218–36.

Jaro, Matthew. 1972. “UNIMATCH-A Computer System for Gen-

eralized Record Linkage Under Conditions of Uncertainty.”

Technical Report, Spring Joint Computer Conference.

Jaro, Matthew. 1989. “Advances in Record-Linkage Methodology as

Applied to Matching the 1985 Census of Tampa, Florida.” Journal

of the American Statistical Association. 84: 414–20.

Jutte, Douglas P., Leslie L. Roos, and Marni D. Browne.

2011. “Administrative Record Linkage as a Tool for

PublicHealthResearch.” Annual Reviewof PublicHealth32: 91

–108.

Kim, Gunky, and Raymond Chambers. 2012. “Regression Analysis

under Incomplete Linkage.” Computational Statistics and Data

Analysis 56: 2756–70.

Lahiri, Partha, and Michael D. Larsen. 2005. “Regression Analysis

with Linked Data.” Journal of the American Statistical Association

100: 222–30.

Larsen, Michael D., and Donald B. Rubin. 2001. “Iterative Auto-

mated Record Linkage Using Mixture Models.” Journal of the

American Statistical Association 96: 32–41.

McLaughlan, Geoffrey, and David Peel. 2000. Finite Mixture Models.

New York: John Wiley & Sons.

McVeigh, Brendan S., and Jared S. Murray. 2017. “Practical Bayesian

Inferencefor Record Linkage.” TechnicalReport,Carnegie Mellon

University.

Meredith, Marc, and Michael Morse. 2014. “Do Voting

Rights Notiﬁcation Laws Increase Ex-Felon Turnout?” The

ANNALS of the American Academy of Political and Social Science

651: 220–49.

Mummolo, Jonathan, and Clayton Nall. 2016. “Why Partisans Don’t

Sort: The Constraints on Political Segregation.” Journal of Politics

79: 45–59.

Murray, Jared S. 2016. “Probabilistic Record Linkage and Dedupli-

cation after Indexing, Blocking, and Filtering.” Journal of Privacy

and Conﬁdentiality 7: 3–24.

Neter, John, E. Scott Maynes, and R. Ramanathan. 1965. “The Effect

of Mismatching on the Measurement of Response Errors.” Journal

of the American Statistical Association 60: 1005–27.

Ong, Toan C., Michael V. Mannino, Lisa M. Schilling, and Michael G.

Kahn. 2014. “Improving Record Linkage Performance in the

Ted Enamorado, Benjamin Fiﬁeld, and Kosuke Imai

370

Presence of Missing Linkage Data.” Journal of Biomedical Infor-

matics 52: 43–54.

Richman, Jesse T., Gulshan A. Chattha, and David C. Earnest. 2014.

“Do Non-Citizens Vote in U.S. Elections?” Electoral Studies 36:

149–57.

Ridder, Geert, and Robert Mofﬁtt. 2007. Handbook of Econometrics.

Vol. 6. Elsevier Chapter The Econometrics of Data Combination,

pp. 5469–547.

Sadinle, Mauricio. 2014. “Detecting Duplicates in a Homicide Reg-

istry Using a Bayesian Partitioning Approach.” Annals of Applied

Statistics. 8: 2404–34.

Sadinle, Mauricio. 2017. “Bayesian Estimation of Bipartite Matchings

for Record Linkage.” Journal of the American Statistical Association

112: 600–12.

Sariyar, Murat, and Andreas Borg. 2016. Record Linkage in R. R package.

Version 0.4-10. http://cran.r-project.org/package5RecordLinkage.

Sariyar, Murat, Andreas Borg, and Klaus Pommerening. 2012.

“Missing Values in Deduplication of Electronic Patient Data.”

Journal of the American Medical Informatics Association 19:

e76–82.

Scheuren, Fritz, and William E. Winkler. 1993. “Regression Analysis

of Data Files that Are Computer Matched.” Survey Methodology

19: 39–58.

Scheuren, Fritz, and William E. Winkler. 1997. “Regression Analysis

of Data Files that Are Computer Matched II.” Survey Methodology.

23: 157–65.

Steorts, Rebecca C. 2015. “Entity Resolution with Empirically

Motivated Priors.” Bayesian Analysis. 10: 849–75.

Steorts, Rebecca C., Samuel L. Ventura, Mauricio Sadinle, and Ste-

phen E. Fienberg. 2014. “A Comparison of Blocking Methods for

Record Linkage.” In Privacy in Statistical Databases, ed. Josep

Domingo-Ferrer. Springer, 253–68.

Tam Cho, Wendy, James Gimpel, and Iris Hui. 2013. “Voter

Migration and the Geographic Sorting of the American Elec-

torate.” Annals of the American Association of Geographers 103:

856–70.

Tancredi, Andrea, and Brunero Liseo. 2011. “A Hierachical Bayesian

Approach to Record Linkage and Population Size Problems.”

Annals of Applied Statistics. 5: 1553–85.

Thibaudeau, Yves. 1993. “The Discrimination Power of Dependency

Structures in Record Linkage.” Survey Methodology 19.

Winkler, William E. 1990. “String Comparator Metrics and

Enhanced Decision Rules i n the Fellegi–Sunter Model of Record

Linkage.” In Proceedings of the Section on Survey Research

Methods. American Statistical Association. https://www.iser.

essex.ac.uk/research/publications/501361.

Winkler, William E. 1993. “ImprovedDecision Rules in the Fellegi–Sunter

Model of Record Linkage.” In Proceedings of Survey Research

Methods Section. American Statistical Association. http://ww2.

amstat.org/sections/srms/Proceedings/papers/1993_042.pdf.

Winkler, William E. 2000. “Using the EM Algorithm for Weight

Computation in the Felligi–Sunter Model of Record Linkage.”

Technical Report No. RR2000/05, Statistical Research Division,

Methodology and Standards Directorate, U.S. Bureau of the

Census.

Winkler, William E. 2005. “Approximate String Comparator Search

Strategies for Very Large Administrative Lists.” Research Report

Series (Statistics) No. 2005-02, Statistical Research Division U.S.

Census Bureau.

Winkler, William E. 2006a. “Automatic Estimation of Record

Linkage False Match Rates.” In Proceedings of the Section on

Survey Research Methods. American Statistical Association.

Winkler, William E. 2006b. “Overview of Record Linkage and

Current Research Directions.” Technical Report, United States

Bureau of the Census.

Winkler, William E., and Willian Yancey. 2006. “Record Linkage

Error-Rate Estimation without Training Data.” In Proceedings of

the Section on Survey Research Methods. American Statistical

Association.

Winkler, William E., Willian Yancey, and E. H. Porter. 2010. “Fast

Record Linkage of Very Large Files in Support of the Decennial

and Administrative Record Projects.” In

Proceedings of the Secion

on Survey Research Methods.

Yancey, Willian. 2005. “Evaluating String Comparator Performance

for Record Linkage.” Research Report Series, Statistical Research

Division U.S. Census Bureau.

Using a Probabilistic Model to Assist Merging of Large-Scale Administrative Records

371