American Political Science Review (2019) 113, 2, 353371
doi:10.1017/S0003055418000783 © American Political Science Association 2019
Using a Probabilistic Model to Assist Merging of Large-Scale
Administrative Records
TED ENAMORADO Princeton University
BENJAMIN FIFIELD Princeton University
KOSUKE IMAI Harvard University
S
ince most social science research relies on multiple data sour ces, merging da ta se ts is an essential
part of researchers workow. Unfortunately, a unique identier that unambiguously links
records is often unavailable, and d ata may contain missing and inaccurate information. These
problems are severe especially when merging large-scale administra tive records. W e develop a fast and
scalable algorithm to implement a canonical model of probabilistic record linkage that has many
advantages over deterministic methods frequently used by social scientists. The proposed methodology
efciently handles millions o f observations while accounting for missing data and measurement error,
incorporating auxiliary information, and adjusting for uncertainty about merging in post-merge
analyses. We conduct comprehensive simulation studies to evaluate the performance of our algo-
rithm in realistic scenarios. W e also apply our methodology to merging campaign contribution records,
survey data, and nationwide voter les. An open-source software package is available for implementing
the proposed methodology.
INTRODUCTION
A
s the amount and diversity of available data sets
rapidly increase, social scientists often harness
multiple data sources to answer substantive
questions. Indeed, merging data sets, in particular large-
scale administrative records, is an essential part of
cutting-edge empirical research in many discip lines
(e.g., Ansolabehere and Hersh 2012; Einav and Levin
2014; Jutte, Roos, and Browne 2011). Data merging can
be consequential. For example, the American National
Election Studies (ANES) and Cooperative Congressional
Election Study (CCES) validate self-reported turnout by
merging their survey data with a nationwide voter le
where only the matched respondents are treated as reg-
istered voters. Although Ansolabehere and Hersh (2012)
advocate the use of such a validation procedure, Berent,
Krosnick, and Lupia (2016) argue that the discrepancy
between self-reported and validated turnout is due to the
failure of the merge procedure rather than social desir-
ability and nonresponse bias.
Merging data sets is straightforward if there exists a
unique identier that unambiguously links records from
different data sets. Unfortunately, such a unique
identier is often unavailable. Under these circum-
stances, some researchers have used a deterministic
algorithm to automate the merge process (e.g., Adena
et al. 2015; Ansolabehere and Hersh 2017; Berent,
Krosnick, and Lupia 2016; Bolsen, Ferraro, and
Miranda 2014; Cesarini et al. 2016; Figlio and Guryan
2014; Giraud-Carrier et al. 2015; Hill 2017; Meredith
and Morse 2014) whereas others have relied on a
proprietary algorithm (e.g., Ansolabehere and Hersh
2012; Engbom and Moser 2017; Figlio and Guryan 2014;
Hersh 2015; Hill and Huber 2017; Richman, Chattha,
and Earnest 2014). However, these methods are not
robust to measur ement error (e.g., misspelling) and
missing data, which are common to social science data.
Furthermore, deterministic merge methods cannot
quantify the uncertainty of the merging procedure and
instead typically rely on arbitrary thresholds to deter-
mine the degree of similarity sufcient for matche s.
1
This means that post-merge data analyses fail to account
for the uncertainty of the merging procedure, yielding a
bias due to measurement error. These methodological
Ted Enamorado, Ph.D. Candidate, Department of Politics, Princeton
University, [email protected], http://www.tedenamorado.com.
Benjamin Field, Ph.D. Candidate, Department of Politics,
Princeton University, bfifi[email protected], http://www.benfifield.
com.
Kosuke Imai
, Professor, Department of Government and
Department of Statistics, Harvard University. [email protected],
https://imai.fas.harvard.edu.
The proposed methodology is implemented through an open-
source R package,
fastLink: Fast Probabilistic Record Linkage,
which is freely available for download at the Comprehensive R Archive
Network (CRAN; https://CRAN.R-project.org/package5fastLink).
WethankBruceWillsieofL2andSteffenWeissofYouGov fordataand
technical assistance, Jake Bowers, Seth Hill,Johan Lim, MarcRatkovic,
Mauricio Sadinle, ve anonymous reviewers, and audiences at the 2017
Annual Meeting of the American Political Science Association,
Columbia University (Political Science), Fifth Asian Political Meth-
odology Meeting, Gakusyuin University (Law), Hong Kong University
of Science and Technology,the Institute for Quantitative Social Science
(IQSS) at Harvard University, the Quantitative Social Science (QSS)
colloquiumat Princeton University,Universidad de Chile(Economics),
Universidad del Desarrollo, Chile (Government), the 2017 Summer
Meeting of the Society for Political Methodology, the Center for Sta-
tistics and the Social Sciences (CSSS) at the University of Washington
for useful comments and suggestions. Replication materials can be
found on Dataverse at: https://doi.org/10.7910/DVN/YGUHTD.
Received: August 8, 2017; revised: May 13, 2018; accepted:
October 17, 2018. First published online: January 2, 2019.
1
These thresholds are highly dependent on data. For example,
Ansolabehere and Hersh (2017) nd that using three elds with exact
matches as the threshold works well for the Texas voter le, but the
same threshold may not work for other data. In contrast, probabilistic
methods can automatically weight observations.
353
Downloaded from https://www.cambridge.org/core. Harvard University, on 24 Apr 2019 at 23:40:20, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S0003055418000783
challenges are amplied especi ally when merging large-
scale administrative records.
We demonstrate that social scientists should use
probabilistic models rather than deterministic methods
when merging large data sets. Probabilistic models can
quantify the uncertainty inherent in many merge pro-
cedures, offering a principled way to calibrate and
account for false positives and false negatives.
Unfortunately, although there exists a well-known
statistics literature on probabilistic record linkage
(e.g., Harron, Goldstein, and Dibben 2015; Herzog,
Scheuren, and Winkler 2007; Winkler 2006b), the
current open-source implementation does not scale to
large data sets commonly used in todays social science
research. We address this challenge by developing a fast
and scalable implementation of the canonical proba-
bilistic record linkage model originally proposed by
Fellegi and Sunter (1969). Together with paralleliza-
tion, this algorithm, which we call
fastLink, can be
used to merge data sets with millions of records in a
reasonable amou nt of time using ones laptop computer.
Additionally, building on the previous methodological
literature (e.g., Lahiri and Larsen 2005), we show (1)
how to incorporate auxiliary informati on such as pop-
ulation name frequency and migration rates into the
merge procedure and (2) how to conduct post-merge
analyses while accounting for the uncertai nty about the
merge process. We describe these methodological
developments in the following section.
We then describe the comprehensive simulation
studies to evaluate the robustness of
fastLink to
several factors including the size of data sets, the pro-
portion of true matches, measurement error, and
missing data proportion and mechanisms. A total of 270
simulation setting s consistently show that
fastLink
signicantly outperforms the deterministic methods.
Although the proposed methodology pro duces high-
quality matches in most situations, the lack of overlap
between two data sets often leads to large error rates,
suggesting that effective blocking is essential when the
expected number of matches is relatively small. Fur-
thermore,
fastLink appears to perform at least as
well as recently proposed probabilistic approaches
(Sadinle 2017; Steorts 2015). Importantly, our merge
method is faster and scales to larger data sets than these
state-of-art methods.
Next, we present two empirical applications. First, we
revisit Hill and Huber (2017) who examine the ideo-
logical differences between donors and nondonors by
merging the CCES data of more than 50,000 survey
respondents, with the a campaign contribution database
of over ve million donor records (Bonica 2013). We
nd that the matches identied by
fastLink are at
least as high quality as those identied by the propri-
etary method, which was used by the original authors.
We also improve the original analysis by incorporating
the uncertainty of the merge process in the pos t-merge
analysis. We show that although the overall conclusion
remains unchanged, the magnitude of the estimated
effects is substantially smaller.
As the second application, we merge two nationwide
voter les of over 160 million voter records each,
representing one of the largest data merges ever con-
ducted in social science research.
2
By merging voter les
over time, scholars can study the causes and con-
sequences of partisan residential segregation (e.g., Tam
Cho, Gimpel, and Hui 2013; Mummolo and Nall 2016)
and political analytics professionals can develop
effective microtargeting strategies (e.g., Hersh 2015).
We show how to incorporate available within-state and
across-state migration rates in the merge process. Given
the enormous size of the data sets, we propose a two-
step procedure where we rst conduct a within-state
merge for each state followed by across-state merges for
every pair of states. The proposed methodology is able
to match about 95% of voters, which is about 30-
percentage points greater than the exact matching
method. Although it is more difcult to nd across-state
movers, we are able to nd 20 times as many such voters
than the existing matching method.
Finally, we give concluding remarks. We provide an
open-source R software package
fastLink:Fast
Probabilistic Record Linkage, which is freely available at
the ComprehensiveR Archive Network (CRAN;https://
CRAN.R-project.org/package5fastLink)forimple-
menting our methodology so that other researchers can
effectively merge data sets in their own projects.
THE PROPOSED METHODOLOGY
In this section, we rst introduce the canonical model of
probabilistic record linkage originally proposed by
Fellegi and Sunter (1969). We describe several
improvements we make to this model, including a fast
and scalable impl ementation, the use of auxiliary
information to inform parameter estimation, and the
incorporation of uncertainty about the merge process in
post-merge analyses.
The Setup
Suppose that we wish to merge two data sets, A and B,
which have sample sizes of N
A
and N
B
, respectively. We
use K variables, which are common to both data sets, to
conduct the merge. We consider all possible pair-wise
comparisons between these two data sets. For each of
these N
A
3N
B
distinct pairs, we dene an agreement
vector of length K, denoted by g(i, j), whose kth element
g
k
(i, j) represents the discr ete level of within-pair
similarity for the kth variable between the ith obser-
vation of data set A and the jth observation of data set B.
Specically, if we have a total of L
k
similarity levels for
the kth variable, then the corresponding element of the
agreement vector can be dened as,
2
AlthoughHersh (2015) conducted a large-scaledata merge, he relied
on a proprietary algorithm. Others such as Ansolabehere and Hersh
(2017) and Tam Cho, Gimpel, and Hui (2013) match data sets of
several million voters each, but neither of these studies approaches the
scale of our applications. Note that the US Census Bureau routinely
conducts large-scale data merges for decennial census (Winkler,
Yancey, and Porter 2010).
Ted Enamorado, Benjamin Field, and Kosuke Imai
354
Downloaded from https://www.cambridge.org/core. Harvard University, on 24 Apr 2019 at 23:40:20, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S0003055418000783
g
k
i; jðÞ¼
0 different
1
.
.
.
L
k
2
9
=
;
similar
L
k
1 identical
8
>
>
>
<
>
>
>
:
(1)
The proposed methodology allows for the existence of
missing data. We dene a missingness vector of length K,
denoted by d(i, j),foreachpair(i, j)whereitskth element
d
k
(i,j)equals 1 if at least one record in the pair has a missing
value in the kth variable and is equal to 0 otherwise.
Table 1 presents an illustrative example of agreement
patterns based on two articial data sets, A and B,eachof
which has two records. In this example, we consider three
possible values of g
k
(i, j)forrst name, last name, and
street name, i.e., L
k
5 3 (different, similar, nearly identi-
cal),whereasabinaryvariableisusedfortheotherelds, i.
e., L
k
5 2 (different, nearly identical). The former set of
variables requires a similarity measure and threshold
values. We use the JaroWinkl er string distance (Jaro
1989; Winkler 1990), which is a commonly used measure in
the literature (e.g., Cohen, Ravikumar, and Fienberg 2003;
Yancey 2005).
3
Because the JaroWinkler distance is a
continuous measure whose values range from 0 (different)
to 1 (identical), we discretize it so that g
k
(i, j) takes an
integer value between 0 and L
k
2 1asdened in equation
(1). Suppose that we use three levels (i.e., different, similar,
and nearly identical) based on the threshold values of 0.88
and 0.94 as recommended by Winkler (1990). Then, when
comparing the last names in Table 1,wend that, for
example, Smith and Smithson are similar (a JaroWinkler
distance of 0.88) whereas Smith and Martinez are different
(a JaroWinkler distance of 0.55).
4
The above setup implies a total of N
A
3N
B
compar-
isons for each of K elds. Thus, the number of com-
parisons grows quickly as the size of data sets increases.
One solution is to use blocking and avoid comparisons
that should not be made. For example, we may make
comparisons within gender group only. While it is
appealing because of computational efciency gains,
Winkler (2005) notes that blocking often involves ad
hoc decisions by researchers and faces difculties when
variables have missing values and measurement error.
Here, we focus on the data merge within a block and
refer interested readers to Christen (2012) and Steorts
et al. (2014) for comprehensive reviews of blocking
techniques.
5
We also note a related technique, called
ltering, which has the potential to overcome the
weaknesses of traditional blocking methods by dis-
carding pa irs that are unlikely to be matches when tting
a probabilistic model (Murray 2016).
The Canonical Model of Probabilistic
Record Linkage
The Model and Assumptions
We rst descri be the most commonly used probabilistic
model of record linkage (Fellegi and Sunter 1969). Let a
latent mixing variable M
ij
indicate whether a pair of
records (the ith record in the data set A and the jth
record in the data set B) rep resents a match. The model
has the following simple nite mixture structure (e.g.,
Imai and Tingley 2012; McLaughlan and Peel 2000):
g
k
i; jðÞjM
ij
¼ m ;
indep:
Discrete p
km
ðÞ; (2)
M
ij
;
i:i:d:
Bernoulli lðÞ; (3)
TABLE 1. An Illustrative Example of Agreement Patterns.
Name Address
First Middle Last Date of birth House Street
Data set A
1 James V Smith 12-12-1927 780 Devereux St.
2 Robert NA Martines 01-15-1942 60 16th St.
Data set B
1 Michael F Martinez 02-03-1956 4 16th St.
2 James D Smithson 12-12-1927 780 Dvereuux St.
Agreement patterns
A:1 B:1 Different Different Different Different Different Different
A:1 B:2 Identical Different Similar Identical Identical Similar
A:2 B:1 Different NA Similar Different Different Identical
A:2 B:2 Different NA Different Different Different Different
The top panel of the table shows two articial data sets, A and B, each of which has two records. The bottom panel shows the agreement
patterns for all possible pairs of these records. For example, the second line of the agreement patterns compares the rst record of the data set
A with the second record of the data set B. These two records have an identical information for rst name, date of birth, and house number;
similar information for last name and street name; and different information for middle name. A comparison involving at least one missing value
is indicated by NA.
3
Online Supplementary Information (SI) S1 describes how the
JaroWinkler string distance is calculated.
4
As shown in the subsection Computational Efciency and Appendix
A, the discretization of the distance measure leads to substantial
computational efciency when making pairwise comparison for each
linkage eld.
5
The parameters of record linkage models must be interpreted
separately for each block (Murray 2016).
Using a Probabilistic Model to Assist Merging of Large-Scale Administrative Records
355
Downloaded from https://www.cambridge.org/core. Harvard University, on 24 Apr 2019 at 23:40:20, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S0003055418000783
where p
km
is a vector of length L
k
, containing the
probability of each agreement level for the kth variable
given that the pair is a match (m 5 1) or a nonmatch
(m 5 0), and l represents the probability of a match
across all pairwise comparisons. Through p
k0
, the
model allows for the possibility that two records can
have identical values for some variables even when they
do not represent a match.
This model is based on two key independence
assumptions. First, the latent variable M
ij
is assumed to be
independently and identically distributed. Such an
assumption is necessarily violated if, for example, each
record in the data set A should be matched with no more
than one record in the data set B.Intheory,this
assumption can be relaxed (e.g., Sadinle 2017) but doing so
makes the estimation signic antly more complex and
reduces its scalability (see Online SI S8). Later in the
paper, we discuss how to impose such a constraint without
sacricing computational efci ency. Second, the condi-
tional independence among linkage variables is assumed
given the match status. Some studies nd that the violation
of this assumption leads to unsatisfactory performance (e.
g., Belin and Rubin 1995; Herzog, Scheuren, and Winkler
2010;LarsenandRubin2001; Thibaudeau 1993; Winkler
and Yancey 2006). In Online SI S4, we show how to relax
the conditional independence assumption while keeping
our scalable implementation.
In the literature, researchers often treat missing data as
disagreements, i.e., g
k
(i,j)5 0ifd
k
(i,j)5 1 (e.g., Goldstein
and Harron 2015;Ongetal.2014; Sariyar, Borg, and
Pommerening 2012). This procedure is problematic
becauseatruematchcancontainmissingvalues.Other
imputation procedures also exist but none of them has a
theoretical justication or appe ars to perform well in
practice.
6
To address this problem, following Sadinle
(20 14, 2017), we assume that data are missing at random
(MAR) conditional on the latent variable M
ij
,
d
k
i; j
ðÞ
?? g
k
i; j
ðÞ
j M
ij
;
for each i ¼ 1; 2; ...; N
A
, j ¼ 1; 2; ...; N
B
, and k 5 1,
2, , K. Under this MAR assumption, we can simply
ignore missing data. The observed-data likelihood
function of the model dened in equations (2) and (3) is
given by,
L
obs
k; p j d; cðÞ}
N
A
i¼1
N
B
j¼1
1
m¼0
l
m
1lðÞ
1m
K
k¼1
L
k
1
¼0
p
1fg
k
ði;jÞ¼g
km

1d
k
i;jðÞ
()
;
where p
km
represents the th element of probability
vector p
km
,i.e.,p
km
¼ Pr g
k
i; jðÞ¼ j M
ij
¼ m

.
Because the direct maximization of the observed-data
log-likelihood function is difcult, we estimate the model
parameters using the Expectation-Maximization (EM)
algorithm (see Online SI S2).
The Uncertainty of the Merge Process
The advantage of probabilistic models is their ability to
quantify the uncertainty inherent in merging. Once the
model parameters are estimated, we can compute the
match probability for each pair using Bayes rule,
7
n
ij
¼ Pr M
ij
¼ 1 j d i; jðÞ; g i; jðÞ

¼
l
K
k¼1
L
k
1
¼0
p
1fg
k
ði;jÞ¼g
k1

1d
k
i;jðÞ
1
m¼0
l
m
1 lðÞ
1m
K
k¼1
L
k
1
¼0
p
1fg
k
ði;jÞ¼g
km

1d
k
i;jðÞ
:
(4)
In the subsection Post-merge Analysis, we show how
to incorporate this match probability into post-merge
regression analysis to account for the uncertainty of the
merge process.
Although in theory a post-merge analysis can use all
pairs with nonzero match probabilities, it is often more
convenient to determine a threshold S when creating a
merged data set. Such an approach is useful especially
when the data sets are large. Specically, we call a pair
(i, j) a match if the match probability n
ij
exceeds S. There
is a clear trade-off in the choice of this threshold value. A
large value of S will ensure that most of the selected pairs
are correct matches but may fail to identify many true
matches. In contrast, if we lower S too much, we will
select more pairs but many of them may be false
matches. Therefore, it is important to quantify the
degree of these matching errors in the merging process.
One advantage of probabilistic models over deter-
ministic methods is that we can estimate the false dis-
covery rate (FDR) and the false negative rate (FNR).
The FDR represents the proportion of false matches
among the selected pairs whose matching probability is
greater than or equal to the threshold. We estimate the
FDR using our model parameters as follows:
,
Pr M
ij
¼ 0 j n
ij
$ S

¼
N
A
i¼1
N
B
j¼1
1 n
ij
$ S
no
1 n
ij

N
A
i¼1
N
B
j¼1
1 n
ij
$ S
no
(5)
whereas the FNR, which represents the proportion of
true matches that are not selected, is estimated as
Pr M
ij
¼ 1 j n
ij
, S

¼
N
A
i¼1
N
B
j¼1
n
ij
1 n
ij
,S
no
lN
A
N
B
: (6)
Researchers typically select, at their own discretion,
the value of S such that the FDR is sufciently small.
But, we also emphasize the FNR because a strict
threshold can lead to many false negatives.
8
In our
simulations and empirical studies, we nd that the
6
For example, although Goldstein and Harron (2015) suggest the
possibility of treating a comparison that involves a missing value as a
separate agreement value, but Sariyar, Borg, and Pommerening
(2012) nd that this approach does not outperform the standard
method of treating missing values as disagreements.
7
This is known as the maximum a posteriori estimate.
8
A more principled solution to the threshold S selection problem
would require data for which the true matching status M(i, j)is
knownso that one can select the value of S to minimize the clas-
sication error. However, in record linkage problems, only in rare
occasions do labeled data sets exist. See Larsen and Rubin (2001),
Feigenbaum (2016), and Enamorado (2018) for approaches that
directly incorporate labeled data.
Ted Enamorado, Benjamin Field, and Kosuke Imai
356
Downloaded from https://www.cambridge.org/core. Harvard University, on 24 Apr 2019 at 23:40:20, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S0003055418000783
results are not particularly sensitive to the choice of
threshold value, although in other applications,
scholars found ex-post adjustments are necessary for
obtaining good estimates of error rates (e.g., Belin and
Rubin 1995; Larsen and Rubin 2001;Murray2016;
Thibaudeau 1993; Winkler 1993; Winkler 2006a).
In the merging process, for a given record in the data
set A, it is possible to nd multiple records in the data set
B that have high match probabilities. In some cases,
multiple observations have an identical value of match
probability, i.e., n
ij
¼ n
ij
0
with j j9. Following the lit-
erature (e.g., McVeigh and Murray 2017; Sadinle 2017;
Tancredi and Liseo 2011), we recommend that
researchers analyze all matched observations by
weighting them accordi ng to the matching probability
(see the subsection Post-Merge Analysis). If researchers
wish to enforce a constraint that each record in one data
set is only matched at most with one record in the other
data set, they may follow a procedure described in
Online SI S5.
Incorporating Auxiliary Information
Another advantage of the probabilistic model intro-
duced above is that we can incorporate auxiliary
information in parameter estimation. This point has not
been emphasized enough in the literature. Here, we
briey discuss two adjustments using auxiliary
datarst, how to adjust for the fact that some names
are more common than others, and second, how to
incorporate aggregate information about migration.
More details can be found in Online SI S6.
Because some rst names are more common than
others, they may be more likely to be false matches. To
adjust for this possibility without increasing the com-
putational burden, we formalize the conditions under
which the ex-post correction originally proposed by
Winkler (2000) is well-suited for this purpose. Briey,
the probability of being a match will be up-weighted or
down-weighted given the true frequencies of different
rst names (obtained, for instance, from Census data) or
observed frequencies of each unique rst name in the
data (see Online SI S6.3.1).
Furthermore, we may know apriorihow many
matches w e should nd in two data set s becau se of the
knowledge and data on over-time migration. For
instance, the Internal Revenue Service (IRS) pub-
lishes detailed information on migration in the United
States from tax records (see https://www.irs.gov/uac/
soi-tax-stats-migration-data). An estimate of the share
of individuals who moved out of a state or who moved
in-state can be easily reformulated as a prior on rel-
evant parameters in the FellegiSunter model and
incorporated into parameter estimation (see Online SI
S6.3.2).
Post-Merge Analysis
Finally, we discuss how to conduct a statistical analysis
once merging is complete. One advantage of proba-
bilistic models is that we can directly incorpo rate the
uncertainty inherent to the merging process in the post-
merge analysis. This is important because resear chers
often use the merged variable either as the outcome or
as the explanatory variable in the post-merge analysis.
For example, when the ANES validates self-reported
turnout by merging the survey da ta with a nationwide
voter le, respondents who are unable to be merge d are
coded as nonregistered voters. Give n the uncertainty
inherent to the merging process, it is possible that a
merging algorithm fails to nd some respondents in the
voter le even though they are actually registered
voters. Similarly, we may incorrectly merge survey
respondents with other registered voters. These mis-
matches,if ignored, can adversely affect the properties of
post-match analyses (e.g., Neter, Maynes, and Ram-
anathan 1965; Scheuren and Winkler 1993).
Unfortunately, most of the record linkage literature
has focused on the linkage process itself without con-
sidering how to conduct subsequent statistical analyses
after merging data sets.
9
Here, we build on a small
literature about post-merge regression analysis, the goal
of which is to eliminate possible biases due to the linkage
process within the FellegiSunter framework (e.g., Hof
and Zwinderman 2012; Kim and Chambers 2012; Lahiri
and Larsen 2005; Scheuren and Winkler 1993, 1997).
We also clarify the assum ptions under which a valid
post-merge analysis can be conducted.
The Merged Variable as an Outcome Variable
We rst consider the scenario, in which researchers wish
to use the variable Z merged from the data set B as a
proxy for the outcome variable in a regression analysis.
We assume that this regression analysis is applied to all
observations of the data set A and uses a set of
explanatory variables X taken from this data set. These
explanatory variables may or may not include the
variables used fo r merging. In the ANES application
mentioned above, for example, we may be interested in
regressing the validated turnout measure merged from
the nationwide voter le on a variety of demographic
variables measured in the survey.
For each observation i in the data set A, we obtain the
mean of the merged variable, i.e., z
i
¼ E Z
i
j g; d

where Z
i
represents the true value of the merged
variable. This quantity can be computed as the weighted
average of the variable Z merged from the data set B
where the weights are proportional to the match
probabilities, i.e., z
i
¼
N
B
j¼1
n
ij
Z
j
=
N
B
j¼1
n
ij
. In the ANES
application, for example, z
i
represents the probability of
turnout for survey respondent i in the data set A and can
be computed as the weighted average of turnout among
the registered voters in the voter le merged with
respondent i. If we use thresholding and one-to-one
match assignment so that each record in the data set A is
matched with at most one record in the data set B (see
the subsection The Canonical Model of Probabilistic
9
An important exception includes a fully Bayesian approach outside
of the FellegiSunter framework, which we do not pursue here
because of its limited scalability (see Tancredi and Liseo 2011; Gut-
man, Afendulis, and Zaslavsky 2013; Gutman et al. 2016; Dalzell and
Reiter 2018).
Using a Probabilistic Model to Assist Merging of Large-Scale Administrative Records
357
Downloaded from https://www.cambridge.org/core. Harvard University, on 24 Apr 2019 at 23:40:20, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S0003055418000783
Record Linkage), then we compute the mean of the
merged variable as z
i
¼
N
B
j¼1
M
ij
n
ij
Z
j
where M
ij
is a
binary variable indicating whether record i in the data
set A is matched with record j in the data set B subject to
the constraint
N
B
j¼1
M
ij
#1.
Under this setting, we assume that the true value of
the outcome variable is independent of the explanatory
variables in the regression conditional on the infor-
mation used for merging, i.e.,
Z
i
?? X
i
j d; gðÞ; (7)
for each i ¼ 1; 2; ...; N
A
. The assumption implies that
the merging process is based on all relevant information.
Specically, within an agreement pattern, the true value
of the merged variable Z
i
is not correlated with the
explanatory variables X
i
. Under this assumption, the
law of iterated expectation implies that regressing z
i
on
X
i
gives the results equivalent to the ones based on the
regression of Z
i
on X
i
in expectation.
E Z
i
j X
i

¼ EEZ
i
j c; d; X
i

j X
i

¼ E z
i
j X
i
ðÞ: (8)
The condition al independence assumption may be
violated if, for example, within the same agreement
pattern, a variable correlated with explanatory varia-
bles is associated with merging error. Without this
assumption, however, only the bounds can be identied
(Cross and Manski 2002). Thus, alternativ e assumptions
such as parametric assumptions an d exclusion restric-
tions are needed to achieve identication (see Ridde r
and Moftt 2007, for a review).
The Merged Variable as an Explanatory Variable
The second scenario we consider is the case where we
use the merged variable as an explanatory variable.
Suppose that we are interested in tting the following
linear regression model:
Y
i
¼ a þ bZ
i
þ h
>
X
i
þ «
i
; (9)
where Y
i
is a scalar outcome variable and the strict
exogeneity is assumed, i.e., E «
i
j Z
; XðÞ¼0 for all i.
We follow the analysis strategy rst proposed by Lahiri
and Larsen (2005) but clarify the assumptions required
for their approach to be valid (see also Hof and
Zwinderman 2012). Specically, we maintain the
assumption of no omitted variable for merging given in
equation (7). Additionally, we assume that the merging
variables are independent of the outcome variable
conditional on the explanatory variables Z*andX, i.e.,
Y
i
?? c; dðÞjZ
; X: (10)
Under these two assumptions, we can consistently
estimate the coefcients by regressing Y
i
on z
i
and X
i
,
E Y
i
j c; d; X
i
ðÞ¼a þ bE Z
i
j c; d; X
i

þ h
>
X
i
þ E «
i
j c; d; X
i
ðÞ
¼ a þ bz
i
þ h
>
X
i
;
(11)
where the second equality follows from the assumptions
and the law of iterated expectation.
We generalize this strategy to the maximum like-
lihood (ML) estimation, which, to the best of our
knowledge, has not been considered in the literature
(but see Kim and Chambers (2012) for an estimating
equations approach),
Y
i
j Z
i
; X
i
;
indep:
P
u
Y
i
j Z
i
; X
i

; (12)
where u is a vector of model parameters. To estimate the
parameters of this model, we maximize the following
weighted log-likelihood function:
^
h ¼ argmax
u
N
A
i¼1
N
B
j¼1
n
ij
log P
u
Y
i
j Z
i
¼ Z
j
; X
i

; (13)
where n
ij
¼ n
ij
=
N
B
j9¼1
n
ij9
. Online SI S7 shows that under
the two assumptions described earlier and mild regu-
larity conditions, the weighted ML estimator given in
equation (13) is consistent and asymptotically normal.
Note that because we are considering large data sets, we
ignore the uncertainty about n
ij
.
SIMULATION STUDIES
We conduct a comprehensive set of simulation studies
to evaluate the statistical accuracy and computational
efciency of our probabilistic modeling approach and
compare them with those of deterministic methods.
Specically, we assess the ability of the proposed
methodology to control estimation error, false positives
and fals e negatives, and its robustness to missing values
and noise in the linkage elds, as well as the degree of
overlap between two data sets to be merged. We do so
by systematically varying the amount and structure of
missing data and measurement error.
The Setup
To make our sim ulation studies realistic, we use a data
set taken from the 2006 California voter le. Because
merging voter les is often done by blocking on gend er,
we subset the data set to extract the information about
female voters only, reducing the number of observation
to approximately 17 million voters to 8.3 million
observations. To create a base data set for simulations,
we further subset the data set by removing all obser-
vations that have at least one missing value in the fol-
lowing variables: rst name, middle initial, last name,
date of birth, registration date, address, zip code, and
turnout in the 2004 Presidential election. After listwise
deletion, we obtain the na l da ta set of 341,160 voters,
from which we generate two data sets of various
characteristics to be merged. From this data set, we
independently and randomly select two subsamples to
be merged under a variety of scenarios.
We design our simulation studies by varying the
values of the ve parameters as summarized below.
Online SI S9.1 describ es in detail the precise setups of
these simulations.
1. Degree of overlap: Proportion of records in the smaller
data set that are also in the larger data set. We consider
three scenarios20% (small), 50% (medium), and
80% (large).
Ted Enamorado, Benjamin Field, and Kosuke Imai
358
Downloaded from https://www.cambridge.org/core. Harvard University, on 24 Apr 2019 at 23:40:20, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S0003055418000783
2. Size balance: Balance of sample sizes between the two
data sets to be merged. We consider three ratios1:1
(equally sized), 1:10 (imbalanced), and 1:100 (lopsided).
3. Missing data: We consider ve different mechanisms,
missing completely at random (MCAR), MAR, and not
missing at random (NMAR). For MAR and NMAR, we
consider independent and dependent missingness
patterns across linkage elds.
4. Amount of missing data: Proportion of missing values in
each linkage variable other than year of birth. We
consider three scenarios5% (small), 10% (medium),
and 15% (large).
5. Measurement error: Proportion of records (6%) for
which the rst name, last name, and street name contain
a classical measurement error.
Together, we conduct a total of 135 (53
3
3 5) sim-
ulation studies where missing data are of main concern.
We also conduct another set of 135 simulations with
various types of nonclassical measurement errors, while
keeping the amount of missing values xed (see Online
SI S9.2).
Results
Figure 1 compares the performance of fastLink
(blue solid bars) to the two deterministic methods
oftenusedbysocialscientists.Therst is the merging
method based on exact matches (red shaded bars),
whereas the second is the recently proposed partial
match algorithm (ADGN; light green solid bars) that
considers two records as a match if at least three elds
of their address, date of birth, gender, and name are
identical (Ansolabehere and Hersh 2017). The top
panel of Figure 1 presents the FNR whereas the bottom
panel presents the absolute error for estimating the
2004 turnout rate. We merge two data sets of equal size
(100,000 records each) after introducing the classical
measurement error and the medium amount of missing
data as explained above. For
fastLink, only pairs
with a match probability $0.85 are considered to be
matches, but the results remain qualitatively similar if
we change the threshold to 0.75 or 0.95.
We nd that
fastLink signicantly outperforms
the two deterministic methods.
10
Although all three
methods are designed to control the FDR, only
fastLink is able to keep the FNR low (less than ve
percentage in all cases considered here). The deter-
ministic algorithms are not robust to missing data and
measurement error, yielding a FNR of much greater
magnitude. Additionally, we observe that the deter-
ministic methods yield a substantially greater estimation
FIGURE 1. Accuracy of Data Merge
The top and bottom panels present the false negative rate (FNR) and the absolute estimation error (for estimating the turnout rate), respectively, when
merging datasets of 100,000 records each across with different levels of overlap (measured as a percentage of a data set). Three missing data
mechanisms are studied with the missing data proportion of 10% for each linkage eldother than year of birth: missing completely at random (MCAR),
missing at random (MAR), and missing not at random (MNAR). Classical measurement error is introduced to several linkage elds. The proposed
probabilistic methodology (
fastLink; blue solid bars) signicantly outperforms the two deterministic algorithms, i.e., exact match (red shaded
bars) and partial match (ADGN; light green solid bars), across simulation settings.
10
In Online SI S8, we compare fastLink to the state-of-the-art
probabilistic methods and nd that
fastLink performs as well as
these methods.
Using a Probabilistic Model to Assist Merging of Large-Scale Administrative Records
359
Downloaded from https://www.cambridge.org/core. Harvard University, on 24 Apr 2019 at 23:40:20, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S0003055418000783
biasthanfastLink unlessthe dataare MCAR.Under
the other two missing data mechanisms, the magni-
tude of the bias is substantially greater than that of
fastLink. Although fast Link has an absolute
estimation error of less than 1.5 percentage points even
under MNAR, the other two methods have an absolute
estimation error of more than 7.5 percentage points
under both MAR and MNAR. Finally, the performance
of
fastLink worsens as the size of overlap reduces
and the missing data mechanism becomes less random.
We next evaluate the accuracy of FDR and FNR
estimates in the top and bottom panels, respectively.
Because the deterministic methods do not give such
error estimates, we compare the performance of the
proposed methodology (indicated by blue solid circles)
with that of the same probabilistic modeling approach,
which treats missing values as disagreements foll owing a
common practice in the literature (indicated by solid
triangles). Figure 2 presents the results of merging two
data sets of equal size where the medium amount of data
are assumed to be MAR and some noise are added as
described earlier. In the top panel of the gure, we nd
that the true FDR is low and its estimate is accurate
unless the degree of overlap is small. With a small
degree of overlap, both methods signicantly under-
estimate the FDR. A similar nding is obtained for the
FNR in the bottom panel of the gure where estimated
FNR is biased upward.
One way to address the problem of having small
overlap would be to use blocking based on a set of fully
observed covariates. For example, in our simulations,
because the year of birth is observed for each record in
both data sets, we bloc k the data by making comparisons
only across individuals within a window of 61 year
around each birth year.
11
Then, we apply fastLink
to each block separately. As shown in the right most
column of Figure 2, blocking signicantly improves the
estimation accuracy for the FDR and FNR estimates as
well as their true values although the bias is not elim-
inated. The reason for this improvement is that tradi-
tional blocking increases the degree of overlap. For
example, in this simulation setting for each of the 94
blocks under consideration, the ratio of true matches to
all pos sible pairs is at least 8 3 10
25
, which is more than
15 times as large as the corresponding ratio for no
blocking and is comparableto thecaseof overlapof 50%.
We present the results of the remaining simulation
studies in the Online Simulation Appendix. Two major
FIGURE 2. Accuracy of FDR and FNR Estimates
The top panel compares the estimated FDR (x-axis) with its true value (y-axis) whereas the bottom panel compares the estimated FNR
against its true value. We consider the medium amount of missing data generated under MAR as a missingness mechanism and add
measurement error to some linkage elds. The blue solid circles represent the estimates based on
fastLink whereas the black solid
triangles represent the estimates obtained by treating missing data as disagreements. The FDR and FNR estimates are accurate when the
overlap is high. Additionally,
fastLink gives lower FDR and FNR than the same algorithm that treats missing values as a disagreement.
Note that in cases where the overlap is small (20%), blocking improves the precision of our estimates.
11
In Online SI S12, we also present results using a clustering method,
i.e., k-means, to group similar observations.
Ted Enamorado, Benjamin Field, and Kosuke Imai
360
Downloaded from https://www.cambridge.org/core. Harvard University, on 24 Apr 2019 at 23:40:20, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S0003055418000783
patterns discussed above are also found under these other
simulation scenarios. First, regardless of the missing data
mechanisms and the amount of missing observations,
fastLink controls FDR, FNR, and estimation error
well. Second, a greater degree of overlap between data sets
leads to better merging results in terms of FDR and FNR as
well as the accuracy of their estimate s. Blocking can
ameliorate these problems caused by small overlap to
some extent. These empirical patterns are consistently
found across simulations even when two data sets have
unequal sizes.
Computational Efciency
We compare the computational performance of
fastLink with that of the RecordLinkage package
in R (Sariyar and Borg 2016) and the Record Linkage
package in Python (de Bruin 2017) in terms of running
time. The latter two are the only other open
source packages in R and Python that implement
a probabilistic model of record linkage under the Fell-
egiSunter framework. To mimic a standard computing
environment of applied researchers, all the calculations
are performed in a Macintosh laptop computer with a 2.8
GHz Intel Core i7 processor and 8 GB of RAM.
Although
fastLink takes advantage of a multi-
core machine via the OpenMP-based parallelization
(the other two packages do not have a parallelization
feature), we perform the comparison on a single-core
computing environment so that we can assess the
computational efciency of our algorithm itself. Addi-
tionally, we include runtime results where we parallelize
computation across eight cores. For all implementa-
tions, we set the convergence threshold to 1 3 10
25
.
12
We consider the setup in which we merge two data sets
of equal size with 50% overlap, 10% missing proportion
under MCAR, and no measurement error. Our linkage
variables are rst name, middle initial, last name, house
number, street name, and year of birth. We vary the size of
each data set from 1,000 records to 300,000 observations.
As in the earlier simulations, each data set is based on the
sample of 341,160 female registered voters in California,
for whom we have complete information in each linkage
eld. To build the agreement patterns, we use the Jar-
oWin kler string distance with a cutoff of 0.94 for rst
name, last name, and street name. For the remaining
elds, we only consider exact matches as agreements.
Figure 3 presents the results of this running time
comparison. We nd that although all three packages
take a similar amount of time for data sets of 1,000
records, the running time increases exponentially for
the other packages in contrast to
fastLink (black
solid triangles connected by a dashed line, single core;
blue solid circles connected by a solid line, eight cores),
which exhibi ts a near linear increase. When matching
data sets of 150,000 records each,
fastLink takes less
than six hours to merge using a single core (under three
hours whenparallelizedacrosseight cores). In contrast,it
takes more than 24 hours for Record Linkage (Python;
solid purple squares connected by a dotted line), to
merge two data sets of only 20,000 observations each.
The performance is not as bad for Record Linkage (R;
red crosses connected by a dashed line), but it still takes
over six hours to merge data sets of 40,000 records each.
Moreover, an approximation based on an exponential
regression model suggests that Record Linkage (R)
would take around 22 hours to merge two data sets of
50,000 records each, while Record Linkage (Python)
wouldtakeabout900daysto accomplishthesamemerge.
In Online SI S3.1, we further decompose the runtime
comparison to provide more detail on the sources of our
computational improvements. We detail the choices we
make in the computational implementation that yields
these substantial efciency gains in Appendix A.
EMPIRICAL APPLICATIONS
In this section, we present two empirical applications of the
proposed methodology. First, we merge election survey
data (about 55,000 observations) with political con-
tribution data (about vemillion observations). The major
challenge of this merge is the fact that the expected
number of matches between the two data sets is small.
Therefore, we utilize blocking and conduct the data merge
within each block. The second application is to merge two
nationwide voter les, each of which has more than 160
million records. This may, therefore, represent the largest
data merge ever conducted in the social sciences. We show
how to use auxiliary information about within-state and
across-state migration rates to inform the match.
Merging Election Survey Data with Political
Contribution Data
Hill and Huber (2017) study differences between donors
and nondonors by merging the 2012 CCES survey with
the Database on Ideology, Money in Politics, and
Elections [DIME, Bonica (2013)]. The 2012 CCES is
based on a nationally representative sample of 54,535
individuals recruited from the voting-age population in
the United States. The DIME data, on the other hand,
provide the information about individual donations to
political campaigns. For the 2010 and 2012 elections, the
DIME contains over ve million donors.
The original authors asked YouGov, the company
which conducted the survey, to merge the two data sets
using a proprietary algorithm. This yielded a total of
4,432 CCES respondents matched to a donor in the
DIME data. Aft er the merge, Hill and Huber (2017)
treat each matched CCES respondent as a donor and
conduct various analyses by comparing these matched
respondents with those who are not matched with a
donor in the DIME data and hence are treated as
nondonors. Below, we apply the proposed methodology
to merge these two data sets and conduct a post-merge
12
Starting values differ across methods because other methods do not
allow us to change their default starting values. However, the EM
algorithm converges quickly regardless of the choice of starting values.
In fact, it is well known that the bottleneck is a large number of
required comparisons (e.g., Christen 2012; Jaro 1972), for which we
use a hashing technique as described below in Appendix A.
Using a Probabilistic Model to Assist Merging of Large-Scale Administrative Records
361
Downloaded from https://www.cambridge.org/core. Harvard University, on 24 Apr 2019 at 23:40:20, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S0003055418000783
analysis by incorporating the uncertainty about the
merge process.
Merge Procedure
We use the name, address, and gender information to
merge the two data sets. To protect th e anonymity of
CCES respondents, YouGov used
fastLink to
merge the data sets on our behalf. Moreover, because of
contractual obligations, the merge was conducted only
for 51,184 YouGov panelists, which is a subset of the
2012 CCES respondents. We block based on gender and
state of residence, resulting in 102 blocks (50 states plus
Washington DC 3 two gender categories). The size of
each block ranges from 175,861 (CCES 5 49, DIME 5
3,589) to 790,372,071 pairs (CCES 5 2,367, DIME 5
333,913) with the median value of 14,048,151 pairs
(CCES 5 377, DIME 5 37,263). Within each block, we
merge the data sets using the rst name, middle initial,
last name, house number, street name, and postal code.
As done in the simulations, we use three levels of
agreement for the string-valued variables based on the
JaroWinkler distance with 0.85 and 0.92 as the
thresholds. For the remaining variables (i.e., middle
initial, house number, and postal code), we utilize a
binary comparison indicating whether they have an
identical value.
To construct our set of matched pairs between CCES
and DIME, rst, we use the one-to-one matching
assignment algorithm described in Online SI S5 and nd
the best match in the DIME data for each CCES
respondent. Then, we declare as a match any pair whose
matching probability is above a certain threshold. We
use three thresholds, i.e., 0.75, 0.85, and 0.95, and
examine the sensitivity of the empirical results to the
choice of threshold value.
13
Finally, in the original study
of Hill and Huber (2017), noise is added to the amount
of contribution to protect the anonymity of matched
CCES respondents. However, we signed a non-
disclosure agreement with YouGov for our analysis so
that we can make a precise comparison between the
proposed methodology and the proprietary merge
method used by YouGov.
Merge Results
Table 2 presents the merge results. We begin by
assessing the match rates, which represent the pro-
portion of CCES respondents who are matched with
donors in the DIME data. Although the match rates are
similar between the two methods,
fastLink appears
to nd slightly more (less) matches for male (female)
respondents than the proprietary method regardless of
the threshold used. However, this does not mean that
both methods nd identical matches. In fact, out of 4,797
matches identied by
fastLink (using the threshold
of 0.85), the proprietary method does not identify 861 or
18% of them as matches.
As discussed in the subsection The Canonical Model
of Probabilistic Record Linkage, one important ad-
vantage of the probabilistic modeling approach is that
we can estimate the FDR and FNR, which are shown in
the table. Such error rates are not available for the
FIGURE 3. Running Time Comparison
The plot presents the results of merging datasets of equal sizeusing different implementations of the Fellegi-Sunter model. The datasets were
constructed from a sample of female registered voters in California. The amount of overlap between datasets is 50%, and, for each dataset,
there are 10% missing observations in each linkage variable: rst name, middle initial, last name, house number, street name, and year of
birth. The missing data mechanism is Missing Completely at Random (MCAR). The computation is performed on a Macintosh laptop
computer with a 2.8 GHz Intel Core i7 processor and 8 GB of RAM. The proposed implementation
fastLink (single-core runtime as black
solid triangles connected by a dashed line, and parallelized over eight cores as blue solid dots connected by a solid line) is signicantly faster
than the other open-source packages.
13
In Online SI S10.3, instead of a one-to-one matching restriction
used here, we present the results of the weighted approach described
in the subsection The Uncertainty of the Merge Process. As shown in
Figure S7 of Online SI S10.3, there is no distinguishable difference in
the results obtained from either approach.
Ted Enamorado, Benjamin Field, and Kosuke Imai
362
Downloaded from https://www.cambridge.org/core. Harvard University, on 24 Apr 2019 at 23:40:20, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S0003055418000783
proprietary method. As expected, the overal l estimated
FDR is controlled to less than 1.5% for both male and
female respondents. The estimated FNR, on the other
hand, is large, illustrating the difculty of nding some
donors. In particular, we nd that female donors are
much more difcult to nd than male donors.
Specically, there are 12,803 CCES respondents
who said they made a campaign contribution during
the last 12 months before the 2012 election. Among
them, 5,206 respondents claimed to have donated at
least 200 dollars. Interestingly, both
fastLink and
the proprietary method matched an essentially iden-
tical number of self-reported donors with a con-
tribution of over 200 dollars (2,431 and 2 ,434 or
approximately 47%, respectively), whereas among the
self-reported small donors both methods can only
match approximately 16% of them.
Next, we examine the quality of matches for the two
methods (see also Online SI S13). We begin by com-
paring the self-reported donation amount of matched
CCES respondents with their actual donation amount
recorded in the DIME data. Although only donations
greater than 200 dollars are recorded at the federal
level, the DIME data include some donations of smaller
amounts, if not all, at the state level. Thus, although we
do not expect a perfect correlation between self-
reported and actual donation amount, under the
assumption that donors do not systematic ally under or
over report the amount of campaign contributions, a
high correlation between the two measures implies a
more accurate merging process.
The uppe r panel of Figure 4 presents the results
where for
fastLink, we use one-to-one match with
the threshold of 0.85.
14
We nd that for the respondents
who are matched by both methods, the correlation
between the self-reported and matched donation
amounts is reasonably high (0.73). In the case of
respondents who are matched by
fastLink only, we
observe that the correlation is low (0.57) but is greater
than the correlation for those matches identied by the
proprietary method alone (0.42). We also examine the
distribution of match probabilities for these three groups
of matches. The bottom panel of the gure presents the
results, which are consistent with the patterns of corre-
lation identied in the top panel. That is, those matches
identied by the two methods have the highest match
probability whereas most of the matches identied only
by the proprietary method have extremely low match
probabilities. In Online SI S13, we also examine the
quality of the agreement patterns separately for the
matches identied by both methods,
fastLink only,
and the proprietary method only. Overall, our results
indicate that
fastLink produces matches whose
quality is often better than that based on the proprietary
method.
TABLE 2. The Results of Merging the 2012 Cooperative Congressional Election Study (CCES) with the
2010 and 2012 Database on Ideology, Money in Politics, and Elections (DIME) Data
fastLink Proprietary method
0.75 0.85 0.95
Number of matches All 4,948 4,797 4,576 4,534
Female 2,198 2,156 2,067 2,210
Male 2,750 2,641 2,524 2,324
Overlap between
fastLink
and proprietary method
All 3,959 3,936 3,881
Female 1,877 1,866 1,844
Male 2,082 2,070 2,037
Match rate (%) All 9.67 9.37 8.94 8.85
Female 8.12 7.96 7.63 8.16
Male 11.40 10.95 10.40 9.64
FDR (%) All 1.24 0.65 0.21
Female 0.92 0.53 0.14
Male 1.50 0.75 0.28
FNR (%) All 15.25 17.35 20.81
Female 5.35 6.80 10.30
Male 21.83 24.36 27.79
The table presents the merging results for both fastLink and the proprietary method used by YouGov. The results of fastLink are
presented for one-to-one match with three different thresholds (i.e., 0.75, 0.85, 0.95) for the matching probability to declare a pair of
observations as a successful match. The number of matches, the amount of overlap, and the overall match rates are similar between the two
methods.The table also presents information on the estimated false discoveryand false negative rates (FDR and FNR, respectively) obtained
using
fastLink. These statistics are not available for the proprietary method.
14
Figures S5 and S6 in Online SI S10 present the results under two
different thresholds: 0.75 and 0.95, respectively. The results under
those thresholds are similar to those with the threshold of 0.85 pre-
sented here.
Using a Probabilistic Model to Assist Merging of Large-Scale Administrative Records
363
Downloaded from https://www.cambridge.org/core. Harvard University, on 24 Apr 2019 at 23:40:20, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S0003055418000783
Post-Merge Analysis
An important advantage of the probabilistic modeling
approach is its ability to account for the uncertainty of
the merge process in post-merge analyses. We illustrate
this feature by revisiting the post-merge analysis of Hill
and Huber (2017). The original authors are interested in
the comparison of donors (dened as those who are
matched with records in the DIME data) and nondonors
(dened as those who are not matched) among CCES
respondents. Usin g the matches identi ed by a pro-
prietary method, Hill and Huber (2017) regress policy
ideology on the matching indicator variable, which is
interpreted as a donation indicator variable, the turnout
indicator variables for the 2012 general election and
2012 congressional primary elections, as well as several
demographic variables. Policy ideology, which ranges
from 21 (most liberal) to 1 (most conservative), is
constructed by applying a factor analysis to a series of
questions on various issues.
15
The demographic control
variables include income, education, gender, household
union membership, race, age in decades, and impor-
tance of religion. The same model is tted separately for
Democrats and Republicans.
To account for the uncertaintyof the mergeprocess, as
explained in the subsection Post-Merge Analysis,wet
the same linear regression except thatwe use the mean of
the match indicator variable as the main explanatory
variableratherthan the match indicatorvariable. Table 3
presents the estimated coefcients of the afore-
mentioned linear regression models with the corre-
sponding heteroskedasticity-robust standard errors in
parentheses. Generally, the results of our improved
analysisagree withthose of the original analysis, showing
that donors tend to be more ideologically extreme than
nondonors.
Although the overall conclusion is similar, the esti-
mated coefcients are smaller in magnitude when
accounting for the uncertainty of merge process. In
particular, according to
fastLink, for Republican
respondents, the estimated coefcient of being a donor
represents only 12% of the standard deviation of their
ideological positions (instead of 21% given by the
proprietary method). Indeed, the difference in the
estimated coefcients between
fastLink and the
proprietary method is statistically signicant for both
Republicans (0.035, s.e. 5 0.014), and Democrats
(20.015, s.e. 5 0.007). Moreover, although the original
analysis nd that the parti san mean ideological differ-
ence for donors (1.108, s.e. 5 0.018) is 31 percentage
larger than that for nondonors (0.848, s.e. 5 0.001), the
results based on
fastLink shows that this difference
is only 25 percentage larger for donors (1.058,
FIGURE 4. Comparison of fastLink and the Proprietary Method
The top panel compares the self-reported donations (y-axis) by matched CCES respondents with their donation amount recorded in the DIME
data (x-axis) for the three different groups of observations: those declared as matches by both
fastLink and the proprietary method (left),
those identied by
fastLink only (middle), and those matched by the proprietary method only (right). The bottom panel presents histograms
for the match probability for each group. For
fastLink, we use one-to-one match with the threshold of 0.85.
15
They include gun control, climate change, immigration, abortion,
jobs versus the environment, same-sex marriage, afrmative action,
and scal policy.
Ted Enamorado, Benjamin Field, and Kosuke Imai
364
Downloaded from https://www.cambridge.org/core. Harvard University, on 24 Apr 2019 at 23:40:20, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S0003055418000783
s.e. 5 0.018). Thus, although the proprietary method
suggests that the partisan gap for donors is similar to the
partisan gap for those with a college degree or higher
(1.100, s.e. 5 0.036),
fastLink shows that it is closer
to the partisan gap for those with just some college
education but without a degree (1.036, s.e. 5 0.035).
Merging Two Nationwide Voter Files Over Time
Our second application is what might be the largest data
merging exercise ever conducted in social sciences.
Specically, we merge the 2014 nationwide voter le to
the 2015 nationwide voter le, each of which has over
160 million records. The data sets are provided by L2, Inc.,
a leading national non-partisan rm and the oldest
organization in the United States that supplies voter data
and related technology to candidates, political parties,
pollsters, and consultants for use in campaigns. In addition
tothesheersizeofthedatasets,mergingthesenationwide
voter les is methodologically challenging because
some voters change their residence over time, making
addresses uninformative for matching these voters.
Merge Procedure
When merging data sets of this scale, we must drastically
reduce the number of comparisons. In fact, if we
examine all possible pairwise comparisons between the
two voter les, the total number of such pairs exceeds 2.5
3 10
16
. It is also impor tant to incorporate auxiliary
information about movers because the address variable
is noninformative when matching these voters. We use
the IRS Statistics of Income (SOI) to calibrate match
rates for within-state and across-state movers. Details
on incorporating migration rates into parameter esti-
mation can be found in the subsection Incorporating
Auxiliary Information and Online SI S6.2. The IRS SOI
data are denitive source of migration data in the
United States that tracks individual residences year-to-
year across all states through their tax returns.
We develop the following two-step procedure that
utilizes random sampling and blocking of voter records to
reduce the computational burden of the merge (see
Online SI S3.2 and S6.2). Our merge is based on rst name,
middle initial, last name, house number, street name, date/
year/month of birth, date/year/month of registration, and
gender. The rst step uses each of these elds to inform the
merge, whereas the second step uses only rst name,
middle initial, last name, date/year/month of birth, and
gender. For both rst name and last name, we include a
partial match category based on the JaroWinkler string
distance calculation, setting the cutoff for a full match at
0.92 and for a partial match at 0.88.
As described in Online SI S6.2, we incorporate
auxiliary information into the model by moving from
the likelihood framework to a fully Bayesian approach.
Because of conjugacy of our priors, we can obtain the
estimated parameters by maximizing the log posterior
distribution via the EM algorithm. This approach allows
us to maintain the computational efc iency.
16
Step 1: Matching within-state movers and nonmovers
for each state.
(a) Obtain a random sample of voter records from each
state le.
(b) Fit the model to this sample using the within-state migration
rates from the IRS data to specify prior parameters.
(c) Create blocks by rst stratifying on gender and then
applying the k-means algorithm to the rst name.
(d) Using the estimated model parameters, conduct the
data merge within each block.
Step 2: Matching across-state movers for each pair of
states.
(a) Set aside voters who are identied as successful
matches in Step 1.
(b) Obtain a random sample of voter records from each
state le as done in Step 1(a).
TABLE 3. Predicting Policy Ideology Using Contributor Status
Republicans Democrats
Original
fastLink Original fastLink
Contributor 0.080*** 0.046*** 20.180*** 20.165***
(0.016) (0.015) (0.008) (0.009)
Turnout for 2012 general election 0.095*** 0.095*** 20.060*** 20.060***
(0.013) (0.013) (0.010) (0.010)
Turnout for 2012 primary election 0.094*** 0.095*** 20.019** 20.022***
(0.009) (0.009) (0.009) 0.009)
Demographic controls Yes Yes Yes Yes
Number of observations 17,386 17,386 20,925 20,925
The estimated coefcients from the linear regression of policy ideology score on the contributor indicator variable and a set of demographic
controls. Along with the original analysis, the table presents the results of the improved analysis based on
fastLink, which accounts for the
uncertainty of the merge process. *** p , 0.001, ** p , 0.01, * p , 0.05. Robust standard errors in parentheses.
16
Specically, we set prior parameters on the expected match rate and
expected within-state movers rate using the IRS data, giving 75%
weight to the prior estimate and 25% weight to the maximum like-
lihood estimate. For the rst step, we set priors on both p
address,1,0
(the
probability of a voters address not matching conditional on being in
the matched set, which is equivalent to the share of in-state movers in
the matched set) and l. For the second step, we set a prior on l.
Using a Probabilistic Model to Assist Merging of Large-Scale Administrative Records
365
Downloaded from https://www.cambridge.org/core. Harvard University, on 24 Apr 2019 at 23:40:20, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S0003055418000783
(c) Fit the model using the across-state migration rates
from the IRS data to specify prior parameters.
(d) Create blocks by rst stratifying on gender and then
applying the k-means algorithm to the rst name as
done in Step 1(c).
(e) Using the estimated model parameters, conduct the
data merge within each block as done in Step 1(e).
In Step 1, we apply random sampling, rather than
blocking, strategy to use the within-state migration rates
from the IRS data and t the model to a representative
sample for each state. For the same reason, we use a
random sampling strategy in Step 2 to exploit the
availability of IRS across-state migration rates. We
obtain a random sample of 800,000 voter records for
les with over 800,000 voters and use the entire state le
for states with fewer than 800,000 voter records on le.
Online SI S11 shows through simulation studies that for
datasetsas smallas100,000records,a 5% random sample
leads to parameter estimates nearly indistinguishable
from those obtained using the full data set. Based on this
nding, we choose 800,000 records as the size of the
random samples, corresponding to a 5% of records from
California, the largest state in the United States.
Second, within each step, we conduct the merge by
creating blocks to reduce the number of pairs for consid-
eration. We block based on gender, rst name, and state,
and we select the number of blocks so that the average size
of each blocked data set is approximately 250,000 records.
To block by rst name, we rank ordered the rst names
alphabetically and ran the k-means algorithm on this
ranking in order to create clusters of maximally similar
names.
17
Finally, the entire merge procedure is computa-
tionally intensive. The reason is that we need to repeat Step
1 for each of 50 states plus Washington DC and apply Step 2
to each of 1,275 pairs. Thus, as explained in Online SI S2, we
use parallelization whenever possible. All merges were run
on a Linux cluster with 16 2.4-GHz Broadwell 28-core
nodes with 128 GB of RAM per node.
Merge Results
Table 4 presents the overall match rate, FDR, and FNR
obtained from
fastLink.Weassesstheperformance
of the match at three separate matching probability
thresholds to declare a pair of observations a successful
match:0.75,0.85,and0.95.Wealsobreakoutthematches
by within-state matches only and across-state matches
only. Across the three thresholds, the overall match rate
remains very high, at 93.04% under a 95% acceptance
threshold, although the estimated FDR and FNR remain
controlledat 0.03%and 3.86%.All threethresholdsyield
match rates that are signicant higher than the corre-
sponding match rates of the exact matching technique.
In Figure 5, we examine the quality of the merge
separately for the within-state merge (top panel) and
across-state merge (bottom panel). The rst column
plots the distribution of the matching probability
across all potential match pairs. For both within-state
and across-state merge, we observe a clear separation
between the successful matches and unsuccessful
matches, with very few matches falling in the middle.
This suggests that the true and false matches are
identied reasonably well. In the second column, we
examine the distribution of the match rate by state.
Here, we see that most states are tightly clustered
between 88% and 96%. Only Ohio, with a match rate of
85%, has a lower match rate. For the across-state merge,
the match rate is clustered tightly between 0% and 5%.
In the third column, we plot the estimated FDR
against the estimated FNR for each state. For the
within-state merge, the FDR is controlled wellevery
state other than Minnesota has an FDR below 0.1%.
Additionally, there are only two states, Mississippi and
TABLE 4. The Results of Merging the 2014 Nationwide Voter File with the 2015 Nationwide Voter File
fastLink
Exact0.75 0.85 0.95
Match count (millions) All 135.60 129.69 128.73 91.62
Within-state 127.38 127.12 126.80 91.36
Across-state 8.22 2.57 1.93 0.27
Match rate (%) All 97.25 93.67 93.04 66.24
Within-state 92.06 91.87 91.66 66.05
Across-state 5.19 1.80 1.38 0.19
FDR (%) All 1.02 0.10 0.03
Within-state 0.08 0.04 0.01
Across-state 0.95 0.06 0.02
FNR (%) All 3.35 3.63 3.86
Within-state 2.63 2.83 3.05
Across-state 0.72 0.80 0.81
This table presents the merging results for fastLink for three different thresholds (i.e., 0.75, 0.85, 0.95) for the matching probability to
declare a pair of observations a successful match. Across the different thresholds, the match rates do not change substantially and are
signicantly greater than the corresponding match rates of the exact matching technique.
17
See Online SI S14 for evidence that this blocking strategy performs
similarly to a blocking strategy based on age windowing.
Ted Enamorado, Benjamin Field, and Kosuke Imai
366
Downloaded from https://www.cambridge.org/core. Harvard University, on 24 Apr 2019 at 23:40:20, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S0003055418000783
New Mexico, where fastLink seems to have trouble
identifying true matches, as measured by the FNR. In
the across-state merge, the FDR for every state is below
0.1%, suggesting that the resulting matches are of high
quality. Furthermore,
fastLink appears to be
nding a high share of true movers across voter les, as
the FNR for all but three states falls under 2%.
Finally, we examine the across-state migration pat-
terns recovered from our matching procedure. Figure 6
displays a heatmap of the migration patterns obtained
from
fastLink with darker purple colors indicating a
higher match rate when merging the 2014 nationwide
voter le for a given state (origin state) to the 2015
nationwide voter le for a given state (destination state).
We uncover several regional migration patterns. First,
we nd a migration cluster in New England, wherevoters
from New Hampshire and Rhode Island migrated to
Massachusetts between 2014 and 2015. Another strong
migration cluster exists between New Jersey, Delaware,
and Pennsylvania in the mid-Atlantic region. Both pat-
ternssuggestthatmostmigrationoccursbetweenclusters
of adjacent states and urban centers. Finally, we nd a
large volume of out migration to Florida from across the
United States, and the out migration is particularly
concentrated in states on the Eastern seaboard such as
Virginia, New Hampshire, New Jersey, and Connecticut.
This possibly reects the ow of older voters and retirees
to the more temperate climate.
CONCLUDING REMARKS
With the advance of the Internet, the last two decades
have witnessed a data revolut ion in the social sciences
where diverse and large data sets have become elec-
tronically available to researchers. Much of todays
cutting-edge quantitative social science research results
from research ers creativity to link multiple data sets
that are collected separately. In many cases, however, a
unique identier that can be used to merge multiple data
sources does not exist. Currently, most social scientists
rely on either deterministic or proprietary methods.
Yet, deterministic methods are not robust to meas-
urement errors and missing data, cannot quantify the
uncertainty inherent in merge process, and often require
arbitrary decisions from researchers. Proprietary meth-
ods, many of which are also deterministic, lack trans-
parency and hence are not suitable for academic and
FIGURE 5. Graphical Diagnostics From Merging the 2014 Nationwide Voter File with the 2015
Nationwide Voter File
This gure presentsgraphicaldiagnosticsfor fastLink for within-state matches(toppanel)andacross-statematches(bottompanel).The rst
column plots the distribution of the matching probability across all patterns. The second column plots the distribution of the match rate for each
state. Lastly, the third column compares the FNR against the FDR for each state separately.
Using a Probabilistic Model to Assist Merging of Large-Scale Administrative Records
367
Downloaded from https://www.cambridge.org/core. Harvard University, on 24 Apr 2019 at 23:40:20, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S0003055418000783
policy research where reproducibility and transparency
play an essential role.
Here, we advocate the use of probabilistic modeling
to assist merging large-scale data sets. The main
advantage of probabilistic models is their ability to
estimate false positive and false negative rates that arise
when linking multiple data sets. We contribute to the
statistical literature of record linkage by developing a
fast and scalable implementation of the canonical
model. Through simulation and empirical studies, we
demonstrate that the proposed methodology can
quickly and reliably merge data sets even when they
have millions of records.
Like any methods, however, the proposed record
linkage technology has important limitations of which
researchers must be aware. Most importantly, the
proposed methodology is likely to ha ve a difcult time
producing high-quality matches when the overlap
between two data sets is expected to be small. As shown
in our simulation studies, for these difcult merge
problems, effective blocking is essential. Blocking is
even more important when linking many data sets at
once. Other important research questions are how to
merge more than two les at the same time and how
to efciently use a small amount of hand-coded data to
improve the quality of record linkage. We leave these
methodological challenges to future research.
Appendix A. COMPUTATIONALLY
EFFICIENT IMPLEMENTATION
In this appendix, we describe the details of our computa-
tionally efcient implementation of the canonical model of
probabilistic record linkage.
Reverse Data Structures for
Field Comparisons
The critical step in record linkage is to compare pairs of
records across the K elds used to link two data sets, which is
often regarded as the most expensive step in terms of
computational time (Christen 2012). To do so, for each
linkage eld k,werst compare observation i of data set A
and j from data set B via a predened distance metric (e.g.,
JaroWinkler for string-valued elds) and obtain a value
which we call S
k
(i, j). However, comparisons in the FellegiSunter
model are represented in terms of a discrete agreement
level per linkage eld, not a continuous measure of
agreement as the one implied by the distance metric. In
other words, we need a discrete representation of S
k
(i, j).
FIGURE 6. Across-State Match Rates for the 2014 Nationwide Voter File to 2015 Nationwide Voter File
Merge
We plot the match rates from each across-state match pair as a heatmap, where darker colors indicate a higher match rate.
Ted Enamorado, Benjamin Field, and Kosuke Imai
368
Downloaded from https://www.cambridge.org/core. Harvard University, on 24 Apr 2019 at 23:40:20, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S0003055418000783
Specically, if we have a total of L
k
agreement levels for the
kth variable, then,
g
k
i; jðÞ¼
0ifS
k
i; jðÞ#t
0
1ift
0
, S
k
i; jðÞ#t
1
.
.
.
L
k
1ift
L
k
2
, S
k
i; jðÞ#t
L
k
1
8
>
>
>
<
>
>
>
:
; (14)
where g
k
(i, j) represents the agreement level between the
values for variable k for the pair (i, j) and s ¼ t
0
; t
1
; ...; t
L
k1
fg
the set of predetermined thresholds use to dene the agree-
ment levels. For example, to compare names and last names,
some authors such as Winkler (1990) argue in favor of using
the JaroWinkler string distance to produce S
k
, where one
could use t 5 {0.88, 0.94} to construct g
k
for three agreement
levels.
Still the problem with constructing g
k
is that the number of
comparisons we have to make is often large. In our proposed
implementation we exploit the following characteristics of
typical record linkage problems in social sciences:
The number of unique values observed in each linkage
eld is often less than the number of observations in each
data set. For example, consider a variable such as rst
name. Naively, one may compare the rst name of each
observation in data set A with that of every observation
in B. In practice, however, we can reduce the number of
comparisons by considering only unique rst name that
appears in each data set. The same trick can be used for
all linkage elds by focusing on the comparison of the
unique values of each variable.
For each comparison between two unique rst names
(name
1;A
and name
1;B
), for example, we only keep the
indices of the original data sets and store them using what
is often referred as a reverse data structure in the lit-
erature (Christen 2012). In such an arrangement, a pair
of names (name
1;A
, name
1;B
) becomes a key with two
lists, one containing the indices from data set A that have
a rst name equal to name
1;A
, and another list that does
the same for name
1;B
in data set B.
Comparisons involving a missing value need not be made.
Instead, we only need to store the indices of the observa-
tions in A and B that contain missing information for eld k.
Because the agreement levels are mutually exclusive, we
use the lowest agreement level as the base category.
Once a set of threshold values has been dened, then a
pair of names can only be categorized in one of the L
k
agreement levels. The indices for the pairs of values that
can be categorized as disagreements (or nearly dis-
agreements) do not need to be stored. For most varia-
bles, disagreement is the category that encompasses the
largest number of pairs. Thus, our reverse data structure
lists become quite sparse. This sparsity can be exploited
by the use of sparse matrix, yielding a substantially
memory-efcient implementation.
Sparse Matrix Representation of Hash Tables
to Count Agreement Patterns
Next, we describe our computationally efcient imple-
mentation of the FellegiSunter model via the EM algorithm
(see Online SI S2 for the exact algorithm we use). First, for
implementing the E step, notice that the match probability
given in equation (5) takes the same value for two pairs if their
agreement patterns are identical. For the sake of illustration,
consider a simple example where two variables are used for
merging, i.e., K 5 2, and binary comparison is made for each
variable, i.e., L
k
5 2. Under this setting, there are a total of nine
agreement patterns: (0, 0), (0, 1), (1, 0), (1, 1), (NA, 0), (NA, 1),
(0, NA), (1, NA), and (NA, NA), where 1 and 0 represent
agreement and disagreement, respectively, whereas NA
represents a missing value. Then, for instance, the match
probability for (0, 1) is given by lp
110
p
211
/{lp
110
p
211
1 (1 2
l)p
100
p
201
} whereas that for (1, NA) is equal to lp
111
/{lp
111
1
(1 2 l)p
101
}. If all comparison values are missing, e.g., (NA,
NA), then we set the match probability to l. Thus, the E step
can be implemented by computing the match probability for
each of the realized agreement patterns. Often, the total
number of realized agreement patterns is much smaller than
the number of all possible agreement patterns.
Second, the M step dened in equations (S1) and (S2)
requires the summation of match probabilities across all pairs
or their subset. Because this probability is identical within each
agreement pattern, all we have to do is to count the total
number of pairs that have each agreement pattern. In other
words, the number of pairs per agreement pattern becomes
our sufcient statistic. We use the following hash function for
efcient counting:
18
H ¼
K
k¼1
H
k
where
H
k
¼
h
1;1ðÞ
k
h
1;2ðÞ
k
... h
1;N
B
ðÞ
k
.
.
.
.
.
.
.
.
.
.
.
.
h
N
A
;1ðÞ
k
h
N
A
;2ðÞ
k
... h
N
A
;N
B
ðÞ
k
2
6
6
4
3
7
7
5
;
(15)
where h
i;jðÞ
k
¼ 1 g
k
i; jðÞ> 0
fg
2
g
k
i;j
ðÞþ
1fk > 1g3S
k1
e¼1
ðL
e
1Þ
. The
matrix H
k
maps each pair of records to a corresponding
agreement pattern in the kth variable that is represented by a
unique hash value based on the powers of 2. These hash values
are chosen such that the matrix H links each pair to the
corresponding agreement pattern across K variables.
Because an overwhelming majority of pairs do not agree
in any of the linkage elds, most elements of the H
k
matrix
are zero. As a result, the H matrix also has many zeros. In
our implementation, we utilize sparse matrices the lookup
time of which i s O(P)whereP is the number of unique
agreement patterns observed. In most applications, P is
much less than the total number of possible agreement
patterns, i.e.,
K
k¼1
L
k
. This hashing technique is applicable
if the number of variables used for merge is moderate. If
many va riables are used for the merge, approximate
hashing techniquessuch as min hashing and locallysensitive
hashing are necessary.
18
Since the work of Jaro (1972), the use of table-like objects to store
agreement patterns has been recognized as an important step to
improve computational efciency. Our contribution goes beyond by
tying together, under a unied framework, reverse data structures and
novel use of a sparse matrix representation of a hash table to store
agreement patterns.
Using a Probabilistic Model to Assist Merging of Large-Scale Administrative Records
369
Downloaded from https://www.cambridge.org/core. Harvard University, on 24 Apr 2019 at 23:40:20, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S0003055418000783
SUPPLEMENTARY MATERIAL
To view supplementary material for this article, please visit
https://doi.org/10.1017/S0003055418000783.
Replication materials can be found on Dataverse at: https://
doi.org/10.7910/DVN/YGUHTD.
REFERENCES
Adena, Maja, Ruben Enikolopov, Maria Petrova, Veronica Santarosa,
andEkaterinaZhuravskaya.2015. RadioandtheRiseof theNazisin
Prewar Germany. Quarterly Journal of Economics 130: 1885939.
Ansolabehere, Stephen, and Eitan Hersh. 2012. Validation: What
Big Data Reveal about Survey Misreporting and the Real Elec-
torate. Political Analysis 20: 43759.
Ansolabehere, Stephen, and Eitan Hersh. 2017. ADGN: An
Algorithm for Record Linkage Using Address, Date of Birth,
Gender and Name. Statistics and Public Policy 4: 110.
Belin, Thomas R., and Donald B. Rubin. 1995. A Method for Cal-
ibrating False-Match Rates in Record Linkage. Journal of the
American Statistical Association 90: 694707.
Berent, Matthew K., Jon Arthur Krosnick, and A. Lupia. 2016.
Measuring Voter Registration and Turnout in Surveys. Do Ofcial
Government Records Yield More Accurate Assessments? Public
Opinion Quarterly. 80: 597621.
Bolsen, Toby, Paul J. Ferraro, and Juan Jose Miranda. 2014. Are
Voters More Likely to Contribute to Other Public Goods? Evi-
dence from a Large-Scale Randomized Policy Experiment.
American Journal of Political Science 58: 1730.
Bonica, Adam. 2013. Database on Ideology, Money in Politics, and
Elections: Public Version 1.0 [Computer File]. Stanford, CA:
Stanford University Libraries.
Cesarini, David, Erik Lindqvist, Robert Ostling, and Bjorn Wallace.
2016. Wealth, Health, and Child Development: Evidence from
Administrative Data on Swedish Lottery Players. Quarterly
Journal of Economics 131: 687738.
Christen, Peter. 2012. Data Matching. Concepts and Techniques for
Record Linkage, Entity Resolution, and Duplicate Detection.
Heidelberg, Germany: Springer.
Cohen, William W., Pradeep Ravikumar, and Stephen Fienberg. 2003.
A Comparison of String Distance Metrics for Name-Matching
Tasks. In International Joint Conference on Articial Intelligence
(IJCAI) 18.
Cross, Philip J., and Charles F. Manski. 2002. Regressions, Short and
Long. Econometrica 70: 357
68.
Dalzell, Nicole M., and Jerome P. Reiter. 2018. Regression Modeling
and File Matching Using Possibly Erroneous Matching Variables.
Journal of Computational and Graphical Statistics 111. Published
online July 11, 2018.
de Bruin, Jonathan. 2017. Record Linkage. Python library. Version
0.8.1. https://recordlinkage.readthedocs.io/.
Einav, Liran, and Jonathan Levin. 2014. Economics in the Age of Big
Data. Science 346 (6210): 1243089-16.
Enamorado, Ted. 2018. Active Learning for Probabilisitic Record
Linkage. Social Science Research Network (SSRN). URL: https://
ssrn.com/abstract53257638.
Engbom, Niklas, and Christian Moser. 2017. Returns to Education
Through Access to Higher-Paying Firms: Evidence from US
Matched Employer-Employee Data. American Economic Review:
Papers and Proceedings 107: 37478.
Feigenbaum, James. 2016. Automated Census Record Linking: A
Machine Learning Approach. Boston University. Technical Report.
https://jamesfeigenbaum.github.io/research/pdf/census-link-ml.pdf
Fellegi, Ivan P., and Alan B. Sunter.1969. A Theory of Record Linkage.
Journal of the American Statistical Association 64: 1183210.
Figlio, David, and Jonathan Guryan. 2014. The Effects of Poor
Neonatal Health on Childrens Cognitive Development. American
Economic Review 104: 392155.
Giraud-Carrier, Christophe, Jay Goodlife, Bradley M. Jones, and
Stacy Cueva. 2015. Effective Record Linkage for Mining Cam-
paign Contribution Data. Knowledge and Information Systems 45:
389416.
Goldstein, Harvey, and Katie Harron. 2015. Methodological Devel-
opments in Data Linkage. John Wiley & Sons, Ltd. Chapter 6:
Record Linkage: A Missing Data Problem, pp. 10924.
Gutman, Roee, Christopher C. Afendulis, and Alan M. Zaslavsky.
2013. A Bayesian Procedure for File Linking to End-of-Life
Medical Costs. Journal of the American Medical Informatics
Association. 103: 3447.
Gutman, Roee, Cara J. Sammartino, Traci C. Green, and Brian T.
Montague. 2016. Error Adjustments for File Linking Methods
Using Encrypted Unique Client Identier (eUCI) with Application
to Recently Released Prisoners Who Are HIV1. Statistics in
Medicine 35: 11529.
Harron, Katie, Harvey Goldstein, and Chris Dibben, eds. 2015.
Methodological Developments in Data Linkage. West Sussex: John
Wiley & Sons.
Hersh, Eitan D. 2015. Hacking the Electorate: How Campaigns Per-
ceive Voters. Cambridge, U.K.: Cambridge University Press.
Herzog, Thomas H., Fritz Scheuren, and William E. Winkler. 2010.
Record Linkage. Wiley Interdisciplinary Reviews: Computational
Statistics 2: 53543.
Herzog, Thomas N., Fritz J. Scheuren, and William E. Winkler.
2007. Data Quality and Record Linkage Techniques. New York:
Springer.
Hill, Seth. 2017. Changing Votes or Changing Voters: How Candi-
dates and Election Context Swing Voters and Mobilize the Base.
Electoral Studies 48: 13148.
Hill, Seth J., and Gregory A. Huber. 2017. Representativeness and
Motivations of the Contemporary Donorate: Results from
Merged Survey and Administrative Records. Political Behavior
39: 329.
Hof, Michel H. P., and Aeilko H. Zwinderman. 2012. Methods for
Analyzing Data from Probabilistic Linkage Strategies Based
on Partially Identifying Variables. Statistics in Medicine 31:
423142.
Imai, Kosuke, and Dustin Tingley. 2012. A Statistical Method for
Empirical Testing of Competing Theories. American Journal of
Political Science 56: 21836.
Jaro, Matthew. 1972. UNIMATCH-A Computer System for Gen-
eralized Record Linkage Under Conditions of Uncertainty.
Technical Report, Spring Joint Computer Conference.
Jaro, Matthew. 1989. Advances in Record-Linkage Methodology as
Applied to Matching the 1985 Census of Tampa, Florida. Journal
of the American Statistical Association. 84: 41420.
Jutte, Douglas P., Leslie L. Roos, and Marni D. Browne.
2011. Administrative Record Linkage as a Tool for
PublicHealthResearch. Annual Reviewof PublicHealth32: 91
108.
Kim, Gunky, and Raymond Chambers. 2012. Regression Analysis
under Incomplete Linkage. Computational Statistics and Data
Analysis 56: 275670.
Lahiri, Partha, and Michael D. Larsen. 2005. Regression Analysis
with Linked Data. Journal of the American Statistical Association
100: 22230.
Larsen, Michael D., and Donald B. Rubin. 2001. Iterative Auto-
mated Record Linkage Using Mixture Models. Journal of the
American Statistical Association 96: 3241.
McLaughlan, Geoffrey, and David Peel. 2000. Finite Mixture Models.
New York: John Wiley & Sons.
McVeigh, Brendan S., and Jared S. Murray. 2017. Practical Bayesian
Inferencefor Record Linkage. TechnicalReport,Carnegie Mellon
University.
Meredith, Marc, and Michael Morse. 2014. Do Voting
Rights Notication Laws Increase Ex-Felon Turnout? The
ANNALS of the American Academy of Political and Social Science
651: 22049.
Mummolo, Jonathan, and Clayton Nall. 2016. Why Partisans Dont
Sort: The Constraints on Political Segregation. Journal of Politics
79: 4559.
Murray, Jared S. 2016. Probabilistic Record Linkage and Dedupli-
cation after Indexing, Blocking, and Filtering. Journal of Privacy
and Condentiality 7: 324.
Neter, John, E. Scott Maynes, and R. Ramanathan. 1965. The Effect
of Mismatching on the Measurement of Response Errors. Journal
of the American Statistical Association 60: 100527.
Ong, Toan C., Michael V. Mannino, Lisa M. Schilling, and Michael G.
Kahn. 2014. Improving Record Linkage Performance in the
Ted Enamorado, Benjamin Field, and Kosuke Imai
370
Downloaded from https://www.cambridge.org/core. Harvard University, on 24 Apr 2019 at 23:40:20, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S0003055418000783
Presence of Missing Linkage Data. Journal of Biomedical Infor-
matics 52: 4354.
Richman, Jesse T., Gulshan A. Chattha, and David C. Earnest. 2014.
Do Non-Citizens Vote in U.S. Elections? Electoral Studies 36:
14957.
Ridder, Geert, and Robert Moftt. 2007. Handbook of Econometrics.
Vol. 6. Elsevier Chapter The Econometrics of Data Combination,
pp. 5469547.
Sadinle, Mauricio. 2014. Detecting Duplicates in a Homicide Reg-
istry Using a Bayesian Partitioning Approach. Annals of Applied
Statistics. 8: 240434.
Sadinle, Mauricio. 2017. Bayesian Estimation of Bipartite Matchings
for Record Linkage. Journal of the American Statistical Association
112: 60012.
Sariyar, Murat, and Andreas Borg. 2016. Record Linkage in R. R package.
Version 0.4-10. http://cran.r-project.org/package5RecordLinkage.
Sariyar, Murat, Andreas Borg, and Klaus Pommerening. 2012.
Missing Values in Deduplication of Electronic Patient Data.
Journal of the American Medical Informatics Association 19:
e7682.
Scheuren, Fritz, and William E. Winkler. 1993. Regression Analysis
of Data Files that Are Computer Matched. Survey Methodology
19: 3958.
Scheuren, Fritz, and William E. Winkler. 1997. Regression Analysis
of Data Files that Are Computer Matched II. Survey Methodology.
23: 15765.
Steorts, Rebecca C. 2015. Entity Resolution with Empirically
Motivated Priors. Bayesian Analysis. 10: 84975.
Steorts, Rebecca C., Samuel L. Ventura, Mauricio Sadinle, and Ste-
phen E. Fienberg. 2014. A Comparison of Blocking Methods for
Record Linkage. In Privacy in Statistical Databases, ed. Josep
Domingo-Ferrer. Springer, 25368.
Tam Cho, Wendy, James Gimpel, and Iris Hui. 2013. Voter
Migration and the Geographic Sorting of the American Elec-
torate. Annals of the American Association of Geographers 103:
85670.
Tancredi, Andrea, and Brunero Liseo. 2011. A Hierachical Bayesian
Approach to Record Linkage and Population Size Problems.
Annals of Applied Statistics. 5: 155385.
Thibaudeau, Yves. 1993. The Discrimination Power of Dependency
Structures in Record Linkage. Survey Methodology 19.
Winkler, William E. 1990. String Comparator Metrics and
Enhanced Decision Rules i n the FellegiSunter Model of Record
Linkage. In Proceedings of the Section on Survey Research
Methods. American Statistical Association. https://www.iser.
essex.ac.uk/research/publications/501361.
Winkler, William E. 1993. ImprovedDecision Rules in the FellegiSunter
Model of Record Linkage. In Proceedings of Survey Research
Methods Section. American Statistical Association. http://ww2.
amstat.org/sections/srms/Proceedings/papers/1993_042.pdf.
Winkler, William E. 2000. Using the EM Algorithm for Weight
Computation in the FelligiSunter Model of Record Linkage.
Technical Report No. RR2000/05, Statistical Research Division,
Methodology and Standards Directorate, U.S. Bureau of the
Census.
Winkler, William E. 2005. Approximate String Comparator Search
Strategies for Very Large Administrative Lists. Research Report
Series (Statistics) No. 2005-02, Statistical Research Division U.S.
Census Bureau.
Winkler, William E. 2006a. Automatic Estimation of Record
Linkage False Match Rates. In Proceedings of the Section on
Survey Research Methods. American Statistical Association.
Winkler, William E. 2006b. Overview of Record Linkage and
Current Research Directions. Technical Report, United States
Bureau of the Census.
Winkler, William E., and Willian Yancey. 2006. Record Linkage
Error-Rate Estimation without Training Data. In Proceedings of
the Section on Survey Research Methods. American Statistical
Association.
Winkler, William E., Willian Yancey, and E. H. Porter. 2010. Fast
Record Linkage of Very Large Files in Support of the Decennial
and Administrative Record Projects. In
Proceedings of the Secion
on Survey Research Methods.
Yancey, Willian. 2005. Evaluating String Comparator Performance
for Record Linkage. Research Report Series, Statistical Research
Division U.S. Census Bureau.
Using a Probabilistic Model to Assist Merging of Large-Scale Administrative Records
371
Downloaded from https://www.cambridge.org/core. Harvard University, on 24 Apr 2019 at 23:40:20, subject to the Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S0003055418000783