Linking Temporal Records

Pei Li

University of Milan-Bicocca

[email protected].it

Xin Luna Dong

AT&T Labs-Research

[email protected]

Andrea Maurino

University of Milan-Bicocca

[email protected].it

Divesh Srivastava

AT&T Labs-Research

div[email protected]

ABSTRACT

Many data sets contain temporal records over a long period of time; each

record is associated with a time stamp and describes some aspects of a real-

world entity at that particular time (e.g., author information in DBLP). In

such cases, we often wish to identify records that describe the same entity

over time and so be able to enable interesting longitudinal data analysis.

However, existing record linkage techniques ignore the temporal informa-

tion and can fall short for temporal data.

This paper studies linking temporal records. First, we apply time decay to

capture the effect of elapsed time on entity value evolution. Second, instead

of comparing each pair of records locally, we propose clustering methods

that consider time order of the records and make global decisions. Experi-

mental results show that our algorithms signiﬁcantly outperform traditional

linkage methods on various temporal data sets.

1. INTRODUCTION

Record linkage takes a set of records as input and discovers which

records refer to the same real-world entity. It plays an important

role in data integration, data aggregation, and personal information

management, and has been extensively studied in recent years (see

[7, 12] for recent surveys). Existing techniques typically proceed

in two steps: the ﬁrst step compares similarity between each pair

of records, deciding if they match or do not match; the second step

clusters the records accordingly, with the goal that records in the

same cluster refer to the same real-world entity and records in dif-

ferent clusters refer to different ones.

In practice, a data set may contain temporal records over a long

period of time; each record is associated with a time stamp and de-

scribes some aspects of a real-world entity at that particular time.

In such cases, we often wish to identify records that describe the

same real-world entity over time and so be able to trace the history

of that entity. For example, DBLP

lists research papers over many

decades; we wish to identify individual authors such that we can

list all publications by each author. Other examples include medi-

cal data that keep patient information over tens of years, customer-

relationship data that contain customer information over years, and

http://www.informatik.uni-trier.de/∼ley/db/.

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for proﬁt or commercial advantage and that copies

bear this notice and the full citation on the ﬁrst page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior speciﬁc

permission and/or a fee. Articles from this volume were invited to present

their results at The 37th International Conference on Very Large Data Bases,

August 29th - September 3rd 2011, Seattle, Washington.

Proceedings of the VLDB Endowment, Vol. 4, No. 7

Table 1: Records from DBLP

ID name afﬁliation co-authors year

Xin Dong R. Polytechnic Institute Wozny 1991

Xin Dong Univ of Washington Halevy, Tatarinov 2004

Xin Dong Univ of Washington Halevy 2005

Xin Luna Dong Univ of Washington Halevy, Yu 2007

Xin Luna Dong AT&T Labs-Research Das Sarma, Halevy 2009

Xin Luna Dong AT&T Labs-Research Naumann 2010

Dong Xin Univ of Illinois Han, Wah 2004

Dong Xin Univ of Illinois Wah 2007

Dong Xin Microsoft Research Wu, Han 2008

Dong Xin Univ of Illinois Ling, He 2009

Dong Xin Microsoft Research Chaudhuri, Ganti 2009

Dong Xin Microsoft Research Ganti 2010

so on; identifying records that refer to the same entity enables in-

teresting longitudinal data analysis over such data [17].

Although linking temporal records is important, to the best of

our knowledge, traditional techniques ignore the temporal infor-

mation in linkage. Thus, they can fall short for such data sets for

two reasons. First, the same real-world entity can evolve over time

(e.g., a person can change her phone number and address) and so

records that describe the same real-world entity at different times

can contain different values; blindly requiring value consistency of

the linked records can thus cause false negatives. Second, it is more

likely to ﬁnd highly similar entities over a long time period than at

the same time (e.g., having two persons with highly similar names

in the same university over the past 30 years is more likely than

at the same time) and so records that describe different entities at

different times can share common values; blindly matching records

that have similar attribute values can thus cause false positives. We

illustrate the challenges by the following example.

EXAMPLE 1.1. Consider records that describe paper authors

in Table 1; each record is derived from a publication record at

DBLP (we may skip some co-authors for space reason). These

records describe 3 real-world persons: r

describes E

: Xin Dong,

who was at R. Polytechnic in 1991; r

− r

describe E

: Xin

Luna Dong, who moved from Univ of Washington to AT&T Labs;

− r

describe E

: Dong Xin, who moved from Univ of Illinois

to Microsoft Research.

If we require high similarity on both name and afﬁliation, we

may split entities E

and E

, as records for each of them can have

different values for afﬁliation. If we require only high similarity of

name, we may merge E

with E

as they share the same name,

and may even merge all of the three entities. 2

Despite the challenges, temporal information does present addi-

tional evidence for linkage. First, record values typically transition

smoothly. In the motivating example, person E

moved to a new

afﬁliation in 2008, but still had similar co-authors from previous

years. Second, record values seldom change erratically. In our

example, r

, r

are very unlikely to refer to the same

person, as a person rarely moves between two afﬁliations back and

forth over many years. (However, this can happen around transition

time; for example, entity E

has a paper with the old afﬁliation in-

formation after he moved to a new afﬁliation, as shown by record

.) Third, in case we have a fairly complete data set such as

DBLP, records that refer to the same real-world entity often (but

not necessarily) observe continuity; for example, one is less conﬁ-

dent that r

and r

− r

refer to the same person given the big time

gap between them. Exploring such evidence would require a global

view of the records with the time factor in mind.

This paper studies linking temporal records and makes three con-

tributions. First, we apply time decay, which aims to capture the

effect of time elapse on entity value evolution (Section 3). In par-

ticular, we deﬁne disagreement decay, with which value difference

over a long time is not necessarily taken as a strong indicator of re-

ferring to different real-world entities; we deﬁne agreement decay,

with which the same value with a long time gap is not necessarily

taken as a strong indicator of referring to the same entity. We de-

scribe how we learn decay from labeled data and how we apply it

when computing similarity between records.

Second, instead of comparing each pair of records locally and

then clustering, we describe three temporal clustering methods that

consider records in time order and accumulate evidence over time

to enable global decision making (Section 4). Among them, early

binding makes eager decisions and merges a record with an already

created cluster once it computes a high similarity; late binding in-

stead keeps all evidence and makes decisions at the end; and ad-

justed binding in addition compares a record with clusters that are

created for records with later time stamps.

Finally, we applied our methods on a European patent data set

and two subsets of the DBLP data set. Our experimental results

show that applying decay in traditional methods can already im-

prove linkage results, and applying our clustering methods can ob-

tain results with high precision and recall (Section 5).

This paper focuses on improving quality (precision and recall) of

linking temporal records. We can improve efﬁciency of linkage by

applying previous techniques such as canopy [13] to create small

blocks of records that are candidates for temporal linkage.

2. OVERVIEW

Consider a domain D of object entities (not known a-priori) where

each entity is described by a set of attributes A = {A

, . . . , A

}

and values of an attribute can change over time (e.g., person afﬁli-

ation, business addresses). We distinguish single-valued and multi-

valued attributes, where the difference is whether for an attribute an

entity can have a single or multiple values at any time. Consider a

set R of records, each associated with a time stamp and describing

an entity in D at that particular time. Given a record r ∈ R, we

denote by r.t the time stamp of r and by r.A the value of attribute

A ∈ A from r (we allow null as a value). Our goal is to decide

which records in R refer to the same entity in D.

DEFINITION 2.1 (TEMPORAL RECORD LINKAGE). Let R be

a set of records, each in the form of (x

, . . . , x

, t), where t is the

time stamp associated with the record, and x

, i ∈ [1, n], is the

value of attribute

at time

for the referred entity.

The temporal record linkage problem clusters the records in R

such that records in the same cluster refer to the same entity over

time and records in different clusters refer to different entities. 2

EXAMPLE 2.2. Consider the records in Table 1, where each

record describes an author by her name, afﬁliation, and co-authors

(co-authors is a multi-valued attribute) and is associated with a

time stamp (year). The ideal linkage solution contains 3 clusters:

}, {r

, . . . , r

}, {r

, . . . , r

}. 2

Overview of our solution: Our record-linkage techniques leverage

the temporal information in two ways.

First, when computing record similarity, traditional linkage tech-

niques reward high value similarity and penalize low value similar-

ity. However, as time elapses, values of a particular entity may

evolve; for example, a researcher may change afﬁliation, email,

and even name over time (see entities E

and E

in Example 1.1).

Meanwhile, different entities are more likely to share the same

value(s) with a long time gap; for example, it is more likely that

we observe two persons with the same name within 30 years than

at the same time. We thus deﬁne decay (Section 3), with which we

can reduce penalty for value disagreement and reduce reward for

value agreement over a long period. Our experimental results show

that applying decay in similarity computation can already improve

over traditional linkage techniques.

Second, when clustering records according to record similarity,

traditional techniques do not consider time order of the records.

However, time order can often provide important clues. In Ex-

ample 1.1, records r

− r

and r

− r

may refer to the same

person even though the decayed similarity between r

and r

low, because the time period of r

− r

(year 2004-2007) and that

of r

− r

(year 2009-2010) do not overlap; on the other hand,

records r

− r

and r

, r

are very likely to refer to different

persons even though the decayed similarity between r

and r

high, because the records interleave and their occurrence periods

highly overlap. We propose temporal clustering algorithms (Sec-

tion 4) that consider time order of records and can further improve

linkage results.

3. TIME DECAY

This section introduces time decay, an important concept that

aims at capturing the effect of time elapsing on value evolution.

Section 3.1 deﬁnes decay, Section 3.2 describes how we learn de-

cay, and Section 3.3 describes how we apply decay in similarity

computation. Experimental results show that by applying decay in

traditional linkage techniques, we can already improve the results.

3.1 Deﬁnition

As time goes by, the value of an entity may evolve; for example,

entity E

in Example 1.1 was at UW from 2004 to 2007, and moved

to AT&T Labs afterwards. Thus, different values for a single-valued

attribute over a long period of time should not be considered as

strong indicator of referring to different entities. We deﬁne dis-

agreement decay to capture this intuition.

DEFINITION 3.1 (DISAGREEMENT DECAY). Let ∆t be a time

distance and A ∈ A be a single-valued attribute. Disagreement

decay of A over time ∆t, denoted by d

̸=

(A, ∆t), is the probability

that an entity changes its A-value within time ∆t. 2

On the other hand, as time goes by, we are more likely to observe

two entities with the same attribute value; for example, in Exam-

ple 1.1 entity E

occurred in 1991 and E

occurred in 2004-2010,

and they share the same name. Thus, the same value over a long

period of time should not be considered as strong indicator of re-

ferring to the same entity. We deﬁne agreement decay accordingly.

DEFINITION 3.2 (AGREEMENT DECAY). Let ∆t be a time dis-

tance and A ∈ A be an attribute. The agreement decay of A over

time ∆t, denoted by d

(A, ∆t), is the probability that two different

entities share the same A-value within time ∆t. 2

Figure 1: Address decay curves. Figure 2: Learning d

̸=

(aﬀ, ∆t) Figure 3: Learning d

(name, ∆t)

According to the deﬁnitions, decay satisﬁes two properties. First,

decay is in the range of [0,1]; however, d

̸=

(A, 0) and d

(A, 0) are

not necessarily 0. Second, decay observes monotonicity; that is, for

any ∆t < ∆t

′

and any attribute A, d

̸=

(A, ∆t) ≤ d

̸=

(A, ∆t

′

) and

(A, ∆t) ≤ d

(A, ∆t

′

). Whereas our deﬁnition of decay ap-

plies to all attributes, for attributes whose values always remain sta-

ble (e.g., birth-date), the disagreement decay is always 0, and for

those whose values change rapidly (e.g., bank-account-balance),

the disagreement decay is always 1.

EXAMPLE 3.3. Figure 1 shows the curves of disagreement de-

cay and agreement decay on attribute address learned from a Eu-

ropean patent data set (described in detail in Section 5).

We observe that (1) the disagreement decay increases from 0 to 1

when time elapses, showing that two records differing in afﬁliation

over a long time is not a strong indicator of referring to different

entities; (2) the agreement decay is close to 0 everywhere, showing

that in this data set, sharing the same address is a strong indicator

of referring to the same entity even over a long time; (3) even when

∆t = 0, neither the disagreement nor the agreement decay is ex-

actly 0, meaning that even at the same time address match does not

correspond to record match and vice versa. 2

3.2 Learning decay

Decay can be speciﬁed by domain experts or learnt from a la-

beled data set, for which we know if two records refer to the same

entity and if two strings represent the same value.

For simpli-

ﬁcation of computation, we make three assumptions. 1. Value

uniqueness: at each time point an entity has a single value for a

single-valued attribute. 2. Closed-world: for each entity described

in the labeled data set, during the time period when records that

describe this entity are present, each of its ever-existing values is

reﬂected by some record and the change is reﬂected at the transi-

tion point. 3. Correctness: values in each record reﬂect the truth in

real world. The data sets in practice often violate the assumptions.

In our learning we remove records that violate the ﬁrst assumption.

Our experimental results show that the learned decay does lead to

good linkage results even when the latter two assumptions are vio-

lated, as various kinds of violations in the real data often cancel out

each other in affecting the learned curves.

Consider attribute A and time period ∆t. We next describe how

we calculate the decay for A and ∆t according to the labels. Ap-

pendix A describes alternative ways of learning decay, whose re-

sults lead to similar linkage results in our experiments.

Disagreement decay: By deﬁnition, disagreement decay for ∆t is

the probability that an entity changes its A-value within time ∆t.

So we need to ﬁnd the valid period of each A-value of an entity.

Consider an entity E and its records in increasing time order,

denoted by r

, . . . , r

, n ≥ 1. We call a time point t a change point

if r

is at time t or if at time t there exists a record r

, i ∈ [2, n],

whose A-value is different from r

i−1

. For each change point t

(associated with a new value), we can compute a life span: if t is

not the last change point of E, we call the life span a full time span

In case there is no label for whether two strings represent the same value,

we can easily extend our techniques by considering value similarity.

and denote it by [t, t

), where t

is the next change point;

otherwise, we call the life span a partial time span and denote it by

[t, t

end

+ δ), where t

end

is the last time stamp for this value and δ

denotes one time unit. A life span [t, t

′

) has length t

′

−t, indicating

that the corresponding value lasts for time t

′

− t before any change

in case of a full life span, and that the value lasts at least for time

′

− t in case of a partial life span. We denote by

the bag of

lengths of full life spans, and by

that for partial life spans.

To learn d

̸=

(A, ∆t), we consider all full life spans and the partial

life spans with at least length ∆t (we do not know for others if the

value will change in time ∆t). We compute the decay as

̸=

(A, ∆t) =

|{l ∈

|l ≤ ∆t}|

| + |{l ∈

|l ≥ ∆t}|

. (1)

We give details in Algorithm LEARNDISAGREEDECAY (Appendix A).

We can prove that the decay it learns satisﬁes the monotonicity

property (proofs of all results are in the appendix).

PROPOSITION 3.4. Let A be an attribute. For any ∆t < ∆t

′

the decay learned by Algorithm LEARNDISAGREEDECAY satisﬁes

̸=

(A, ∆t) ≤ d

̸=

(A, ∆t

′

). 2

EXAMPLE 3.5. Consider learning disagreement decay for af-

ﬁliation from the data in Example 1.1. For illustrative purpose, we

remove record r

as its afﬁliation information is incorrect. Take

as an example. As shown in Figure 2, it has two change points:

2004 and 2009. So there are two life spans: [2004, 2009) has

length 5 and is full, and [2009, 2011) has length 2 and is partial.

After considering other entities, we have

= {4, 5} and

{1, 2, 3}. Accordingly, d

̸=

(aﬀ, ∆t ∈ [0, 1]) =

2+3

= 0, d

̸=

(aﬀ,

∆t = 2) =

2+2

= 0, d

̸=

(aﬀ, ∆t = 3) =

2+1

= 0, d

̸=

(aﬀ, ∆t =

4) =

= 0.5, and d

̸=

(aﬀ, ∆t ≥ 5) =

= 1. 2

Agreement decay: By deﬁnition, agreement decay for ∆t is the

probability that two different entities share the same value within

time period ∆t. Consider a value v of attribute A. Assume entity

has value v with life span [t

, t

) and E

has value v with life

span [t

, t

). Without losing generality, we assume t

≤ t

. Then,

for any ∆t ≥ max{0, t

− t

+ δ}, E

and E

share the same

value v within a period of ∆t. We call max{0, t

− t

+ δ} the

span distance for v between E

and E

For any pair of entities, we ﬁnd the shared values and compute

the corresponding span distance for each of them. If two entities

never share any value, we use ∞ as the span distance between

them. We denote by

L the bag of span distances and compute the

agreement decay as

(A, ∆t) =

|{l ∈

L|l ≤ ∆t}|

. (2)

Algorithm LEARNAGREEDECAY (Appendix A) describes the

details and we next show monotonicity of its results.

PROPOSITION 3.6. Let A be an attribute. For any ∆t < ∆t

′

the decay learned by Algorithm LEARN AGREEDECAY satisﬁes

(A, ∆t) ≤ d

(A, ∆t

′

). 2

We can easily extend to the case where v has multiple life spans for the

same entity.

EXAMPLE 3.7. Consider learning agreement decay for name

from data in Example 1.1. As shown in Figure 3, entities E

and E

share value Xin Dong, for which the life span for E

is [1991, 1992)

and that for E

is [2004, 2007). Thus, the span distance between

and E

is 2004 − 1992 + 1 = 13. No other pair of enti-

ties shares the same value; thus,

L = {13, ∞, ∞}. Accordingly,

(name, ∆t ∈ [0, 12]) =

= 0, and d

(name, ∆t ≥ 13) =

= 0.33. 2

3.3 Applying decay

We now describe how we apply decay in record-similarity com-

putation. We focus on single-valued attributes and extend our method

for multi-valued attributes in Appendix B.

When computing similarity between two records with a big time

gap, we often wish to reduce the penalty if they have different val-

ues and reduce the reward if they share the same value. Thus, we as-

sign weights to the attributes according to the decay; the lower the

weight, the less important is an attribute in record-similarity com-

putation, so the less penalty for value disagreement or the less re-

ward for value agreement. This weight is decided both by the time

gap and by the similarity between the values (to decide whether to

apply agreement or disagreement decay). We denote by w

(s, ∆t)

the weight of attribute A with value similarity s and time difference

∆t. Given records r and r

′

, we compute their similarity as

sim(r, r

′

) =



A∈A

(sim(r.A, r

′

.A), |r.t − r

′

.t|) · sim(r.A, r

′

.A)



A∈A

(sim(r.A, r

′

.A), |r.t − r

′

.t|)

(3)

Next we describe how we set w

(s, ∆t). With probability s, the

two values are the same and we shall use the complement of the

agreement decay; with probability 1 − s, they are different and we

shall use the complement of the disagreement decay. Thus, we set

(s, ∆t) = 1 −s ·d

(A, ∆t)−(1−s) · d

̸=

(A, ∆t). In practice,

we use thresholds θ

and θ

to indicate high similarity and low

similarity respectively, and set w

(s, ∆t) = 1 − d

(A, ∆t) if s >

and w

(s, ∆t) = 1 − d

̸=

(A, ∆t)) if s < θ

. Our experiments

show robustness of our techniques with respect to different settings

of the thresholds.

EXAMPLE 3.8. Consider records r

and r

in Example 1.1 and

we focus on single-valued attributes name and afﬁliation. Assume

the name similarity between r

and r

is .9 and the afﬁliation sim-

ilarity is 0. Suppose d

(name, ∆t = 5) = .05, d

̸=

(aﬀ, ∆t =

5) = .9, and θ

= .8. Then, the weight for name is 1 − .05 =

.95 and that for afﬁliation is 1 − .9 = .1. So the similarity is

sim(r

, r

) =

.95∗.9+.1∗0

.95+.1

= .81. Note that if we do not apply

decay and assign the same weight to each attribute, the similarity

would become

.5∗.9+.5∗0

.5+.5

= .45.

Thus, by applying decay, we are able to merge r

− r

, despite

the afﬁliation change of the entity. Note however that we will also

incorrectly merge all records together because each record has a

high decayed similarity with r

. 2

4. TEMPORAL CLUSTERING

As shown in Example 3.8, even when we apply decay in similar-

ity computation, traditional clustering methods do not necessarily

lead to good results as they ignore the time order of the records.

This section proposes three clustering methods, all processing the

records in increasing time order. Early binding (Section 4.1) makes

eager decisions and merges a record with an already created clus-

ter once it computes a high similarity between them. Late binding

(Section 4.2) compares a record with each already created clus-

ter and keeps the probability, and makes clustering decision at the

end. Adjusted binding (Section 4.3) is applied after early binding

or late binding, and improves over them by comparing a record

also with clusters created later and adjusting the clustering results.

Our experimental results show that adjusted binding signiﬁcantly

outperforms traditional clustering methods on temporal data.

4.1 Early binding

Algorithm: Early binding considers the records in time order; for

each record it eagerly creates its own cluster or merges it with an

already created cluster. In particular, consider record r and already

created clusters C

, . . . , C

. EARLY (details in Appendix C) pro-

ceeds in three steps.

1. Compute the similarity between r and each C

, i ∈ [1, n].

2. Choose the cluster C with the highest similarity. Merge r

with C if sim(r, C) > θ, where θ is a threshold indicating

high similarity; create a new cluster C

n+1

for r otherwise.

3. Update signature for the cluster with r accordingly.

Cluster signature: When we merge record r with cluster C, we

need to update the signature of C accordingly (step 3). As we

consider r as the latest record of C, we take r’s values as the lat-

est values of C. For the purpose of similarity computation, which

we describe shortly, for each latest value v we wish to keep 1) its

various representations, denoted by

R(v), and 2) its earliest and

latest time stamps in the current period of occurrence, denoted by

(v) and t

(v) respectively. The latest occurrence of v is obvi-

ously r.t. We maintain the earliest time stamp and various rep-

resentations recursively as follows. Let v

′

be the previous value

of C, and let s

max

be the highest similarity between v and the

values in

R(v

′

). (1) If s

max

> θ

, we consider the two values

as the same and set t

(v) = t

′

) and

R(v) =

R(v

′

) ∪ {v}.

(2) If s

max

< θ

, we consider the two values as different and set

(v) = r.t and

R(v) = {v}. (3) Otherwise, we consider that with

probability s

max

the two values are the same, so we set t

(v) =

sim(v, v

′

) + (1 − sim(v, v

′

))r.t and

R(v) =

R(v

′

) ∪ {v}.

Similarity computation: When we compare r with a cluster C

(step 1), for each attribute A, we compare r’s A-value r.A with the

A-value in C’s signature, denoted by C .A . We make two changes

in this process: ﬁrst, we compare r.A with each value in

R(C.A)

and take the maximum similarity; second, when we compute the

weight for A, we use t

(C.A) for disagreement decay as C.A

starts from time t

(C.A), and use t

(C.A) for agreement decay

as t

(C.A) is the last time we see C.A.

EARLY runs in time O(|R|

); the quadratic time is in the number

of records in each block after preprocessing.

EXAMPLE 4.1. Consider applying early binding to records in

Table 1. We start with creating C

for r

. Then we merge r

with C

because of the high record similarity (same name and

high disagreement decay on afﬁliation with time difference 2004-

1991=13). The new signature of C

contains address UW from

2004 to 2004. We then create a new cluster C

for r

, as r

differs

signiﬁcantly from C

. Next, we merge r

and r

with C

and merge

and r

with C

. The signature of C

then contains address UW

from 2004 to 2007, and the signature of C

contains address MSR

from 2008 to 2008.

Now consider r

. It has a low similarity with C

and r

have a short time distance but different afﬁliations), but a high sim-

ilarity with C

(fairly similar name and high disagreement decay

on afﬁliation with time difference 2009-2004=5). We thus wrongly

merge r

with C

. This eager decision further prevents merging

and r

with C

and we create C

for them separately. 2

4.2 Late binding

Instead of making eager decisions and comparing a record with

a cluster based on such eager decisions, late binding keeps all ev-

idence, considers them in record-cluster comparison, and makes a

global decision at the end.

Late binding is facilitated by a bi-partite graph (N

, N

, E),

where each node in N

represents a record, each node in N

rep-

resents a cluster, and each edge (n

, n

) ∈ E is marked with the

probability that record r belongs to cluster C (see Figure 4 for an

example). Late binding clusters the records in two stages: ﬁrst,

evidence collection creates the bi-partite graph and computes the

weight for each edge; then, decision making removes edges such

that each record belongs to a single cluster.

Evidence collection: Late binding behaves similarly to early bind-

ing at the evidence collection stage, except that it keeps all possi-

bilities rather than making eager decisions. For each record r and

already created clusters C

, . . . , C

, it proceeds in three steps.

1. Compute the similarity between r and each C

, i ∈ [1, n].

2. Create a new cluster C

n+1

and assign similarity as follows.

(1) If for each i ∈ [1, n], sim(r, C

) ≤ θ, we consider that

r is unlikely to belong to any C

and set sim(r, C

n+1

) = θ.

(2) If there exists i ∈ [1, n], such that not only sim(r, C

) >

θ , but also sim

′

(r, C

) > θ, where sim

′

(r, C

) is computed

by ignoring decay, we consider that r is very likely to belong

to C

and set sim(r, C

n+1

) = 0. (3) Otherwise, we set

sim(r, C

n+1

) = max

i∈[1,n]

sim(r, C

3. Normalize the similarities such that they sum up to 1 and

use the results as probabilities of r belonging to each cluster.

Update the signature of each cluster accordingly.

In the last step, we normalize the similarities such that the higher

the similarity, the higher the result probability. Note that in contrast

to early binding, late binding is conservative when the record sim-

ilarity without decay is low (Step 2(3)); this may lead to splitting

records that have different values but refer to the same entity, and

we show later how adjusted binding can beneﬁt from the conserva-

tiveness. Note also that in practice, we may set low similarities to

0 to improve efﬁciency; we discuss the details in Appendix D.

Cluster signature: For each cluster, the signature consists of all

records that may belong to the cluster along with the probabilities.

For each value of each record, we maintain the earliest time stamp,

the latest time stamp, and similar values, as we do in early binding.

Similarity computation: When we compare r with a cluster C,

we need to consider the probability that a record in C’s signature

belongs to C. Let r

, . . . , r

be the records of C in increasing

time order, and let P (r

), i ∈ [1, m], be the probability that r

belongs to C. Then, with probability P (r

), record r

is the lat-

est record of C and we should compare r with it; with probability

(1 − P (r

))P (r

m−1

), record r

m−1

is the latest record of C and

we should compare r with it; and so on. Note that the cluster is

valid only when r

, for which we create the cluster, belongs to

the cluster, so we use P (r

) = 1 in the computation (the original

P (r

) is used in the decision-making stage). Formally, the similar-

ity is computed as

sim(r, C) =



i=1

sim(r, r

)P (r

)Π

j=i+1

(1 − P (r

)). (4)

EXAMPLE 4.2. Consider applying late binding to the records

in Table 1 and let θ = .8. Figure 4 shows a part of the bi-partite

graph. At the beginning, we create an edge between r

and C

with weight 1. We then compare r

with C

: the similarity with

decay (.89 > θ) is high but that without decay (.5 < θ) is low.

Figure 4: Ex. 4.2. A part

of the bi-partite graph.

Figure 5: Continuity between record

r and cluster C.

We thus create a new cluster C

and set sim(r

, C

) = .89. After

normalization, each edge from r

has a weight of .5.

Now consider r

. For C

, with probability .5 we shall compare

with r

(suppose sim(r

, r

) = .4) and with probability 1-.5=.5

we shall compare r

with r

(suppose sim(r

, r

) = .8). Thus,

sim(r

, C

) = .8∗.5+.4∗.5 = .6 < θ. For C

, we shall compare

only with r

and the similarity is .4 < θ. Because of the low

similarities, we create a new cluster C

and set sim(r

, C

) = .8.

After normalization, the probabilities from r

to C

, C

and C

are

.33, .22 and .45 respectively. 2

Decision making: The second stage makes clustering decisions ac-

cording to the evidence we have collected. We consider only valid

clusterings, where each non-empty cluster contains the record for

which we create the cluster. Let C be a clustering and we denote by

C(r) the cluster to which r belongs in C. We can compute the prob-

ability of C as Π

r∈R

P (r ∈ C(r )), where P (r ∈ C(r)) denotes the

probability that r belongs to C(r). We wish to choose the valid

clustering with the highest probability. Enumerating all clusterings

and computing the probability for each of them can take exponen-

tial time. We next propose an algorithm that takes only polynomial

time and is guaranteed to ﬁnd the optimal solution.

1. Select the edge (n

, n

) with the highest weight.

2. Remove other edges connected to n

3. If n

is the ﬁrst selected edge to n

but C is created for

record r

′

̸= r, select the edge (n

′

, n

) and remove all other

edges connected to n

′

(so the result clustering is valid).

4. Go to Step 1 until all edges are either selected or removed.

We describe LATE algorithm in Appendix D and next state the

optimality of the decision-making stage.

PROPOSITION 4.3. LATE runs in time O(|R|

) and chooses

the clustering with the highest probability among all valid clus-

terings. 2

EXAMPLE 4.4. Continue with Example 4.2. After evidence col-

lection, we created 5 clusters and the weight of each record-cluster

pair is shown in Table 2. Weights of selected edges are in bold.

We ﬁrst select edge (n

, n

) with weight 1. We then choose

, n

) with weight .5 (there is a tie between C

and C

; even

if we choose C

at the beginning, we will change back to C

when

we select edge (n

, n

)), and (n

, n

) with weight .48. As C

is created for record r

, we also select edge (n

, n

) and remove

other edges from r

. We choose edges for the rest of the records

similarly and the ﬁnal result contains 5 clusters: {r

}, {r

, . . . , r

, r

}, {r

, r

}, {r

Note that despite the error made for r

, we are still able to

correctly merge r

and r

with C

because we make the decision

at the end. Note however that we did not merge r

, r

and r

with C

, because of the conservativeness of late binding. 2

Table 2: Example 4.4. Weights on the bipartite graph.

1 .5 .33 .37 .27 .38 .16 .13 .18 .24 .12 .22

0 .5 .22 .4 .25 .4 .16 .12 .17 .27 .1 .24

0 0 .45 .23 .48 .22 .24 .26 .2 .17 .23 .18

0 0 0 0 0 0 .44 .19 .29 .16 .33 .18

0 0 0 0 0 0 0 .3 .16 .16 .22 .18

4.3 Adjusted binding

Neither early binding nor late binding compares a record with

a cluster created later. However, evidence from later records may

ﬁx early errors; in Example 1.1, after observing r

and r

, we

are more conﬁdent that r

− r

refer to the same entity but record

has out-of-date afﬁliation information. Adjusted binding allows

comparison between a record and clusters that are created later.

Adjusted binding can start with the results from either early or

late binding and iteratively adjust the clustering (deterministic ad-

justing), or start with the bi-partite graph created from evidence

collection of late binding, and iteratively adjust the probabilities

(probabilistic adjusting). We next describe the deterministic algo-

rithm and Appendix E describes the probabilistic one.

Algorithm: Deterministic adjusting proceeds in EM-style.

1. Initialization: Set the initial assignment as the result of early

or late binding.

2. Estimation (E-step): Compute the similarity of each record-

cluster pair and normalize the similarities as in late binding.

3. Maximization (M-step): Choose the clustering with the max-

imum probability, as in late binding.

4. Termination: Repeat Steps 2-3 until the results converge or

oscillate.

Similarity computation: The E-step compares a record r with a

cluster C, whose signature may contain records that occur later

than r. Our similarity computation takes advantage of this com-

plete view of value evolution as follows.

First, we consider consistency of the records, including consis-

tency in evolution of the values, in occurrence frequency, and so

on. We describe how we compute value consistency next and dis-

cuss occurrence frequency in Appendix E. Consider the value con-

sistency between r and C = {r

, . . . , r

} (if r ∈ C, we re-

move r from C), denoted by cons(r, C) ∈ [0, 1]. Assume the

records of C are in time order and r

.t < r.t < r

k+1

.t.

In-

serting r into C can affect the consistency of C in two ways: 1)

r may be inconsistent with r

, so the similarity between r and

the sub-cluster C

= {r

, . . . , r

} is low; 2) r

k+1

may be incon-

sistent with r, so the similarity between r

k+1

and the sub-cluster

= {r

, . . . , r

, r} is low. We take the minimum as cons(r, C):

cons(r, C) = min(sim(r, C

), sim(r

k+1

, C

)). (5)

Second, we consider continuity of r and C’s other records in

time, denoted by cont(r, C) ∈ [0, 1]. Consider the ﬁve cases

in Figure 5 and assume the same consistency between r and C.

Record r is farther away in time from C’s records in cases 1 and 5

than in cases 2-4, so it is less likely to belong to C in cases 1 and

5. Let C.early denote the earliest time stamp of records in C and

C.late denote the latest one. We compute the continuity as follows.

cont(r, C) = e

−λy

; (6)

y =

|r.t − C .early| + α

C.late − C.early + α

. (7)

We can extend our techniques to the case when r has the same time stamp

as some record in C.

Here, λ > 0 is a parameter that controls the level of continuity

we require; α > 0 is a small number such that when the denom-

inator (resp., numerator) is 0, the numerator (resp., denominator)

can still affect the result.

Under this deﬁnition, the higher the time

difference between r and the earliest record in C compared with

the length of C, the lower the continuity. In Figure 5, cont(r, C) is

close to 0 in cases 1, 5, close to 1 in cases 2, 3, and close to e

−λ

case 4. Note that we favor time points close to C.early more than

those close to C.late; thus, when we shall merge two clusters that

are close in time, we will gradually move the latest record of the

early cluster into the late cluster, as it has a higher continuity with

the late cluster.

Finally, the similarity of r and C considers both consistency and

continuity, and is computed by

sim(r, C) = cons(r, C) · cont(r, C). (8)

Recall that late binding is conservative for records whose simi-

larity without decay is low and may split them. Adjusted binding

re-examines them and merges them only when they have both high

consistency and high continuity, and thus avoids aggressive merg-

ing of records with long time gap.

We describe the detailed algorithm, ADJUST, in Appendix E.

Our experiments show that ADJUST does not necessarily converge,

but the quality measures of the results at the oscillating rounds are

very similar.

EXAMPLE 4.5. Consider r

and C

in the results of Exam-

ple 4.4. For value consistency, inserting r

into C

results in

, . . . , r

}. Assume sim(r

, {r

}) = sim(r

, {r

, r

}) =

.6. Then, cons(r

, C

) = .6. For continuity, if we set λ = 2 and

α = 1, we obtain l(r

, C

) = e

−2·

1+1

2+1

= 0.26. Thus, the simi-

larity is .26 ∗ .6 = .16. On the other hand, the similarity between

and C

is 1 · e

−2·

= .14. We thus merge r

with C

Similarly, we then merge r

with C

and in turn r

with C

leading to the correct result. Note that we do not merge r

with C

because of the long time gap and thus a low continuity. 2

5. EXPERIMENTAL EVALUATION

This section describes experimental results on two real data sets.

We show that (1) our technique signiﬁcantly improves over tradi-

tional methods on various data sets; (2) the two key components of

our strategy, namely, decay and temporal clustering, are both im-

portant for obtaining good results; (3) our technique is robust with

respect to various data sets and reasonable parameter settings; (4)

our techniques are efﬁcient and scalable (see Appendix F).

5.1 Experiment settings

Data and golden standard: We experimented on two real-world

data sets: a benchmark of European patent data set

and the DBLP

data set. From the patent data we extracted Inventor records with

attributes name and address; the time stamp of each record is

the patent ﬁling date. The benchmark involves 359 inventors from

French patents, where different inventors rarely share similar names;

we thus increased the hardness by deriving a data set with only ﬁrst

name and last name initial for each inventor. We call the original

data set the full set and the derived one the partial set.

In practice, we set α to one time unit, and set λ = − ln cons where cons

is the minimum consistency we require for merging a record with a cluster.

When we shall merge two clusters C

and C

where C

.late = C

.early,

with such λ the latest record r

of C

has continuity cons with C

and

continuity 1 with C

, so can be merged with C

if cons(r

, C

) > cons.

http://www.esf-ape-inv.eu/

Table 3: Statistics of the experimental data sets.

#Records #Entities Years

Patent (full or partial) 1871 359 1978-2003

DBLP-XD 72 8 1991-2010

DBLP-WW 738 18+potpourri 1992-2011

From the DBLP data we considered two subsets: the XD data set

contains 72 records for authors named Xin Dong, Luna Dong, Xin

Luna Dong, or Dong Xin, for which we manually identiﬁed 8 au-

thors; the WW data set contains 738 records for authors named Wei

Wang, for 302 of which DBLP has manually identiﬁed 18 authors

(the rest is left in a potpourri). For each subset we extracted Au-

thor records with attributes name, afﬁliation, and co-author (we

extracted afﬁliation information from the papers) on 2/1/2011; the

time stamp of each record is the paper publication year. Table 3

shows statistics of the data sets.

Implementation: We learned decay from both Patent data sets.

The decay we learned for the address attribute is shown in Fig-

ure 1; for name both agreement and disagreement decay are close

to 0 on both data sets. We observed similar linkage results when

we learned the decays from half of the data and applied them on

the other half. We also applied the decay learned from the partial

data set on linking DBLP records.

We pre-partitioned the records by the initial of the last name, and

implemented the following methods on each partition.

• Traditional methods include PARTITION, CENTER, and MERGE

[10]. They all compute pairwise record similarity but apply

different clustering strategies. Appendix F.1 gives the details.

• Decayed traditional methods include DECAYEDPARTITION,

DECAYEDCENTER, and DECAYEDMERGE, each modifying

the corresponding traditional method by applying decays in

record-similarity computation.

• Temporal clustering methods include NODECAYADJUST, ap-

plying ADJUST without using decay.

• Full methods include EARLY, LATE, and ADJUST, each ap-

plying both decay and the corresponding clustering algorithm.

We present details of similarity computation in Appendix F.1.

By default, when we computed the record similarity without ap-

plying decay, we used weight .5 for both name and address (or

afﬁliation). No matter whether we applied decay, we used weight

.3 for co-author. We applied threshold .8 for deciding if a similar-

ity is high in various contexts. In addition, we set θ

= .8, θ

.6, λ = .5, α = 1 in our methods. We vary these parameters and

present some results in Appendix F.2 to demonstrate robustness.

We implemented the algorithms in Java. We used a WindowsXP

machine with 2.66 GHz Intel CPU and 1 GB of RAM.

Measure: We compared pairwise linking decisions with the golden

standard and measured the quality of the results by precision (P),

recall (R), and F-measure (F). We denote the set of false posi-

tive pairs by F P , the set of false negative pairs by F N, and the

set of true positive pairs by T P . Then, P =

|T P |

|T P |+|F P |

, R =

|T P |

|T P |+|F N|

, F =

2P R

P +R

5.2 Results on Patent data

Figure 6 compares ADJUST with the traditional methods. AD-

JUST obtains slightly lower precision (but still above .9) but much

higher recall (above .8) on both data sets; it improves the F-measure

over traditional methods by 15%-27% on the full data set, and by

11%-22% on the partial data set. The full data set is simpler as very

few inventors share similar full names; as a result, ADJUST obtains

higher precision and recall on this data set. The slightly lower recall

on the partial data set is because early false matching can prevent

Figure 6: Results on the patent data set.

Figure 7: Contribution of dif-

ferent components.

Figure 8: Different clustering

methods on patent partial data.

correct later matching. We next give a detailed comparison on the

partial data set, which is harder. Among the traditional methods,

PARTITION obtains the best results on the Patent data set and we

next show results only on it. Results for the other two traditional

methods follow the same pattern.

Different components: Figure 7 shows the contribution of ap-

plying decay and applying temporal clustering. We observe that

DECAYEDPARTITION and NODECAYADJUST both improve over

PARTITION, and ADJUST obtains the best result. Applying decay

on traditional methods increases the recall a lot, but it is at the price

of a big drop in precision. Temporal clustering, on the other hand,

considers the time information in clustering and in continuity com-

putation, so it increases the recall quite a bit without reducing the

precision much. Finally, we observe that applying agreement de-

cay does not change the results much, since the agreement decays

of both attributes are close to 0.

Different clustering methods: Figure 8 compares early, late, and

adjusted binding. We observe that they all improve the recall over

PARTITION, and reduce the precision only slightly. Between EARLY

and LATE, EARLY has a lower precision as it makes local deci-

sions, while LATE has a lower recall as it is conservative in merging

records with similar names but different addresses (high decayed

similarity but low non-decayed similarity). ADJUST signiﬁcantly

improves the recall over both methods by comparing early records

with clusters formed later, without sacriﬁcing the precision much.

5.3 Results on DBLP data

XD data set: The golden standard contains 8 clusters: the Xin clus-

ter has 36 records in years 2003-2010, including name Dong Xin

and 2 afﬁliations ( UIUC, MSR); the Dong cluster has 29 records

in years 2003-2010, including 3 names (Xin Dong, Luna Dong,

Xin Luna Dong) and 3 afﬁliations (UW, Google, AT&T); the rest of

them each has 1 or 2 records, including 1 name Xin Dong and 1 af-

ﬁliation. ADJUST results in 9 clusters and makes only one mistake:

it splits the Xin records in 2009 with afﬁliation UIUC from the rest

of Xin records. This is because Xin moved to MSR in 2008, so

ADJUST considers the two afﬁliations as conﬂicting. We highlight

that (1) ADJUST ﬁxes an error in DBLP: it (correctly) separates the

records with afﬁliation UNL from the Dong cluster; and (2) AD-

JUST is able to distinguish the various people, even though their

names are exactly the same or very similar (the similarity between

Xin Dong and Dong Xin is set to .8).

Figure 9: Results on XD set. Figure 10: Results on WW set.

Figure 9 shows the results of various methods on this data set.

ADJUST improves over traditional methods by 37%-43%. Other

observations are similar to those on the Patent data set, except

that applying decay to some traditional methods (PARTITION and

MERGE) can considerably reduce the precision and result in a low

F-measure, as this data set is small and extremely difﬁcult.

WW data set: We ﬁrst report results on the 302 records for which

DBLP has identiﬁed 18 clusters, among which (1) 3 involve 2 af-

ﬁliations, 2 involve 3 afﬁliations, and 1 involves 4 afﬁliations, so

in total 10 afﬁliation transitions; (2) two authors share the same

afﬁliation Fudan; (3) the largest cluster contains 92 records, the

smallest one contains 1 record, and 6 clusters contain more than 10

records. ADJUST obtains both high precision (.98) and high recall

(.97). We highlight that (1) ADJUST is able to distinguish the dif-

ferent authors in most cases; (2) among the 10 transitions, ADJUST

identiﬁes 5 of them. ADJUST makes four types of mistakes: (1)

it merges the two Fudan clusters, as one of them contains a single

record with year in the middle of the time period of the other clus-

ter; (2) it merges the big Fudan cluster with another record, whose

afﬁliation appears different from the rest in its own cluster, and time

stamp is one year before the earliest record in the big Fudan cluster,

so makes a hard case for the adjusting step; (3) it does not identify

one of the transitions for the same reason as in the XD data set; and

(4) it does not identify the other 4 transitions because there are very

few records for one of the afﬁliations and so not enough evidence

for merging. Finally, Figure 10 shows that ADJUST is signiﬁcantly

better than all other methods.

In the complete DBLP WW data set, 124 other WW records are

merged with these 18 clusters and we manually veriﬁed the cor-

rectness. Among them, 63 are correctly merged, ﬁxing errors from

DBLP; 26 are wrongly merged but can be correctly separated if

we have department information for afﬁliation; and 35 are wrongly

merged mainly because of high similarity of afﬁliations (e.g., many

records with “technology” in afﬁliation are wrongly merged be-

cause the IDF of “technology” is not so low on this small data set).

If we count these additional records, we are still able to obtain a

precision of .94 and a recall of .94.

6. RELATED WORK AND CONCLUSIONS

Related work: We have compared our techniques with traditional

record-linkage techniques in Section 2 and in experiments; Ap-

pendix G gives more details. We next compare with related works

regarding temporal information.

A suite of temporal data models [15] and temporal knowledge

discovery paradigms [16] have been proposed in the past; however,

we are not aware of any work focusing on linking temporal records.

Behavior based linkage [20] leverages periodical behavior patterns

of each entity in linking pairs of records and learns such patterns

from transaction logs. Their behavior pattern is different from the

decay in our techniques in that decay learns the probability of value

changes over time for all entities. In addition, we do not require a

ﬁxed and repeated value change pattern of particular entities and

apply decay in a global fashion (rather than just between pairs of

records) such that we can handle value evolution over time.

The notion of decay has been proposed in the context of data

warehouses and streaming data recently [3, 4]. They use decay to

reduce the effect of older tuples on data analysis. Among them,

backward decay [3] measures time difference backward from the

latest time and forward decay [4] measures time difference forward

from a ﬁxed landmark. Their decay function is either binary or a

ﬁxed (exponential or polynomial) function. We differ in that 1) we

consider time difference between two records rather than from a

ﬁxed point, and 2) we learn the decay curves purely from the data

rather than using a ﬁxed function.

Conclusions and future work: This paper studied linking records

with temporal information. We apply decay in record-similarity

computation and consider the time order of records in clustering;

thus, our linkage technique is tolerant of entity evolution over time

and can glean evidence globally for decision making. Future work

includes combining temporal information with other dimensions of

information such as spatial information to achieve better results,

and considering erroneous data especially erroneous time stamps.

7. REFERENCES

[1] R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating fuzzy

duplicates in data warehouses. In VLDB, pages 586–597, 2002.

[2] Z. Chen, D. V. Kalashnikov, and S. Mehrotra. Exploiting

relationships for object consolidation. In IQIS, pages 47–58, 2005.

[3] E. Cohen and M. Strauss. Maintaining time-decaying stream

aggregates. In PODS, pages 223–233, 2003.

[4] G. Cormode, V. Shkapenyuk, D. Srivastava, and B. Xu. Forward

decay: A practical time decay model for streaming systems. ICDE,

pages 138–149, 2009.

[5] D. Dey. Entity matching in heterogeneous databases: A logistic

regression approach. Decis. Support Syst., 44:740–747, 2008.

[6] P. Domingos. Multi-relational record linkage. In In Proceedings of

the KDD Workshop on Multi-Relational Data Mining, 2004.

[7] A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate

record detection: A survey. IEEE Trans. Knowl. Data Eng.,

19(1):1–16, 2007.

[8] I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of

the American Statistical Association, 64(328):1183–1210, 1969.

[9] G. Flake, R. Tarjan, and K. Tsioutsiouliklis. Graph clustering and

minimum cut trees. Internet Mathematics, 1:385–408, 2004.

[10] O. Hassanzadeh, F. Chiang, H. C. Lee, and R. J. Miller. Framework

for evaluating clustering algorithms in duplicate detection. PVLDB,

pages 1282–1293, 2009.

[11] M. A. Hernandez and S. J. Stolfo. Real-world data is dirty: Data

cleansing and the merge/purge problem. Data Mining and

Knowledge Discovery, 2:9–37, 1998.

[12] N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage:

similarity measures and algorithms. In SIGMOD, 2006.

[13] A. K. McCallum, K. Nigam, and L. H. Ungar. Efﬁcient clustering of

high-dimensional data sets with application to reference matching. In

Proc. of SIGKDD, pages 169–178, 2000.

[14] B. W. On, N. Koudas, D. Lee, and D. Srivastava. Group linkage. In

ICDE, pages 496–505, 2007.

[15] G. Ozsoyoglu and R. Snodgrass. Temporal and real-time databases: a

survey. TKDE, 7:513–532, 1995.

[16] J. F. Roddick and M. Spiliopoulou. A survey of temporal knowledge

discovery paradigms and methods. TKDE, 14:750–767, 2002.

[17] G. Weikum, N. Ntarmos, M. Spaniol, P. Triantaﬁllou, A. Bencz

ur,

S. Kirkpatrick, P. Rigaux, and M. Williamson. Longitudinal analytics

on web archive data: It’s about time! In CIDR, pages 199–202, 2011.

[18] D. T. Wijaya, S. Bressan, J. Joxan, and D. T. Wijaya. Ricochet: A

family of unconstrained algorithms for graph clustering. In DASFAA,

pages 153–167, 2009.

[19] W. E. Winkler. Methods for record linkage and bayesian networks.

Technical report, U.S. Bureau of the Census, 2002.

[20] M. Yakout, A. K. Elmagarmid, H. Elmeleegy, M. Ouzzani, and

A. Qi. Behavior based record linkage. PVLDB, 3:439–448, 2010.

APPENDIX

A. DETAILS OF LEARNING DECAY

We describe Algorithms LEARNDISAGREEDECAY and LEAR-

NAGREEDECAY and show the monotonicity of their results.

Algorithm 1 LEARNDISAGREEDECAY(C, A)

Input: C Clusters of records in the sample data set, where records

in the same cluster refer to the same entity and records in dif-

ferent clusters refer to different entities.

A Attribute for learning decay.

Output: Disagreement decay d

̸=

(∆t, A).

= ϕ;

2: for each C ∈ C do

3: sort records in C in increasing time order to r

, . . . , r

|C|

;

4: // Find life spans

5: start = 1;

6: while start ≤ |C| do

7: end = start + 1;

8: while r

start

.A = r

end

.A and end ≤ |C| do

9: end + +;

10: end while

11: if end > |C| then

12: insert r

|C|

.t − r

start

.t + δ into

; // partial life span

13: else

14: insert r

end

.t − r

start

.t into

; // full life span

15: end if

16: start = end;

17: end while

18: end for

19: // Learn decay

20: for ∆t = 1, . . . , max

l∈

∪

{l} do

21: d

̸=

(A, ∆t) =

|{l∈

|l≤∆t}|

|+|{l∈

|l≥∆t}|

22: end for

PROOF MONOTONICITY. In LEARNDISAGREEDECAY, as ∆t

increases, |{l ∈

|l ≤ ∆t}| increases while |{l ∈

|l ≥ ∆t}|

decreases; thus,

|{l∈

|l≤∆t}|

|+|{l∈

|l≥∆t}|

increases.

In LEARNAGREEDECAY, as ∆t increases, |{l ∈

L|l ≤ ∆t}|

increases; thus,

|{l∈

L|l≤∆t}|

increases.

We next describe two alternatives of decay learning, namely,

single-count decay and probabilistic decay for disagreement decay

(similar for agreement decay). Experimental results show that they

learn similar (but more or less smooth) decay curves as LEARNDIS-

AGREEDECAY and LEARNAGREEDECAY .

Single-count decay considers an entity at most once and learns the

disagreement decay d

̸=

(A, ∆t) as the fraction of entities that have

changed their A-value within time ∆t. In particular, if an entity E

has full life spans, we choose the shortest one and insert its length

l to

, indicating that E has changed its A-value in time l; other-

wise, we consider E’s partial life span and insert its length l to

indicating that E has not changed its A-value in time l, but we do

not know if it will change its A-value in any longer time. We learn

disagreement decay using Eq.(1).

Probabilistic decay removes the timeliness assumption; that is, each

value change is reﬂected by a record at the change point. In partic-

ular, consider a full life span [t, t

) and we assume the last time

we see the same value is t

end

, t ≤ t

end

≤ t

. We assume the

value can change at any time from t

end

to t

with equal prob-

ability

−t

end

. Thus, for each t

′

∈ [t

end

, t

], we insert

length t

′

−t into

and annotate it with probability

−t

end

If we denote by p(l) the probability for a particular length l in

we compute the disagreement decay as

̸=

(A, ∆t) =



l∈

,l≤∆t

p(l)



l∈

p(l) + |{l ∈

|l ≥ ∆t}|

. (9)

Algorithm 2 LEARN AGREEDECAY(C, A)

Input: C Clusters of records in the sample data set, where records

in the same cluster refer to the same entity and records in dif-

ferent clusters refer to different entities.

A An attribute for decay learning.

Output: Agreement decay d

(∆t, A).

1: //Find life spans

2: for each C ∈ C do

3: sort records in C in increasing time order to r

, . . . , r

|C|

;

4: start = 1;

5: while start ≤ |C| do

6: end = start + 1;

7: while r

start

.A = r

end

.A and end ≤ |C| do

8: end + +;

9: end while

10: if end > |C| then

11: r

start

= r

|C|−1

.t + δ; // partial life span

12: else

13: r

start

= r

end

.t; // full life span

14: end if

15: star t

end

;

16: end while

17: end for

18: //Learn agreement decay

19:

L = ϕ

20: for each C, C

′

∈ C do

21: same = false;

22: for each r ∈ C s.t. r.t

̸= null do

23: for each r

′

∈ C

′

s.t. r.t

̸= null do

24: if r.A = r

′

.A then

25: same = true;

26: if r.t ≤ r

′

.t then

27: insert max{0, r

′

.t − r.t

+ 1} into

28: else

29: insert max{0, r.t − r

′

+ 1} into

30: end if

31: end if

32: end for

33: end for

34: if !same then

35: insert ∞ into

36: end if

37: end for

38: for ∆t = 1, . . . , max

l∈

{l} do

39: d

(A, ∆t) =

|{l∈

L|l≤∆t}|

40: end for

B. DECAY FOR MULTI-VALUED ATTRIBUTES

In this section we consider multi-valued attributes such as co-

authors. We start with describing record-similarity computation

with such attributes, and then describe how we learn and apply de-

cay for such attributes.

Multi-valued attributes differ from single-valued attributes in that

the same entity can have multiple values for such attributes even

at the same time; therefore, (1) having different values for such at-

tributes does not indicate record un-match; and (2) sharing the same

value for such attributes is additional evidence for record match.

Consider a multi-valued attribute A. Consider records r and r

′

;

r.A and r

′

.A each is a set of values. Then, the similarity of r.A

and r

′

.A, denoted by sim(r.A, r

′

.A), is computed by a variant of

Jaccard distance between the two sets.

sim(r.A, r

′

.A) =



v∈r.A,v

′

∈r

′

.A,sim(v,v

′

)>θ

sim(v, v

′

)

min{|r.A|, |r

′

.A|}

. (10)

If the relationship between the entities and the A-values are one-

to-many, we add the attribute similarity (with a certain weight) to

the record similarity between r and r

′

. In particular, let sim

′

(r, r

′

)

be the similarity between r and r

′

when we consider all attributes

and w

be the weight for attribute A, then,

sim

′

(r, r

′

) = min{1, sim(r, r

′



multi−valued attr A

sim(r.A, r

′

.A)}.

(11)

On the other hand, if the relationship between the entities and the

A-values are many-to-many, we apply Eq.(11) only when sim(r, r

′

)

> θ

, where θ

is a threshold for high similarity on values of single-

valued attributes.

Now consider decay on such multi-valued attributes. First, we do

not learn disagreement decay on multi-valued attributes and learn

agreement decay in the same way as for single-valued attributes.

Second, we apply agreement decay when we compute the similar-

ity between values of a multi-valued attribute, so if the time gap

between two similar values is large, we discount the similarity. In

particular, we revise Eq.(10) as follows.

sim(r.A, r

′

.A)



v∈r.A,v

′

∈r

′

.A,sim(v,v

′

)>θ

(1 − d

(A, |r.t − r

′

.t|))sim(v, v

′

)

min{|r.A|, |r

′

.A|}

.(12)

EXAMPLE B.1. Consider records r

and r

and the multi-valued

attribute co-authors (many-to-many relationship) in Example 1.1.

Let θ

= .8 and w

= .3. Record r

and r

share one co-author

with string similarity 1. Suppose d

(co, ∆t = 5) = .05. Then,

sim(r

.co, r

.co) =

(1−.05)∗1

min{2,2}

= .475. Recall from Example 3.8

that sim(r

, r

) = .81 > θ

; therefore, the overall similarity is

sim

′

, r

) = min{1, .81 + .475 ∗ .3} = .95. 2

C. DETAILS OF EARLY BINDING

We describe algorithm EARLY in Algorithm 3.

D. DETAILS OF LATE BINDING

We describe algorithm LATE in Algorithm 4.

In Line 24 of the algorithm, we remove edges with low similarity

scores for each record before we normalize the edge weights. We

next describe several edge-deletion strategies. Our experimental

results (Appendix F.2) show that they obtain similar results, while

they all improve over not deleting edges in both efﬁciency and ac-

curacy of the results.

• Thresholding removes all edges whose associated similarity

scores are less or equal to a threshold θ .

• Top-K keeps top-k edges whose associated similarity scores

are above threshold θ.

Algorithm 3 EARLY(R )

Input: R records in increasing time order

Output: C clustering of records in R

1: for each record r ∈ R do

2: for each C ∈ C do

3: compute record-cluster similarity sim(r, C);

4: end for

5: if max

C∈C

sim(r, C) ≥ θ then

6: C = Argmax

C∈C

sim(r, C);

7: insert r into C;

8: update signature of C;

9: else

10: insert cluster {r} into C;

11: end if

12: end for

13: return C;

• Gap orders the edges in descending order of the associated

similarity scores to e

, e

, . . . , e

, and selects the edges in de-

creasing order until reaching an edge e

where (1) the scores

for e

and e

i+1

have a gap larger than a given threshold θ

gap

or (2) the score for e

i+1

is less than threshold θ .

PROOF PROPOSITION 4.3. Our evidence collection step guar-

antees that if C

is created for record r, then the edge (N

, N

)

has the highest weight among edges from N

. Thus, the deci-

sion making step chooses the edge with the highest weight for each

record and obtains the optimal solution.

E. DETAILS OF ADJUST BINDING

We describe Algorithm ADJUST in Algorithm 5. Next, we de-

scribe frequency consistency and probabilistic adjusting.

Frequency consistency: Consider a cluster C. The occurrence

frequency of C, denoted by freq(C), is computed by

freq(C) =

late

− C

early

|C|

. (13)

Let C

′

be the cluster after inserting record r into C. The fre-

quency consistency between r and C, denoted by cons

(r, C) is

computed by

cons

(r, C) = 1 −

|freq(C) − f req(C

′

max{freq(C), freq(C

′

)}

. (14)

In practise, we can also apply a window of size n ≤ |C|, so that

occurrence frequency is only considered within n records.

Probabilistic adjusting: Probabilistic adjusted binding proceeds

in three steps.

• The algorithm starts with the bi-partite graph created from

evidence collection in late binding.

• It iteratively adjusts the weight of each edge and keeps all

edges, until the weights converge or oscillate. In each iter-

ation, it (1) re-computes the similarity between each record

and each cluster, (2) normalizes the weights of edges from

the same record node, and (3) re-computes the signature of

each cluster.

• It selects the possible world with the highest probability as in

late binding.

In the second step, when we compute the similarity between a

record and a cluster, we compute consistency cons(r, C) and con-

tinuity cont (r, C) similarly as in deterministic adjusted binding,

Algorithm 4 LATE(R)

Input: R records in increasing time order

Output: C clustering of records in R

1: Initialize a bi-partite graph (N

, N

, E) where N

= N

E = ∅;

2: //Evidence collection

3: for each record r ∈ R do

4: insert node n

into N

;

5: for each n

∈ N

6: compute decayed record-cluster similarity sim(r, C);

7: end for

8: if max

∈N

sim(r, C) ≤ θ then

9: insert node n

into N

;

10: insert edge (n

, n

) with weight θ into E;

11: else

12: newCluster = tr ue ;

13: for each n

∈ N

, where sim(r, C) > θ do

14: compute no decayed similarity sim

′

(r, C);

15: if sim

′

(r, C) > θ then

16: newCluster = false; break;

17: end if

18: end for

19: if newCluster then

20: insert node n

into N

;

21: insert edge (n

, n

) with weight

max

∈N

,sim

′

(r,C)>θ

(sim(r, C)) into E;

22: end if

23: end if

24: delete edges with low weights;

25: normalize weights for all edges from n

;

26: end for

27: // Decision making

28: while |E| > |N

| do

29: select edge (n

, n

) with maximal edge weight;

30: remove edges (n

, n

′

) for all C

′

̸= C;

31: if cluster C is created for record r

′

̸= r then

32: select edge (n

′

, n

);

33: remove edges (n

′

, n

′

) for all C

′

̸= C;

34: end if

35: end while

36: return C;

except that we need to consider the probability of a record belong-

ing to a cluster. For value consistency, we consider probability in

the same way as in late binding. For continuity, we compute the

probabilistic earliest and latest time stamps of a cluster as follows.

Suppose cluster C is connected to m records r

, . . . , r

where

.t ≤ r

.t ≤ · · · ≤ r

.t, each with probability p

, i ∈ [1, m].

We compute C

early

and C

late

as follows.

early



i=1

.t · (p

i−1

k=1

(1 − p

)); (15)

late



i=1

.t · (p

k=i+1

(1 − p

)). (16)

F. DETAILS OF EXPERIMENTS

F.1 Details of implementations

Traditional methods: We implemented three traditional methods:

PARTITION, CENTER, and MERGE [10]. All methods compute the

similarity between a pair of records as the weighted sum of the

attribute similarities. They differ in clustering methods:

Algorithm 5 ADJUST(R, C)

Input: R records in increasing time order.

C pre-clustering of records in R.

Output: C new clustering of records in R.

1: repeat

2: //E-step

3: for each record r ∈ R do

4: for each cluster C ∈ C do

5: compute sim(r, C) = cons(r, C) ∗ cont(r, C);

6: end for

7: end for

8: //M-step

9: Choose the possible world with the highest probability as in

Ln.28-35 of LATE;

10: until C is not changed

11: return C;

• PARTITION starts with single-record clusters and merges two

clusters if they contain similar records (i.e., applying the tran-

sitive rule).

• CENTER scans the records, merging a record r with a cluster

if it is similar to the center of the cluster, and creates a new

cluster with r as its center if r is not similar to any existing

center.

• MERGE starts from the result of CENTER and merges two

clusters if a record from one cluster is similar to the center of

the other cluster.

Since CENTER and MERGE are order sensitive, we run each of

them 5 times and report the best results.

Similarity computation: We compute similarity between a pair of

attribute values as follows.

• name: We used Levenshtein metric except that if the Leven-

shtein similarity is above .5 and the Soundex similarity is 1,

we set similarity 1.

• address: We used TF/IDF metric, where token similarity is

measured by Jaro-Winker distance with threshold .9. If the

TF/IDF similarity is above .5 and the Soundex similarity is 1,

we set similarity 1.

• co-author: We used a variant of Jaccard metric (see Eq.(12)),

where name similarity is measured by Levenshtein distance

and θ

= .8. We apply this similarity only when the record

similarity w.r.t single-valued attributes is above θ

= .5.

F.2 Additional experimental results

We compared several implementation details and reported efﬁ-

ciency on the partial Patent data. We used the default setting unless

mentioned otherwise and reported results for ADJUST.

Decay learning methods: We learned the decay in three ways:

DETERMINISTIC learns the decay as described in Section 3; SIN-

GLECOUNT learns single-count decay; and PROBABILISTIC learns

the probabilistic decay (Appendix A). We observed that (1) these

three methods learn similar curves, and (2) applying the three dif-

ferent curves lead to very similar results (Figure 11).

Edge deletion strategies: We tried various edge-deletion strate-

gies for late binding: NODELETE keeps all edges; THRESHOLD-

ING keeps edges with similarity over .8; TOPK keeps only the top-k

edges with similarity over .8; GAP keeps the top edges with weights

above .8 and gap within .1 (Appendix D). Figure 12 shows that (1)

NODELETE keeps all edges, which often have low weights after

Figure 11: Comparing different decay

learning methods.

Figure 12: Comparing different edge

deletion strategies.

Figure 13: Comparing different ad-

justed binding methods.

Figure 14: Results of varying

attribute weights.

Figure 15: Scalability of AD-

JUST.

normalization, and can thus split many clusters and obtain a very

low recall; and (2) different edge-deletion strategies lead to very

similar F-measures and improve both efﬁciency and result quality

over NODELETE.

Cluster adjusting strategies: We implemented three versions of

adjusted binding: LATEADJUST applies deterministic binding on

the results of late binding; EARLYADJUST applies deterministic

binding on the results of early binding; and PROBADJUST applies

probabilistic binding on the bi-partite graph created in late binding.

Figure 13 shows their results. First, we observe that PROBADJUST

obtains similar results to LAT EADJUST while the running time is

50% longer (not shown in the ﬁgure); showing that it does not have

obvious advantage. Second, we observe that EARLYADJUST and

LATEADJUST obtain similar results on the patent data set; how-

ever, LATEADJUST improves over EARLYADJUST by 26% on the

DBLP WW data set.

Different parameters: We ran two experiments to test robust-

ness against parameter settings. We ﬁrst changed thresholds θ

and θ

for string similarity and observed very similar results (vary-

ing within .4%) when θ

∈ [.7, .9] and θ

∈ [.5, .7]. Second,

we applied different attribute weights (w

name

∈ [0.4, 1], w

aﬀ

1 − w

name

) to compute no-decayed similarity. Figure 14 shows

that (1) ADJUST is robust against attribute weights; and (2) AD-

JUST always outperforms PARTITION.

Scalability: To test scalability of our techniques, we randomly di-

vided the partial patent data set into 10 subsets with similar sizes

without splitting entities. We started with one subset and gradually

added more, and reported the execution time in Figure 15. We

observe that (1) ADJUST terminated in 10.3 minutes on all 1871

records and is reasonably fast given that this is an off-line process;

and (2) the execution time grows nearly linearly in the size of the

data (though can be quadratic in the size of a partition after pre-

processing), showing scalability of our techniques.

G. COMPARING WITH EXISTING RECORD

LINKAGE TECHNIQUES

Record linkage has been extensively studied in recent years [7,

12]. To the best of our knowledge, existing techniques do not con-

sider evolution of entities over time and treat the data as snapshot

data. Our techniques differ from them in two aspects: the way we

compute record similarity and the way we cluster records.

For record-similarity computation, existing works can be divided

into three categories: classiﬁcation-based approaches [8] classify a

pair of records as match, unmatch and maybe; distance-based ap-

proaches [5] apply distance metrics to compute similarity of each

attribute, and take the weighted sum as the record similarity; rule-

based approaches [11] apply domain knowledge to match records.

Our work falls in the distance-based category; however, we apply

decay such that the weights we use for combining attribute simi-

larities are functions of the time difference between the records, so

we are tolerant of value evolution over time.

Many linkage techniques, especially classiﬁcation-based approaches,

require learning parameters or classiﬁcation models from training

data [6, 19, 8]. Their learning techniques all assume that record val-

ues do not change over time and value differences are due to differ-

ent representations of the same value (e.g., “Google” and “Google,

Inc.”). We learn parameters from training data as well, but we are

different in that we take into account possible value change over

time; the decay curves we learn can be considered as consisting of

parameters learned for different time gaps.

Relational entity resolution techniques take entity relationships

(e.g., co-author, co-citation) into account when computing record

similarity [1, 2, 14]. Our techniques also consider such multi-

valued attributes, but we apply agreement decay and give less re-

ward to similar values of such attributes in case of a big time gap.

For record clustering, there exists a wealth of literature on clus-

tering algorithms for record linkage [10]. Among them, uncon-

strained and unsupervised algorithms that result in disjoint clus-

ters are closest to ours. These algorithms may apply the transitive

rule (e.g., Partition algorithm [11]), may iteratively specify seeds

of clusters and assign vertexes to the seeds (e.g., Ricochet algo-

rithm [18]), and may perform clustering by solving an optimiza-

tion problem (e.g., Cut clustering [9]). These methods typically

consider the records in decreasing order of record similarity while

we consider the records in time order and collect evidence glob-

ally. Thus, our techniques do not necessarily merge records with

high value similarity if the resulting entity shows erratic changes in

a time period, and do not necessarily split records with low value

similarity if value evolution over time is likely.