Figure 9: Results on XD set. Figure 10: Results on WW set.
Figure 9 shows the results of various methods on this data set.
ADJUST improves over traditional methods by 37%-43%. Other
observations are similar to those on the Patent data set, except
that applying decay to some traditional methods (PARTITION and
MERGE) can considerably reduce the precision and result in a low
F-measure, as this data set is small and extremely difficult.
WW data set: We first report results on the 302 records for which
DBLP has identified 18 clusters, among which (1) 3 involve 2 af-
filiations, 2 involve 3 affiliations, and 1 involves 4 affiliations, so
in total 10 affiliation transitions; (2) two authors share the same
affiliation Fudan; (3) the largest cluster contains 92 records, the
smallest one contains 1 record, and 6 clusters contain more than 10
records. ADJUST obtains both high precision (.98) and high recall
(.97). We highlight that (1) ADJUST is able to distinguish the dif-
ferent authors in most cases; (2) among the 10 transitions, ADJUST
identifies 5 of them. ADJUST makes four types of mistakes: (1)
it merges the two Fudan clusters, as one of them contains a single
record with year in the middle of the time period of the other clus-
ter; (2) it merges the big Fudan cluster with another record, whose
affiliation appears different from the rest in its own cluster, and time
stamp is one year before the earliest record in the big Fudan cluster,
so makes a hard case for the adjusting step; (3) it does not identify
one of the transitions for the same reason as in the XD data set; and
(4) it does not identify the other 4 transitions because there are very
few records for one of the affiliations and so not enough evidence
for merging. Finally, Figure 10 shows that ADJUST is significantly
better than all other methods.
In the complete DBLP WW data set, 124 other WW records are
merged with these 18 clusters and we manually verified the cor-
rectness. Among them, 63 are correctly merged, fixing errors from
DBLP; 26 are wrongly merged but can be correctly separated if
we have department information for affiliation; and 35 are wrongly
merged mainly because of high similarity of affiliations (e.g., many
records with “technology” in affiliation are wrongly merged be-
cause the IDF of “technology” is not so low on this small data set).
If we count these additional records, we are still able to obtain a
precision of .94 and a recall of .94.
6. RELATED WORK AND CONCLUSIONS
Related work: We have compared our techniques with traditional
record-linkage techniques in Section 2 and in experiments; Ap-
pendix G gives more details. We next compare with related works
regarding temporal information.
A suite of temporal data models [15] and temporal knowledge
discovery paradigms [16] have been proposed in the past; however,
we are not aware of any work focusing on linking temporal records.
Behavior based linkage [20] leverages periodical behavior patterns
of each entity in linking pairs of records and learns such patterns
from transaction logs. Their behavior pattern is different from the
decay in our techniques in that decay learns the probability of value
changes over time for all entities. In addition, we do not require a
fixed and repeated value change pattern of particular entities and
apply decay in a global fashion (rather than just between pairs of
records) such that we can handle value evolution over time.
The notion of decay has been proposed in the context of data
warehouses and streaming data recently [3, 4]. They use decay to
reduce the effect of older tuples on data analysis. Among them,
backward decay [3] measures time difference backward from the
latest time and forward decay [4] measures time difference forward
from a fixed landmark. Their decay function is either binary or a
fixed (exponential or polynomial) function. We differ in that 1) we
consider time difference between two records rather than from a
fixed point, and 2) we learn the decay curves purely from the data
rather than using a fixed function.
Conclusions and future work: This paper studied linking records
with temporal information. We apply decay in record-similarity
computation and consider the time order of records in clustering;
thus, our linkage technique is tolerant of entity evolution over time
and can glean evidence globally for decision making. Future work
includes combining temporal information with other dimensions of
information such as spatial information to achieve better results,
and considering erroneous data especially erroneous time stamps.
7. REFERENCES
[1] R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating fuzzy
duplicates in data warehouses. In VLDB, pages 586–597, 2002.
[2] Z. Chen, D. V. Kalashnikov, and S. Mehrotra. Exploiting
relationships for object consolidation. In IQIS, pages 47–58, 2005.
[3] E. Cohen and M. Strauss. Maintaining time-decaying stream
aggregates. In PODS, pages 223–233, 2003.
[4] G. Cormode, V. Shkapenyuk, D. Srivastava, and B. Xu. Forward
decay: A practical time decay model for streaming systems. ICDE,
pages 138–149, 2009.
[5] D. Dey. Entity matching in heterogeneous databases: A logistic
regression approach. Decis. Support Syst., 44:740–747, 2008.
[6] P. Domingos. Multi-relational record linkage. In In Proceedings of
the KDD Workshop on Multi-Relational Data Mining, 2004.
[7] A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate
record detection: A survey. IEEE Trans. Knowl. Data Eng.,
19(1):1–16, 2007.
[8] I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of
the American Statistical Association, 64(328):1183–1210, 1969.
[9] G. Flake, R. Tarjan, and K. Tsioutsiouliklis. Graph clustering and
minimum cut trees. Internet Mathematics, 1:385–408, 2004.
[10] O. Hassanzadeh, F. Chiang, H. C. Lee, and R. J. Miller. Framework
for evaluating clustering algorithms in duplicate detection. PVLDB,
pages 1282–1293, 2009.
[11] M. A. Hernandez and S. J. Stolfo. Real-world data is dirty: Data
cleansing and the merge/purge problem. Data Mining and
Knowledge Discovery, 2:9–37, 1998.
[12] N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage:
similarity measures and algorithms. In SIGMOD, 2006.
[13] A. K. McCallum, K. Nigam, and L. H. Ungar. Efficient clustering of
high-dimensional data sets with application to reference matching. In
Proc. of SIGKDD, pages 169–178, 2000.
[14] B. W. On, N. Koudas, D. Lee, and D. Srivastava. Group linkage. In
ICDE, pages 496–505, 2007.
[15] G. Ozsoyoglu and R. Snodgrass. Temporal and real-time databases: a
survey. TKDE, 7:513–532, 1995.
[16] J. F. Roddick and M. Spiliopoulou. A survey of temporal knowledge
discovery paradigms and methods. TKDE, 14:750–767, 2002.
[17] G. Weikum, N. Ntarmos, M. Spaniol, P. Triantafillou, A. Bencz
´
ur,
S. Kirkpatrick, P. Rigaux, and M. Williamson. Longitudinal analytics
on web archive data: It’s about time! In CIDR, pages 199–202, 2011.
[18] D. T. Wijaya, S. Bressan, J. Joxan, and D. T. Wijaya. Ricochet: A
family of unconstrained algorithms for graph clustering. In DASFAA,
pages 153–167, 2009.
[19] W. E. Winkler. Methods for record linkage and bayesian networks.
Technical report, U.S. Bureau of the Census, 2002.
[20] M. Yakout, A. K. Elmagarmid, H. Elmeleegy, M. Ouzzani, and
A. Qi. Behavior based record linkage. PVLDB, 3:439–448, 2010.