↓ Skip to main content

Evaluating privacy-preserving record linkage using cryptographic long-term keys and multibit trees on large medical datasets

Overview of attention for article published in BMC Medical Informatics and Decision Making, June 2017
Altmetric Badge

Citations

dimensions_citation
24 Dimensions

Readers on

mendeley
39 Mendeley
You are seeing a free-to-access but limited selection of the activity Altmetric has collected about this research output. Click here to find out more.
Title
Evaluating privacy-preserving record linkage using cryptographic long-term keys and multibit trees on large medical datasets
Published in
BMC Medical Informatics and Decision Making, June 2017
DOI 10.1186/s12911-017-0478-5
Pubmed ID
Authors

Adrian P. Brown, Christian Borgs, Sean M. Randall, Rainer Schnell

Abstract

Integrating medical data using databases from different sources by record linkage is a powerful technique increasingly used in medical research. Under many jurisdictions, unique personal identifiers needed for linking the records are unavailable. Since sensitive attributes, such as names, have to be used instead, privacy regulations usually demand encrypting these identifiers. The corresponding set of techniques for privacy-preserving record linkage (PPRL) has received widespread attention. One recent method is based on Bloom filters. Due to superior resilience against cryptographic attacks, composite Bloom filters (cryptographic long-term keys, CLKs) are considered best practice for privacy in PPRL. Real-world performance of these techniques using large-scale data is unknown up to now. Using a large subset of Australian hospital admission data, we tested the performance of an innovative PPRL technique (CLKs using multibit trees) against a gold-standard derived from clear-text probabilistic record linkage. Linkage time and linkage quality (recall, precision and F-measure) were evaluated. Clear text probabilistic linkage resulted in marginally higher precision and recall than CLKs. PPRL required more computing time but 5 million records could still be de-duplicated within one day. However, the PPRL approach required fine tuning of parameters. We argue that increased privacy of PPRL comes with the price of small losses in precision and recall and a large increase in computational burden and setup time. These costs seem to be acceptable in most applied settings, but they have to be considered in the decision to apply PPRL. Further research on the optimal automatic choice of parameters is needed.

Mendeley readers

Mendeley readers

The data shown below were compiled from readership statistics for 39 Mendeley readers of this research output. Click here to see the associated Mendeley record.

Geographical breakdown

Country Count As %
United States 1 3%
Unknown 38 97%

Demographic breakdown

Readers by professional status Count As %
Researcher 8 21%
Student > Master 6 15%
Student > Ph. D. Student 6 15%
Student > Bachelor 3 8%
Other 2 5%
Other 3 8%
Unknown 11 28%
Readers by discipline Count As %
Computer Science 13 33%
Biochemistry, Genetics and Molecular Biology 3 8%
Nursing and Health Professions 2 5%
Medicine and Dentistry 2 5%
Business, Management and Accounting 1 3%
Other 4 10%
Unknown 14 36%