↓ Skip to main content

Estimating parameters for probabilistic linkage of privacy-preserved datasets

Overview of attention for article published in BMC Medical Research Methodology, July 2017
Altmetric Badge

Citations

dimensions_citation
10 Dimensions

Readers on

mendeley
21 Mendeley
You are seeing a free-to-access but limited selection of the activity Altmetric has collected about this research output. Click here to find out more.
Title
Estimating parameters for probabilistic linkage of privacy-preserved datasets
Published in
BMC Medical Research Methodology, July 2017
DOI 10.1186/s12874-017-0370-0
Pubmed ID
Authors

Adrian P. Brown, Sean M. Randall, Anna M. Ferrante, James B. Semmens, James H. Boyd

Abstract

Probabilistic record linkage is a process used to bring together person-based records from within the same dataset (de-duplication) or from disparate datasets using pairwise comparisons and matching probabilities. The linkage strategy and associated match probabilities are often estimated through investigations into data quality and manual inspection. However, as privacy-preserved datasets comprise encrypted data, such methods are not possible. In this paper, we present a method for estimating the probabilities and threshold values for probabilistic privacy-preserved record linkage using Bloom filters. Our method was tested through a simulation study using synthetic data, followed by an application using real-world administrative data. Synthetic datasets were generated with error rates from zero to 20% error. Our method was used to estimate parameters (probabilities and thresholds) for de-duplication linkages. Linkage quality was determined by F-measure. Each dataset was privacy-preserved using separate Bloom filters for each field. Match probabilities were estimated using the expectation-maximisation (EM) algorithm on the privacy-preserved data. Threshold cut-off values were determined by an extension to the EM algorithm allowing linkage quality to be estimated for each possible threshold. De-duplication linkages of each privacy-preserved dataset were performed using both estimated and calculated probabilities. Linkage quality using the F-measure at the estimated threshold values was also compared to the highest F-measure. Three large administrative datasets were used to demonstrate the applicability of the probability and threshold estimation technique on real-world data. Linkage of the synthetic datasets using the estimated probabilities produced an F-measure that was comparable to the F-measure using calculated probabilities, even with up to 20% error. Linkage of the administrative datasets using estimated probabilities produced an F-measure that was higher than the F-measure using calculated probabilities. Further, the threshold estimation yielded results for F-measure that were only slightly below the highest possible for those probabilities. The method appears highly accurate across a spectrum of datasets with varying degrees of error. As there are few alternatives for parameter estimation, the approach is a major step towards providing a complete operational approach for probabilistic linkage of privacy-preserved datasets.

Mendeley readers

Mendeley readers

The data shown below were compiled from readership statistics for 21 Mendeley readers of this research output. Click here to see the associated Mendeley record.

Geographical breakdown

Country Count As %
Unknown 21 100%

Demographic breakdown

Readers by professional status Count As %
Student > Bachelor 4 19%
Student > Ph. D. Student 3 14%
Student > Master 3 14%
Researcher 2 10%
Other 2 10%
Other 4 19%
Unknown 3 14%
Readers by discipline Count As %
Computer Science 5 24%
Pharmacology, Toxicology and Pharmaceutical Science 2 10%
Medicine and Dentistry 2 10%
Nursing and Health Professions 1 5%
Agricultural and Biological Sciences 1 5%
Other 4 19%
Unknown 6 29%