Estimating parameters for probabilistic linkage of privacy-preserved datasets

Overview of attention for article published in BMC Medical Research Methodology, July 2017

Altmetric Badge

Citations

dimensions_citation: 10 Dimensions

Readers on

mendeley: 21 Mendeley

Summary Dimensions citations

You are seeing a free-to-access but limited selection of the activity Altmetric has collected about this research output. Click here to find out more.

Title	Estimating parameters for probabilistic linkage of privacy-preserved datasets
Published in	BMC Medical Research Methodology, July 2017
DOI	10.1186/s12874-017-0370-0
Pubmed ID	28693507
Authors	Adrian P. Brown, Sean M. Randall, Anna M. Ferrante, James B. Semmens, James H. Boyd
Abstract	Probabilistic record linkage is a process used to bring together person-based records from within the same dataset (de-duplication) or from disparate datasets using pairwise comparisons and matching probabilities. The linkage strategy and associated match probabilities are often estimated through investigations into data quality and manual inspection. However, as privacy-preserved datasets comprise encrypted data, such methods are not possible. In this paper, we present a method for estimating the probabilities and threshold values for probabilistic privacy-preserved record linkage using Bloom filters. Our method was tested through a simulation study using synthetic data, followed by an application using real-world administrative data. Synthetic datasets were generated with error rates from zero to 20% error. Our method was used to estimate parameters (probabilities and thresholds) for de-duplication linkages. Linkage quality was determined by F-measure. Each dataset was privacy-preserved using separate Bloom filters for each field. Match probabilities were estimated using the expectation-maximisation (EM) algorithm on the privacy-preserved data. Threshold cut-off values were determined by an extension to the EM algorithm allowing linkage quality to be estimated for each possible threshold. De-duplication linkages of each privacy-preserved dataset were performed using both estimated and calculated probabilities. Linkage quality using the F-measure at the estimated threshold values was also compared to the highest F-measure. Three large administrative datasets were used to demonstrate the applicability of the probability and threshold estimation technique on real-world data. Linkage of the synthetic datasets using the estimated probabilities produced an F-measure that was comparable to the F-measure using calculated probabilities, even with up to 20% error. Linkage of the administrative datasets using estimated probabilities produced an F-measure that was higher than the F-measure using calculated probabilities. Further, the threshold estimation yielded results for F-measure that were only slightly below the highest possible for those probabilities. The method appears highly accurate across a spectrum of datasets with varying degrees of error. As there are few alternatives for parameter estimation, the approach is a major step towards providing a complete operational approach for probabilistic linkage of privacy-preserved datasets.

View on publisher site Alert me about new mentions

Mendeley readers

Mendeley readers

The data shown below were compiled from readership statistics for 21 Mendeley readers of this research output. Click here to see the associated Mendeley record.

Geographical breakdown

Country	Count	As %
Unknown	21	100%

Demographic breakdown

Readers by professional status	Count	As %
Student > Bachelor	4	19%
Student > Ph. D. Student	3	14%
Student > Master	3	14%
Researcher	2	10%
Other	2	10%
Other	4	19%
Unknown	3	14%

Readers by discipline	Count	As %
Computer Science	5	24%
Pharmacology, Toxicology and Pharmaceutical Science	2	10%
Medicine and Dentistry	2	10%
Nursing and Health Professions	1	5%
Agricultural and Biological Sciences	1	5%
Other	4	19%
Unknown	6	29%