Automatic identification of variables in epidemiological datasets using logic regression

Overview of attention for article published in BMC Medical Informatics and Decision Making, April 2017

Altmetric Badge

Citations

dimensions_citation: 1 Dimensions

Readers on

mendeley: 54 Mendeley

Summary Dimensions citations

You are seeing a free-to-access but limited selection of the activity Altmetric has collected about this research output. Click here to find out more.

Title	Automatic identification of variables in epidemiological datasets using logic regression
Published in	BMC Medical Informatics and Decision Making, April 2017
DOI	10.1186/s12911-017-0429-1
Pubmed ID	28407816
Authors	Matthias W. Lorenz, Negin Ashtiani Abdi, Frank Scheckenbach, Anja Pflug, Alpaslan Bülbül, Alberico L. Catapano, Stefan Agewall, Marat Ezhov, Michiel L. Bots, Stefan Kiechl, Andreas Orth, on behalf of the PROG-IMT study group
Abstract	For an individual participant data (IPD) meta-analysis, multiple datasets must be transformed in a consistent format, e.g. using uniform variable names. When large numbers of datasets have to be processed, this can be a time-consuming and error-prone task. Automated or semi-automated identification of variables can help to reduce the workload and improve the data quality. For semi-automation high sensitivity in the recognition of matching variables is particularly important, because it allows creating software which for a target variable presents a choice of source variables, from which a user can choose the matching one, with only low risk of having missed a correct source variable. For each variable in a set of target variables, a number of simple rules were manually created. With logic regression, an optimal Boolean combination of these rules was searched for every target variable, using a random subset of a large database of epidemiological and clinical cohort data (construction subset). In a second subset of this database (validation subset), this optimal combination rules were validated. In the construction sample, 41 target variables were allocated on average with a positive predictive value (PPV) of 34%, and a negative predictive value (NPV) of 95%. In the validation sample, PPV was 33%, whereas NPV remained at 94%. In the construction sample, PPV was 50% or less in 63% of all variables, in the validation sample in 71% of all variables. We demonstrated that the application of logic regression in a complex data management task in large epidemiological IPD meta-analyses is feasible. However, the performance of the algorithm is poor, which may require backup strategies.

View on publisher site Alert me about new mentions

Mendeley readers

Mendeley readers

The data shown below were compiled from readership statistics for 54 Mendeley readers of this research output. Click here to see the associated Mendeley record.

Geographical breakdown

Country	Count	As %
Unknown	54	100%

Demographic breakdown

Readers by professional status	Count	As %
Researcher	9	17%
Professor	8	15%
Student > Master	6	11%
Student > Ph. D. Student	4	7%
Librarian	3	6%
Other	11	20%
Unknown	13	24%

Readers by discipline	Count	As %
Medicine and Dentistry	23	43%
Economics, Econometrics and Finance	5	9%
Social Sciences	3	6%
Computer Science	2	4%
Nursing and Health Professions	2	4%
Other	2	4%
Unknown	17	31%