↓ Skip to main content

Metabolomics variable selection and classification in the presence of observations below the detection limit using an extension of ERp

Overview of attention for article published in BMC Bioinformatics, February 2017
Altmetric Badge

Citations

dimensions_citation
6 Dimensions

Readers on

mendeley
31 Mendeley
You are seeing a free-to-access but limited selection of the activity Altmetric has collected about this research output. Click here to find out more.
Title
Metabolomics variable selection and classification in the presence of observations below the detection limit using an extension of ERp
Published in
BMC Bioinformatics, February 2017
DOI 10.1186/s12859-017-1480-8
Pubmed ID
Authors

Mari van Reenen, Johan A. Westerhuis, Carolus J. Reinecke, J Hendrik Venter

Abstract

ERp is a variable selection and classification method for metabolomics data. ERp uses minimized classification error rates, based on data from a control and experimental group, to test the null hypothesis of no difference between the distributions of variables over the two groups. If the associated p-values are significant they indicate discriminatory variables (i.e. informative metabolites). The p-values are calculated assuming a common continuous strictly increasing cumulative distribution under the null hypothesis. This assumption is violated when zero-valued observations can occur with positive probability, a characteristic of GC-MS metabolomics data, disqualifying ERp in this context. This paper extends ERp to address two sources of zero-valued observations: (i) zeros reflecting the complete absence of a metabolite from a sample (true zeros); and (ii) zeros reflecting a measurement below the detection limit. This is achieved by allowing the null cumulative distribution function to take the form of a mixture between a jump at zero and a continuous strictly increasing function. The extended ERp approach is referred to as XERp. XERp is no longer non-parametric, but its null distributions depend only on one parameter, the true proportion of zeros. Under the null hypothesis this parameter can be estimated by the proportion of zeros in the available data. XERp is shown to perform well with regard to bias and power. To demonstrate the utility of XERp, it is applied to GC-MS data from a metabolomics study on tuberculosis meningitis in infants and children. We find that XERp is able to provide an informative shortlist of discriminatory variables, while attaining satisfactory classification accuracy for new subjects in a leave-one-out cross-validation context. XERp takes into account the distributional structure of data with a probability mass at zero without requiring any knowledge of the detection limit of the metabolomics platform. XERp is able to identify variables that discriminate between two groups by simultaneously extracting information from the difference in the proportion of zeros and shifts in the distributions of the non-zero observations. XERp uses simple rules to classify new subjects and a weight pair to adjust for unequal sample sizes or sensitivity and specificity requirements.

Mendeley readers

Mendeley readers

The data shown below were compiled from readership statistics for 31 Mendeley readers of this research output. Click here to see the associated Mendeley record.

Geographical breakdown

Country Count As %
United Kingdom 1 3%
South Africa 1 3%
Unknown 29 94%

Demographic breakdown

Readers by professional status Count As %
Researcher 8 26%
Student > Master 4 13%
Student > Ph. D. Student 3 10%
Student > Doctoral Student 1 3%
Student > Bachelor 1 3%
Other 5 16%
Unknown 9 29%
Readers by discipline Count As %
Biochemistry, Genetics and Molecular Biology 4 13%
Agricultural and Biological Sciences 4 13%
Medicine and Dentistry 3 10%
Chemistry 3 10%
Neuroscience 3 10%
Other 6 19%
Unknown 8 26%