↓ Skip to main content

Joint use of over- and under-sampling techniques and cross-validation for the development and assessment of prediction models

Overview of attention for article published in BMC Bioinformatics, November 2015
Altmetric Badge

About this Attention Score

  • Average Attention Score compared to outputs of the same age
  • Average Attention Score compared to outputs of the same age and source

Mentioned by

twitter
4 X users
facebook
1 Facebook page

Citations

dimensions_citation
85 Dimensions

Readers on

mendeley
126 Mendeley
citeulike
1 CiteULike
You are seeing a free-to-access but limited selection of the activity Altmetric has collected about this research output. Click here to find out more.
Title
Joint use of over- and under-sampling techniques and cross-validation for the development and assessment of prediction models
Published in
BMC Bioinformatics, November 2015
DOI 10.1186/s12859-015-0784-9
Pubmed ID
Authors

Rok Blagus, Lara Lusa

Abstract

Prediction models are used in clinical research to develop rules that can be used to accurately predict the outcome of the patients based on some of their characteristics. They represent a valuable tool in the decision making process of clinicians and health policy makers, as they enable them to estimate the probability that patients have or will develop a disease, will respond to a treatment, or that their disease will recur. The interest devoted to prediction models in the biomedical community has been growing in the last few years. Often the data used to develop the prediction models are class-imbalanced as only few patients experience the event (and therefore belong to minority class). Prediction models developed using class-imbalanced data tend to achieve sub-optimal predictive accuracy in the minority class. This problem can be diminished by using sampling techniques aimed at balancing the class distribution. These techniques include under- and oversampling, where a fraction of the majority class samples are retained in the analysis or new samples from the minority class are generated. The correct assessment of how the prediction model is likely to perform on independent data is of crucial importance; in the absence of an independent data set, cross-validation is normally used. While the importance of correct cross-validation is well documented in the biomedical literature, the challenges posed by the joint use of sampling techniques and cross-validation have not been addressed. We show that care must be taken to ensure that cross-validation is performed correctly on sampled data, and that the risk of overestimating the predictive accuracy is greater when oversampling techniques are used. Examples based on the re-analysis of real datasets and simulation studies are provided. We identify some results from the biomedical literature where the incorrect cross-validation was performed, where we expect that the performance of oversampling techniques was heavily overestimated.

X Demographics

X Demographics

The data shown below were collected from the profiles of 4 X users who shared this research output. Click here to find out more about how the information was compiled.
Mendeley readers

Mendeley readers

The data shown below were compiled from readership statistics for 126 Mendeley readers of this research output. Click here to see the associated Mendeley record.

Geographical breakdown

Country Count As %
United States 1 <1%
Belgium 1 <1%
Unknown 124 98%

Demographic breakdown

Readers by professional status Count As %
Student > Ph. D. Student 25 20%
Student > Master 20 16%
Researcher 14 11%
Student > Bachelor 13 10%
Professor > Associate Professor 5 4%
Other 19 15%
Unknown 30 24%
Readers by discipline Count As %
Computer Science 26 21%
Engineering 15 12%
Agricultural and Biological Sciences 8 6%
Medicine and Dentistry 8 6%
Biochemistry, Genetics and Molecular Biology 5 4%
Other 29 23%
Unknown 35 28%
Attention Score in Context

Attention Score in Context

This research output has an Altmetric Attention Score of 2. This is our high-level measure of the quality and quantity of online attention that it has received. This Attention Score, as well as the ranking and number of research outputs shown below, was calculated when the research output was last mentioned on 27 October 2021.
All research outputs
#15,376,383
of 24,840,108 outputs
Outputs from BMC Bioinformatics
#4,684
of 7,595 outputs
Outputs of similar age
#149,069
of 291,301 outputs
Outputs of similar age from BMC Bioinformatics
#92
of 155 outputs
Altmetric has tracked 24,840,108 research outputs across all sources so far. This one is in the 37th percentile – i.e., 37% of other outputs scored the same or lower than it.
So far Altmetric has tracked 7,595 research outputs from this source. They typically receive a little more attention than average, with a mean Attention Score of 5.5. This one is in the 35th percentile – i.e., 35% of its peers scored the same or lower than it.
Older research outputs will score higher simply because they've had more time to accumulate mentions. To account for age we can compare this Altmetric Attention Score to the 291,301 tracked outputs that were published within six weeks on either side of this one in any source. This one is in the 47th percentile – i.e., 47% of its contemporaries scored the same or lower than it.
We're also able to compare this research output to 155 others from the same source and published within six weeks on either side of this one. This one is in the 39th percentile – i.e., 39% of its contemporaries scored the same or lower than it.