↓ Skip to main content

A robust data scaling algorithm to improve classification accuracies in biomedical data

Overview of attention for article published in BMC Bioinformatics, September 2016
Altmetric Badge

About this Attention Score

  • In the top 25% of all research outputs scored by Altmetric
  • Good Attention Score compared to outputs of the same age (76th percentile)
  • Good Attention Score compared to outputs of the same age and source (76th percentile)

Mentioned by

twitter
13 tweeters

Citations

dimensions_citation
35 Dimensions

Readers on

mendeley
111 Mendeley
citeulike
1 CiteULike
You are seeing a free-to-access but limited selection of the activity Altmetric has collected about this research output. Click here to find out more.
Title
A robust data scaling algorithm to improve classification accuracies in biomedical data
Published in
BMC Bioinformatics, September 2016
DOI 10.1186/s12859-016-1236-x
Pubmed ID
Authors

Xi Hang Cao, Ivan Stojkovic, Zoran Obradovic

Abstract

Machine learning models have been adapted in biomedical research and practice for knowledge discovery and decision support. While mainstream biomedical informatics research focuses on developing more accurate models, the importance of data preprocessing draws less attention. We propose the Generalized Logistic (GL) algorithm that scales data uniformly to an appropriate interval by learning a generalized logistic function to fit the empirical cumulative distribution function of the data. The GL algorithm is simple yet effective; it is intrinsically robust to outliers, so it is particularly suitable for diagnostic/classification models in clinical/medical applications where the number of samples is usually small; it scales the data in a nonlinear fashion, which leads to potential improvement in accuracy. To evaluate the effectiveness of the proposed algorithm, we conducted experiments on 16 binary classification tasks with different variable types and cover a wide range of applications. The resultant performance in terms of area under the receiver operation characteristic curve (AUROC) and percentage of correct classification showed that models learned using data scaled by the GL algorithm outperform the ones using data scaled by the Min-max and the Z-score algorithm, which are the most commonly used data scaling algorithms. The proposed GL algorithm is simple and effective. It is robust to outliers, so no additional denoising or outlier detection step is needed in data preprocessing. Empirical results also show models learned from data scaled by the GL algorithm have higher accuracy compared to the commonly used data scaling algorithms.

Twitter Demographics

The data shown below were collected from the profiles of 13 tweeters who shared this research output. Click here to find out more about how the information was compiled.

Mendeley readers

The data shown below were compiled from readership statistics for 111 Mendeley readers of this research output. Click here to see the associated Mendeley record.

Geographical breakdown

Country Count As %
France 2 2%
United Kingdom 1 <1%
Brazil 1 <1%
Canada 1 <1%
Russia 1 <1%
Unknown 105 95%

Demographic breakdown

Readers by professional status Count As %
Student > Ph. D. Student 33 30%
Student > Master 21 19%
Researcher 17 15%
Student > Bachelor 13 12%
Student > Doctoral Student 6 5%
Other 9 8%
Unknown 12 11%
Readers by discipline Count As %
Engineering 24 22%
Computer Science 22 20%
Agricultural and Biological Sciences 10 9%
Biochemistry, Genetics and Molecular Biology 10 9%
Medicine and Dentistry 7 6%
Other 23 21%
Unknown 15 14%

Attention Score in Context

This research output has an Altmetric Attention Score of 7. This is our high-level measure of the quality and quantity of online attention that it has received. This Attention Score, as well as the ranking and number of research outputs shown below, was calculated when the research output was last mentioned on 04 April 2017.
All research outputs
#1,971,379
of 12,167,776 outputs
Outputs from BMC Bioinformatics
#859
of 4,424 outputs
Outputs of similar age
#60,472
of 260,378 outputs
Outputs of similar age from BMC Bioinformatics
#30
of 128 outputs
Altmetric has tracked 12,167,776 research outputs across all sources so far. Compared to these this one has done well and is in the 83rd percentile: it's in the top 25% of all research outputs ever tracked by Altmetric.
So far Altmetric has tracked 4,424 research outputs from this source. They receive a mean Attention Score of 4.9. This one has done well, scoring higher than 80% of its peers.
Older research outputs will score higher simply because they've had more time to accumulate mentions. To account for age we can compare this Altmetric Attention Score to the 260,378 tracked outputs that were published within six weeks on either side of this one in any source. This one has done well, scoring higher than 76% of its contemporaries.
We're also able to compare this research output to 128 others from the same source and published within six weeks on either side of this one. This one has done well, scoring higher than 76% of its contemporaries.