↓ Skip to main content

A robust data scaling algorithm to improve classification accuracies in biomedical data

Overview of attention for article published in BMC Bioinformatics, September 2016
Altmetric Badge

About this Attention Score

  • In the top 25% of all research outputs scored by Altmetric
  • Good Attention Score compared to outputs of the same age (74th percentile)
  • Good Attention Score compared to outputs of the same age and source (73rd percentile)

Mentioned by

twitter
11 X users

Citations

dimensions_citation
98 Dimensions

Readers on

mendeley
181 Mendeley
citeulike
1 CiteULike
You are seeing a free-to-access but limited selection of the activity Altmetric has collected about this research output. Click here to find out more.
Title
A robust data scaling algorithm to improve classification accuracies in biomedical data
Published in
BMC Bioinformatics, September 2016
DOI 10.1186/s12859-016-1236-x
Pubmed ID
Authors

Xi Hang Cao, Ivan Stojkovic, Zoran Obradovic

Abstract

Machine learning models have been adapted in biomedical research and practice for knowledge discovery and decision support. While mainstream biomedical informatics research focuses on developing more accurate models, the importance of data preprocessing draws less attention. We propose the Generalized Logistic (GL) algorithm that scales data uniformly to an appropriate interval by learning a generalized logistic function to fit the empirical cumulative distribution function of the data. The GL algorithm is simple yet effective; it is intrinsically robust to outliers, so it is particularly suitable for diagnostic/classification models in clinical/medical applications where the number of samples is usually small; it scales the data in a nonlinear fashion, which leads to potential improvement in accuracy. To evaluate the effectiveness of the proposed algorithm, we conducted experiments on 16 binary classification tasks with different variable types and cover a wide range of applications. The resultant performance in terms of area under the receiver operation characteristic curve (AUROC) and percentage of correct classification showed that models learned using data scaled by the GL algorithm outperform the ones using data scaled by the Min-max and the Z-score algorithm, which are the most commonly used data scaling algorithms. The proposed GL algorithm is simple and effective. It is robust to outliers, so no additional denoising or outlier detection step is needed in data preprocessing. Empirical results also show models learned from data scaled by the GL algorithm have higher accuracy compared to the commonly used data scaling algorithms.

X Demographics

X Demographics

The data shown below were collected from the profiles of 11 X users who shared this research output. Click here to find out more about how the information was compiled.
Mendeley readers

Mendeley readers

The data shown below were compiled from readership statistics for 181 Mendeley readers of this research output. Click here to see the associated Mendeley record.

Geographical breakdown

Country Count As %
France 2 1%
Brazil 1 <1%
United Kingdom 1 <1%
Canada 1 <1%
Russia 1 <1%
Unknown 175 97%

Demographic breakdown

Readers by professional status Count As %
Student > Ph. D. Student 37 20%
Student > Master 29 16%
Student > Bachelor 25 14%
Researcher 17 9%
Student > Doctoral Student 7 4%
Other 13 7%
Unknown 53 29%
Readers by discipline Count As %
Computer Science 33 18%
Engineering 29 16%
Agricultural and Biological Sciences 11 6%
Biochemistry, Genetics and Molecular Biology 9 5%
Medicine and Dentistry 8 4%
Other 35 19%
Unknown 56 31%
Attention Score in Context

Attention Score in Context

This research output has an Altmetric Attention Score of 6. This is our high-level measure of the quality and quantity of online attention that it has received. This Attention Score, as well as the ranking and number of research outputs shown below, was calculated when the research output was last mentioned on 04 April 2017.
All research outputs
#5,954,816
of 24,558,777 outputs
Outputs from BMC Bioinformatics
#1,903
of 7,553 outputs
Outputs of similar age
#82,510
of 336,627 outputs
Outputs of similar age from BMC Bioinformatics
#34
of 126 outputs
Altmetric has tracked 24,558,777 research outputs across all sources so far. Compared to these this one has done well and is in the 75th percentile: it's in the top 25% of all research outputs ever tracked by Altmetric.
So far Altmetric has tracked 7,553 research outputs from this source. They typically receive a little more attention than average, with a mean Attention Score of 5.5. This one has gotten more attention than average, scoring higher than 73% of its peers.
Older research outputs will score higher simply because they've had more time to accumulate mentions. To account for age we can compare this Altmetric Attention Score to the 336,627 tracked outputs that were published within six weeks on either side of this one in any source. This one has gotten more attention than average, scoring higher than 74% of its contemporaries.
We're also able to compare this research output to 126 others from the same source and published within six weeks on either side of this one. This one has gotten more attention than average, scoring higher than 73% of its contemporaries.