↓ Skip to main content

Boosting for high-dimensional two-class prediction

Overview of attention for article published in BMC Bioinformatics, September 2015
Altmetric Badge

About this Attention Score

  • Above-average Attention Score compared to outputs of the same age (51st percentile)
  • Average Attention Score compared to outputs of the same age and source

Mentioned by

twitter
4 X users

Citations

dimensions_citation
23 Dimensions

Readers on

mendeley
45 Mendeley
citeulike
1 CiteULike
You are seeing a free-to-access but limited selection of the activity Altmetric has collected about this research output. Click here to find out more.
Title
Boosting for high-dimensional two-class prediction
Published in
BMC Bioinformatics, September 2015
DOI 10.1186/s12859-015-0723-9
Pubmed ID
Authors

Rok Blagus, Lara Lusa

Abstract

In clinical research prediction models are used to accurately predict the outcome of the patients based on some of their characteristics. For high-dimensional prediction models (the number of variables greatly exceeds the number of samples) the choice of an appropriate classifier is crucial as it was observed that no single classification algorithm performs optimally for all types of data. Boosting was proposed as a method that combines the classification results obtained using base classifiers, where the sample weights are sequentially adjusted based on the performance in previous iterations. Generally boosting outperforms any individual classifier, but studies with high-dimensional data showed that the most standard boosting algorithm, AdaBoost.M1, cannot significantly improve the performance of its base classier. Recently other boosting algorithms were proposed (Gradient boosting, Stochastic Gradient boosting, LogitBoost); they were shown to perform better than AdaBoost.M1 but their performance was not evaluated for high-dimensional data. In this paper we use simulation studies and real gene-expression data sets to evaluate the performance of boosting algorithms when data are high-dimensional. Our results confirm that AdaBoost.M1 can perform poorly in this setting, often failing to improve the performance of its base classifier. We provide the explanation for this and propose a modification, AdaBoost.M1.ICV, which uses cross-validated estimates of the prediction errors and outperforms the original algorithm when data are high-dimensional. The use of AdaBoost.M1.ICV is advisable when the base classifier overfits the training data: the number of variables is large, the number of samples is small, and/or the difference between the classes is large. To a lesser extent also Gradient boosting suffers from similar problems. Contrary to the findings for the low-dimensional data, shrinkage does not improve the performance of Gradient boosting when data are high-dimensional, however it is beneficial for Stochastic Gradient boosting, which outperformed the other boosting algorithms in our analyses. LogitBoost suffers from overfitting and generally performs poorly. The results show that boosting can substantially improve the performance of its base classifier also when data are high-dimensional. However, not all boosting algorithms perform equally well. LogitBoost, AdaBoost.M1 and Gradient boosting seem less useful for this type of data. Overall, Stochastic Gradient boosting with shrinkage and AdaBoost.M1.ICV seem to be the preferable choices for high-dimensional class-prediction.

X Demographics

X Demographics

The data shown below were collected from the profiles of 4 X users who shared this research output. Click here to find out more about how the information was compiled.
Mendeley readers

Mendeley readers

The data shown below were compiled from readership statistics for 45 Mendeley readers of this research output. Click here to see the associated Mendeley record.

Geographical breakdown

Country Count As %
Japan 1 2%
Unknown 44 98%

Demographic breakdown

Readers by professional status Count As %
Student > Ph. D. Student 10 22%
Student > Master 9 20%
Student > Bachelor 5 11%
Student > Doctoral Student 3 7%
Researcher 3 7%
Other 9 20%
Unknown 6 13%
Readers by discipline Count As %
Computer Science 15 33%
Agricultural and Biological Sciences 5 11%
Biochemistry, Genetics and Molecular Biology 4 9%
Mathematics 4 9%
Neuroscience 3 7%
Other 7 16%
Unknown 7 16%
Attention Score in Context

Attention Score in Context

This research output has an Altmetric Attention Score of 2. This is our high-level measure of the quality and quantity of online attention that it has received. This Attention Score, as well as the ranking and number of research outputs shown below, was calculated when the research output was last mentioned on 10 May 2017.
All research outputs
#13,448,315
of 22,829,083 outputs
Outputs from BMC Bioinformatics
#4,198
of 7,287 outputs
Outputs of similar age
#129,518
of 274,256 outputs
Outputs of similar age from BMC Bioinformatics
#75
of 149 outputs
Altmetric has tracked 22,829,083 research outputs across all sources so far. This one is in the 39th percentile – i.e., 39% of other outputs scored the same or lower than it.
So far Altmetric has tracked 7,287 research outputs from this source. They typically receive a little more attention than average, with a mean Attention Score of 5.4. This one is in the 39th percentile – i.e., 39% of its peers scored the same or lower than it.
Older research outputs will score higher simply because they've had more time to accumulate mentions. To account for age we can compare this Altmetric Attention Score to the 274,256 tracked outputs that were published within six weeks on either side of this one in any source. This one has gotten more attention than average, scoring higher than 51% of its contemporaries.
We're also able to compare this research output to 149 others from the same source and published within six weeks on either side of this one. This one is in the 43rd percentile – i.e., 43% of its contemporaries scored the same or lower than it.