↓ Skip to main content

CURE-SMOTE algorithm and hybrid algorithm for feature selection and parameter optimization based on random forests

Overview of attention for article published in BMC Bioinformatics, March 2017
Altmetric Badge

Citations

dimensions_citation
163 Dimensions

Readers on

mendeley
125 Mendeley
You are seeing a free-to-access but limited selection of the activity Altmetric has collected about this research output. Click here to find out more.
Title
CURE-SMOTE algorithm and hybrid algorithm for feature selection and parameter optimization based on random forests
Published in
BMC Bioinformatics, March 2017
DOI 10.1186/s12859-017-1578-z
Pubmed ID
Authors

Li Ma, Suohai Fan

Abstract

The random forests algorithm is a type of classifier with prominent universality, a wide application range, and robustness for avoiding overfitting. But there are still some drawbacks to random forests. Therefore, to improve the performance of random forests, this paper seeks to improve imbalanced data processing, feature selection and parameter optimization. We propose the CURE-SMOTE algorithm for the imbalanced data classification problem. Experiments on imbalanced UCI data reveal that the combination of Clustering Using Representatives (CURE) enhances the original synthetic minority oversampling technique (SMOTE) algorithms effectively compared with the classification results on the original data using random sampling, Borderline-SMOTE1, safe-level SMOTE, C-SMOTE, and k-means-SMOTE. Additionally, the hybrid RF (random forests) algorithm has been proposed for feature selection and parameter optimization, which uses the minimum out of bag (OOB) data error as its objective function. Simulation results on binary and higher-dimensional data indicate that the proposed hybrid RF algorithms, hybrid genetic-random forests algorithm, hybrid particle swarm-random forests algorithm and hybrid fish swarm-random forests algorithm can achieve the minimum OOB error and show the best generalization ability. The training set produced from the proposed CURE-SMOTE algorithm is closer to the original data distribution because it contains minimal noise. Thus, better classification results are produced from this feasible and effective algorithm. Moreover, the hybrid algorithm's F-value, G-mean, AUC and OOB scores demonstrate that they surpass the performance of the original RF algorithm. Hence, this hybrid algorithm provides a new way to perform feature selection and parameter optimization.

Mendeley readers

Mendeley readers

The data shown below were compiled from readership statistics for 125 Mendeley readers of this research output. Click here to see the associated Mendeley record.

Geographical breakdown

Country Count As %
Unknown 125 100%

Demographic breakdown

Readers by professional status Count As %
Student > Ph. D. Student 24 19%
Student > Master 11 9%
Lecturer 10 8%
Student > Bachelor 9 7%
Researcher 9 7%
Other 22 18%
Unknown 40 32%
Readers by discipline Count As %
Computer Science 45 36%
Engineering 13 10%
Mathematics 6 5%
Business, Management and Accounting 5 4%
Arts and Humanities 2 2%
Other 13 10%
Unknown 41 33%