↓ Skip to main content

Collective feature selection to identify crucial epistatic variants

Overview of attention for article published in BioData Mining, April 2018
Altmetric Badge

Mentioned by

twitter
3 X users

Citations

dimensions_citation
24 Dimensions

Readers on

mendeley
50 Mendeley
You are seeing a free-to-access but limited selection of the activity Altmetric has collected about this research output. Click here to find out more.
Title
Collective feature selection to identify crucial epistatic variants
Published in
BioData Mining, April 2018
DOI 10.1186/s13040-018-0168-6
Pubmed ID
Authors

Shefali S. Verma, Anastasia Lucas, Xinyuan Zhang, Yogasudha Veturi, Scott Dudek, Binglan Li, Ruowang Li, Ryan Urbanowicz, Jason H. Moore, Dokyoon Kim, Marylyn D. Ritchie

Abstract

Machine learning methods have gained popularity and practicality in identifying linear and non-linear effects of variants associated with complex disease/traits. Detection of epistatic interactions still remains a challenge due to the large number of features and relatively small sample size as input, thus leading to the so-called "short fat data" problem. The efficiency of machine learning methods can be increased by limiting the number of input features. Thus, it is very important to perform variable selection before searching for epistasis. Many methods have been evaluated and proposed to perform feature selection, but no single method works best in all scenarios. We demonstrate this by conducting two separate simulation analyses to evaluate the proposed collective feature selection approach. Through our simulation study we propose a collective feature selection approach to select features that are in the "union" of the best performing methods. We explored various parametric, non-parametric, and data mining approaches to perform feature selection. We choose our top performing methods to select the union of the resulting variables based on a user-defined percentage of variants selected from each method to take to downstream analysis. Our simulation analysis shows that non-parametric data mining approaches, such as MDR, may work best under one simulation criteria for the high effect size (penetrance) datasets, while non-parametric methods designed for feature selection, such as Ranger and Gradient boosting, work best under other simulation criteria. Thus, using a collective approach proves to be more beneficial for selecting variables with epistatic effects also in low effect size datasets and different genetic architectures. Following this, we applied our proposed collective feature selection approach to select the top 1% of variables to identify potential interacting variables associated with Body Mass Index (BMI) in ~ 44,000 samples obtained from Geisinger's MyCode Community Health Initiative (on behalf of DiscovEHR collaboration). In this study, we were able to show that selecting variables using a collective feature selection approach could help in selecting true positive epistatic variables more frequently than applying any single method for feature selection via simulation studies. We were able to demonstrate the effectiveness of collective feature selection along with a comparison of many methods in our simulation analysis. We also applied our method to identify non-linear networks associated with obesity.

X Demographics

X Demographics

The data shown below were collected from the profiles of 3 X users who shared this research output. Click here to find out more about how the information was compiled.
Mendeley readers

Mendeley readers

The data shown below were compiled from readership statistics for 50 Mendeley readers of this research output. Click here to see the associated Mendeley record.

Geographical breakdown

Country Count As %
Unknown 50 100%

Demographic breakdown

Readers by professional status Count As %
Student > Ph. D. Student 13 26%
Student > Master 7 14%
Student > Bachelor 4 8%
Researcher 4 8%
Professor 3 6%
Other 7 14%
Unknown 12 24%
Readers by discipline Count As %
Biochemistry, Genetics and Molecular Biology 7 14%
Computer Science 7 14%
Agricultural and Biological Sciences 5 10%
Medicine and Dentistry 3 6%
Engineering 3 6%
Other 11 22%
Unknown 14 28%
Attention Score in Context

Attention Score in Context

This research output has an Altmetric Attention Score of 1. This is our high-level measure of the quality and quantity of online attention that it has received. This Attention Score, as well as the ranking and number of research outputs shown below, was calculated when the research output was last mentioned on 27 March 2019.
All research outputs
#15,505,836
of 23,043,346 outputs
Outputs from BioData Mining
#226
of 309 outputs
Outputs of similar age
#208,691
of 327,380 outputs
Outputs of similar age from BioData Mining
#4
of 5 outputs
Altmetric has tracked 23,043,346 research outputs across all sources so far. This one is in the 22nd percentile – i.e., 22% of other outputs scored the same or lower than it.
So far Altmetric has tracked 309 research outputs from this source. They typically receive more attention than average, with a mean Attention Score of 7.7. This one is in the 19th percentile – i.e., 19% of its peers scored the same or lower than it.
Older research outputs will score higher simply because they've had more time to accumulate mentions. To account for age we can compare this Altmetric Attention Score to the 327,380 tracked outputs that were published within six weeks on either side of this one in any source. This one is in the 27th percentile – i.e., 27% of its contemporaries scored the same or lower than it.
We're also able to compare this research output to 5 others from the same source and published within six weeks on either side of this one.