Report for: Collective feature selection to identify crucial epistatic variants

You are seeing a free-to-access but limited selection of the activity Altmetric has collected about this research output. Click here to find out more.

Title	Collective feature selection to identify crucial epistatic variants
Published in	BioData Mining, April 2018
DOI	10.1186/s13040-018-0168-6
Pubmed ID	29713383
Authors	Shefali S. Verma, Anastasia Lucas, Xinyuan Zhang, Yogasudha Veturi, Scott Dudek, Binglan Li, Ruowang Li, Ryan Urbanowicz, Jason H. Moore, Dokyoon Kim, Marylyn D. Ritchie
Abstract	Machine learning methods have gained popularity and practicality in identifying linear and non-linear effects of variants associated with complex disease/traits. Detection of epistatic interactions still remains a challenge due to the large number of features and relatively small sample size as input, thus leading to the so-called "short fat data" problem. The efficiency of machine learning methods can be increased by limiting the number of input features. Thus, it is very important to perform variable selection before searching for epistasis. Many methods have been evaluated and proposed to perform feature selection, but no single method works best in all scenarios. We demonstrate this by conducting two separate simulation analyses to evaluate the proposed collective feature selection approach. Through our simulation study we propose a collective feature selection approach to select features that are in the "union" of the best performing methods. We explored various parametric, non-parametric, and data mining approaches to perform feature selection. We choose our top performing methods to select the union of the resulting variables based on a user-defined percentage of variants selected from each method to take to downstream analysis. Our simulation analysis shows that non-parametric data mining approaches, such as MDR, may work best under one simulation criteria for the high effect size (penetrance) datasets, while non-parametric methods designed for feature selection, such as Ranger and Gradient boosting, work best under other simulation criteria. Thus, using a collective approach proves to be more beneficial for selecting variables with epistatic effects also in low effect size datasets and different genetic architectures. Following this, we applied our proposed collective feature selection approach to select the top 1% of variables to identify potential interacting variables associated with Body Mass Index (BMI) in ~ 44,000 samples obtained from Geisinger's MyCode Community Health Initiative (on behalf of DiscovEHR collaboration). In this study, we were able to show that selecting variables using a collective feature selection approach could help in selecting true positive epistatic variables more frequently than applying any single method for feature selection via simulation studies. We were able to demonstrate the effectiveness of collective feature selection along with a comparison of many methods in our simulation analysis. We also applied our method to identify non-linear networks associated with obesity.

View on publisher site Alert me about new mentions

X Demographics

The data shown below were collected from the profiles of 3 X users who shared this research output. Click here to find out more about how the information was compiled.

Geographical breakdown

Country	Count	As %
United States	2	67%
France	1	33%

Demographic breakdown

Type	Count	As %
Members of the public	2	67%
Scientists	1	33%

Mendeley readers

The data shown below were compiled from readership statistics for 50 Mendeley readers of this research output. Click here to see the associated Mendeley record.

Geographical breakdown

Country	Count	As %
Unknown	50	100%

Demographic breakdown

Readers by professional status	Count	As %
Student > Ph. D. Student	13	26%
Student > Master	7	14%
Student > Bachelor	4	8%
Researcher	4	8%
Professor	3	6%
Other	7	14%
Unknown	12	24%

Readers by discipline	Count	As %
Biochemistry, Genetics and Molecular Biology	7	14%
Computer Science	7	14%
Agricultural and Biological Sciences	5	10%
Medicine and Dentistry	3	6%
Engineering	3	6%
Other	11	22%
Unknown	14	28%

Attention Score in Context

This research output has an Altmetric Attention Score of 1. This is our high-level measure of the quality and quantity of online attention that it has received. This Attention Score, as well as the ranking and number of research outputs shown below, was calculated when the research output was last mentioned on 27 March 2019.

All research outputs

#15,505,836

of 23,043,346 outputs

Outputs from BioData Mining

#226

of 309 outputs

Outputs of similar age

#208,691

of 327,380 outputs

Outputs of similar age from BioData Mining

of 5 outputs

Altmetric has tracked 23,043,346 research outputs across all sources so far. This one is in the 22nd percentile – i.e., 22% of other outputs scored the same or lower than it.

So far Altmetric has tracked 309 research outputs from this source. They typically receive more attention than average, with a mean Attention Score of 7.7. This one is in the 19th percentile – i.e., 19% of its peers scored the same or lower than it.

Older research outputs will score higher simply because they've had more time to accumulate mentions. To account for age we can compare this Altmetric Attention Score to the 327,380 tracked outputs that were published within six weeks on either side of this one in any source. This one is in the 27th percentile – i.e., 27% of its contemporaries scored the same or lower than it.

We're also able to compare this research output to 5 others from the same source and published within six weeks on either side of this one.

Collective feature selection to identify crucial epistatic variants

Mentioned by

Citations

Readers on

X Demographics

Geographical breakdown

Demographic breakdown

Mendeley readers

Geographical breakdown

Demographic breakdown

Attention Score in Context