↓ Skip to main content

Practical impacts of genomic data “cleaning” on biological discovery using surrogate variable analysis

Overview of attention for article published in BMC Bioinformatics, November 2015
Altmetric Badge

About this Attention Score

  • In the top 25% of all research outputs scored by Altmetric
  • High Attention Score compared to outputs of the same age (80th percentile)
  • High Attention Score compared to outputs of the same age and source (81st percentile)

Mentioned by

twitter
9 X users
patent
1 patent

Citations

dimensions_citation
52 Dimensions

Readers on

mendeley
151 Mendeley
You are seeing a free-to-access but limited selection of the activity Altmetric has collected about this research output. Click here to find out more.
Title
Practical impacts of genomic data “cleaning” on biological discovery using surrogate variable analysis
Published in
BMC Bioinformatics, November 2015
DOI 10.1186/s12859-015-0808-5
Pubmed ID
Authors

Andrew E. Jaffe, Thomas Hyde, Joel Kleinman, Daniel R. Weinbergern, Joshua G. Chenoweth, Ronald D. McKay, Jeffrey T. Leek, Carlo Colantuoni

Abstract

Genomic data production is at its highest level and continues to increase, making available novel primary data and existing public data to researchers for exploration. Here we explore the consequences of "batch" correction for biological discovery in two publicly available expression datasets. We consider this to include the estimation of and adjustment for wide-spread systematic heterogeneity in genomic measurements that is unrelated to the effects under study, whether it be technical or biological in nature. We present three illustrative data analyses using surrogate variable analysis (SVA) and describe how to perform artifact discovery in light of natural heterogeneity within biological groups, secondary biological questions of interest, and non-linear treatment effects in a dataset profiling differentiating pluripotent cells (GSE32923) and another from human brain tissue (GSE30272). Careful specification of biological effects of interest is very important to factor-based approaches like SVA. We demonstrate greatly sharpened global and gene-specific differential expression across treatment groups in stem cell systems. Similarly, we demonstrate how to preserve major non-linear effects of age across the lifespan in the brain dataset. However, the gains in precisely defining known effects of interest come at the cost of much other information in the "cleaned" data, including sex, common copy number effects and sample or cell line-specific molecular behavior. Our analyses indicate that data "cleaning" can be an important component of high-throughput genomic data analysis when interrogating explicitly defined effects in the context of data affected by robust technical artifacts. However, caution should be exercised to avoid removing biological signal of interest. It is also important to note that open data exploration is not possible after such supervised "cleaning", because effects beyond those stipulated by the researcher may have been removed. With the goal of making these statistical algorithms more powerful and transparent to researchers in the biological sciences, we provide exploratory plots and accompanying R code for identifying and guiding "cleaning" process ( https://github.com/andrewejaffe/StemCellSVA ). The impact of these methods is significant enough that we have made newly processed data available for the brain data set at http://braincloud.jhmi.edu/plots/ and GSE30272.

X Demographics

X Demographics

The data shown below were collected from the profiles of 9 X users who shared this research output. Click here to find out more about how the information was compiled.
Mendeley readers

Mendeley readers

The data shown below were compiled from readership statistics for 151 Mendeley readers of this research output. Click here to see the associated Mendeley record.

Geographical breakdown

Country Count As %
France 1 <1%
Sweden 1 <1%
United Kingdom 1 <1%
Spain 1 <1%
United States 1 <1%
Luxembourg 1 <1%
Unknown 145 96%

Demographic breakdown

Readers by professional status Count As %
Student > Ph. D. Student 42 28%
Researcher 38 25%
Student > Master 16 11%
Student > Bachelor 11 7%
Student > Doctoral Student 9 6%
Other 14 9%
Unknown 21 14%
Readers by discipline Count As %
Biochemistry, Genetics and Molecular Biology 43 28%
Agricultural and Biological Sciences 33 22%
Computer Science 10 7%
Medicine and Dentistry 8 5%
Neuroscience 6 4%
Other 24 16%
Unknown 27 18%
Attention Score in Context

Attention Score in Context

This research output has an Altmetric Attention Score of 8. This is our high-level measure of the quality and quantity of online attention that it has received. This Attention Score, as well as the ranking and number of research outputs shown below, was calculated when the research output was last mentioned on 11 May 2023.
All research outputs
#4,300,399
of 24,220,739 outputs
Outputs from BMC Bioinformatics
#1,562
of 7,512 outputs
Outputs of similar age
#56,578
of 290,681 outputs
Outputs of similar age from BMC Bioinformatics
#27
of 150 outputs
Altmetric has tracked 24,220,739 research outputs across all sources so far. Compared to these this one has done well and is in the 82nd percentile: it's in the top 25% of all research outputs ever tracked by Altmetric.
So far Altmetric has tracked 7,512 research outputs from this source. They typically receive a little more attention than average, with a mean Attention Score of 5.5. This one has done well, scoring higher than 79% of its peers.
Older research outputs will score higher simply because they've had more time to accumulate mentions. To account for age we can compare this Altmetric Attention Score to the 290,681 tracked outputs that were published within six weeks on either side of this one in any source. This one has done well, scoring higher than 80% of its contemporaries.
We're also able to compare this research output to 150 others from the same source and published within six weeks on either side of this one. This one has done well, scoring higher than 81% of its contemporaries.