↓ Skip to main content

Identifying mislabeled and contaminated DNA methylation microarray data: an extended quality control toolset with examples from GEO

Overview of attention for article published in Clinical Epigenetics, June 2018
Altmetric Badge

About this Attention Score

  • In the top 25% of all research outputs scored by Altmetric
  • Good Attention Score compared to outputs of the same age (73rd percentile)
  • Good Attention Score compared to outputs of the same age and source (70th percentile)

Mentioned by

twitter
11 X users

Citations

dimensions_citation
97 Dimensions

Readers on

mendeley
83 Mendeley
You are seeing a free-to-access but limited selection of the activity Altmetric has collected about this research output. Click here to find out more.
Title
Identifying mislabeled and contaminated DNA methylation microarray data: an extended quality control toolset with examples from GEO
Published in
Clinical Epigenetics, June 2018
DOI 10.1186/s13148-018-0504-1
Pubmed ID
Authors

Jonathan A. Heiss, Allan C. Just

Abstract

Mislabeled, contaminated or poorly performing samples can threaten power in methylation microarray analyses or even result in spurious associations. We describe a set of quality checks for the popular Illumina 450K and EPIC microarrays to identify problematic samples and demonstrate their application in publicly available datasets. Quality checks implemented here include 17 control metrics defined by the manufacturer, a sex check to detect mislabeled sex-discordant samples, and both an identity check for fingerprinting sample donors and a measure of sample contamination based on probes querying high-frequency SNPs. These checks were tested on 80 datasets comprising 8327 samples run on the 450K microarray from the GEO repository. Nine hundred forty samples were flagged by at least one control metric and 133 samples from 20 datasets were assigned the wrong sex. In a dataset in which a subset of samples appear contaminated with a single source of DNA, we demonstrate that our measure based on outliers among SNP probes was strongly correlated (> 0.95) with another independent measure of contamination. A more complete examination of samples that may be mislabeled, contaminated, or have poor performance due to technical problems will improve downstream analyses and replication of findings. We demonstrate that quality control problems are prevalent in a public repository of DNA methylation data. We advocate for a more thorough quality control workflow in epigenome-wide association studies and provide a software package to perform the checks described in this work. Reproducible code and supplementary material are available at 10.5281/zenodo.1172730.

X Demographics

X Demographics

The data shown below were collected from the profiles of 11 X users who shared this research output. Click here to find out more about how the information was compiled.
Mendeley readers

Mendeley readers

The data shown below were compiled from readership statistics for 83 Mendeley readers of this research output. Click here to see the associated Mendeley record.

Geographical breakdown

Country Count As %
Unknown 83 100%

Demographic breakdown

Readers by professional status Count As %
Researcher 19 23%
Student > Ph. D. Student 16 19%
Student > Master 8 10%
Student > Bachelor 5 6%
Student > Postgraduate 5 6%
Other 8 10%
Unknown 22 27%
Readers by discipline Count As %
Biochemistry, Genetics and Molecular Biology 22 27%
Agricultural and Biological Sciences 13 16%
Medicine and Dentistry 7 8%
Computer Science 4 5%
Environmental Science 2 2%
Other 6 7%
Unknown 29 35%
Attention Score in Context

Attention Score in Context

This research output has an Altmetric Attention Score of 7. This is our high-level measure of the quality and quantity of online attention that it has received. This Attention Score, as well as the ranking and number of research outputs shown below, was calculated when the research output was last mentioned on 20 December 2019.
All research outputs
#5,182,858
of 25,563,770 outputs
Outputs from Clinical Epigenetics
#371
of 1,442 outputs
Outputs of similar age
#91,618
of 343,300 outputs
Outputs of similar age from Clinical Epigenetics
#10
of 31 outputs
Altmetric has tracked 25,563,770 research outputs across all sources so far. Compared to these this one has done well and is in the 79th percentile: it's in the top 25% of all research outputs ever tracked by Altmetric.
So far Altmetric has tracked 1,442 research outputs from this source. They typically receive more attention than average, with a mean Attention Score of 8.4. This one has gotten more attention than average, scoring higher than 74% of its peers.
Older research outputs will score higher simply because they've had more time to accumulate mentions. To account for age we can compare this Altmetric Attention Score to the 343,300 tracked outputs that were published within six weeks on either side of this one in any source. This one has gotten more attention than average, scoring higher than 73% of its contemporaries.
We're also able to compare this research output to 31 others from the same source and published within six weeks on either side of this one. This one has gotten more attention than average, scoring higher than 70% of its contemporaries.