↓ Skip to main content

ALE: automated label extraction from GEO metadata

Overview of attention for article published in BMC Bioinformatics, December 2017
Altmetric Badge

Mentioned by

twitter
1 X user

Citations

dimensions_citation
24 Dimensions

Readers on

mendeley
40 Mendeley
You are seeing a free-to-access but limited selection of the activity Altmetric has collected about this research output. Click here to find out more.
Title
ALE: automated label extraction from GEO metadata
Published in
BMC Bioinformatics, December 2017
DOI 10.1186/s12859-017-1888-1
Pubmed ID
Authors

Cory B. Giles, Chase A. Brown, Michael Ripperger, Zane Dennis, Xiavan Roopnarinesingh, Hunter Porter, Aleksandra Perz, Jonathan D. Wren

Abstract

NCBI's Gene Expression Omnibus (GEO) is a rich community resource containing millions of gene expression experiments from human, mouse, rat, and other model organisms. However, information about each experiment (metadata) is in the format of an open-ended, non-standardized textual description provided by the depositor. Thus, classification of experiments for meta-analysis by factors such as gender, age of the sample donor, and tissue of origin is not feasible without assigning labels to the experiments. Automated approaches are preferable for this, primarily because of the size and volume of the data to be processed, but also because it ensures standardization and consistency. While some of these labels can be extracted directly from the textual metadata, many of the data available do not contain explicit text informing the researcher about the age and gender of the subjects with the study. To bridge this gap, machine-learning methods can be trained to use the gene expression patterns associated with the text-derived labels to refine label-prediction confidence. Our analysis shows only 26% of metadata text contains information about gender and 21% about age. In order to ameliorate the lack of available labels for these data sets, we first extract labels from the textual metadata for each GEO RNA dataset and evaluate the performance against a gold standard of manually curated labels. We then use machine-learning methods to predict labels, based upon gene expression of the samples and compare this to the text-based method. Here we present an automated method to extract labels for age, gender, and tissue from textual metadata and GEO data using both a heuristic approach as well as machine learning. We show the two methods together improve accuracy of label assignment to GEO samples.

X Demographics

X Demographics

The data shown below were collected from the profile of 1 X user who shared this research output. Click here to find out more about how the information was compiled.
Mendeley readers

Mendeley readers

The data shown below were compiled from readership statistics for 40 Mendeley readers of this research output. Click here to see the associated Mendeley record.

Geographical breakdown

Country Count As %
Unknown 40 100%

Demographic breakdown

Readers by professional status Count As %
Researcher 9 23%
Student > Ph. D. Student 9 23%
Other 4 10%
Student > Bachelor 3 8%
Student > Master 3 8%
Other 1 3%
Unknown 11 28%
Readers by discipline Count As %
Agricultural and Biological Sciences 7 18%
Computer Science 6 15%
Biochemistry, Genetics and Molecular Biology 4 10%
Medicine and Dentistry 4 10%
Engineering 3 8%
Other 4 10%
Unknown 12 30%
Attention Score in Context

Attention Score in Context

This research output has an Altmetric Attention Score of 1. This is our high-level measure of the quality and quantity of online attention that it has received. This Attention Score, as well as the ranking and number of research outputs shown below, was calculated when the research output was last mentioned on 02 January 2018.
All research outputs
#20,458,307
of 23,015,156 outputs
Outputs from BMC Bioinformatics
#6,890
of 7,315 outputs
Outputs of similar age
#377,608
of 441,976 outputs
Outputs of similar age from BMC Bioinformatics
#121
of 143 outputs
Altmetric has tracked 23,015,156 research outputs across all sources so far. This one is in the 1st percentile – i.e., 1% of other outputs scored the same or lower than it.
So far Altmetric has tracked 7,315 research outputs from this source. They typically receive a little more attention than average, with a mean Attention Score of 5.4. This one is in the 1st percentile – i.e., 1% of its peers scored the same or lower than it.
Older research outputs will score higher simply because they've had more time to accumulate mentions. To account for age we can compare this Altmetric Attention Score to the 441,976 tracked outputs that were published within six weeks on either side of this one in any source. This one is in the 1st percentile – i.e., 1% of its contemporaries scored the same or lower than it.
We're also able to compare this research output to 143 others from the same source and published within six weeks on either side of this one. This one is in the 1st percentile – i.e., 1% of its contemporaries scored the same or lower than it.