↓ Skip to main content

Using random forests for assistance in the curation of G-protein coupled receptor databases

Overview of attention for article published in BioMedical Engineering OnLine, August 2017
Altmetric Badge

Mentioned by

twitter
3 X users

Citations

dimensions_citation
7 Dimensions

Readers on

mendeley
14 Mendeley
You are seeing a free-to-access but limited selection of the activity Altmetric has collected about this research output. Click here to find out more.
Title
Using random forests for assistance in the curation of G-protein coupled receptor databases
Published in
BioMedical Engineering OnLine, August 2017
DOI 10.1186/s12938-017-0357-4
Pubmed ID
Authors

Aleksei Shkurin, Alfredo Vellido

Abstract

Biology is experiencing a gradual but fast transformation from a laboratory-centred science towards a data-centred one. As such, it requires robust data engineering and the use of quantitative data analysis methods as part of database curation. This paper focuses on G protein-coupled receptors, a large and heterogeneous super-family of cell membrane proteins of interest to biology in general. One of its families, Class C, is of particular interest to pharmacology and drug design. This family is quite heterogeneous on its own, and the discrimination of its several sub-families is a challenging problem. In the absence of known crystal structure, such discrimination must rely on their primary amino acid sequences. We are interested not as much in achieving maximum sub-family discrimination accuracy using quantitative methods, but in exploring sequence misclassification behavior. Specifically, we are interested in isolating those sequences showing consistent misclassification, that is, sequences that are very often misclassified and almost always to the same wrong sub-family. Random forests are used for this analysis due to their ensemble nature, which makes them naturally suited to gauge the consistency of misclassification. This consistency is here defined through the voting scheme of their base tree classifiers. Detailed consistency results for the random forest ensemble classification were obtained for all receptors and for all data transformations of their unaligned primary sequences. Shortlists of the most consistently misclassified receptors for each subfamily and transformation, as well as an overall shortlist including those cases that were consistently misclassified across transformations, were obtained. The latter should be referred to experts for further investigation as a data curation task. The automatic discrimination of the Class C sub-families of G protein-coupled receptors from their unaligned primary sequences shows clear limits. This study has investigated in some detail the consistency of their misclassification using random forest ensemble classifiers. Different sub-families have been shown to display very different discrimination consistency behaviors. The individual identification of consistently misclassified sequences should provide a tool for quality control to GPCR database curators.

X Demographics

X Demographics

The data shown below were collected from the profiles of 3 X users who shared this research output. Click here to find out more about how the information was compiled.
Mendeley readers

Mendeley readers

The data shown below were compiled from readership statistics for 14 Mendeley readers of this research output. Click here to see the associated Mendeley record.

Geographical breakdown

Country Count As %
Unknown 14 100%

Demographic breakdown

Readers by professional status Count As %
Student > Bachelor 4 29%
Researcher 4 29%
Student > Master 2 14%
Professor > Associate Professor 1 7%
Unknown 3 21%
Readers by discipline Count As %
Computer Science 3 21%
Biochemistry, Genetics and Molecular Biology 2 14%
Agricultural and Biological Sciences 2 14%
Engineering 2 14%
Medicine and Dentistry 1 7%
Other 1 7%
Unknown 3 21%
Attention Score in Context

Attention Score in Context

This research output has an Altmetric Attention Score of 1. This is our high-level measure of the quality and quantity of online attention that it has received. This Attention Score, as well as the ranking and number of research outputs shown below, was calculated when the research output was last mentioned on 23 October 2017.
All research outputs
#15,477,045
of 22,999,744 outputs
Outputs from BioMedical Engineering OnLine
#425
of 824 outputs
Outputs of similar age
#200,103
of 318,830 outputs
Outputs of similar age from BioMedical Engineering OnLine
#12
of 20 outputs
Altmetric has tracked 22,999,744 research outputs across all sources so far. This one is in the 22nd percentile – i.e., 22% of other outputs scored the same or lower than it.
So far Altmetric has tracked 824 research outputs from this source. They receive a mean Attention Score of 4.7. This one is in the 35th percentile – i.e., 35% of its peers scored the same or lower than it.
Older research outputs will score higher simply because they've had more time to accumulate mentions. To account for age we can compare this Altmetric Attention Score to the 318,830 tracked outputs that were published within six weeks on either side of this one in any source. This one is in the 28th percentile – i.e., 28% of its contemporaries scored the same or lower than it.
We're also able to compare this research output to 20 others from the same source and published within six weeks on either side of this one. This one is in the 15th percentile – i.e., 15% of its contemporaries scored the same or lower than it.