Report for: Using random forests for assistance in the curation of G-protein coupled receptor databases

You are seeing a free-to-access but limited selection of the activity Altmetric has collected about this research output. Click here to find out more.

Title	Using random forests for assistance in the curation of G-protein coupled receptor databases
Published in	BioMedical Engineering OnLine, August 2017
DOI	10.1186/s12938-017-0357-4
Pubmed ID	28830426
Authors	Aleksei Shkurin, Alfredo Vellido
Abstract	Biology is experiencing a gradual but fast transformation from a laboratory-centred science towards a data-centred one. As such, it requires robust data engineering and the use of quantitative data analysis methods as part of database curation. This paper focuses on G protein-coupled receptors, a large and heterogeneous super-family of cell membrane proteins of interest to biology in general. One of its families, Class C, is of particular interest to pharmacology and drug design. This family is quite heterogeneous on its own, and the discrimination of its several sub-families is a challenging problem. In the absence of known crystal structure, such discrimination must rely on their primary amino acid sequences. We are interested not as much in achieving maximum sub-family discrimination accuracy using quantitative methods, but in exploring sequence misclassification behavior. Specifically, we are interested in isolating those sequences showing consistent misclassification, that is, sequences that are very often misclassified and almost always to the same wrong sub-family. Random forests are used for this analysis due to their ensemble nature, which makes them naturally suited to gauge the consistency of misclassification. This consistency is here defined through the voting scheme of their base tree classifiers. Detailed consistency results for the random forest ensemble classification were obtained for all receptors and for all data transformations of their unaligned primary sequences. Shortlists of the most consistently misclassified receptors for each subfamily and transformation, as well as an overall shortlist including those cases that were consistently misclassified across transformations, were obtained. The latter should be referred to experts for further investigation as a data curation task. The automatic discrimination of the Class C sub-families of G protein-coupled receptors from their unaligned primary sequences shows clear limits. This study has investigated in some detail the consistency of their misclassification using random forest ensemble classifiers. Different sub-families have been shown to display very different discrimination consistency behaviors. The individual identification of consistently misclassified sequences should provide a tool for quality control to GPCR database curators.

View on publisher site Alert me about new mentions

X Demographics

The data shown below were collected from the profiles of 3 X users who shared this research output. Click here to find out more about how the information was compiled.

Geographical breakdown

Country	Count	As %
Spain	1	33%
United States	1	33%
Unknown	1	33%

Demographic breakdown

Type	Count	As %
Members of the public	2	67%
Scientists	1	33%

Mendeley readers

The data shown below were compiled from readership statistics for 14 Mendeley readers of this research output. Click here to see the associated Mendeley record.

Geographical breakdown

Country	Count	As %
Unknown	14	100%

Demographic breakdown

Readers by professional status	Count	As %
Student > Bachelor	4	29%
Researcher	4	29%
Student > Master	2	14%
Professor > Associate Professor	1	7%
Unknown	3	21%

Readers by discipline	Count	As %
Computer Science	3	21%
Biochemistry, Genetics and Molecular Biology	2	14%
Agricultural and Biological Sciences	2	14%
Engineering	2	14%
Medicine and Dentistry	1	7%
Other	1	7%
Unknown	3	21%

Attention Score in Context

This research output has an Altmetric Attention Score of 1. This is our high-level measure of the quality and quantity of online attention that it has received. This Attention Score, as well as the ranking and number of research outputs shown below, was calculated when the research output was last mentioned on 23 October 2017.

All research outputs

#15,477,045

of 22,999,744 outputs

Outputs from BioMedical Engineering OnLine

#425

of 824 outputs

Outputs of similar age

#200,103

of 318,830 outputs

Outputs of similar age from BioMedical Engineering OnLine

#12

of 20 outputs

Altmetric has tracked 22,999,744 research outputs across all sources so far. This one is in the 22nd percentile – i.e., 22% of other outputs scored the same or lower than it.

So far Altmetric has tracked 824 research outputs from this source. They receive a mean Attention Score of 4.7. This one is in the 35th percentile – i.e., 35% of its peers scored the same or lower than it.

Older research outputs will score higher simply because they've had more time to accumulate mentions. To account for age we can compare this Altmetric Attention Score to the 318,830 tracked outputs that were published within six weeks on either side of this one in any source. This one is in the 28th percentile – i.e., 28% of its contemporaries scored the same or lower than it.

We're also able to compare this research output to 20 others from the same source and published within six weeks on either side of this one. This one is in the 15th percentile – i.e., 15% of its contemporaries scored the same or lower than it.

Using random forests for assistance in the curation of G-protein coupled receptor databases

Mentioned by

Citations

Readers on

X Demographics

Geographical breakdown

Demographic breakdown

Mendeley readers

Geographical breakdown

Demographic breakdown

Attention Score in Context