↓ Skip to main content

PubChem chemical structure standardization

Overview of attention for article published in Journal of Cheminformatics, August 2018
Altmetric Badge

About this Attention Score

  • In the top 25% of all research outputs scored by Altmetric
  • High Attention Score compared to outputs of the same age (92nd percentile)
  • High Attention Score compared to outputs of the same age and source (94th percentile)

Mentioned by

blogs
2 blogs
twitter
30 X users
facebook
2 Facebook pages

Citations

dimensions_citation
91 Dimensions

Readers on

mendeley
185 Mendeley
citeulike
1 CiteULike
You are seeing a free-to-access but limited selection of the activity Altmetric has collected about this research output. Click here to find out more.
Title
PubChem chemical structure standardization
Published in
Journal of Cheminformatics, August 2018
DOI 10.1186/s13321-018-0293-8
Pubmed ID
Authors

Volker D. Hähnke, Sunghwan Kim, Evan E. Bolton

Abstract

PubChem is a chemical information repository, consisting of three primary databases: Substance, Compound, and BioAssay. When individual data contributors submit chemical substance descriptions to Substance, the unique chemical structures are extracted and stored into Compound through an automated process called structure standardization. The present study describes the PubChem standardization approaches and analyzes them for their success rates, reasons that cause structures to be rejected, and modifications applied to structures during the standardization process. Furthermore, the PubChem standardization is compared to the structure normalization of the IUPAC International Chemical Identifier (InChI) software, as manifested by conversion of the InChI back into a chemical structure. The observed rejection rate for substances processed by PubChem standardization was 0.36%, which is predominantly attributed to structures with invalid atom valences that cannot be readily corrected without additional information from contributors. Of all structures that pass standardization, 44% are modified in the process, reducing the count of unique structures from 53,574,724 in substance to 45,808,881 in compound as identified by de-aromatized canonical isomeric SMILES. Even though the processing time is very low on average (only 0.4% of structures have individual standardization time above 0.1 s), total standardization time is completely dominated by edge cases: 90% of the time to standardize all structures in PubChem substance is spent on the 2.05% of structures with the highest individual standardization time. It is worth noting that 60% of the structures obtained from PubChem structure standardization are not identical to the chemical structure resulting from the InChI (primarily due to preferences for a different tautomeric form). Standardization of chemical structures is complicated by the diversity of chemical information and their representations approaches. The PubChem standardization is an effective and efficient tool to account for molecular diversity and to eliminate invalid/incomplete structures. Further development will concentrate on improved tautomer consideration and an expanded stereocenter definition. Modifications are difficult to thoroughly validate, with slight changes often affecting many thousands of structures and various edge cases. The PubChem structure standardization service is accessible as a public resource ( https://pubchem.ncbi.nlm.nih.gov/standardize ), and via programmatic interfaces.

X Demographics

X Demographics

The data shown below were collected from the profiles of 30 X users who shared this research output. Click here to find out more about how the information was compiled.
Mendeley readers

Mendeley readers

The data shown below were compiled from readership statistics for 185 Mendeley readers of this research output. Click here to see the associated Mendeley record.

Geographical breakdown

Country Count As %
Unknown 185 100%

Demographic breakdown

Readers by professional status Count As %
Researcher 33 18%
Student > Bachelor 19 10%
Student > Master 16 9%
Student > Ph. D. Student 12 6%
Other 9 5%
Other 23 12%
Unknown 73 39%
Readers by discipline Count As %
Chemistry 44 24%
Pharmacology, Toxicology and Pharmaceutical Science 16 9%
Biochemistry, Genetics and Molecular Biology 12 6%
Agricultural and Biological Sciences 9 5%
Unspecified 5 3%
Other 17 9%
Unknown 82 44%
Attention Score in Context

Attention Score in Context

This research output has an Altmetric Attention Score of 31. This is our high-level measure of the quality and quantity of online attention that it has received. This Attention Score, as well as the ranking and number of research outputs shown below, was calculated when the research output was last mentioned on 01 September 2023.
All research outputs
#1,286,505
of 25,713,737 outputs
Outputs from Journal of Cheminformatics
#59
of 981 outputs
Outputs of similar age
#26,440
of 342,290 outputs
Outputs of similar age from Journal of Cheminformatics
#1
of 19 outputs
Altmetric has tracked 25,713,737 research outputs across all sources so far. Compared to these this one has done particularly well and is in the 94th percentile: it's in the top 10% of all research outputs ever tracked by Altmetric.
So far Altmetric has tracked 981 research outputs from this source. They typically receive more attention than average, with a mean Attention Score of 10.0. This one has done particularly well, scoring higher than 93% of its peers.
Older research outputs will score higher simply because they've had more time to accumulate mentions. To account for age we can compare this Altmetric Attention Score to the 342,290 tracked outputs that were published within six weeks on either side of this one in any source. This one has done particularly well, scoring higher than 92% of its contemporaries.
We're also able to compare this research output to 19 others from the same source and published within six weeks on either side of this one. This one has done particularly well, scoring higher than 94% of its contemporaries.