Report for: Semantic relatedness and similarity of biomedical terms: examining the effects of recency, size, and section of biomedical publications on the performance of word2vec

You are seeing a free-to-access but limited selection of the activity Altmetric has collected about this research output. Click here to find out more.

Title	Semantic relatedness and similarity of biomedical terms: examining the effects of recency, size, and section of biomedical publications on the performance of word2vec
Published in	BMC Medical Informatics and Decision Making, July 2017
DOI	10.1186/s12911-017-0498-1
Pubmed ID	28673289
Authors	Yongjun Zhu, Erjia Yan, Fei Wang
Abstract	Understanding semantic relatedness and similarity between biomedical terms has a great impact on a variety of applications such as biomedical information retrieval, information extraction, and recommender systems. The objective of this study is to examine word2vec's ability in deriving semantic relatedness and similarity between biomedical terms from large publication data. Specifically, we focus on the effects of recency, size, and section of biomedical publication data on the performance of word2vec. We download abstracts of 18,777,129 articles from PubMed and 766,326 full-text articles from PubMed Central (PMC). The datasets are preprocessed and grouped into subsets by recency, size, and section. Word2vec models are trained on these subtests. Cosine similarities between biomedical terms obtained from the word2vec models are compared against reference standards. Performance of models trained on different subsets are compared to examine recency, size, and section effects. Models trained on recent datasets did not boost the performance. Models trained on larger datasets identified more pairs of biomedical terms than models trained on smaller datasets in relatedness task (from 368 at the 10% level to 494 at the 100% level) and similarity task (from 374 at the 10% level to 491 at the 100% level). The model trained on abstracts produced results that have higher correlations with the reference standards than the one trained on article bodies (i.e., 0.65 vs. 0.62 in the similarity task and 0.66 vs. 0.59 in the relatedness task). However, the latter identified more pairs of biomedical terms than the former (i.e., 344 vs. 498 in the similarity task and 339 vs. 503 in the relatedness task). Increasing the size of dataset does not always enhance the performance. Increasing the size of datasets can result in the identification of more relations of biomedical terms even though it does not guarantee better precision. As summaries of research articles, compared with article bodies, abstracts excel in accuracy but lose in coverage of identifiable relations.

View on publisher site Alert me about new mentions

X Demographics

The data shown below were collected from the profiles of 3 X users who shared this research output. Click here to find out more about how the information was compiled.

Geographical breakdown

Country	Count	As %
Japan	2	67%
Unknown	1	33%

Demographic breakdown

Type	Count	As %
Scientists	2	67%
Members of the public	1	33%

Mendeley readers

The data shown below were compiled from readership statistics for 108 Mendeley readers of this research output. Click here to see the associated Mendeley record.

Geographical breakdown

Country	Count	As %
Unknown	108	100%

Demographic breakdown

Readers by professional status	Count	As %
Student > Ph. D. Student	20	19%
Researcher	16	15%
Student > Master	15	14%
Student > Doctoral Student	7	6%
Professor	6	6%
Other	22	20%
Unknown	22	20%

Readers by discipline	Count	As %
Computer Science	37	34%
Medicine and Dentistry	10	9%
Engineering	5	5%
Nursing and Health Professions	3	3%
Biochemistry, Genetics and Molecular Biology	3	3%
Other	14	13%
Unknown	36	33%

Attention Score in Context

This research output has an Altmetric Attention Score of 2. This is our high-level measure of the quality and quantity of online attention that it has received. This Attention Score, as well as the ranking and number of research outputs shown below, was calculated when the research output was last mentioned on 23 June 2018.

All research outputs

#15,867,545

of 23,577,761 outputs

Outputs from BMC Medical Informatics and Decision Making

#1,343

of 2,027 outputs

Outputs of similar age

#198,878

of 314,967 outputs

Outputs of similar age from BMC Medical Informatics and Decision Making

#26

of 39 outputs

Altmetric has tracked 23,577,761 research outputs across all sources so far. This one is in the 22nd percentile – i.e., 22% of other outputs scored the same or lower than it.

So far Altmetric has tracked 2,027 research outputs from this source. They receive a mean Attention Score of 4.9. This one is in the 24th percentile – i.e., 24% of its peers scored the same or lower than it.

Older research outputs will score higher simply because they've had more time to accumulate mentions. To account for age we can compare this Altmetric Attention Score to the 314,967 tracked outputs that were published within six weeks on either side of this one in any source. This one is in the 28th percentile – i.e., 28% of its contemporaries scored the same or lower than it.

We're also able to compare this research output to 39 others from the same source and published within six weeks on either side of this one. This one is in the 23rd percentile – i.e., 23% of its contemporaries scored the same or lower than it.

Semantic relatedness and similarity of biomedical terms: examining the effects of recency, size, and section of biomedical publications on the performance of word2vec

Mentioned by

Citations

Readers on

X Demographics

Geographical breakdown

Demographic breakdown

Mendeley readers

Geographical breakdown

Demographic breakdown

Attention Score in Context