↓ Skip to main content

Expansion of medical vocabularies using distributional semantics on Japanese patient blogs

Overview of attention for article published in Journal of Biomedical Semantics, September 2016
Altmetric Badge

About this Attention Score

  • Above-average Attention Score compared to outputs of the same age (51st percentile)

Mentioned by

patent
1 patent

Readers on

mendeley
19 Mendeley
You are seeing a free-to-access but limited selection of the activity Altmetric has collected about this research output. Click here to find out more.
Title
Expansion of medical vocabularies using distributional semantics on Japanese patient blogs
Published in
Journal of Biomedical Semantics, September 2016
DOI 10.1186/s13326-016-0093-x
Pubmed ID
Authors

Magnus Ahltorp, Maria Skeppstedt, Shiho Kitajima, Aron Henriksson, Rafal Rzepka, Kenji Araki

Abstract

Research on medical vocabulary expansion from large corpora has primarily been conducted using text written in English or similar languages, due to a limited availability of large biomedical corpora in most languages. Medical vocabularies are, however, essential also for text mining from corpora written in other languages than English and belonging to a variety of medical genres. The aim of this study was therefore to evaluate medical vocabulary expansion using a corpus very different from those previously used, in terms of grammar and orthographics, as well as in terms of text genre. This was carried out by applying a method based on distributional semantics to the task of extracting medical vocabulary terms from a large corpus of Japanese patient blogs. Distributional properties of terms were modelled with random indexing, followed by agglomerative hierarchical clustering of 3 ×100 seed terms from existing vocabularies, belonging to three semantic categories: Medical Finding, Pharmaceutical Drug and Body Part. By automatically extracting unknown terms close to the centroids of the created clusters, candidates for new terms to include in the vocabulary were suggested. The method was evaluated for its ability to retrieve the remaining n terms in existing medical vocabularies. Removing case particles and using a context window size of 1+1 was a successful strategy for Medical Finding and Pharmaceutical Drug, while retaining case particles and using a window size of 8+8 was better for Body Part. For a 10n long candidate list, the use of different cluster sizes affected the result for Pharmaceutical Drug, while the effect was only marginal for the other two categories. For a list of top n candidates for Body Part, however, clusters with a size of up to two terms were slightly more useful than larger clusters. For Pharmaceutical Drug, the best settings resulted in a recall of 25 % for a candidate list of top n terms and a recall of 68 % for top 10n. For a candidate list of top 10n candidates, the second best results were obtained for Medical Finding: a recall of 58 %, compared to 46 % for Body Part. Only taking the top n candidates into account, however, resulted in a recall of 23 % for Body Part, compared to 16 % for Medical Finding. Different settings for corpus pre-processing, window sizes and cluster sizes were suitable for different semantic categories and for different lengths of candidate lists, showing the need to adapt parameters, not only to the language and text genre used, but also to the semantic category for which the vocabulary is to be expanded. The results show, however, that the investigated choices for pre-processing and parameter settings were successful, and that a Japanese blog corpus, which in many ways differs from those used in previous studies, can be a useful resource for medical vocabulary expansion.

Mendeley readers

Mendeley readers

The data shown below were compiled from readership statistics for 19 Mendeley readers of this research output. Click here to see the associated Mendeley record.

Geographical breakdown

Country Count As %
Unknown 19 100%

Demographic breakdown

Readers by professional status Count As %
Student > Master 4 21%
Researcher 3 16%
Student > Ph. D. Student 3 16%
Other 2 11%
Librarian 1 5%
Other 2 11%
Unknown 4 21%
Readers by discipline Count As %
Computer Science 6 32%
Medicine and Dentistry 3 16%
Social Sciences 2 11%
Agricultural and Biological Sciences 2 11%
Linguistics 1 5%
Other 3 16%
Unknown 2 11%
Attention Score in Context

Attention Score in Context

This research output has an Altmetric Attention Score of 3. This is our high-level measure of the quality and quantity of online attention that it has received. This Attention Score, as well as the ranking and number of research outputs shown below, was calculated when the research output was last mentioned on 14 June 2022.
All research outputs
#7,413,245
of 22,663,969 outputs
Outputs from Journal of Biomedical Semantics
#145
of 364 outputs
Outputs of similar age
#115,084
of 322,207 outputs
Outputs of similar age from Journal of Biomedical Semantics
#7
of 9 outputs
Altmetric has tracked 22,663,969 research outputs across all sources so far. This one is in the 44th percentile – i.e., 44% of other outputs scored the same or lower than it.
So far Altmetric has tracked 364 research outputs from this source. They receive a mean Attention Score of 4.6. This one has gotten more attention than average, scoring higher than 53% of its peers.
Older research outputs will score higher simply because they've had more time to accumulate mentions. To account for age we can compare this Altmetric Attention Score to the 322,207 tracked outputs that were published within six weeks on either side of this one in any source. This one has gotten more attention than average, scoring higher than 51% of its contemporaries.
We're also able to compare this research output to 9 others from the same source and published within six weeks on either side of this one. This one has scored higher than 2 of them.