↓ Skip to main content

A Maximum-Entropy approach for accurate document annotation in the biomedical domain

Overview of attention for article published in Journal of Biomedical Semantics, April 2012
Altmetric Badge

Citations

dimensions_citation
10 Dimensions

Readers on

mendeley
28 Mendeley
You are seeing a free-to-access but limited selection of the activity Altmetric has collected about this research output. Click here to find out more.
Title
A Maximum-Entropy approach for accurate document annotation in the biomedical domain
Published in
Journal of Biomedical Semantics, April 2012
DOI 10.1186/2041-1480-3-s1-s2
Pubmed ID
Authors

George Tsatsaronis, Natalia Macari, Sunna Torge, Heiko Dietze, Michael Schroeder

Abstract

The increasing number of scientific literature on the Web and the absence of efficient tools used for classifying and searching the documents are the two most important factors that influence the speed of the search and the quality of the results. Previous studies have shown that the usage of ontologies makes it possible to process document and query information at the semantic level, which greatly improves the search for the relevant information and makes one step further towards the Semantic Web. A fundamental step in these approaches is the annotation of documents with ontology concepts, which can also be seen as a classification task. In this paper we address this issue for the biomedical domain and present a new automated and robust method, based on a Maximum Entropy approach, for annotating biomedical literature documents with terms from the Medical Subject Headings (MeSH).The experimental evaluation shows that the suggested Maximum Entropy approach for annotating biomedical documents with MeSH terms is highly accurate, robust to the ambiguity of terms, and can provide very good performance even when a very small number of training documents is used. More precisely, we show that the proposed algorithm obtained an average F-measure of 92.4% (precision 99.41%, recall 86.77%) for the full range of the explored terms (4,078 MeSH terms), and that the algorithm's performance is resilient to terms' ambiguity, achieving an average F-measure of 92.42% (precision 99.32%, recall 86.87%) in the explored MeSH terms which were found to be ambiguous according to the Unified Medical Language System (UMLS) thesaurus. Finally, we compared the results of the suggested methodology with a Naive Bayes and a Decision Trees classification approach, and we show that the Maximum Entropy based approach performed with higher F-Measure in both ambiguous and monosemous MeSH terms.

Mendeley readers

Mendeley readers

The data shown below were compiled from readership statistics for 28 Mendeley readers of this research output. Click here to see the associated Mendeley record.

Geographical breakdown

Country Count As %
Unknown 28 100%

Demographic breakdown

Readers by professional status Count As %
Researcher 7 25%
Student > Ph. D. Student 4 14%
Professor 2 7%
Student > Bachelor 2 7%
Student > Master 2 7%
Other 2 7%
Unknown 9 32%
Readers by discipline Count As %
Agricultural and Biological Sciences 4 14%
Medicine and Dentistry 3 11%
Computer Science 2 7%
Engineering 2 7%
Energy 1 4%
Other 3 11%
Unknown 13 46%