Thematic clustering of text documents using an EM-based approach

Overview of attention for article published in Journal of Biomedical Semantics, October 2012

Altmetric Badge

Citations

dimensions_citation: 9 Dimensions

Readers on

mendeley: 20 Mendeley

Summary Dimensions citations

You are seeing a free-to-access but limited selection of the activity Altmetric has collected about this research output. Click here to find out more.

Title	Thematic clustering of text documents using an EM-based approach
Published in	Journal of Biomedical Semantics, October 2012
DOI	10.1186/2041-1480-3-s3-s6
Pubmed ID	23046528
Authors	Sun Kim, W John Wilbur
Abstract	Clustering textual contents is an important step in mining useful information on the web or other text-based resources. The common task in text clustering is to handle text in a multi-dimensional space, and to partition documents into groups, where each group contains documents that are similar to each other. However, this strategy lacks a comprehensive view for humans in general since it cannot explain the main subject of each cluster. Utilizing semantic information can solve this problem, but it needs a well-defined ontology or pre-labeled gold standard set. In this paper, we present a thematic clustering algorithm for text documents. Given text, subject terms are extracted and used for clustering documents in a probabilistic framework. An EM approach is used to ensure documents are assigned to correct subjects, hence it converges to a locally optimal solution. The proposed method is distinctive because its results are sufficiently explanatory for human understanding as well as efficient for clustering performance. The experimental results show that the proposed method provides a competitive performance compared to other state-of-the-art approaches. We also show that the extracted themes from the MEDLINE® dataset represent the subjects of clusters reasonably well.

View on publisher site Alert me about new mentions

Mendeley readers

Mendeley readers

The data shown below were compiled from readership statistics for 20 Mendeley readers of this research output. Click here to see the associated Mendeley record.

Geographical breakdown

Country	Count	As %
Unknown	20	100%

Demographic breakdown

Readers by professional status	Count	As %
Student > Ph. D. Student	5	25%
Researcher	5	25%
Student > Doctoral Student	2	10%
Professor	2	10%
Student > Master	2	10%
Other	2	10%
Unknown	2	10%

Readers by discipline	Count	As %
Computer Science	6	30%
Agricultural and Biological Sciences	3	15%
Engineering	2	10%
Biochemistry, Genetics and Molecular Biology	2	10%
Arts and Humanities	1	5%
Other	2	10%
Unknown	4	20%