↓ Skip to main content

Word2Vec inversion and traditional text classifiers for phenotyping lupus

Overview of attention for article published in BMC Medical Informatics and Decision Making, August 2017
Altmetric Badge

Mentioned by

twitter
1 X user

Readers on

mendeley
137 Mendeley
You are seeing a free-to-access but limited selection of the activity Altmetric has collected about this research output. Click here to find out more.
Title
Word2Vec inversion and traditional text classifiers for phenotyping lupus
Published in
BMC Medical Informatics and Decision Making, August 2017
DOI 10.1186/s12911-017-0518-1
Pubmed ID
Authors

Clayton A. Turner, Alexander D. Jacobs, Cassios K. Marques, James C. Oates, Diane L. Kamen, Paul E. Anderson, Jihad S. Obeid

Abstract

Identifying patients with certain clinical criteria based on manual chart review of doctors' notes is a daunting task given the massive amounts of text notes in the electronic health records (EHR). This task can be automated using text classifiers based on Natural Language Processing (NLP) techniques along with pattern recognition machine learning (ML) algorithms. The aim of this research is to evaluate the performance of traditional classifiers for identifying patients with Systemic Lupus Erythematosus (SLE) in comparison with a newer Bayesian word vector method. We obtained clinical notes for patients with SLE diagnosis along with controls from the Rheumatology Clinic (662 total patients). Sparse bag-of-words (BOWs) and Unified Medical Language System (UMLS) Concept Unique Identifiers (CUIs) matrices were produced using NLP pipelines. These matrices were subjected to several different NLP classifiers: neural networks, random forests, naïve Bayes, support vector machines, and Word2Vec inversion, a Bayesian inversion method. Performance was measured by calculating accuracy and area under the Receiver Operating Characteristic (ROC) curve (AUC) of a cross-validated (CV) set and a separate testing set. We calculated the accuracy of the ICD-9 billing codes as a baseline to be 90.00% with an AUC of 0.900, the shallow neural network with CUIs to be 92.10% with an AUC of 0.970, the random forest with BOWs to be 95.25% with an AUC of 0.994, the random forest with CUIs to be 95.00% with an AUC of 0.979, and the Word2Vec inversion to be 90.03% with an AUC of 0.905. Our results suggest that a shallow neural network with CUIs and random forests with both CUIs and BOWs are the best classifiers for this lupus phenotyping task. The Word2Vec inversion method failed to significantly beat the ICD-9 code classification, but yielded promising results. This method does not require explicit features and is more adaptable to non-binary classification tasks. The Word2Vec inversion is hypothesized to become more powerful with access to more data. Therefore, currently, the shallow neural networks and random forests are the desirable classifiers.

X Demographics

X Demographics

The data shown below were collected from the profile of 1 X user who shared this research output. Click here to find out more about how the information was compiled.
Mendeley readers

Mendeley readers

The data shown below were compiled from readership statistics for 137 Mendeley readers of this research output. Click here to see the associated Mendeley record.

Geographical breakdown

Country Count As %
Unknown 137 100%

Demographic breakdown

Readers by professional status Count As %
Researcher 21 15%
Student > Ph. D. Student 16 12%
Student > Master 12 9%
Student > Bachelor 8 6%
Other 8 6%
Other 22 16%
Unknown 50 36%
Readers by discipline Count As %
Medicine and Dentistry 24 18%
Computer Science 19 14%
Engineering 10 7%
Agricultural and Biological Sciences 5 4%
Biochemistry, Genetics and Molecular Biology 4 3%
Other 16 12%
Unknown 59 43%
Attention Score in Context

Attention Score in Context

This research output has an Altmetric Attention Score of 1. This is our high-level measure of the quality and quantity of online attention that it has received. This Attention Score, as well as the ranking and number of research outputs shown below, was calculated when the research output was last mentioned on 15 December 2017.
All research outputs
#18,578,649
of 23,011,300 outputs
Outputs from BMC Medical Informatics and Decision Making
#1,583
of 2,008 outputs
Outputs of similar age
#243,401
of 317,360 outputs
Outputs of similar age from BMC Medical Informatics and Decision Making
#22
of 30 outputs
Altmetric has tracked 23,011,300 research outputs across all sources so far. This one is in the 11th percentile – i.e., 11% of other outputs scored the same or lower than it.
So far Altmetric has tracked 2,008 research outputs from this source. They receive a mean Attention Score of 4.9. This one is in the 9th percentile – i.e., 9% of its peers scored the same or lower than it.
Older research outputs will score higher simply because they've had more time to accumulate mentions. To account for age we can compare this Altmetric Attention Score to the 317,360 tracked outputs that were published within six weeks on either side of this one in any source. This one is in the 12th percentile – i.e., 12% of its contemporaries scored the same or lower than it.
We're also able to compare this research output to 30 others from the same source and published within six weeks on either side of this one. This one is in the 10th percentile – i.e., 10% of its contemporaries scored the same or lower than it.