Title |
Recognizing chemicals in patents: a comparative analysis
|
---|---|
Published in |
Journal of Cheminformatics, October 2016
|
DOI | 10.1186/s13321-016-0172-0 |
Pubmed ID | |
Authors |
Maryam Habibi, David Luis Wiegandt, Florian Schmedding, Ulf Leser |
Abstract |
Recently, methods for Chemical Named Entity Recognition (NER) have gained substantial interest, driven by the need for automatically analyzing todays ever growing collections of biomedical text. Chemical NER for patents is particularly essential due to the high economic importance of pharmaceutical findings. However, NER on patents has essentially been neglected by the research community for long, mostly because of the lack of enough annotated corpora. A recent international competition specifically targeted this task, but evaluated tools only on gold standard patent abstracts instead of full patents; furthermore, results from such competitions are often difficult to extrapolate to real-life settings due to the relatively high homogeneity of training and test data. Here, we evaluate the two state-of-the-art chemical NER tools, tmChem and ChemSpot, on four different annotated patent corpora, two of which consist of full texts. We study the overall performance of the tools, compare their results at the instance level, report on high-recall and high-precision ensembles, and perform cross-corpus and intra-corpus evaluations. Our findings indicate that full patents are considerably harder to analyze than patent abstracts and clearly confirm the common wisdom that using the same text genre (patent vs. scientific) and text type (abstract vs. full text) for training and testing is a pre-requisite for achieving high quality text mining results. |
Twitter Demographics
Geographical breakdown
Country | Count | As % |
---|---|---|
Switzerland | 1 | 17% |
Netherlands | 1 | 17% |
Canada | 1 | 17% |
United States | 1 | 17% |
Unknown | 2 | 33% |
Demographic breakdown
Type | Count | As % |
---|---|---|
Members of the public | 4 | 67% |
Scientists | 1 | 17% |
Science communicators (journalists, bloggers, editors) | 1 | 17% |
Mendeley readers
Geographical breakdown
Country | Count | As % |
---|---|---|
United Kingdom | 1 | 3% |
Netherlands | 1 | 3% |
Denmark | 1 | 3% |
Germany | 1 | 3% |
Unknown | 27 | 87% |
Demographic breakdown
Readers by professional status | Count | As % |
---|---|---|
Researcher | 10 | 32% |
Student > Master | 7 | 23% |
Student > Ph. D. Student | 4 | 13% |
Other | 3 | 10% |
Librarian | 1 | 3% |
Other | 2 | 6% |
Unknown | 4 | 13% |
Readers by discipline | Count | As % |
---|---|---|
Computer Science | 10 | 32% |
Chemistry | 5 | 16% |
Pharmacology, Toxicology and Pharmaceutical Science | 3 | 10% |
Agricultural and Biological Sciences | 3 | 10% |
Business, Management and Accounting | 1 | 3% |
Other | 6 | 19% |
Unknown | 3 | 10% |