Title |
Rule-based knowledge aggregation for large-scale protein sequence analysis of influenza A viruses
|
---|---|
Published in |
BMC Bioinformatics, February 2008
|
DOI | 10.1186/1471-2105-9-s1-s7 |
Pubmed ID | |
Authors |
Olivo Miotto, Tin Wee Tan, Vladimir Brusic |
Abstract |
The explosive growth of biological data provides opportunities for new statistical and comparative analyses of large information sets, such as alignments comprising tens of thousands of sequences. In such studies, sequence annotations frequently play an essential role, and reliable results depend on metadata quality. However, the semantic heterogeneity and annotation inconsistencies in biological databases greatly increase the complexity of aggregating and cleaning metadata. Manual curation of datasets, traditionally favoured by life scientists, is impractical for studies involving thousands of records. In this study, we investigate quality issues that affect major public databases, and quantify the effectiveness of an automated metadata extraction approach that combines structural and semantic rules. We applied this approach to more than 90,000 influenza A records, to annotate sequences with protein name, virus subtype, isolate, host, geographic origin, and year of isolation. |
Mendeley readers
Geographical breakdown
Country | Count | As % |
---|---|---|
United States | 1 | 4% |
Netherlands | 1 | 4% |
Tanzania, United Republic of | 1 | 4% |
Unknown | 24 | 89% |
Demographic breakdown
Readers by professional status | Count | As % |
---|---|---|
Researcher | 5 | 19% |
Student > Bachelor | 4 | 15% |
Student > Master | 4 | 15% |
Student > Ph. D. Student | 4 | 15% |
Student > Doctoral Student | 1 | 4% |
Other | 3 | 11% |
Unknown | 6 | 22% |
Readers by discipline | Count | As % |
---|---|---|
Agricultural and Biological Sciences | 7 | 26% |
Computer Science | 4 | 15% |
Immunology and Microbiology | 2 | 7% |
Engineering | 2 | 7% |
Medicine and Dentistry | 2 | 7% |
Other | 4 | 15% |
Unknown | 6 | 22% |