Title |
Improving protein function prediction methods with integrated literature data
|
---|---|
Published in |
BMC Bioinformatics, April 2008
|
DOI | 10.1186/1471-2105-9-198 |
Pubmed ID | |
Authors |
Aaron P Gabow, Sonia M Leach, William A Baumgartner, Lawrence E Hunter, Debra S Goldberg |
Abstract |
Determining the function of uncharacterized proteins is a major challenge in the post-genomic era due to the problem's complexity and scale. Identifying a protein's function contributes to an understanding of its role in the involved pathways, its suitability as a drug target, and its potential for protein modifications. Several graph-theoretic approaches predict unidentified functions of proteins by using the functional annotations of better-characterized proteins in protein-protein interaction networks. We systematically consider the use of literature co-occurrence data, introduce a new method for quantifying the reliability of co-occurrence and test how performance differs across species. We also quantify changes in performance as the prediction algorithms annotate with increased specificity. We find that including information on the co-occurrence of proteins within an abstract greatly boosts performance in the Functional Flow graph-theoretic function prediction algorithm in yeast, fly and worm. This increase in performance is not simply due to the presence of additional edges since supplementing protein-protein interactions with co-occurrence data outperforms supplementing with a comparably-sized genetic interaction dataset. Through the combination of protein-protein interactions and co-occurrence data, the neighborhood around unknown proteins is quickly connected to well-characterized nodes which global prediction algorithms can exploit. Our method for quantifying co-occurrence reliability shows superior performance to the other methods, particularly at threshold values around 10% which yield the best trade off between coverage and accuracy. In contrast, the traditional way of asserting co-occurrence when at least one abstract mentions both proteins proves to be the worst method for generating co-occurrence data, introducing too many false positives. Annotating the functions with greater specificity is harder, but co-occurrence data still proves beneficial. Co-occurrence data is a valuable supplemental source for graph-theoretic function prediction algorithms. A rapidly growing literature corpus ensures that co-occurrence data is a readily-available resource for nearly every studied organism, particularly those with small protein interaction databases. Though arguably biased toward known genes, co-occurrence data provides critical additional links to well-studied regions in the interaction network that graph-theoretic function prediction algorithms can exploit. |
Mendeley readers
Geographical breakdown
Country | Count | As % |
---|---|---|
United States | 4 | 9% |
Israel | 1 | 2% |
Italy | 1 | 2% |
Canada | 1 | 2% |
Unknown | 38 | 84% |
Demographic breakdown
Readers by professional status | Count | As % |
---|---|---|
Researcher | 10 | 22% |
Student > Master | 10 | 22% |
Student > Ph. D. Student | 7 | 16% |
Professor | 3 | 7% |
Other | 3 | 7% |
Other | 7 | 16% |
Unknown | 5 | 11% |
Readers by discipline | Count | As % |
---|---|---|
Agricultural and Biological Sciences | 17 | 38% |
Computer Science | 10 | 22% |
Biochemistry, Genetics and Molecular Biology | 5 | 11% |
Medicine and Dentistry | 2 | 4% |
Sports and Recreations | 1 | 2% |
Other | 2 | 4% |
Unknown | 8 | 18% |