Title |
MISSEL: a method to identify a large number of small species-specific genomic subsequences and its application to viruses classification
|
---|---|
Published in |
BioData Mining, December 2016
|
DOI | 10.1186/s13040-016-0116-2 |
Pubmed ID | |
Authors |
Giulia Fiscon, Emanuel Weitschek, Eleonora Cella, Alessandra Lo Presti, Marta Giovanetti, Muhammed Babakir-Mina, Marco Ciotti, Massimo Ciccozzi, Alessandra Pierangeli, Paola Bertolazzi, Giovanni Felici |
Abstract |
Continuous improvements in next generation sequencing technologies led to ever-increasing collections of genomic sequences, which have not been easily characterized by biologists, and whose analysis requires huge computational effort. The classification of species emerged as one of the main applications of DNA analysis and has been addressed with several approaches, e.g., multiple alignments-, phylogenetic trees-, statistical- and character-based methods. We propose a supervised method based on a genetic algorithm to identify small genomic subsequences that discriminate among different species. The method identifies multiple subsequences of bounded length with the same information power in a given genomic region. The algorithm has been successfully evaluated through its integration into a rule-based classification framework and applied to three different biological data sets: Influenza, Polyoma, and Rhino virus sequences. We discover a large number of small subsequences that can be used to identify each virus type with high accuracy and low computational time, and moreover help to characterize different genomic regions. Bounding their length to 20, our method found 1164 characterizing subsequences for all the Influenza virus subtypes, 194 for all the Polyoma viruses, and 11 for Rhino viruses. The abundance of small separating subsequences extracted for each genomic region may be an important support for quick and robust virus identification. Finally, useful biological information can be derived by the relative location and abundance of such subsequences along the different regions. |
Mendeley readers
Geographical breakdown
Country | Count | As % |
---|---|---|
Germany | 1 | 6% |
Unknown | 17 | 94% |
Demographic breakdown
Readers by professional status | Count | As % |
---|---|---|
Student > Ph. D. Student | 5 | 28% |
Researcher | 3 | 17% |
Student > Doctoral Student | 2 | 11% |
Student > Master | 2 | 11% |
Professor | 1 | 6% |
Other | 1 | 6% |
Unknown | 4 | 22% |
Readers by discipline | Count | As % |
---|---|---|
Agricultural and Biological Sciences | 4 | 22% |
Computer Science | 3 | 17% |
Medicine and Dentistry | 2 | 11% |
Biochemistry, Genetics and Molecular Biology | 1 | 6% |
Immunology and Microbiology | 1 | 6% |
Other | 1 | 6% |
Unknown | 6 | 33% |