↓ Skip to main content

Playing hide and seek with repeats in local and global de novo transcriptome assembly of short RNA-seq reads

Overview of attention for article published in Algorithms for Molecular Biology, February 2017
Altmetric Badge

Mentioned by

twitter
2 X users

Citations

dimensions_citation
20 Dimensions

Readers on

mendeley
57 Mendeley
You are seeing a free-to-access but limited selection of the activity Altmetric has collected about this research output. Click here to find out more.
Title
Playing hide and seek with repeats in local and global de novo transcriptome assembly of short RNA-seq reads
Published in
Algorithms for Molecular Biology, February 2017
DOI 10.1186/s13015-017-0091-2
Pubmed ID
Authors

Leandro Lima, Blerina Sinaimeri, Gustavo Sacomoto, Helene Lopez-Maestre, Camille Marchet, Vincent Miele, Marie-France Sagot, Vincent Lacroix

Abstract

The main challenge in de novo genome assembly of DNA-seq data is certainly to deal with repeats that are longer than the reads. In de novo transcriptome assembly of RNA-seq reads, on the other hand, this problem has been underestimated so far. Even though we have fewer and shorter repeated sequences in transcriptomics, they do create ambiguities and confuse assemblers if not addressed properly. Most transcriptome assemblers of short reads are based on de Bruijn graphs (DBG) and have no clear and explicit model for repeats in RNA-seq data, relying instead on heuristics to deal with them. The results of this work are threefold. First, we introduce a formal model for representing high copy-number and low-divergence repeats in RNA-seq data and exploit its properties to infer a combinatorial characteristic of repeat-associated subgraphs. We show that the problem of identifying such subgraphs in a DBG is NP-complete. Second, we show that in the specific case of local assembly of alternative splicing (AS) events, we can implicitly avoid such subgraphs, and we present an efficient algorithm to enumerate AS events that are not included in repeats. Using simulated data, we show that this strategy is significantly more sensitive and precise than the previous version of KisSplice (Sacomoto et al. in WABI, pp 99-111, 1), Trinity (Grabherr et al. in Nat Biotechnol 29(7):644-652, 2), and Oases (Schulz et al. in Bioinformatics 28(8):1086-1092, 3), for the specific task of calling AS events. Third, we turn our focus to full-length transcriptome assembly, and we show that exploring the topology of DBGs can improve de novo transcriptome evaluation methods. Based on the observation that repeats create complicated regions in a DBG, and when assemblers try to traverse these regions, they can infer erroneous transcripts, we propose a measure to flag transcripts traversing such troublesome regions, thereby giving a confidence level for each transcript. The originality of our work when compared to other transcriptome evaluation methods is that we use only the topology of the DBG, and not read nor coverage information. We show that our simple method gives better results than Rsem-Eval (Li et al. in Genome Biol 15(12):553, 4) and TransRate (Smith-Unna et al. in Genome Res 26(8):1134-1144, 5) on both real and simulated datasets for detecting chimeras, and therefore is able to capture assembly errors missed by these methods.

X Demographics

X Demographics

The data shown below were collected from the profiles of 2 X users who shared this research output. Click here to find out more about how the information was compiled.
Mendeley readers

Mendeley readers

The data shown below were compiled from readership statistics for 57 Mendeley readers of this research output. Click here to see the associated Mendeley record.

Geographical breakdown

Country Count As %
France 1 2%
Unknown 56 98%

Demographic breakdown

Readers by professional status Count As %
Student > Master 13 23%
Student > Ph. D. Student 8 14%
Researcher 7 12%
Unspecified 5 9%
Student > Bachelor 4 7%
Other 8 14%
Unknown 12 21%
Readers by discipline Count As %
Biochemistry, Genetics and Molecular Biology 14 25%
Agricultural and Biological Sciences 9 16%
Computer Science 9 16%
Unspecified 5 9%
Immunology and Microbiology 2 4%
Other 6 11%
Unknown 12 21%
Attention Score in Context

Attention Score in Context

This research output has an Altmetric Attention Score of 1. This is our high-level measure of the quality and quantity of online attention that it has received. This Attention Score, as well as the ranking and number of research outputs shown below, was calculated when the research output was last mentioned on 15 March 2017.
All research outputs
#18,536,772
of 22,958,253 outputs
Outputs from Algorithms for Molecular Biology
#197
of 264 outputs
Outputs of similar age
#238,146
of 311,178 outputs
Outputs of similar age from Algorithms for Molecular Biology
#7
of 7 outputs
Altmetric has tracked 22,958,253 research outputs across all sources so far. This one is in the 11th percentile – i.e., 11% of other outputs scored the same or lower than it.
So far Altmetric has tracked 264 research outputs from this source. They receive a mean Attention Score of 3.1. This one is in the 12th percentile – i.e., 12% of its peers scored the same or lower than it.
Older research outputs will score higher simply because they've had more time to accumulate mentions. To account for age we can compare this Altmetric Attention Score to the 311,178 tracked outputs that were published within six weeks on either side of this one in any source. This one is in the 12th percentile – i.e., 12% of its contemporaries scored the same or lower than it.
We're also able to compare this research output to 7 others from the same source and published within six weeks on either side of this one.