↓ Skip to main content

NeatFreq: reference-free data reduction and coverage normalization for De Novosequence assembly

Overview of attention for article published in BMC Bioinformatics, November 2014
Altmetric Badge

About this Attention Score

  • In the top 25% of all research outputs scored by Altmetric
  • High Attention Score compared to outputs of the same age (91st percentile)
  • High Attention Score compared to outputs of the same age and source (89th percentile)

Mentioned by

twitter
25 X users

Citations

dimensions_citation
17 Dimensions

Readers on

mendeley
86 Mendeley
citeulike
1 CiteULike
You are seeing a free-to-access but limited selection of the activity Altmetric has collected about this research output. Click here to find out more.
Title
NeatFreq: reference-free data reduction and coverage normalization for De Novosequence assembly
Published in
BMC Bioinformatics, November 2014
DOI 10.1186/s12859-014-0357-3
Pubmed ID
Authors

Jamison M McCorrison, Pratap Venepally, Indresh Singh, Derrick E Fouts, Roger S Lasken, Barbara A Methé

Abstract

BackgroundDeep shotgun sequencing on next generation sequencing (NGS) platforms has contributed significant amounts of data to enrich our understanding of genomes, transcriptomes, amplified single-cell genomes, and metagenomes. However, deep coverage variations in short-read data sets and high sequencing error rates of modern sequencers present new computational challenges in data interpretation, including mapping and de novo assembly. New lab techniques such as multiple displacement amplification (MDA) of single cells and sequence independent single primer amplification (SISPA) allow for sequencing of organisms that cannot be cultured, but generate highly variable coverage due to amplification biases.ResultsHere we introduce NeatFreq, a software tool that reduces a data set to more uniform coverage by clustering and selecting from reads binned by their median kmer frequency (RMKF) and uniqueness. Previous algorithms normalize read coverage based on RMKF, but do not include methods for the preferred selection of (1) extremely low coverage regions produced by extremely variable sequencing of random-primed products and (2) 2-sided paired-end sequences. The algorithm increases the incorporation of the most unique, lowest coverage, segments of a genome using an error-corrected data set. NeatFreq was applied to bacterial, viral plaque, and single-cell sequencing data. The algorithm showed an increase in the rate at which the most unique reads in a genome were included in the assembled consensus while also reducing the count of duplicative and erroneous contigs (strings of high confidence overlaps) in the deliverable consensus. The results obtained from conventional Overlap-Layout-Consensus (OLC) were compared to simulated multi-de Bruijn graph assembly alternatives trained for variable coverage input using sequence before and after normalization of coverage. Coverage reduction was shown to increase processing speed and reduce memory requirements when using conventional bacterial assembly algorithms.ConclusionsThe normalization of deep coverage spikes, which would otherwise inhibit consensus resolution, enables High Throughput Sequencing (HTS) assembly projects to consistently run to completion with existing assembly software. The NeatFreq software package is free, open source and available at https://github.com/bioh4x/NeatFreq.

X Demographics

X Demographics

The data shown below were collected from the profiles of 25 X users who shared this research output. Click here to find out more about how the information was compiled.
Mendeley readers

Mendeley readers

The data shown below were compiled from readership statistics for 86 Mendeley readers of this research output. Click here to see the associated Mendeley record.

Geographical breakdown

Country Count As %
Germany 2 2%
United States 2 2%
France 1 1%
Netherlands 1 1%
Japan 1 1%
Sweden 1 1%
Unknown 78 91%

Demographic breakdown

Readers by professional status Count As %
Researcher 26 30%
Student > Ph. D. Student 17 20%
Student > Master 13 15%
Student > Bachelor 5 6%
Other 3 3%
Other 10 12%
Unknown 12 14%
Readers by discipline Count As %
Agricultural and Biological Sciences 43 50%
Biochemistry, Genetics and Molecular Biology 12 14%
Computer Science 10 12%
Environmental Science 4 5%
Mathematics 1 1%
Other 3 3%
Unknown 13 15%
Attention Score in Context

Attention Score in Context

This research output has an Altmetric Attention Score of 15. This is our high-level measure of the quality and quantity of online attention that it has received. This Attention Score, as well as the ranking and number of research outputs shown below, was calculated when the research output was last mentioned on 27 August 2015.
All research outputs
#2,169,578
of 23,498,099 outputs
Outputs from BMC Bioinformatics
#562
of 7,400 outputs
Outputs of similar age
#31,335
of 366,454 outputs
Outputs of similar age from BMC Bioinformatics
#15
of 137 outputs
Altmetric has tracked 23,498,099 research outputs across all sources so far. Compared to these this one has done particularly well and is in the 90th percentile: it's in the top 10% of all research outputs ever tracked by Altmetric.
So far Altmetric has tracked 7,400 research outputs from this source. They typically receive a little more attention than average, with a mean Attention Score of 5.4. This one has done particularly well, scoring higher than 92% of its peers.
Older research outputs will score higher simply because they've had more time to accumulate mentions. To account for age we can compare this Altmetric Attention Score to the 366,454 tracked outputs that were published within six weeks on either side of this one in any source. This one has done particularly well, scoring higher than 91% of its contemporaries.
We're also able to compare this research output to 137 others from the same source and published within six weeks on either side of this one. This one has done well, scoring higher than 89% of its contemporaries.