↓ Skip to main content

A comparative study of k-spectrum-based error correction methods for next-generation sequencing data analysis

Overview of attention for article published in Human Genomics, July 2016
Altmetric Badge

About this Attention Score

  • Average Attention Score compared to outputs of the same age
  • Average Attention Score compared to outputs of the same age and source

Mentioned by

twitter
7 X users

Citations

dimensions_citation
21 Dimensions

Readers on

mendeley
52 Mendeley
You are seeing a free-to-access but limited selection of the activity Altmetric has collected about this research output. Click here to find out more.
Title
A comparative study of k-spectrum-based error correction methods for next-generation sequencing data analysis
Published in
Human Genomics, July 2016
DOI 10.1186/s40246-016-0068-0
Pubmed ID
Authors

Isaac Akogwu, Nan Wang, Chaoyang Zhang, Ping Gong

Abstract

Innumerable opportunities for new genomic research have been stimulated by advancement in high-throughput next-generation sequencing (NGS). However, the pitfall of NGS data abundance is the complication of distinction between true biological variants and sequence error alterations during downstream analysis. Many error correction methods have been developed to correct erroneous NGS reads before further analysis, but independent evaluation of the impact of such dataset features as read length, genome size, and coverage depth on their performance is lacking. This comparative study aims to investigate the strength and weakness as well as limitations of some newest k-spectrum-based methods and to provide recommendations for users in selecting suitable methods with respect to specific NGS datasets. Six k-spectrum-based methods, i.e., Reptile, Musket, Bless, Bloocoo, Lighter, and Trowel, were compared using six simulated sets of paired-end Illumina sequencing data. These NGS datasets varied in coverage depth (10× to 120×), read length (36 to 100 bp), and genome size (4.6 to 143 MB). Error Correction Evaluation Toolkit (ECET) was employed to derive a suite of metrics (i.e., true positives, false positive, false negative, recall, precision, gain, and F-score) for assessing the correction quality of each method. Results from computational experiments indicate that Musket had the best overall performance across the spectra of examined variants reflected in the six datasets. The lowest accuracy of Musket (F-score = 0.81) occurred to a dataset with a medium read length (56 bp), a medium coverage (50×), and a small-sized genome (5.4 MB). The other five methods underperformed (F-score < 0.80) and/or failed to process one or more datasets. This study demonstrates that various factors such as coverage depth, read length, and genome size may influence performance of individual k-spectrum-based error correction methods. Thus, efforts have to be paid in choosing appropriate methods for error correction of specific NGS datasets. Based on our comparative study, we recommend Musket as the top choice because of its consistently superior performance across all six testing datasets. Further extensive studies are warranted to assess these methods using experimental datasets generated by NGS platforms (e.g., 454, SOLiD, and Ion Torrent) under more diversified parameter settings (k-mer values and edit distances) and to compare them against other non-k-spectrum-based classes of error correction methods.

X Demographics

X Demographics

The data shown below were collected from the profiles of 7 X users who shared this research output. Click here to find out more about how the information was compiled.
Mendeley readers

Mendeley readers

The data shown below were compiled from readership statistics for 52 Mendeley readers of this research output. Click here to see the associated Mendeley record.

Geographical breakdown

Country Count As %
United States 1 2%
Netherlands 1 2%
Sweden 1 2%
France 1 2%
Unknown 48 92%

Demographic breakdown

Readers by professional status Count As %
Researcher 14 27%
Student > Ph. D. Student 9 17%
Student > Bachelor 7 13%
Student > Master 7 13%
Professor > Associate Professor 3 6%
Other 7 13%
Unknown 5 10%
Readers by discipline Count As %
Biochemistry, Genetics and Molecular Biology 13 25%
Agricultural and Biological Sciences 12 23%
Computer Science 8 15%
Medicine and Dentistry 2 4%
Engineering 2 4%
Other 7 13%
Unknown 8 15%
Attention Score in Context

Attention Score in Context

This research output has an Altmetric Attention Score of 3. This is our high-level measure of the quality and quantity of online attention that it has received. This Attention Score, as well as the ranking and number of research outputs shown below, was calculated when the research output was last mentioned on 01 August 2016.
All research outputs
#8,534,528
of 25,373,627 outputs
Outputs from Human Genomics
#211
of 564 outputs
Outputs of similar age
#136,977
of 379,928 outputs
Outputs of similar age from Human Genomics
#7
of 12 outputs
Altmetric has tracked 25,373,627 research outputs across all sources so far. This one is in the 43rd percentile – i.e., 43% of other outputs scored the same or lower than it.
So far Altmetric has tracked 564 research outputs from this source. They typically receive more attention than average, with a mean Attention Score of 7.6. This one has gotten more attention than average, scoring higher than 53% of its peers.
Older research outputs will score higher simply because they've had more time to accumulate mentions. To account for age we can compare this Altmetric Attention Score to the 379,928 tracked outputs that were published within six weeks on either side of this one in any source. This one is in the 47th percentile – i.e., 47% of its contemporaries scored the same or lower than it.
We're also able to compare this research output to 12 others from the same source and published within six weeks on either side of this one. This one is in the 41st percentile – i.e., 41% of its contemporaries scored the same or lower than it.