Report for: Fast randomized approximate string matching with succinct hash data structures

You are seeing a free-to-access but limited selection of the activity Altmetric has collected about this research output. Click here to find out more.

Title	Fast randomized approximate string matching with succinct hash data structures
Published in	BMC Bioinformatics, June 2015
DOI	10.1186/1471-2105-16-s9-s4
Pubmed ID	26051265
Authors	Alberto Policriti, Nicola Prezza
Abstract	The high throughput of modern NGS sequencers coupled with the huge sizes of genomes currently analysed, poses always higher algorithmic challenges to align short reads quickly and accurately against a reference sequence. A crucial, additional, requirement is that the data structures used should be light. The available modern solutions usually are a compromise between the mentioned constraints: in particular, indexes based on the Burrows-Wheeler transform offer reduced memory requirements at the price of lower sensitivity, while hash-based text indexes guarantee high sensitivity at the price of significant memory consumption. In this work we describe a technique that permits to attain the advantages granted by both classes of indexes. This is achieved using Hamming-aware hash functions--hash functions designed to search the entire Hamming sphere in reduced time--which are also homomorphisms on de Bruijn graphs. We show that, using this particular class of hash functions, the corresponding hash index can be represented in linear space introducing only a logarithmic slowdown (in the query length) for the lookup operation. We point out that our data structure reaches its goals without compressing its input: another positive feature, as in biological applications data is often very close to be un-compressible. The new data structure introduced in this work is called dB-hash and we show how its implementation--BW-ERNE--maintains the high sensitivity and speed of its (hash-based) predecessor ERNE, while drastically reducing space consumption. Extensive comparison experiments conducted with several popular alignment tools on both simulated and real NGS data, show, finally, that BW-ERNE is able to attain both the positive features of succinct data structures (that is, small space) and hash indexes (that is, sensitivity). In applications where space and speed are both a concern, standard methods often sacrifice accuracy to obtain competitive throughputs and memory footprints. In this work we show that, combining hashing and succinct indexing techniques, we can attain good performances and accuracy with a memory footprint comparable to that of the most popular compressed indexes.

View on publisher site Alert me about new mentions

X Demographics

The data shown below were collected from the profiles of 5 X users who shared this research output. Click here to find out more about how the information was compiled.

Geographical breakdown

Country	Count	As %
Russia	1	20%
United Kingdom	1	20%
United States	1	20%
Unknown	2	40%

Demographic breakdown

Type	Count	As %
Members of the public	3	60%
Scientists	1	20%
Practitioners (doctors, other healthcare professionals)	1	20%

Mendeley readers

The data shown below were compiled from readership statistics for 7 Mendeley readers of this research output. Click here to see the associated Mendeley record.

Geographical breakdown

Country	Count	As %
Unknown	7	100%

Demographic breakdown

Readers by professional status	Count	As %
Student > Ph. D. Student	3	43%
Student > Master	2	29%
Researcher	1	14%
Lecturer	1	14%

Readers by discipline	Count	As %
Computer Science	4	57%
Agricultural and Biological Sciences	3	43%

Attention Score in Context

This research output has an Altmetric Attention Score of 3. This is our high-level measure of the quality and quantity of online attention that it has received. This Attention Score, as well as the ranking and number of research outputs shown below, was calculated when the research output was last mentioned on 31 December 2016.

All research outputs

#13,202,980

of 22,808,725 outputs

Outputs from BMC Bioinformatics

#4,001

of 7,284 outputs

Outputs of similar age

#123,365

of 267,542 outputs

Outputs of similar age from BMC Bioinformatics

#77

of 127 outputs

Altmetric has tracked 22,808,725 research outputs across all sources so far. This one is in the 41st percentile – i.e., 41% of other outputs scored the same or lower than it.

So far Altmetric has tracked 7,284 research outputs from this source. They typically receive a little more attention than average, with a mean Attention Score of 5.4. This one is in the 42nd percentile – i.e., 42% of its peers scored the same or lower than it.

Older research outputs will score higher simply because they've had more time to accumulate mentions. To account for age we can compare this Altmetric Attention Score to the 267,542 tracked outputs that were published within six weeks on either side of this one in any source. This one has gotten more attention than average, scoring higher than 53% of its contemporaries.

We're also able to compare this research output to 127 others from the same source and published within six weeks on either side of this one. This one is in the 36th percentile – i.e., 36% of its contemporaries scored the same or lower than it.

Fast randomized approximate string matching with succinct hash data structures

About this Attention Score

Mentioned by

Readers on

X Demographics

Geographical breakdown

Demographic breakdown

Mendeley readers

Geographical breakdown

Demographic breakdown

Attention Score in Context