Zsuzsanna Lipták, Martina Lucà, Francesco Masillo and Simon J. Puglisi
Fast Matching Statistics for Sets of Long Similar Strings
Abstract: |
Matching statistics (MS) computation is at the heart of numerous bioinformatics applications, from read alignment to computing phylogenies of a set of genomes or even speeding up the computation of core data structures on collections of genomes. Many of these datasets have the property of being highly similar to the reference, which itself, however, may not be very repetitive. Some heuristics based on sequence-to-sequence similarity have already been studied in [Lipták et al., Alg. Mol. Biol. 2024], leading to a significant speedup in the computation of the matching statistics. In this paper, we introduce a new heuristic that further speeds MS computation. The core idea is to take advantage of existing similarities between the input sequences and the reference. We give an implementation making use of this heuristic, which also allows the use of multiple threads to parallelize MS computation. We give an experimental evaluation of our tool, LRF-ms, comparing it to other MS computation tools, on publicly available genomic datasets, and show that it is the fastest when the collection of genomes is highly similar to the reference string, while keeping a comparably low memory footprint. |
Download paper: | |||
PostScript | BibTeX reference |
Download presentation: |