Ondřej Sladký, Pavel Veselý and Karel Břinda
Towards Efficient k-Mer Set Operations via Function-Assigned Masked Superstrings
Abstract: |
The design of efficient dynamic data structures for large $k$-mer sets belongs to central challenges of sequence bioinformatics. Recent advances in compact $k$-mer set representations via Spectrum-Preserving String Sets (SPSS), culminating with the masked superstring framework, have provided data structures of remarkable space efficiency for wide ranges of $k$-mer sets. However, the possibility to perform set operations with the resulting indexes has remained limited due to the static nature of the underlying compact representations. Here, we develop $f$-masked superstrings, a concept combining masked superstrings with custom demasking functions $f$ to enable $k$-mer set operations based on index merging. Combined with the FMSI index for masked superstrings, we obtain a memory-efficient $k$-mer membership index and compressed dictionary supporting set operations via Burrows-Wheeler Transform merging. The framework provides a promising theoretical solution to a pressing bioinformatics problem and highlights the potential of $f$-masked superstrings to become an elementary data type for $k$-mer sets. |
Download paper: | ![]() |
![]() |
![]() |
PostScript | BibTeX reference |