The Prague Stringology Conference 2008

Manolis Christodoulakis and Gerhard Brey

Edit Distance with Single-Symbol Combinations and Splits

Abstract:
In this article we introduce new variants of the edit distance string similarity measure, where apart from the traditional insertion, deletion and substitution operations, two new operations are supported. The first one is called a combination and it allows two or more symbols from one string, to be ``matched'' against one symbol of the other. The dual of a combination, is the operation of a split, where one symbol from the first string is broken down into a sequence of two or more other symbols, that can then be matched against an equal number of symbols from the second string. The notions of combining and splitting symbols can be defined in a variety of ways, depending on how the application in hand defines similarity. Here we introduce three different possible definitions, and we provide an algorithm that deals with one of them. Our algorithm requires O(L) time for preprocessing, and O(m n k) time for computing the edit distance, where L is the total length of all the valid combinations/splits, and k is an upper bound on the number of valid splits of any single symbol.

Download paper: Article in PostScript Article in PDF BibTeX Reference
 PostScript   PDF   BibTeX reference 
Download presentation: Presentation