The Prague Stringology Club Workshop 2000

Miroslav Balík

Condensation Principle

Abstract:
The most important contribution of this paper is the discovery and description of an effective implementation of a finite automaton that accepts all substrings of a given text. The ratio of the size of this automaton with respect to the size of the input text is in standard implementations 10 or more, see [AndNil95], in [Bal98] is described implementation with ratio less than 4. This implementation does not increase the time to search for a pattern, which is proportional to the length of the pattern.

The major approach introduced in this paper uses transformation of a text into a new alphabet that contains more symbols than the original alphabet and decreases text size. It operates over alphabets that contain small number of symbols. Such an automaton is suitable for example for processing of DNA sequences.

A new automaton and new search algorithms are presented for such a transformed text (patterns are still input in the original alphabet). Space requirements of such condensed automata are as low as the size of the input text.

This paper further introduces a new type of an automaton, a so-called Identifier DAWG, which uses special properties of the DAWG automata and further decreases space requirements by introducing associations between states of an automaton and positions in a text. It is designed better searching algorithm using this association.

Download paper: Article in PostScript Article in PDF
 PostScript   PDF