Condensation Principle
Abstract: |
The most important contribution of this paper is the discovery and
description of an effective implementation of a finite automaton that
accepts all substrings of a given text. The ratio of the size of this
automaton with respect to the size of the input text is in standard
implementations 10 or more, see [AndNil95], in [Bal98] is described
implementation with ratio less than 4. This implementation does not increase
the time to search for a pattern, which is proportional to the length of the
pattern.
The major approach introduced in this paper uses transformation of a text into a new alphabet that contains more symbols than the original alphabet and decreases text size. It operates over alphabets that contain small number of symbols. Such an automaton is suitable for example for processing of DNA sequences. A new automaton and new search algorithms are presented for such a transformed text (patterns are still input in the original alphabet). Space requirements of such condensed automata are as low as the size of the input text. This paper further introduces a new type of an automaton, a so-called Identifier DAWG, which uses special properties of the DAWG automata and further decreases space requirements by introducing associations between states of an automaton and positions in a text. It is designed better searching algorithm using this association. |
Download paper: | ||
PostScript |