Prague Stringology Conference 2013

Loek Cleophas, Derrick G. Kourie and Bruce W. Watson

Weak Factor Automata: Comparing (Failure) Oracles and Storacles

Abstract:
The factor oracle (C. Allauzen, M. Crochemore, and M. Raffinot: Efficient Experimental String Matching by Weak Factor Recognition, in Proceedings of the 12th conference on Combinatorial Pattern Matching, vol. 2089 of LNCS, 2001, pp. 51–72.) is a data structure for weak factor recognition. It is a deterministic finite automaton (DFA) built on a string p of length m that is acyclic, recognizes at least all factors of p, has m+1 states which are all final, is homogeneous, and has m to 2m-1 transitions. The factor storacle (L. Cleophas and B. W. Watson: On Factor Storacles: an Alternative to Factor Oracles?, in Festschrift for Boŕivoj Melichar, Department of Theoretical Computer Science, Czech Technical University, Prague, August 2012.) is an alternative automaton that satisfies the same properties, except that its number of transitions may be larger than 2m-1, although it is conjectured to be linear with an upper bound of at most 3m. In (D. G. Kourie, B. W. Watson, L. Cleophas, and F. Venter: Failure Deterministic Finite Automata, in Proceedings of the Prague Stringology Conference 2012, Department of Theoretical Computer Science, Czech Technical University, Prague, September 2012.) (among others), we described the concept of a failure automaton i.e. a failure DFA (FDFA), in which so-called failure transitions are used to reduce the total number of transitions and thus reduce representation space compared to the use of a DFA. We modify factor oracle and storacle construction algorithms to introduce failure arcs during the respective automata's construction. We thus end up with four deterministic automata types for weak factor recognition: factor oracle, factor storacle, failure factor oracle, and failure factor storacle. We compare them empirically in terms of size. The results show that despite the relative simplicity of (failure) factor (st)oracles, the failure versions show additional savings of 2–7% in number of transitions, for generated keywords of length 5–9, and of e.g. 5–9% for English words of lengths around 9–15. This may already be substantial in memory-restricted settings such as hardware implementations of automata. The results indicate the gains increase for longer keywords, which seems promising for applications in DNA processing and intrusion detection. Furthermore, our results provide a rather negative result on storacles: apart from rare cases, factor storacles do not have fewer transitions than factor oracles, and similarly for failure factor storacles versus failure factor oracles.

Download paper: Article in PostScript Article in PDF BibTeX Reference
   PDF   BibTeX reference 
Download presentation: Presentation