The Prague Stringology Club Workshop '99

Fei Shi and Dong-Guk Shin

Centroid Trees with Application to String Processing

Abstract:
A centroid of a tree T is a node v which minimizes over all nodes the largest connected component of T induced by removing v from T. A centroid tree U of another tree T is defined on the same set of nodes of T: the root v of U is a centroid of T and the subtrees of v (in U) are the centroid trees of the connected components of T - v. We describe some interesting properties of the centroid and of the centroid tree. Our linear algorithm to find a centroid of a tree improves on the previously known algorithms either in terms of space requirement or in terms of time requirement. From the algorithm for finding a centroid it is easy to obtain an O(n log(n)) time algorithm to construct a centroid tree of a gi ven tree with n nodes. However, we do not know whether this is the best that one can achieve. By exploiting the properties of the centroid tree, we devise an efficient algorithm for the longest common substring problem (LCS). Given two strings S (the text) of length n and P (the pattern) of length m, the LCS problem is to find the longest substring that appears in both the text and the pattern. Our algorithm requires O(n log(n)) time and O(n) space to preprocess the text. After preprocessing of the text, the algorithm takes O(m log(n)) time using O(m) extra space to find the solution. The algorithm may be used in the DNA applications in which the text is very large and fixed and is to be searched with many different patterns (n >> m).

Download paper: Article in PostScript Article in PDF
 PostScript   PDF