The Prague Stringology Club Workshop '96

Takashi Sato

Fast Full Text Search Using Tree Structured[TS] File

The author proposes a new data structure (TS-file) in order to make a fast search for an arbitrary string in a large full text stored in secondary storage. The TS-file stores the location of every string of length L (the level) in the text. Using this, we can efficiently search for, not only strings of length L but also those shorter than or longer than L. From an analysis of search cost, the number of accesses to secondary storage in order to find the first match to a key is two when the key length lk is shorter than or equal to L, and 2(L-lk+1) otherwise. And the time required to find all matching patterns is proportional to the number of matches, which is the lowest rate of increase for these kind of searches. Because of the high storage cost of the basic TS-file, a compressed TS-file is introduced in order to lower storage costs for practical use without losing search speed. The experimental results on compression using UNIX online manuals and network news show that the space overhead of the TS-file against the text searched is from 17% (when L=3) to 212% (when L=12) which is small enough for practical use.

Download paper: Article in PostScript Article in PDF
 PostScript   PDF