DCA (Data Compression using Antidictionaries)
Main features
- context compression method
- contextual method
- works with binary alphabet
History
Method was developed in 1998 by M. Crochemore, F. Mignosi, A. Restivo, S. Salemi.
Description
Algorithm uses antidictionary (AD) - a set of words (of certain maximal length), which are not contained in input text. AD doesn't contain all such words, but just minimal forbidden words - MFW. MFW is word u[1..n], u doesn't occur in input text, but both u[1..n-1] and u[2..n] do.
When processing i-th letter of input and there is a word v[1..n] in antidictionary, that v[1..n-1] is a suffix of previously processed input, we don't need to code i-th letter into output, because (when decoding) we can find that letter with a help of the suffix andd the antidictionary. Other input letters are normally copied to output.
Applet description
When building a factor automaton, the node being just processed is blue-colored and newly inserted node is colored red. After finishing the automaton all nodes that have just one child are marked blue. Then we add new nodes (to pair with "lonely children") and mark them red.
During the process of encoding we can see all suffixes of processed input and each of these suffixes has additional "0" or "1" at the end (marked red).
Pseudo-code
1: function Create AD (input word w, AD's word's max length l)
2: begin
3: create factor automaton matching all factors (of maximal length l) of input text
4: each node has value equal to corresponding factor
5: for each (node U) do
6: begin
7: if (U has just one child) then
8: begin
9: let h is value of edge leading from U (h ∈ {0,1})
10: create node U' and mark it as forbidden word
11: create edge from U to U' with value ¬h
12: create suffix-link - edge from U (with value s[1..n])
13: to U'' with value s[2..n] (if U'' doesn't exist,
14: try s[3..n] etc.)
15: end
16: end
17: for each (node U marked as forbidden word) do
18: begin
19: let P is direct ancestor of U
20: let S is node, where suffix-link from P leads
21: let h is value of edge from P to U
22: if ((exists node T linked by edge with value h from S)
23: and (T is not forbidden word)) then
24: begin
25: mark U as minimal forbidden word - MFW
26: end
27: end
28: return values of all nodes marked as MFW - antidictionary AD
29: end
30:
31: function Process Input (antidictionary AD, input word w)
32: begin
33: let v = ε {processed input}
34: let g = ε {output}
35: for (i := 1 to |w|) do
36: begin
37: let M0 = set of all suffixes of v with additional 0 as last letter
38: let M0 = set of all suffixes of v with additional 1 as last letter
39: if ((M0 ∪ M1) ∩ AD ≠ Ø) then
40: g := g.wi
41: v := v.wi
42: end
43: output(g)
44: end
References
- M. Crochemore, F. Mignosi, A. Restiva and S. Salemi: Text Compression Using Antidictionaries. 1998.