Arithmetic coding - implementation with unlimited precision

Main features

statistic compression method
two-pass compression method
semi-adaptive compression method
symetric compression method
the compress ratio is very close to message entropy
unlimited precission required

Time and space complexities

Time complexity of algorithm :O(|Σ|+n)
Memory complexity of algorithm : O(|Σ|)

Description

Arithmetic coders produce near-optimal output for a given set of symbols and probabilities. Compression algorithms that use arithmetic coding start by determining a model of the data - basically a prediction of what patterns will be found in the symbols of the message. The more accurate this prediction is, the closer to optimality the output will be.

Example: a simple, static model of input data:

60% chance of symbol 'a' -> the interval would be [0, 0.6)
20% chance of symbol 'b' -> the interval would be [0.6, 0.8)
10% chance of symbol 'c' -> the interval would be [0.8, 0.9)
10% chance of symbol END-OF-DATA. -> the interval would be [0.9, 1)

The presence of END-OF-DATA symbol means that the stream will be 'internally terminated', as is fairly common in data compression; the first and only time this symbol appears in the data stream, the decoder will know that the entire stream has been decoded.

The encoder has basically just three pieces of data to consider: the next symbol that needs to be encoded, the current interval, the probabilities of symbols. Due to this, is quite easy to modify the algorithm to adaptive model.

The encoder divides the current interval into sub-intervals, each representing a fraction of the current interval proportional to the probability of that symbol in the current context. Whichever interval corresponding to the actual symbol that is next to be encoded becomes the interval used in the next step.

When all symbols have been encoded, the resulting interval identifies, unambiguously, the sequence of symbols that produced it. Anyone who has the final interval and the model used can reconstruct the symbol sequence that must have been entered the encoder to result in that final interval.

It is not necessary to transmit the final interval, however; it is only necessary to transmit one fraction that lies within that interval. In particular, it is only necessary to transmit enough digits (in whatever base) of the fraction so that all fractions that begin with those digits fall into the final interval.

Pseudo-code

 
 1:  begin 
 2:     count source units
 3:     interval I := new interval 0..1
 4:     divide I acording to rate of units
 5:     readSymbol(X)
 6:     while (X!=EOF) do 
 7:      begin
 8:        new I := subinterval I matching X
 9:        divide I acording to rate of units
10:        readSymbol(X)
11:      end
12:     output(best number from I)
13:  end

References

Arithmetic coding. Wikipedia, the free encyclopedia.
Arturo San Emeterio Campos: Arithmetic coding.