Arithmetic coding - integer implementation

Main features

statistic compression method
two-pass compression method
semi-adaptive compression method
symetric compression method
The whole stream of input symbols is replaced with a single floating point output number [0,1).
The infinite floating point number is replaced by sequence of integers.

Time and space complexities

Time complexity of the algorithm: O(|Σ|+n)
Space complexity of the algorithm: O(|Σ|)

Memory complexity depends on a number of different input symbols, at maximum O(n), where n is length of a message. Time complexity depends on a number of different input symbols and length of a message. So n+n*|Σ|, where |Σ| is a number of different input symbols, at maximum O(n*|Σ|).

History

The first arithmetic code was devised by Peter Elias sometime before 1963 in unpublished work. Although the Elias code encodes at an optimal rate, it is impractical because it requires infinite precision and therefore an infinite buffer size. However, the Elias code can be modified to remove these drawbacks. Although practical schemes for arithmetic coding now abound, the first practical arithmetic coding schemes did not become available until 1976, in work by Jorma Rissanen and Richard Pasco.

Description

Primary idea

On the base of probability occurrance of symbols of input alphabet every single symbol is assigned to a relative part of interval <0, 1). During encoding is the interval limited step by step according to actually read input symbol. Every input symbol determines adequate part from the actual interval and it becomes a new base for the next symbol. Coded value is represented by a random value from the resulting interval after reading all input symbols. By reason that during decoding the end of the coded message can't be recognized, the special symbol is added to the message or the length of the message has to be known.

Modification

The interval <0,1) (<Low, High)) is replaced by interval of integers, e.g. <0,9999). If the most significant digits are equal, the digit is written to the output because it can't be changed. Both numbers are then left shifted and one digit (0 for low, 9 for high) is filled in a released position.

Underflow problem

Underflow occurs when both high and low get close to each other but their most significant bits don't match: High = 3001 Low = 2997. If we ever have such numbers and they continue getting closer and closer we'll not be able to output the most significant bit and then in a few iterations bit length of our numbers will not be enough. We have to shift the second digit in this situation and when finally both msb are equal the digits discarded are also written to output.

Formula

Low=Low + (High - Low + 1)*LowCumFreq(x)/TotalFreq
High=Low + (High - Low + 1)*HighCumFreq(x)/TotalFreq - 1

Pseudo-code


 1:  begin
 2:     count frequency of input symbols
 3:     output(symbol's frequency)
 4:     interval I := new interval 0..9999
 5:     split I according frequency of symbols
 6:     readSymbol(X)
 7:     while (X!=EOF) do
 8:      begin
 9:        if (MSB(Low) == MSB(High) then 
10:         begin
11:           output(MSB)
12:           in case of need output discarded digits
13:           shift left High and Low
14:         end
15:        else 
16:         begin
17:           if (underflow danger) then 
18:            begin
19:              shift left High and Low from second position
20:            end
21:         end
22:        new I := interval I accordant with X
23:        split I according frequency of symbols
24:        readSymbol(X)
25:      end
26:     output(remainder)
27:  end

References

Richard C. Pasco: Source Coding Algorithms for Fast Data Compression. Ph.D. Dissertation, Stanford University, CA, May 1976.
Jorma Rissanen, Glen G. Langdon Jr.: Arithmetic Coding. IBM Journal of Research and Development 23(2), 149-162 (1979).
Arturo San Emeterio Campos: Arithmetic coding.
Mark Nelson: Arithmetic Coding + Statistical Modeling = Data Compression.
Arithmetic coding. Wikipedia, the free encyclopedia.