Arithmetic coding - integer implementation
Main features
- statistic compression method
- two-pass compression method
- semi-adaptive compression method
- symetric compression method
- The whole stream of input symbols is replaced with a single floating point output number [0,1).
- The infinite floating point number is replaced by sequence of integers.
Time and space complexities
Time complexity of the algorithm: O(|Σ|+n)
Space complexity of the algorithm: O(|Σ|)
Memory complexity depends on a number of different input symbols, at maximum O(n), where n is length of a message. Time complexity depends on a number of different input symbols and length of a message. So n+n*|Σ|, where |Σ| is a number of different input symbols, at maximum O(n*|Σ|).
History
The first arithmetic code was devised by Peter Elias sometime before 1963 in unpublished work. Although the Elias code encodes at an optimal rate, it is impractical because it requires infinite precision and therefore an infinite buffer size. However, the Elias code can be modified to remove these drawbacks. Although practical schemes for arithmetic coding now abound, the first practical arithmetic coding schemes did not become available until 1976, in work by Jorma Rissanen and Richard Pasco.
Description
Primary idea
On the base of probability occurrance of symbols of input alphabet every single symbol is assigned to a relative part of interval <0, 1). During encoding is the interval limited step by step according to actually read input symbol. Every input symbol determines adequate part from the actual interval and it becomes a new base for the next symbol. Coded value is represented by a random value from the resulting interval after reading all input symbols. By reason that during decoding the end of the coded message can't be recognized, the special symbol is added to the message or the length of the message has to be known.Modification
The interval <0,1) (<Low, High)) is replaced by interval of integers, e.g. <0,9999). If the most significant digits are equal, the digit is written to the output because it can't be changed. Both numbers are then left shifted and one digit (0 for low, 9 for high) is filled in a released position.Underflow problem
Underflow occurs when both high and low get close to each other but their most significant bits don't match: High = 3001 Low = 2997. If we ever have such numbers and they continue getting closer and closer we'll not be able to output the most significant bit and then in a few iterations bit length of our numbers will not be enough. We have to shift the second digit in this situation and when finally both msb are equal the digits discarded are also written to output.Formula
Low=Low + (High - Low + 1)*LowCumFreq(x)/TotalFreqHigh=Low + (High - Low + 1)*HighCumFreq(x)/TotalFreq - 1
Pseudo-code
1: begin
2: count frequency of input symbols
3: output(symbol's frequency)
4: interval I := new interval 0..9999
5: split I according frequency of symbols
6: readSymbol(X)
7: while (X!=EOF) do
8: begin
9: if (MSB(Low) == MSB(High) then
10: begin
11: output(MSB)
12: in case of need output discarded digits
13: shift left High and Low
14: end
15: else
16: begin
17: if (underflow danger) then
18: begin
19: shift left High and Low from second position
20: end
21: end
22: new I := interval I accordant with X
23: split I according frequency of symbols
24: readSymbol(X)
25: end
26: output(remainder)
27: end
References
- Richard C. Pasco: Source Coding Algorithms for Fast Data Compression. Ph.D. Dissertation, Stanford University, CA, May 1976.
- Jorma Rissanen, Glen G. Langdon Jr.: Arithmetic Coding. IBM Journal of Research and Development 23(2), 149-162 (1979).
- Arturo San Emeterio Campos: Arithmetic coding.
- Mark Nelson: Arithmetic Coding + Statistical Modeling = Data Compression.
- Arithmetic coding. Wikipedia, the free encyclopedia.