Is Shannon-Fano coding ambiguous?-CodePudding

In a nutshell:

Is the Shannon-Fano coding as described in Fano's paper The Transmission of Information (1952) really ambiguous?

In Detail:

3 papers
Claude E. Shannon published his famous paper A Mathematical Theory of Communication in July 1948. In this paper he invented the term bit as we know it today and he also defined what we call Shannon entropy today. And he also proposed an entropy based data compression algorithm in this paper. But Shannon's algorithm was so weak, that under certain circumstances the "compressed" messages could be even longer than in fix length coding. A few month later (March 1949) Robert M. Fano published an improved version of Shannons algorithm in the paper The Transmission of Information. 3 years after Fano (in September 1952) his student David A. Huffman published an even better version in his paper A Method for the Construction of Minimum-Redundancy Codes. Hoffman Coding is more efficient than its two predecessors and it is still used today. But my question is about the algorithm published by Fano which usually is called Shannon-Fano-Coding.

The algorithm
This description is based on the description from Wikipedia. Sorry, I did not fully read Fano's paper. I only browsed through it. It is 37 pages long and I really tried hard to find a passage where he talks about the topic of my question, but I could not find it. So, here is how Shannon-Fano encoding works:

Count how often each character appears in the message.
Sort all characters by frequency, characters with highest frequency on top of the list
Divide the list into two parts, such that the sums of frequencies in both parts are as equal as possible. Add the bit 0 to one part and the bit 1 to the other part.
Repeat step 3 on each part that contains 2 or more characters until all parts consist of only 1 character.
Concatenate all bits from all rounds. This is the Shannon-Fano-code of that character.

An example
Let's execute this on a really tiny example (I think it's the smallest message where the problem appears). Here is the message to encode:

aaabcde

Steps 1 and 2 produce the first 2 columns of both tables shown below. But if Wikipedia's explanation of Fanos's algorithm is correct, then step 3 is ambiguous. If you apply this step on my example, you have two possibilities to split the list in 2 parts (see below). These possibilities produce different codes, which by itself would not be worth to be mentioned. But the point is: The two possibilities produce codes of different lengths.

possibility 1

If there are 2 ways to split the list such that both parts are as equal to each other as possible, then put that character, that stands at the splitting point (this is character b in my example) to the part containing the low frequent characters

 ------ ------- ----- ----- ----- ----- ----- ----- ------ 
|      |       |   round1  |   round2  |   round3  |      |
| char | frequ | sum | bit | sum | bit | sum | bit | code |
 ------ ------- ----- ----- ----- ----- ----- ----- ------ 
|   a  |   3   |  3  |  0  |                       | 0    |
|      |        ----- ----- ----- ----- ----- ----- ------ 
|   b  |   1   |     |     |     |     |  1  |  0  | 100  |
|      |       |     |     |  2  |  0   ----- ----- ------ 
|   c  |   1   |     |     |     |     |  1  |  1  | 101  |
|      |       |  4  |  1   ----- ----- ----- ----- ------ 
|   d  |   1   |     |     |     |     |  1  |  0  | 110  |
|      |       |     |     |  2  |  1   ----- ----- ------ 
|   e  |   1   |     |     |     |     |  1  |  1  | 111  |
 ------ ------- ----- ----- ----- ----- ----- ----- ------

The encoded message is

000100101110111  length = 15 bit
aaab  c  d  e

possibility 2

If there are 2 ways to split the list such that both parts are as equal to each other as possible, then put that character, that stands at the splitting point to the part containing the high frequent characters

 ------ ------- ----- ----- ----- ----- ----- ----- ------ 
|      |       |   round1  |   round2  |   round3  |      |
| char | frequ | sum | bit | sum | bit | sum | bit | code |
 ------ ------- ----- ----- ----- ----- ----- ----- ------ 
|   a  |   3   |     |     |  3  |  0  |           | 00   |
|      |       |  4  |  0   ----- -----             ------ 
|   b  |   1   |     |     |  1  |  1  |           | 01   |
|      |        ----- ----- ----- ----- ----- ----- ------ 
|   c  |   1   |     |     |     |     |  1  |  0  | 100  |
|      |       |     |     |  2  |  0  |----- ----- ------ 
|   d  |   1   |  3  |  1  |     |     |  1  |  1  | 101  |
|      |       |     |      ----- ----- ----- ----- ------ 
|   e  |   1   |     |     |  1  |  1  |           | 11   |
 ------ ------- ----- ----- ----- ----- ----- ----- ------

The encoded message is

0000000110010111  length = 16 bit
a a a b c  d  e

So, it is one bit longer.

So, here are my questions:

Is Wikipedia's description of Shannon-Fano Coding really correct and complete? If this is the case, than Shannon-Fano Coding is ambiguous.
Or did Fano in his paper add another step that is missing in Wikipedia's description? If this is the case: How did Fano solve the problem described here? Which of both versions is compatible with Fano's original description?

CodePudding user response：

It is exact that there is an ambiguity in the algorithm. I could not find any hint about it by reading both the Wikipedia page and the algorithm part of the original paper.

In practice, why do you get a solution with degraded performance? Because in Possibility 2, you get two buckets, with frequencies 3 and 1, i.e. rather different frequencies.

This issue is addressed in the paper (emphasize is mine), page 8:

One may observe, however, that it will not generally be possible to form groups equally likely to contain the desired message, because shifting any one of the messages from one group to the other will change, by finite amounts, the probabilities corresponding to the two groups.

But for Fano, this problem is not so important. His ambition is not to define a very simple and practical algorithm to compress some little messages consisting of a few characters. He is more interested by the theoretical aspects. For that, he is considering very individual long messages (these individual messages are characters in your example):

On the other hand, if the length of the messages is increased indefinitely, the accuracy with which the probabilities of the two groups can be made equal becomes better and better since the probability of each individual message approaches zero.

With this hypothesis, the phenomena that you observe is unlikely to happen with an important performance degradation.

CodePudding user response：

To directly answer your question, without further elaboration about how to break ties, two different implementations of Shannon-Fano could produce different codes of different lengths for the same inputs.

As @MattTimmermans noted in the comments, Shannon-Fano does not always produce optimal prefix-free codings the way that, say, Huffman coding does. It might therefore be helpful to think of it less as an algorithm and more of a heuristic - something that likely will produce a good code but isn't guaranteed to give an optimal solution. Many heuristics suffer from similar issues, where minor tweaks in the input or how ties are broken could result in different results. A good example of this is the greedy coloring algorithm for finding vertex colorings of graphs. The linked Wikipedia article includes an example in which changing the order in which nodes are visited by the same basic algorithm yields wildly different results.

Even algorithms that produce optimal results, however, can sometimes produce different optimal results based on tiebreaks. Take Huffman coding, for example, which works by repeatedly finding the two lowest-weight trees assembled so far and merging them together. In the event that there are three or more trees at some intermediary step that are all tied for the same weight, different implementations of Huffman coding could produce different prefix-free codes based on which two they join together. The resulting trees would all be equally "good," though, in that they'd all produce outputs of the same length. (That's largely because, unlike Shannon-Fano, Huffman coding is guaranteed to produce an optimal encoding.)

That being said, it's easy to adjust Shannon-Fano so that it always produces a consistent result. For example, you could say "in the event of a tie, choose the partition that puts fewer items into the top group," at which point you would always consistently produce the same coding. It wouldn't necessarily be an optimal encoding, but, then again, since Shannon-Fano was never guaranteed to do so, this is probably not a major concern.

If, on the other hand, you're interested in the question of "when Shannon-Fano has to break a tie, how do I decide how to break the tie to produce the optimal solution?," then I'm not sure of a way to do this other than recursively trying both options and seeing which one is better, which in the worst case leads to exponentially-slow runtimes. But perhaps someone else here can find a way to do that>