So, I'm trying to solve a challenge regarding decoding some huffman compressed message, without knowing the code tree used to compress with.
I do however know the alphabet that was used in the message.
So my idea was to try to bruteforce it but I am a bit lacking in my algoritm skills.
I imagined I would try to generate the codes for the letters, in all possible combinations. The issue is though, that the codes (in binary) can never be able to hide eachother.
So an example could be:
Letter | Code |
---|---|
A | 0100 |
B | 1111 |
C | 1011 |
But then there couldn't be any other codes, that start with any of the above, as they would end up hiding eachother.
So, for a 40 character alphabet, I would like create unique, non-hiding bit codes.
I have no idea where to start though. Any tips are appreciated.
- Are there any smart algoritms I'm not aware of (very likely)?
- Is it called something I don't know, which could help me search?
- Any tips on how to actually create this, in any way?
CodePudding user response:
I don't think you are going to be able to just enumerate through all possible encodings. I can easily come up with a scheme to generate over 2^39 different encodings, and each of those encodings will have 40! different ways of assigning codes to letters.
Let x
be a random 40-bit string. Look at
~x[0]
x[0:1] ~x[1]
x[0:2] ~x[2]
x[0:3] ~x[3]
....
x[0:39] ~x[39]
x[0:40]
This is a prefix code for any value x
.
CodePudding user response:
Please let me know if I'm misunderstanding your question.
Binary numbers map one-to-one with decimal numbers, so you can cover a two-character alphabet with binary numbers of length 1, four-characters with length 2, etc.
So for a 40-character alphabet you'll need binary codes of length 6. Then if you have them in a list:
alphabet = ['A', 'B', 'C', ...]
You could get your mapping with
mapping = {alphabet[i]: format(i, "b").zfill(6) for i in range(len(alphabet))}
{'A': '000000', 'B': '000001', 'C': '000010', ...}