Loop over string based on keys in dictionary-CodePudding

Instead of looping over each separate character of a string, I want to loop over parts of a string (multiple characters). Those parts are defined by the keys of a dictionary.

Example:

my_dict = {'010': 'a', '000': 'e', '1101': 'f', '1010': 'h', '1000': 'i', '0111': 'm', '0010': 'n', '1011': 's', '0110': 't', '11001': 'l', '00110': 'o', '10011': 'p', '11000': 'r', '00111': 'u', '10010': 'x'}
word = "1000001001100001100000100000110"
output = ""

What I've tried (looping over each character separately, indeed):

for i in word:
   letter = my_dict[i]
   output  = letter
   word = word.lstrip(letter)

My output:

"KeyError: '1'"

But I want to get key "1000" and its value "i", and then continue with key "0010" and get its value "n", etc...

Expected output:

# Expected output:
output = "internet"

CodePudding user response：

Assuming it's a prefix code (otherwise you'd need to define how to deal with ambiguities), accumulate the bits until you have a match, then output the letter and clear the bits:

output = ""
bits = ""
for bit in word:
    bits  = bit
    if bits in my_dict:
        letter = my_dict[bits]
        output  = letter
        bits = ""

Try it online!

Slight variation of it the lookup, reminded by Jnevill's answer:

    if letter := my_dict.get(bits):
        output  = letter

CodePudding user response：

You could use a regular expression to substitutes the patterns with the corresponding letters. re.sub allows use of a function for the replacement which could be access to the dictionary to get the letters. The search pattern would need to have the longer values first so that they are "consumed" in priority over shorter patterns that could start with the same bits:

my_dict = {'010': 'a', '000': 'e', '1101': 'f', '1010': 'h', '1000': 'i', '0111': 'm', '0010': 'n', '1011': 's', '0110': 't', '11001': 'l', '00110': 'o', '10011': 'p', '11000': 'r', '00111': 'u', '10010': 'x'}
word = "1000001001100001100000100000110"

import re

pattern = "|".join(sorted(my_dict.keys(),key=len,reverse=True))
output  = re.sub(pattern,lambda m:my_dict[m.group(0)],word)

print(output) # internet

[EDIT]

If there are no conflicts between short and long bit patterns, the sort is not needed (as Kelly pointed out), the solution could be a single line:

output = re.sub('|'.join(my_dict),lambda m:my_dict[m[0]],word)

CodePudding user response：

Issue with your code:

for i in word:  # here, i is a single character
   # so you can't get corresponding value since it's multiple character keys
   letter = my_dict[i]
   output  = letter  # this would work fine
   word = word.lstrip(letter)

You can do a while loop on word, and remove the part you found in the dict each time. When words is empty, you will stop looping and the program ends.

You can iterate over each key in the dict and test if it match the beginning of the word. If it does, you have the letter you are looking for. Do what you want instead of the print, and repeat.

translate_table = {'010': 'a', '000': 'e', '1101': 'f', '1010': 'h', '1000': 'i', '0111': 'm', '0010': 'n', '1011': 's', '0110': 't', '11001': 'l', '00110': 'o', '10011': 'p', '11000': 'r', '00111': 'u', '10010': 'x'}
message = "1000001001100001100000100000110"

while message:
    for code, letter in translate_table.items():
        if message.startswith(code):
            # replace this with whatever you want to do with the letter
            print(letter, end="")

            # "Cut" the word to keep the remaining characters
            message = message[len(code):]

CodePudding user response：

While iterating my_dict (as DorianTurba suggests) feels like a more elegant solution, your gut was suggesting that you should iterate word. To do this you can use a while loop and then manage the length of characters you jump in each iteration depending on the size of the my_dict key that matches the first 3, 4, or 5 characters in word.

Consider:

my_dict = {'010': 'a', '000': 'e', '1101': 'f', '1010': 'h', '1000': 'i', '0111': 'm', '0010': 'n', '1011': 's', '0110': 't', '11001': 'l', '00110': 'o', '10011': 'p', '11000': 'r', '00111': 'u', '10010': 'x'}
word = "1000001001100001100000100000110"

i=0

while len(word) > i:
    for size in [3,4,5]:
        if my_dict.get(word[i:i size]):
            print(my_dict[word[i:i size]])
            i  = size
            break