python use re.sub with a dict to replace first substring in each line-CodePudding

In python use re.sub with a dict to replace multi exact substrings I use re.sub with a dict to replace multi "exact" substrings.

import re

words = " apple pineapple cat category data old_data"
dic = {"apple":"apple_new", "cat":"cat_new", "data":"data_new"}

#pattern = re.compile("|".join(dic.keys()))
pattern = r'\b('   r'|'.join([re.escape(x) for x in list(dic.keys())])   r')\b'
new_words = re.sub(pattern, lambda m: dic[m.group(0)], words)

print(new_words)

Now I want to just replace first substring in each line. Original words is :

words = " apple cat data old_data\n     pineapple category data old_data\n   data old_data"

The result I expected is:

words = " apple_new cat data old_data\n     pineapple category data old_data\n   data_new old_data"

It keeps spaces and \n, just replace first substring in each line. I have tried (\s*) to match 0~n spaces in each line, But it doesn't work.

pattern = r'\s*\b('   r'|'.join([re.escape(x) for x in list(dic.keys())])   r')\b'

How do I fix it?

CodePudding user response：

Use re.sub with a callback function to replace on every line. Then lookup each first word in your current dictionary. If found, make the replacement and otherwise no-op.

import re

words = " apple cat data old_data\n     pineapple category data old_data\n   data old_data"
dic = {"apple":"apple_new", "cat":"cat_new", "data":"data_new"}

def repl(m):
    if m.group(1) and m.group(1) in dic:
        return dic[m.group(1)]
    else:
        return m.group(1)

output = re.sub(r'^\s*(\w )', repl, words, flags=re.M)
print(output)

This prints:

apple_new cat data old_data
pineapple category data old_data
data_new old_data

CodePudding user response：

You can start matching at the start of a line and grab any chars, as few as possible, before the matching key, and replace accordingly:

pattern = r'(?m)^(.*?)\b({})\b'.format(r'|'.join([re.escape(x) for x in list(dic.keys())]))
new_words = re.sub(pattern, lambda m: f'{m.group(1)}{dic[m.group(2)]}', words)

Output:

>>> print(new_words)
 apple_new cat data old_data
     pineapple category data_new old_data
   data_new old_data

The (?m)^(.*?) means: match the start of a line ((?m)^) and then capture into Group 1 any zero or more chars other than line break chars as few as possible.

The f'{m.group(1)}{dic[m.group(2)]}' replacement is a concatenation of Group 1 and the value of the found key.

Note that \b word boundaries won't work in case your keys start or end with special chars. In that case, use the adaptive dynamic word boundaries.

If you have thousands of keys, use a regex trie rather than '|'.join(...).