Search/replace words beetween custom tags by asterisks with python regex-CodePudding

In python 3, I have a string :

CONN <DO_NOT_PRINT>user</DO_NOT_PRINT>/<DO_NOT_PRINT>password</DO_NOT_PRINT>@//host:port/service

I want to replace each letter in the words beetween <DO_NOT_PRINT> and </DO_NOT_PRINT> tags by asterisks (and remove the tags), ie :

CONN ****/********@//host:port/service

strings user and especially password can be any chars

What I have so far is :

z="CONN <DO_NOT_PRINT>user</DO_NOT_PRINT>/<DO_NOT_PRINT>password</DO_NOT_PRINT>@//host:port/service"
REPLACEME = re.compile('<DO_NOT_PRINT>(. )<\/DO_NOT_PRINT>')
found = REPLACEME.search(z)
print(found)
if found:
    old_text = found.group(1)
    new_z = z.replace(old_text, '*' * len(old_text))
    print(new_z)
else:
    print(z)

but it doesn't work correctly as it prints :

CONN <DO_NOT_PRINT>******************************************</DO_NOT_PRINT>@//host:port/service

instead of :

CONN ****/********@//host:port/service

CodePudding user response：

Regex tries to match with the longest value possible, so the (. ) captures:

user</DO_NOT_PRINT>/<DO_NOT_PRINT>password

You should specify ungreedy operator after plus:

REPLACEME = re.compile('<DO_NOT_PRINT>(. ?)<\/DO_NOT_PRINT>')

Your group(1) does not encompass <DO_NOT_PRINT>. If you want this to also disappear, use group(0) to get the entire matched string. Try:

z.replace(found.group(0), '*' * len(old_text))

Edit:

If you want to replace multiple occurance, you can use re.finditer() and do one .replace() for each match: https://docs.python.org/3/library/re.html#re.finditer

import re
z="CONN <DO_NOT_PRINT>user</DO_NOT_PRINT>/<DO_NOT_PRINT>password</DO_NOT_PRINT>@//host:port/service"
REPLACEME = re.compile('<DO_NOT_PRINT>(. ?)<\/DO_NOT_PRINT>')
founds = REPLACEME.finditer(z)
print(founds)
for found in founds:
    old_text = found.group(1)
    z = z.replace(found.group(0), '*' * len(old_text))
print(z)

Or, use Viktor's answer which looks more elegant.

CodePudding user response：

You can use

import re
z="CONN <DO_NOT_PRINT>user</DO_NOT_PRINT>/<DO_NOT_PRINT>password</DO_NOT_PRINT>@//host:port/service"
REPLACEME = re.compile('<DO_NOT_PRINT>(.*?)</DO_NOT_PRINT>', re.DOTALL)
print( REPLACEME.sub(lambda x: '*' * len(x.group(1)), z) )
    
# => CONN ****/********@//host:port/service

See the Python demo.

NOTES:

re.compile(r'<DO_NOT_PRINT>(.*?)</DO_NOT_PRINT>', re.DOTALL) - *? lazy quantifier is used to make sure the matching stops at the leftmost occurrence of the right-hand delimiter and re.DOTALL makes sure . matches line break chars, too
lambda x: '*' * len(x.group(1)) is now the re.sub replacement argument, where x is the MatchData object, x.group(1) is the Group 1 captured value, the text between two strings.

If you are concerned with performance, unroll the lazy dot pattern:

REPLACEME = re.compile(r'<DO_NOT_PRINT>([^<]*(?:<(?!/DO_NOT_PRINT>)[^<]*)*)</DO_NOT_PRINT>')

Do not use re.DOTALL here.