Python regex with number of occurences-CodePudding

Hi I'm looking for a regular expression that would allow me not only to replace characters but also to annotate the occurrence number.

For example I would like to replace all special characters with "s", all letters with "c" and all number with "d" and annotate their occurrence between "{}".

If I have "123-45AB-78!£", I would like to get d{3}s{1}d{3}c{2}s{1}d{2}s{2}.

Is there a way to do that with regex?

Many thanks

CodePudding user response：

Here is one approach using re.sub with a callback function:

import re

def repl(m):
    c = m.group()
    if re.search(r'^[A-Za-z] $', c):
        return 'c{'   str(len(c.decode('utf8')))   '}'
    elif re.search(r'^\d $', c):
        return 'd{'   str(len(c.decode('utf8')))   '}'
    else:
        return 's{'   str(len(c.decode('utf8')))   '}'

x = "123-45AB-78!£"
print(re.sub('[A-Za-z] |\d |\D ', repl, x))

# d{3}s{1}d{2}c{2}s{1}d{2}s{2}

Note that since your input string contains non ASCII characters, we cannot simply use len() to find the numbes of characters in the string. Assuming a UTF-8 character set and a string str, we can use the following formula:

len(str.decode('utf8'))

CodePudding user response：

Here is a method that first replaces each character by its type-character, then counts them with itertools.groupby. I'm not sure it is any faster than the good answer given by Tim, but it should be comparable.

x = "123-45AB-78!£"
for pat, sub in [(r"[A-Za-z]", "c"), (r"\d", "d"), (r"[^\d\w]", "s")]:
    x = re.sub(pat, sub, x)
print(x)  # dddsddccsddss
y = "".join([f"{k}{{{len(list(g))}}}" for k, g in groupby(x)])
print(y)  # d{3}s{1}d{2}c{2}s{1}d{2}s{2}