Python:Regex to remove more than N consecutive letters-CodePudding

lets say I have this string : Sayy Hellooooooo

if N = 2

I want the result to be (Using Regex): Sayy Helloo

Thank U in advance

CodePudding user response：

Another option is to use re.sub with a callback:

N = 2

result = re.sub(r'(.)\1 ', lambda m: m.group(0)[:N], your_string)

CodePudding user response：

You could build the regex dynamically for a given n, and then call sub without callback:

import re

n = 2
regex = re.compile(rf"((.)\2{{{n-1}}})\2 ")

s = "Sayy Hellooooooo"
print(regex.sub(r"\1", s))  # Sayy Helloo

Explanation:

{{: this double brace represents a literal brace in an f-string
{n-1} injects the value of n-1, so together with the additional (double) brace-wrap, this {{{n-1}}} produces {2} when n is 3.
The outer capture group captures the maximum allowed repetition of a character
The additional \2 captures more subsequent occurrences of that same character, so these are the characters that need removal.
The replacement with \1 thus reproduces the allowed repetition, but omits the additional repetition of that same character.

CodePudding user response：

You could use backreferences to mach the previous character. So (a|b)\1 would match aa or bb. In your case you would want probably any letter and any number of repetitions so ([a-zA-Z])\1{n,} for N repetitions. Then substitute it with one occurence using \1 again. So putting it all together:

import re

n=2

expression = r"([a-zA-Z])\1{" str(n) ",}"
print(re.sub(expression,r"\1","hellooooo friiiiiend"))
# Outputs Hello friend

Attempt This Online!

Note this actually matches N 1 repetitions only, like your test cases. One item then N copies of it. If you want to match exactly N also subtract 1.

Remember to use r in front of regular expressions so you don't need to double escape backslashes.

Learn more about backreferences: https://www.regular-expressions.info/backref.html Learn more about repetition: https://www.regular-expressions.info/repeat.html

CodePudding user response：

You need a regex that search for multiple occurence of the same char, that is done with (.)\1 (the \1 matches the group 1 (in the parenthesis))

To match

2 occurences : (.)\1
3 occurences : (.)\1\1 or (.)\1{2}
4 occurences : (.)\1\1\1 or (.)\1{3}

So you can build it with an f-string and the value you want (that's a bit ugly because you have literal brackets that needs to be escaped using double brackets, and inside that the bracket to allow the value itself)

def remove_letters(value: str, count: int):
    return re.sub(rf"(.)\1{{{count}}}", "", value)


print(remove_letters("Sayy Hellooooooo", 1))  # Sa Heo
print(remove_letters("Sayy Hellooooooo", 2))  # Sayy Hello
print(remove_letters("Sayy Hellooooooo", 3))  # Sayy Hellooo

You may understand the pattern creation easier with that

r"(.)\1{"   str(count)   "}"

CodePudding user response：

This seems to work:

When N=2: the regex pattern is compiled to : ((\w)\2{2,})
When N=3: the regex pattern is compiled to : ((\w)\2{3,})

Code:

import re
N = 2
p = re.compile(r"((\w)\2{"   str(N)   r",})")

text = "Sayy Hellooooooo"
matches = p.findall(text)

for match in matches:
    text = re.sub(match[0], match[1]*N, text)

print(text)

Output:

Sayy Helloo

Note:

Also tested with N=3, N=4 and other text inputs.