Home > Software design >  Python:Regex to remove more than N consecutive letters
Python:Regex to remove more than N consecutive letters

Time:12-11

lets say I have this string : Sayy Hellooooooo

if N = 2

I want the result to be (Using Regex): Sayy Helloo

Thank U in advance

CodePudding user response:

Another option is to use re.sub with a callback:

N = 2

result = re.sub(r'(.)\1 ', lambda m: m.group(0)[:N], your_string)

CodePudding user response:

You could build the regex dynamically for a given n, and then call sub without callback:

import re

n = 2
regex = re.compile(rf"((.)\2{{{n-1}}})\2 ")

s = "Sayy Hellooooooo"
print(regex.sub(r"\1", s))  # Sayy Helloo

Explanation:

  • {{: this double brace represents a literal brace in an f-string
  • {n-1} injects the value of n-1, so together with the additional (double) brace-wrap, this {{{n-1}}} produces {2} when n is 3.
  • The outer capture group captures the maximum allowed repetition of a character
  • The additional \2 captures more subsequent occurrences of that same character, so these are the characters that need removal.
  • The replacement with \1 thus reproduces the allowed repetition, but omits the additional repetition of that same character.

CodePudding user response:

You could use backreferences to mach the previous character. So (a|b)\1 would match aa or bb. In your case you would want probably any letter and any number of repetitions so ([a-zA-Z])\1{n,} for N repetitions. Then substitute it with one occurence using \1 again. So putting it all together:

import re

n=2

expression = r"([a-zA-Z])\1{" str(n) ",}"
print(re.sub(expression,r"\1","hellooooo friiiiiend"))
# Outputs Hello friend

Attempt This Online!

Note this actually matches N 1 repetitions only, like your test cases. One item then N copies of it. If you want to match exactly N also subtract 1.

Remember to use r in front of regular expressions so you don't need to double escape backslashes.

Learn more about backreferences: https://www.regular-expressions.info/backref.html Learn more about repetition: https://www.regular-expressions.info/repeat.html

CodePudding user response:

You need a regex that search for multiple occurence of the same char, that is done with (.)\1 (the \1 matches the group 1 (in the parenthesis))

To match

  • 2 occurences : (.)\1
  • 3 occurences : (.)\1\1 or (.)\1{2}
  • 4 occurences : (.)\1\1\1 or (.)\1{3}

So you can build it with an f-string and the value you want (that's a bit ugly because you have literal brackets that needs to be escaped using double brackets, and inside that the bracket to allow the value itself)

def remove_letters(value: str, count: int):
    return re.sub(rf"(.)\1{{{count}}}", "", value)


print(remove_letters("Sayy Hellooooooo", 1))  # Sa Heo
print(remove_letters("Sayy Hellooooooo", 2))  # Sayy Hello
print(remove_letters("Sayy Hellooooooo", 3))  # Sayy Hellooo

You may understand the pattern creation easier with that

r"(.)\1{"   str(count)   "}"

CodePudding user response:

This seems to work:

  • When N=2: the regex pattern is compiled to : ((\w)\2{2,})
  • When N=3: the regex pattern is compiled to : ((\w)\2{3,})

Code:

import re
N = 2
p = re.compile(r"((\w)\2{"   str(N)   r",})")

text = "Sayy Hellooooooo"
matches = p.findall(text)

for match in matches:
    text = re.sub(match[0], match[1]*N, text)

print(text)

Output:

Sayy Helloo

Note:

Also tested with N=3, N=4 and other text inputs.

  • Related