With regex only, how to match an exact number of consecutive repetitions of an arbitrary single token? For example, matching the "aaa" in “ttaaabbb” instead of the "aaaa" “ttaaaabbb”, given the desired number of repetitions is 3.
Clarification: Note I was using "a" for an example, the token can be arbitrary character/number/symbols. That is, given the desired number of repetitions is 3, the desired match of "aaaa**!!!cccc333**" only gives "!!!" and "333".
In short, I want to find a list of tokens "X" where YXXXY appeared in the given string (Y is some other tokens that are different from X, Y can also be the start of the string or the end of the string). Note there can be repeated tokens in the list, e.g., "aaabbbbaaa" should give ["a", "a"].
Some other examples:
Input: "aaabbbbbbaaa****ccc", output: ["a", "a", "c"] Input: "!!! aaaabbbaaa ccc!!!", output: ["!", "b", "a", "c", "!"].
What I have tried: I tried (.)\1{2}
but unfortunately, it matches "aaaa" and "ccccc" as well in the example above. I further changed it to (?!\1)(.)\1{2}(?!\1)
such that the prefix and postfix of the repeating pattern differ from it. However, I got an error in this case since the first \1
is undefined when being referred to.
CodePudding user response:
You might use a pattern with 2 capture groups and a repeated backreference.
First match 4 or more times the same repeated character that you want to avoid, then match 3 times the same character.
The single characters that you want are in capture group 2, which you can get using re.finditer for example.
(\S)\1{3,}|(\S)\2{2}
The pattern matches:
(\S)\1{3,}
Capture group 1, match a non whitespace char and repeat the backreference 3 or more times|
Or(\S)\2{2}
Capture group 2, match a non whitespace char and repeat the backreference 2 times
For example:
import re
strings = [
"aaaa**!!!cccc333**",
"aaabbbbaaa",
"aaabbbbbbaaa****ccc",
"!!! aaaabbbaaa ccc!!!"
]
pattern = r"(\S)\1{3,}|(\S)\2{2}"
for s in strings:
matches = re.finditer(pattern, s)
result = []
for matchNum, match in enumerate(matches, start=1):
if match.group(2):
result.append(match.group(2))
print(result)
Output
['!', '3']
['a', 'a']
['a', 'a', 'c']
['!', 'b', 'a', 'c', '!']
CodePudding user response:
You can do something like this using a regex and a loop:
def exact_re_match(string, length):
regex = re.compile(r'(.)\1*')
for match in regex.finditer(string):
elm = match.group()
if len(elm) == length:
yield elm
string = "aaaa!!!cccc333"
out = list(exact_re_match(string, 3))
print(out)
# ['!!!', '333']