How to find all possible uniform substrings of a string?-CodePudding

I have a string like

aaabbbbcca

And I'd like to parse all possible uniform substrings from that. So my expected substrings for this string are

['a', 'aa', 'aaa', 'b', 'bb', 'bbb', 'bbbb', 'c', 'cc', 'a']

I tried the following

import re

print(re.findall(r"([a-z])(?=\1*)", "aaabbbbcca"))
# Output: ['a', 'a', 'a', 'b', 'b', 'b', 'b', 'c', 'c', 'a']

Is it possible trough regular expressions? If yes, then how?

CodePudding user response：

You can achieve what you need without a regex here:

result = []
text = "aaabbbbcca"
prev = ''
for c in text:
  if c == prev:
    result.append(result[-1]   c)
  else:
    result.append(c)
    prev = c
 
print(result)
# => ['a', 'aa', 'aaa', 'b', 'bb', 'bbb', 'bbbb', 'c', 'cc', 'a']

See the Python demo.

In short, you can iterate over the string and append new item to a result list when the new char is not equal to the previous char, otherwise, append a new item with the value equal to the previous item the same char concatenated to the value.

With regex, the best you can do is

import re
text = "aaabbbbcca"
print( [x.group(1) for x in re.finditer(r'(?=((.)\2*))', text)] )
# => ['aaa', 'aa', 'a', 'bbbb', 'bbb', 'bb', 'b', 'cc', 'c', 'a']

See this Python demo. Here, (?=((.)\2*)) matches any location inside the string that is immediately preceded with any one char (other than line break chars if you do not use re.DOTALL option) that is followed with zero or more occurrences of the same char (capturing the char(s) into Group 1).

CodePudding user response：

You can use a regex to find streaks of the same character, and then some Python on top to build the smaller streaks.

import re

s = 'aaabbbbcca'
matches = (m.group() for m in re.finditer(r'([a-z])\1*', s))
result = [m[:i] for m in matches for i in range(1, len(m)   1)]

CodePudding user response：

I think this particular problem can be solved with a regex. The answer is based on this answer, where parts of numbers are extracted. The explanation is the same as in the other answer. Each match creates an empty group and a group within the lookahead. The lookahead captures sequences of a, b or c of at least length 1. Afterward, we simply create a list of strings that are in the second group.

import re 

s = "aaabbbbcca"
matches = re.finditer(r'(?=(a{1,}|b{1,}|c{1,}))',s)
results = [match.group(1) for match in matches]
print(results)

Output:

['aaa', 'aa', 'a', 'bbbb', 'bbb', 'bb', 'b', 'cc', 'c', 'a']

The values of the output are the same as requested, but not the exact same order.