Home > Enterprise >  combining consecutive matches and separating non-consecutive matches with from re findall
combining consecutive matches and separating non-consecutive matches with from re findall

Time:05-22

I have a string of the format:

my_string = 'hello|foo world|foo how|bar are|bar you|bar today|foo'

I want to return a list where all consecutive words that are followed by foo are grouped together in the same string, but words with a '|bar' word in between are in separate strings. If I try a lookahead with repetition:

re.findall(r'(\w (?=\|foo\b)) ',my_string)

returns

['hello', 'world', 'today']

but I would like to return is

['hello world', 'today']

Because 'hello' and 'world' are not separated by a non-foo word.

In my real problem, the number of times sequences of words followed by 'foo' will appear in the string being searched is unknown, and 'bar' could be several different patterns.

I could solve it with a couple replaces, first replacing all non-foo patterns with a split indicator and splitting on that, then removing the foos and stripping spaces:

bars_removed = re.sub('(\w \|(?!foo)[a-z]  ) ','split_string',my_string)
only_foo_words = [re.sub('\|foo','',x).strip() for x in bars_removed.split('split_string')]

which returns the desired result, but I feel like there's a way to do this using findall or maybe finditer that I'm missing.

CodePudding user response:

You cannot "exclude" text in between other texts that are captured into the same group.

You need to replace the lookahead with the consuming pattern, extract all consecutive matches, and then remove the |foo with a mere str.replace method as the post-processing step.

final_list = [x.replace('|foo','') for x in re.findall(r'\w \|foo(?:\s \w \|foo)*', my_string)]

See the regex demo.

Details:

  • \w - one or more word chars
  • \|foo - a |foo string
  • (?:\s \w \|foo)* - a non-capturing group matching zero or more sequences of
    • \s - one or more whitespaces
    • \w \|foo - one or more word chars and then a |foo string.
  • Related