Extract multiple matches after a specific string or pattern-CodePudding

I have the below sentence:

we have a Newliner Chatacter in 10000 the Middle of the sentence

When using the regex in Python (\w ), I am getting the below output

we 
have 
a 
Newliner 
Chatacter 
in 
10000 
the 
Middle 
of 
the 
sentence

But when I am trying to find words between 2 words using the regex (?<=a )(\w ), I am not getting the desired output.

What I am getting is

Newliner

When I used the regex (?<=a ).*, I am getting the rest of the words as one group.

And what I need is the below words as separate groups.

Newliner 
Chatacter 
in 
10000 
the 
Middle 
of 
the 
sentence

CodePudding user response：

You could use this regex replacement with split approach. We can trim off the leading text up until the a, then split what remains.

inp = "we have a Newliner Chatacter in 10000 the Middle of the sentence"
words = re.sub(r'^.*\ba\b\s ', '', inp).split()
print(words)

This prints:

['Newliner', 'Chatacter', 'in', '10000', 'the', 'Middle', 'of', 'the', 'sentence']

CodePudding user response：

TLDR:

With Python re, you need to do two steps: a) get the location from where you want to start matching, and then use Pattern.findall / Pattern.finditer to get all expected matches (see the first two snippets below)
With PyPi regex module, you can extract these matches with one go using the \G based regex pattern.

re way

Getting multiple matches after a specific text is a common regex question, but Python re has no way to do that without additional steps. That is, you need to find the position where you need to start extracting matches from first, and then pass this position to the Pattern.findall() method (note it can't be re.findall static method as it does not accept a pos argument).

To accomplish the first step, you can use either a regex (if you need to start matching from the first/last/nth occurrence of some pattern), or simple hardcoded/literal string approach (either with slicing or using str.find(), etc.).

Here, it looks like you need to start extracting matches after a whole word a. In this case, you can use a regex based approach:

import re

def extract_all_after_pattern(search_after_rx, reg, text):
    search_after = re.search(search_after_rx, text)
    if search_after:
        # print(f"Found at {search_after.start()}, searching from {len(search_after.group())}, i.e. in '{text[search_after.end() 1:]}'")
        # Found at 8, searching from 1, i.e. in 'Newliner Chatacter in 10000 the Middle of the sentence'
        return reg.findall(text, search_after.end() 1)
    else:
        return []
    
text = "we have a Newliner Chatacter in 10000 the Middle of the sentence"
search_after_rx = r'\ba\b' # Or, r'a\s', r'\sa\s', r'(?<!\S)a(?!\S)', etc.
reg = re.compile(r'\w ')
print(extract_all_after_pattern(search_after_rx, reg, text))
# => ['Newliner', 'Chatacter', 'in', '10000', 'the', 'Middle', 'of', 'the', 'sentence']

See this Python demo.

If you know the string you want to start extracting after is not a pattern, but a literal string, you can use a "simpler" approach:

import re

def extract_all_after_string(search_after, reg, text):
    start = text.find(search_after)
    if start >= 0:
        # print(f"Found at {start}, looking from {start len(search_after)}, i.e. in '{text[start len(search_after):]}'")
        # Found at 8, looking from 10, i.e. in 'Newliner Chatacter in 10000 the Middle of the sentence'
        return reg.findall(text, start len(search_after))
    return []
        
text = "we have a Newliner Chatacter in 10000 the Middle of the sentence"
search_after = "a "
reg = re.compile(r'\w ')
print(extract_all_after_string(search_after, reg, text))
# => ['have', 'a', 'Newliner', 'Chatacter', 'in', '10000', 'the', 'Middle', 'of', 'the', 'sentence']

See this Python demo.

PyPi regex way

To extract matches after a certain string with a single regex, you need to install PyPi regex module (run pip install regex in your terminal) and then use the \G based regex (see "Continuing at The End of The Previous Match":

import regex
text = "we have a Newliner Chatacter in 10000 the Middle of the sentence"
pattern = r'(?:\G(?!^)|\ba\b)\W*(\w )'
print(regex.findall(pattern, text))
# => ['Newliner', 'Chatacter', 'in', '10000', 'the', 'Middle', 'of', 'the', 'sentence']

See the regex demo and the Python demo. Details:

(?:\G(?!^)|\ba\b) - either the end of the previous match or a whole word a
\W* - zero or more non-word chars (this must be used to get to the next word char chunk)
(\w ) - Group 1 (the value returned by regex.findall): one or more word chars.

CodePudding user response：

Ask yourself, do you really need regular expressions? Maybe the generator function will be enough.

def words_after_stop_word(sentence, stop_word=None):
    for w in sentence.split():
        if stop_word == None:
            yield w
        elif w == stop_word:
            stop_word = None

sentence = 'we have a Newliner Chatacter in 10000 the Middle of the sentence'
for w in words_after_stop_word(sentence, 'a'):
    print(w)

Demo on Rextester.