I have the below sentence:
we have a Newliner Chatacter in 10000 the Middle of the sentence
When using the regex in Python (\w )
, I am getting the below output
we
have
a
Newliner
Chatacter
in
10000
the
Middle
of
the
sentence
But when I am trying to find words between 2 words using the regex (?<=a )(\w )
, I am not getting the desired output.
What I am getting is
Newliner
When I used the regex (?<=a ).*
, I am getting the rest of the words as one group.
And what I need is the below words as separate groups.
Newliner
Chatacter
in
10000
the
Middle
of
the
sentence
CodePudding user response:
You could use this regex replacement with split approach. We can trim off the leading text up until the a
, then split what remains.
inp = "we have a Newliner Chatacter in 10000 the Middle of the sentence"
words = re.sub(r'^.*\ba\b\s ', '', inp).split()
print(words)
This prints:
['Newliner', 'Chatacter', 'in', '10000', 'the', 'Middle', 'of', 'the', 'sentence']
CodePudding user response:
TLDR:
- With Python
re
, you need to do two steps: a) get the location from where you want to start matching, and then usePattern.findall
/Pattern.finditer
to get all expected matches (see the first two snippets below) - With PyPi
regex
module, you can extract these matches with one go using the\G
based regex pattern.
re
way
Getting multiple matches after a specific text is a common regex question, but Python re
has no way to do that without additional steps. That is, you need to find the position where you need to start extracting matches from first, and then pass this position to the Pattern.findall()
method (note it can't be re.findall
static method as it does not accept a pos
argument).
To accomplish the first step, you can use either a regex (if you need to start matching from the first/last/nth occurrence of some pattern), or simple hardcoded/literal string approach (either with slicing or using str.find()
, etc.).
Here, it looks like you need to start extracting matches after a whole word a
. In this case, you can use a regex based approach:
import re
def extract_all_after_pattern(search_after_rx, reg, text):
search_after = re.search(search_after_rx, text)
if search_after:
# print(f"Found at {search_after.start()}, searching from {len(search_after.group())}, i.e. in '{text[search_after.end() 1:]}'")
# Found at 8, searching from 1, i.e. in 'Newliner Chatacter in 10000 the Middle of the sentence'
return reg.findall(text, search_after.end() 1)
else:
return []
text = "we have a Newliner Chatacter in 10000 the Middle of the sentence"
search_after_rx = r'\ba\b' # Or, r'a\s', r'\sa\s', r'(?<!\S)a(?!\S)', etc.
reg = re.compile(r'\w ')
print(extract_all_after_pattern(search_after_rx, reg, text))
# => ['Newliner', 'Chatacter', 'in', '10000', 'the', 'Middle', 'of', 'the', 'sentence']
See this Python demo.
If you know the string you want to start extracting after is not a pattern, but a literal string, you can use a "simpler" approach:
import re
def extract_all_after_string(search_after, reg, text):
start = text.find(search_after)
if start >= 0:
# print(f"Found at {start}, looking from {start len(search_after)}, i.e. in '{text[start len(search_after):]}'")
# Found at 8, looking from 10, i.e. in 'Newliner Chatacter in 10000 the Middle of the sentence'
return reg.findall(text, start len(search_after))
return []
text = "we have a Newliner Chatacter in 10000 the Middle of the sentence"
search_after = "a "
reg = re.compile(r'\w ')
print(extract_all_after_string(search_after, reg, text))
# => ['have', 'a', 'Newliner', 'Chatacter', 'in', '10000', 'the', 'Middle', 'of', 'the', 'sentence']
See this Python demo.
PyPi regex
way
To extract matches after a certain string with a single regex, you need to install PyPi regex module (run pip install regex
in your terminal) and then use the \G
based regex (see "Continuing at The End of The Previous Match":
import regex
text = "we have a Newliner Chatacter in 10000 the Middle of the sentence"
pattern = r'(?:\G(?!^)|\ba\b)\W*(\w )'
print(regex.findall(pattern, text))
# => ['Newliner', 'Chatacter', 'in', '10000', 'the', 'Middle', 'of', 'the', 'sentence']
See the regex demo and the Python demo. Details:
(?:\G(?!^)|\ba\b)
- either the end of the previous match or a whole worda
\W*
- zero or more non-word chars (this must be used to get to the next word char chunk)(\w )
- Group 1 (the value returned byregex.findall
): one or more word chars.
CodePudding user response:
Ask yourself, do you really need regular expressions? Maybe the generator function will be enough.
def words_after_stop_word(sentence, stop_word=None):
for w in sentence.split():
if stop_word == None:
yield w
elif w == stop_word:
stop_word = None
sentence = 'we have a Newliner Chatacter in 10000 the Middle of the sentence'
for w in words_after_stop_word(sentence, 'a'):
print(w)
Demo on Rextester.