I'm currently trying to extract 4 words after "our", but keep getting words after "hour" and "your" as well.
i.e.) "my family will send an email in 2 hours when we arrive at." (text in the column)
What I want: nan (since there is no "our")
What I get: when we arrive at (because hour as "our" in it)
I tried the following code and still have no luck.
our = 'our\W (?P<after>(?:\w \W ){,4})'
Reviews_C['Review_for_Fam'] =Reviews_C.ReviewText2.str.extract(our, expand=True)
Can you please help?
Thank you!
CodePudding user response:
You need to make sure "our" is with space boundaries, like this:
our = '(^|\s )our(\s )?\W (?P<after>(?:\w \W ){,4})'
specifically (^|\s )our(\s )?
is where you need to play, the example only handles spaces and start of sentence, but you might need to extend this to have quotes or other special characters.
CodePudding user response:
Im suprised to see regex used for this due to it causing unneeded complexity sometimes. Could something like this work?
def extract_next_words(sentence):
# split the sentence into words
words = sentence.split()
# find the index of "our"
index = words.index("our")
# extract the next 4 words
next_words = words[index 1:index 5]
# join the words into a string
return " ".join(next_words)