I am trying to extract 3 words before and after a given word using regex in python. It works well for most of the cases, but the issue occurs when there are 2 of the same given words within the 3 words region as per the code snippet below (The given word is "hello").
new_text = "I am going to say hello and hello to him"
re.findall(r"((?:[a-zA-Z'-] [^a-zA-Z'-] ){0,3})(hello)((?:[^a-zA-Z'-] [a-zA-Z'-] ){0,3})", new_text)
Expected Output:
[('going to say ', 'hello', ' and hello to'), ('say hello and ', 'hello', ' to him')]
Actual Output:
[('going to say ', 'hello', ' and hello to')]
From my research, it is due to regex consuming the words that it matches and therefore it is not able to process my second "hello". I will need to capture the region as I will be doing additional processing to it.
Any advice on how to proceed will be greatly appreciated (Regex or non-regex).
Thanks!
CodePudding user response:
You can match the following regular expression:
r'(?=\b((?:\w ){2}\w ) (hello) ((?:\w ){0,2}\w \b))'
There are three capture groups. They contain:
- a string of three words preceding 'hello'
- the word 'hello'
- a string of one to three words following 'hello', as many as possible.
The link shows that there are two matches of the string:
"I am going to say hello and hello to him"
1st match (zero-width, before 'g'
in 'going'
)
- Capture group 1: "going to say"
- Capture group 2: "hello"
- Capture group 3: "and hello to"
2nd match (zero-width, before 's'
in 'say'
)
- Capture group 1: "say hello and"
- Capture group 2: "hello"
- Capture group 3: "to him"
Note that saving 'hello' to a capture group is really unnecessary because there will not be a match if 'hello' is not present in its required position. Observe also that I have constructed the regular expression in such a way that all capture groups begin and end with a word character (rather than with a space as shown in the question).
The regular expression can be broken down as follows. (Note that I show a space as a character class containing one space, merely to make the space character visible to the reader.)
(?= # begin positive lookahead
\b # match word boundary
( # begin capture group 1
(?:\w [ ] ) # match >= 1 word chars followed by >= 1 spaces
{2} # execute preceding non-capture group 2 times
\w # match >= 1 word chars
) # end capture group 1
[ ] # match >= 1 spaces
(hello) # match literal and save to capture group 2
[ ] # match >= 1 spaces
( # begin capture group 3
(?:\w [ ] ) # match >= 1 word chars followed by >= 1 spaces
{0,2} # execute preceding non-capture group 0-2 times
\w # match >= 1 word chars
\b # match a word boundary
) # end capture group 3
) # end positive lookahead