I am trying to get a regex that is able to get the words, not getting the blank spaces, between two given strings, at this moment I have this one:
(?<=STR1)(?:\s*)(.*?)(?:\s*)(?=STR2)
I want to use it to get the following results:
WORD0 STR1 WORD1 WORD2 WORD3
WORD4 WORD5 STR2 WORD6
I want a regex that matches WORD1,WORD2,WORD3,WORD4,WORD5
PS: I am working with python, and thank you
CodePudding user response:
You cannot do that with re
because 1) it does not support unknown length lookbehind patterns and 2) it has no support for \G
operator that can be used to match strings in between two strings.
So, what you can do is pip install regex
, and then use
import regex
text = "WORD0 STR1 WORD1 WORD2 WORD3 \nWORD4 WORD5 STR2 WORD6"
print( regex.findall(r"(?<=STR1.*)\w (?=.*STR2)", text, regex.DOTALL) )
# => ['WORD1', 'WORD2', 'WORD3', 'WORD4', 'WORD5']
See the Python demo. Details:
(?<=STR1.*)
- a positive lookbehind matchingSTR1
and any zero or more chars immediately to the left of the current location\w
- one or more word chars(?=.*STR2)
- a positive lookahead matching any zero or more chars andSTR2
immediately to the right of the current location.
CodePudding user response:
Assuming 'STR1'
and 'STR2'
are known to be present you can write the following
str = "WORD0 STR1 WORD1 WORD2 WORD3\nWORD4 WORD5 STR2 WORD6"
rgx = r'\b(?!.*\bSTR1\b)\w (?=.*\bSTR2\b)'
re.findall(rgx, str, re.S)
#=> ['WORD1', 'WORD2', 'WORD3', 'WORD4', 'WORD5']
re.S
(same as re.DOTALL
) causes periods to match all characters, including line terminators.
Regex demo<-\(ツ)/->Python demo
The regular expression can be broken down as follows.
\b # match a word boundary
(?! # begin a negative lookahead
.* # match zero or more characters
\bSTR1\b # match 'STR1' with word boundaries
) # end negative lookahead
\w # match zero or more word characters
(?= # begin a positive lookahead
.* # match zero or more characters
\bSTR1\b # match 'STR2' with word boundaries
) # end positive lookahead
Note that the negative lookahead ensures that the matched word (\w
) is not followed by 'STR1'
, in which case it must be preceded by that string.
Depending on requirements, \w
might replaced with [A-Z] \d
or something else.
Also note that the word boundary (\b
) at the beginning of the expression is to avoid matching 'TR1'
.