Regular Expressions: Match words between two given strings (no blank spaces or similar)-CodePudding

I am trying to get a regex that is able to get the words, not getting the blank spaces, between two given strings, at this moment I have this one:

(?<=STR1)(?:\s*)(.*?)(?:\s*)(?=STR2)

I want to use it to get the following results:

WORD0 STR1    WORD1 WORD2 WORD3  
WORD4 WORD5 STR2 WORD6

I want a regex that matches WORD1,WORD2,WORD3,WORD4,WORD5

PS: I am working with python, and thank you

CodePudding user response：

You cannot do that with re because 1) it does not support unknown length lookbehind patterns and 2) it has no support for \G operator that can be used to match strings in between two strings.

So, what you can do is pip install regex, and then use

import regex
text = "WORD0 STR1    WORD1 WORD2 WORD3  \nWORD4 WORD5 STR2 WORD6"
print( regex.findall(r"(?<=STR1.*)\w (?=.*STR2)", text, regex.DOTALL) )
# => ['WORD1', 'WORD2', 'WORD3', 'WORD4', 'WORD5']

See the Python demo. Details:

(?<=STR1.*) - a positive lookbehind matching STR1 and any zero or more chars immediately to the left of the current location
\w - one or more word chars
(?=.*STR2) - a positive lookahead matching any zero or more chars and STR2 immediately to the right of the current location.

CodePudding user response：

Assuming 'STR1' and 'STR2' are known to be present you can write the following

str = "WORD0 STR1    WORD1 WORD2 WORD3\nWORD4 WORD5 STR2 WORD6"

rgx = r'\b(?!.*\bSTR1\b)\w (?=.*\bSTR2\b)'

re.findall(rgx, str, re.S) 
  #=> ['WORD1', 'WORD2', 'WORD3', 'WORD4', 'WORD5']

re.S (same as re.DOTALL) causes periods to match all characters, including line terminators.

Regex demo^_<-_\(ツ)/^_->Python demo

The regular expression can be broken down as follows.

\b          # match a word boundary
(?!         # begin a negative lookahead
  .*        # match zero or more characters
  \bSTR1\b  # match 'STR1' with word boundaries
)           # end negative lookahead
\w          # match zero or more word characters
(?=         # begin a positive lookahead
  .*        # match zero or more characters
  \bSTR1\b  # match 'STR2' with word boundaries
)           # end positive lookahead

Note that the negative lookahead ensures that the matched word (\w ) is not followed by 'STR1', in which case it must be preceded by that string.

Depending on requirements, \w might replaced with [A-Z] \d or something else.

Also note that the word boundary (\b) at the beginning of the expression is to avoid matching 'TR1'.