I'm working on parsing an email that's hitting my incoming server with a few conditions that may be variable using python regex. The difficulty here is that any of these conditions must match to group one. The regex I've come up with meets most of the necessary conditions but the final condition eludes me using this regex:
from [LOC ]*[loc ]*0*(\d* [A-Z]{2} [A-Z]* *[A-Z]*)
Here are the variations of the string that I'll have to pull my information out of with what I'd be parsing out in code snippet:
"blah blah blah from LOC 0704 <2Char ABV> <LOCATION NAME>
some URL after a return line"
"blah blah blah from loc 0704 <2Char ABV> <LOCATION NAME>
some URL after a return line"
"blah blah blah from 0704 <2Char ABV> <LOCATION NAME>
some URL after a return line"
"blah blah blah from LOC 704 <2Char ABV> <LOCATION NAME>
some URL after a return line"
"blah blah blah from LOC 0004 <2Char ABV> <LOCATION NAME WITH A SPACE IN IT>
some URL after a return line"
"blah blah blah from LOC 0004 <2Char ABV> <LOCATION NAME WITH A SPACE IN IT>
[0004]
some URL after a return line"
We begin reading the following string at "from" and look for "LOC" or "loc" possibly existing, and continuing on to read any possible leading zeroes in up to four digits and only capturing the digits at the first nonzero on until we reach the end of the location name.
The last condition is that I want to capture all these things but exclude the number in the brackets at the end of the string and any leading spaces prior to it but still capture the if it exists. I feel like I'm so close here but missing something important.
CodePudding user response:
Using a case insensitive match, you might use:
\bfrom\s (?:loc\s )?0*(\d \s [A-Z]{2}\s [A-Z] (?:\s [^][\s] ) )
Explanation
\bfrom\s
Match from and whitespace chars(?:loc\s )?
Optionally match loc and whitespace chars0*
Match optional zeroes(
Capture group 1\d \s
Match 1 digits and whitespace chars[A-Z]{2}\s [A-Z]
Match 2 times A-Z, whitespace chars, 1 chars A-Z(?:\s [^][\s] )
Repeat 1 times matching whitespace chars and non whitespace chars except[
and]
)
Close group 1
To match spaces wihout newlines, the \s
can be [^\S\n]
To not match newlines and only match till before http or data between square brackets:
from[^\S\n] (?:loc\s )?0*(\d [^\S\n] [A-Z]{2}[^\S\n] [A-Z] (?:(?!\s*(?:\[[^][]*]|https?://)).)*)