This might seem to be a repetitive question here but I have tried all other SO posts and the suggestions are not working for me.
Basically, I want to exclude strings that have a particular substring in them, either at the beginning, middle or at the end.
Here is an example,
Max_Num_HR, HR_Max_Num, Max_HR_Num
I want to exclude the strings that contain either _HR
(at the end), HR_
(at the beginning) or _HR_
(in between)
What I have tried so far:
r"(^((?!HR_).*))(?<!_HR)$"
This will successfully exclude strings that have HR_
(at the beginning) and _HR
(at the end), but not _HR_
(in between)
I have looked at How to exclude a string in the middle of a RegEx string?
But their solution did not seem to work for me.
I understand that the first segment of my code (^((?!HR_).*))
will exclude everything that contains HR_
since I have a ^
at the beginning followed by a negative lookahead. The second segment (?<!_HR)$
will begin at the end of the string and perform a negative lookbehind to see if _HR
is not included at the end. Going with this train of thought, I tried including (?!_HR_)
in between the two segments, but to no avail.
So, how do I get it to exclude all three HR_
, _HR_
, _HR
considering Max_Num_HR, HR_Max_Num, Max_HR_Num as the test case?
CodePudding user response:
The pattern is missing the assertion for _HR_
somewhere in the string.
You can add the negative lookbehind to assert not _HR at the end after the dollar sign like $(?<!_HR)
to prevent some backtracking over the .
Note that for a match only you don't need the capture groups.
^(?!HR_)(?!.*_HR_). $(?<!_HR)
^
Start of string(?!HR_)
Assert notHR_
at the start(?!.*_HR_)
Assert not_HR_
in the string. $
Match 1 chars to not match an empty string, and assert end of string(?<!_HR)
Assert not_HR
to the left
CodePudding user response:
One way to avoid matching strings that contain 'HR_'
at the beginning, '_HR_'
in the middle or '_HR'
at the end is to match a regular expression having a beginning-of-string anchor followed by a negative lookahead, followed by .*
:
^(?!HR_|. _HR_.|. _HR$).*
Note that lines containing '_HR_'
at the beginning or end are matched, as per the specification.
The negative lookahead reads, "do not match 'HR_' at the beginning of the string or '_HR_' when preceded by at least one character and followed by one character (possibly more than one) or '_HR' at the end of the string.
The entire string is matched if and only if the negative lookahead succeeds.
The negative lookahead could of course be replaced by three negative lookaheads:
^(?!HR_)(?!. _HR_.)(?!. _HR$).*