I am developing a Python script that extracts hostnames from application-specific entries so I can see what hostnames are being used within the application. The naming convention used in the application specifies the OS/platform (Windows in this case) followed by an underscore and then the hostname, then another _ and the asset ID (e.g. win_host123_84500). The regex I am using is as follows:
(?<=win_).*?(?=_)
I am using Python's native re module to do this and the regular expression I created works in simple cases. For example, the above regex will match win_host123_84500 and extract the hostname (host123); however, it also matches win_dev_host123_84500 but when I run it the value I get is "dev", instead of the hostname. This is an example code snippet:
hostname = re.search(r'(?<=win_).*?(?=_)', 'win_host123_12345')
print(hostname)
To account for those cases where the OS is followed by the "dev_" string I tried grouping multiple regex's in the pattern given to the re.search method but it still fails to extract the correct value. For example, the below code still returns "dev".
hostname = re.search(r'(?<=win_dev_).*?(?=_)|(?<=win_).*?(?=_)', 'win_dev_host123_12345')
print(hostname)
According to the documentation on the re module, the grouping operator (|) should not continue evaluating the remaining patterns once it finds a match. If the first pattern I am providing matches the string, why does it appear to be matching the second pattern? I do not see anything in the documentation that mentions how multiple positive lookbehinds are handled within the search method. Anything I am forgetting to specify within the regex's?
CodePudding user response:
The reason of your results is that the match is always at the leftmost position in the string. (the pattern is tested from left to right in the string). That's why in (?<=win_dev_).*?(?=_)|(?<=win_).*?(?=_)
the second branch always wins. (?<=win_)
succeeds always before (?<=win_dev_)
You can write (?:(?<=win_dev_)|(?<=win_)(?!dev_))[^_] (?=_)
but it's clearly a bad idea! Instead remove the lookarounds and use a capture group:
pat = 'win_(?:dev_)?([^_] )_'
result = re.search(pat, yourstring)
if result:
hostname = result[1]
CodePudding user response:
There are 3 ways to do this. @CasimiretHippolyte very nicely explains the first 2.
The best of which he has in his code sample and is recommended.
All engines (fixed lookbehinds)
(?:win_(?:dev_)?)(.*?)_
https://regex101.com/r/uFKB6A/1
(?:(?<=win_(?!dev_))|(?<=win_dev_)).*?(?=_)
https://regex101.com/r/HDvZYa/1
ECMAScript (variable lookbehinds)
(?<=win_(?:dev_)?(?!dev_)).*?(?=_)
https://regex101.com/r/IPgI8C/1
And as was pointed out, when looking behind, the shorter text is always
stumbled upon first.