I try to prepare a pattern to find all the substrings from the text with this format
system:microsoft,
flow:to_server,
vho:file-was-closed,
heur250:unknown.file
Also, I want to exclude substrings where parts before or after :
include only digits
03:00
file:123
I don't want to catch substrings where the part before :
is equal to mailto
mailto:user
And I don't want to catch substrings where parts before or after :
end with some extensions like jpg
, png
cid:image003.png
I've written the pattern but it doesn't work properly.
pattern = r'(?!^\d $)(?!mailto)[\w\d\.-] :[\w\d\.-(?!(jpg|png))] '
Could you help me to change that and explain what I do wrong?
CodePudding user response:
Can you try:
(?<!\S)(?!mailto|(?:\S*:)?(?:\d |\S*\.(?:jp|pn)g)([\s:]|$))[\w.-] :[\w.-] (?!\S)
See an online demo. Admittedly, the last part of the pattern can be more specific to avoid things like ...:...
to be valid, but that's up to you I guess.
(?<!\S)
- Assert position is not preceded by a non-whitespace;(?!mailto|(?:\S*:)?(?:\d |\S*\.(?:jp|pn)g)([\s:]|$))
- A negative lookahead with alternation: Avoid 'mailto:', avoid trailing '.jpg' or '.png' or just digits on either side of the colon;[\w.-] :[\w.-]
- The pattern to match at least 1 characters from the given class on either side of the colon;(?!\S)
- Assert position is not followed by a non-whitespace char.
CodePudding user response:
If your matches are inside whitespace boundaries, you can use
(?<!\S)(?!mailto:|\d :)[\w.-] (?<!\.jpg|\.png):(?!\d (?!\S))[\w.-] (?!\S)(?<!\.jpg|\.png)
See the regex demo.
Details:
(?<!\S)
- left-hand whitespace boundary(?!mailto:|\d :)
- immediately to the right, there can be nomailto:
or one or more digits followed with a:
char[\w.-]
- one or more word,.
or-
chars(?<!\.jpg|\.png)
- no.jpg
or.png
immediately to the left are allowed:
- a colon(?!\d (?!\S))
- only digits until the whitespace or end of string are allowed[\w.-]
- one or more word,.
or-
chars(?!\S)
- right-hand whitespace boundary(?<!\.jpg|\.png)
- no.jpg
or.png
immediately to the left are allowed.
If your matches are located in any context you can use a solution like
import re
text = "system:microsoft flow:to_server vho:file-was-closed heur250:unknown.file, file.png:word, 03:00, file:123, mailto:user, cid:image003.png"
pattern = r'\bmailto:[\w.-] |\b\d :[\w.-] |[\w.-] :\d |[\w.-] :[\w.-]*\.(?:jpg|png)(?![\w.-])|[\w.-]*\.(?:jpg|png):[\w.-] |([\w.-] :[\w.-] )'
print( [x for x in re.findall(pattern, text) if x!=''] )
See this Python demo.
Output:
['system:microsoft', 'flow:to_server', 'vho:file-was-closed', 'heur250:unknown.file']
Note that this solution is based on the "best regex trick ever". Details:
\bmailto:[\w.-] |
- whole wordmailto:
and then one or more word,.
or-
chars, or\b\d :[\w.-] |
- word boundary, one or more digits,:
, and then one or more word,.
or-
chars, or[\w.-] :\d |
- one or more word,.
or-
chars,:
, one or more digits[\w.-] :[\w.-]*\.(?:jpg|png)(?![\w.-])|
- one or more word,.
or-
chars,:
, zero or more word,.
or-
chars, then.
andjpg
orpng
not followed with a word,.
or-
char, or[\w.-]*\.(?:jpg|png):[\w.-] |
- zero or more word,.
or-
chars,.
,jpg
orpng
,:
, and then one or more word,.
or-
chars, or([\w.-] :[\w.-] )
- Group 1 (we'll output this value only): one or more word,.
or-
chars,:
, and one or more word,.
and-
chars.
All the parts before the last Group 1 pattern are there to filter out unwelcome matches.