Home > Net >  Regex: find substrings with a pattern with some conditions
Regex: find substrings with a pattern with some conditions

Time:03-19

I try to prepare a pattern to find all the substrings from the text with this format

system:microsoft,
flow:to_server,
vho:file-was-closed,
heur250:unknown.file

Also, I want to exclude substrings where parts before or after : include only digits

03:00
file:123

I don't want to catch substrings where the part before : is equal to mailto

mailto:user

And I don't want to catch substrings where parts before or after : end with some extensions like jpg, png

cid:image003.png

I've written the pattern but it doesn't work properly.

pattern = r'(?!^\d $)(?!mailto)[\w\d\.-] :[\w\d\.-(?!(jpg|png))] '

Could you help me to change that and explain what I do wrong?

CodePudding user response:

Can you try:

(?<!\S)(?!mailto|(?:\S*:)?(?:\d |\S*\.(?:jp|pn)g)([\s:]|$))[\w.-] :[\w.-] (?!\S)

See an online demo. Admittedly, the last part of the pattern can be more specific to avoid things like ...:... to be valid, but that's up to you I guess.


  • (?<!\S) - Assert position is not preceded by a non-whitespace;
  • (?!mailto|(?:\S*:)?(?:\d |\S*\.(?:jp|pn)g)([\s:]|$)) - A negative lookahead with alternation: Avoid 'mailto:', avoid trailing '.jpg' or '.png' or just digits on either side of the colon;
  • [\w.-] :[\w.-] - The pattern to match at least 1 characters from the given class on either side of the colon;
  • (?!\S) - Assert position is not followed by a non-whitespace char.

CodePudding user response:

If your matches are inside whitespace boundaries, you can use

(?<!\S)(?!mailto:|\d :)[\w.-] (?<!\.jpg|\.png):(?!\d (?!\S))[\w.-] (?!\S)(?<!\.jpg|\.png)

See the regex demo.

Details:

  • (?<!\S) - left-hand whitespace boundary
  • (?!mailto:|\d :) - immediately to the right, there can be no mailto: or one or more digits followed with a : char
  • [\w.-] - one or more word, . or - chars
  • (?<!\.jpg|\.png) - no .jpg or .png immediately to the left are allowed
  • : - a colon
  • (?!\d (?!\S)) - only digits until the whitespace or end of string are allowed
  • [\w.-] - one or more word, . or - chars
  • (?!\S) - right-hand whitespace boundary
  • (?<!\.jpg|\.png) - no .jpg or .png immediately to the left are allowed.

If your matches are located in any context you can use a solution like

import re
text = "system:microsoft flow:to_server vho:file-was-closed heur250:unknown.file,  file.png:word, 03:00,  file:123, mailto:user, cid:image003.png"
pattern = r'\bmailto:[\w.-] |\b\d :[\w.-] |[\w.-] :\d |[\w.-] :[\w.-]*\.(?:jpg|png)(?![\w.-])|[\w.-]*\.(?:jpg|png):[\w.-] |([\w.-] :[\w.-] )'
print( [x for x in re.findall(pattern, text) if x!=''] )

See this Python demo.

Output:

['system:microsoft', 'flow:to_server', 'vho:file-was-closed', 'heur250:unknown.file']

Note that this solution is based on the "best regex trick ever". Details:

  • \bmailto:[\w.-] | - whole word mailto: and then one or more word, . or - chars, or
  • \b\d :[\w.-] | - word boundary, one or more digits, :, and then one or more word, . or - chars, or
  • [\w.-] :\d | - one or more word, . or - chars, :, one or more digits
  • [\w.-] :[\w.-]*\.(?:jpg|png)(?![\w.-])| - one or more word, . or - chars, :, zero or more word, . or - chars, then . and jpg or png not followed with a word, . or - char, or
  • [\w.-]*\.(?:jpg|png):[\w.-] | - zero or more word, . or - chars, ., jpg or png, :, and then one or more word, . or - chars, or
  • ([\w.-] :[\w.-] ) - Group 1 (we'll output this value only): one or more word, . or - chars, :, and one or more word, . and - chars.

All the parts before the last Group 1 pattern are there to filter out unwelcome matches.

  • Related