Home > Back-end >  RegEx: "don't return the rest" if "this condition"?
RegEx: "don't return the rest" if "this condition"?

Time:10-12

I'm using RegEx to search multiline string, that contains a list of file paths.

The goal is: if the match is in folder name - return this folder path only (do not return any subfolder if they match too). And, if the match is in file name, then return the whole line (full file path).

Pattern I'm currently using returning the whole string: .*([^\\]*(John|Smith|Junior)){2}.*

Desired returned string:

C:\temp\John Smith Junior\file.pdf -> C:\temp\John Smith Junior\
C:\temp\John Smith Junior\John Smith Junior\file.pdf -> C:\temp\John Smith Junior\
C:\temp\John Smith Junior file.pdf -> C:\temp\John Smith Junior file.pdf

I tried adding to the end of the pattern stuff like: [\\n] or (\|\n) or (?!=. \) but that's not working exactly as I want. Thanks for any help!

Demo: https://regex101.com/r/98d6Ed/1

.*([^\\]*(John|Smith|Junior)){2}.*

CodePudding user response:

Using (John|Smith|Junior) is an alternation which will match one of the alternatives John, Smith or Junior.

If you want to match the whole string John Smith Junior you can use that in the pattern instead.


In Python re, you could use an if clause to test for a \ after the first occurrence of Junior.

If it is there, then that is the match, else match any char except \ until the end of the string.

^.*?\bJunior\b(\\)?(?(1)|.*)
  • ^ Start of string
  • .*?\bJunior\b Match the first occurrence of Junior
  • (\\)? Optionally capture \ in group 1
  • (?(1)|.*) Conditional, test for the existence of group 1 using (?(1) If it exists, then that is the match, else match the rest of the string using .*

Regex demo | Python demo

import re

strings = [
    r"C:\temp\John Smith Junior\file.pdf",
    r"C:\temp\John Smith Junior\John Smith Junior\file.pdf",
    r"C:\temp\John Smith Junior file.pdf"
]

for s in strings:
    m = re.match(r".*?\bJunior\b(\\)?(?(1)|.*)", s)
    if m:
        print(m.group())

Output

C:\temp\John Smith Junior\
C:\temp\John Smith Junior\
C:\temp\John Smith Junior file.pdf

Using all 3 alternatives instead of only Junior:

^.*?\\[^\\]*(?:John|Smith|Junior)\b(\\)?(?(1)|.*)

Regex demo

CodePudding user response:

I would suggest not using a regexp and just use the excellent pathlib class.

from pathlib import PureWindowsPath

lines = [
    r"C:\temp\John Smith Junior\file.pdf",
    r"C:\temp\John Smith Junior\John Smith Junior\file.pdf",
    r"C:\temp\John Smith Junior file.pdf"
]

def first_match(path, parts):
  for parent in reversed(path.parents):
    if any(part in str(parent) for part in parts):
      return parent
  return None

for line in lines:
  path = PureWindowsPath(line)
  parts = ('John', 'Smith', 'Junior')
  directory_match = first_match(path, parts)
  if directory_match:
    print(directory_match)
  else:
    if any(part in path.name for part in parts):
      print(path)

A third option would be to use pathlib to parse the part into directories and filenames, as above, and then use a regexp to match, e.g. simply (John|Smith|Junior).

  • Related