I'm using RegEx to search multiline string, that contains a list of file paths.
The goal is: if the match is in folder name - return this folder path only (do not return any subfolder if they match too). And, if the match is in file name, then return the whole line (full file path).
Pattern I'm currently using returning the whole string: .*([^\\]*(John|Smith|Junior)){2}.*
Desired returned string:
C:\temp\John Smith Junior\file.pdf -> C:\temp\John Smith Junior\
C:\temp\John Smith Junior\John Smith Junior\file.pdf -> C:\temp\John Smith Junior\
C:\temp\John Smith Junior file.pdf -> C:\temp\John Smith Junior file.pdf
I tried adding to the end of the pattern stuff like: [\\n] or (\|\n) or (?!=. \) but that's not working exactly as I want. Thanks for any help!
Demo: https://regex101.com/r/98d6Ed/1
.*([^\\]*(John|Smith|Junior)){2}.*
CodePudding user response:
Using (John|Smith|Junior)
is an alternation which will match one of the alternatives John, Smith or Junior.
If you want to match the whole string John Smith Junior
you can use that in the pattern instead.
In Python re
, you could use an if clause to test for a \
after the first occurrence of Junior.
If it is there, then that is the match, else match any char except \
until the end of the string.
^.*?\bJunior\b(\\)?(?(1)|.*)
^
Start of string.*?\bJunior\b
Match the first occurrence of Junior(\\)?
Optionally capture\
in group 1(?(1)|.*)
Conditional, test for the existence of group 1 using(?(1)
If it exists, then that is the match, else match the rest of the string using.*
import re
strings = [
r"C:\temp\John Smith Junior\file.pdf",
r"C:\temp\John Smith Junior\John Smith Junior\file.pdf",
r"C:\temp\John Smith Junior file.pdf"
]
for s in strings:
m = re.match(r".*?\bJunior\b(\\)?(?(1)|.*)", s)
if m:
print(m.group())
Output
C:\temp\John Smith Junior\
C:\temp\John Smith Junior\
C:\temp\John Smith Junior file.pdf
Using all 3 alternatives instead of only Junior:
^.*?\\[^\\]*(?:John|Smith|Junior)\b(\\)?(?(1)|.*)
CodePudding user response:
I would suggest not using a regexp and just use the excellent pathlib
class.
from pathlib import PureWindowsPath
lines = [
r"C:\temp\John Smith Junior\file.pdf",
r"C:\temp\John Smith Junior\John Smith Junior\file.pdf",
r"C:\temp\John Smith Junior file.pdf"
]
def first_match(path, parts):
for parent in reversed(path.parents):
if any(part in str(parent) for part in parts):
return parent
return None
for line in lines:
path = PureWindowsPath(line)
parts = ('John', 'Smith', 'Junior')
directory_match = first_match(path, parts)
if directory_match:
print(directory_match)
else:
if any(part in path.name for part in parts):
print(path)
A third option would be to use pathlib
to parse the part into directories and filenames, as above, and then use a regexp to match, e.g. simply (John|Smith|Junior)
.