This is the structure of the files:
(number)/firstdirectory/unimportant/unimportant/lastdirectory.DAT
I need to write a regex that will place the number, the first directory, and the last directory in groups 1, 2, and 3 respectively.
example of other files(files I use to test):
(1)/Downloads/Maps/Map of Places.pdf
(25)/Publications/1995Publications.pdf
(31)/Table-of-Contents.pdf
This is what I have:
import re
reggie = r"^.* \(([0-9]*)\)(.*)\/([^\/]*)\.(.*)$"
with open('test2.txt') as f:
lines = f.readlines()
for line in lines:
match = re.search(reggie, line)
if match:
num = match.group(1)
sub = match.group(2)
file = match.group(3)
print(num, sub, file)
What I hope to get is:
1 Downloads Map of Places
25 Publications 1995Publications
31 Table-of-Contents (assumes theres no first directory and just takes the last)
What I end up getting is:
1 /Downloads/Maps Map of Places
25 /Publications 1963Publications
31 Table of Contents
It's very close, the only problem is, when there's more than 2 directories, the middle ones are included with the first one and there's unnecessary forward slashes before the first directory.
I've been thinking about this for a couple hours, and I'm stumped. My best attempt was to force a forward slash after the number to remove the unnecessary ones in the output, then adding an optional one after the first directory, in cases where there's more than 2 directories.
Like this:
reggie = r"^.*\(([0-9]*)\)\/(.*)\/*([^\/]*)\.(.*)$"
However, with this, all the directories merge into one and there is no last directory.
Any help would be appreciated, it seems like a simple solution, but I must be looking at it all wrong.
CodePudding user response:
First of all regex is not the way to go. Pathlib should be used instead.
Here is the regex solution if you do wish to use it anyway:
import re
regex = re.compile(r"\((\d )\)(?:/([^/] ))?.*/([^\.] )\..*$")
paths = ["(1)/Downloads/Maps/Map of Places.pdf","(25)/Publications/1995Publications.pdf","(31)/Table-of-Contents.pdf"]
for path in paths:
print(regex.match(path).groups())
Output:
('1', 'Downloads', 'Map of Places')
('25', 'Publications', '1995Publications')
('31', None, 'Table-of-Contents')
CodePudding user response:
Instead of using a regex, you should use Pathlib. It is more reliable and supports different operating systems:
import pathlib
paths = ["(1)/Downloads/Maps/Map of Places.pdf","(25)/Publications/1995Publications.pdf","(31)/Table-of-Contents.pdf"]
for path in map(pathlib.PurePath, paths): # Convert all paths to PurePaths
path_parts = path.parts
number = path_parts[0]
filename = path.stem
root_directory = path_parts[1] if len(path_parts) > 2 else None
print((number, root_directory, filename))
Output:
('(1)', 'Downloads', 'Map of Places')
('(25)', 'Publications', '1995Publications')
('(31)', None, 'Table-of-Contents')