Home > Net >  Regex to split file paths into groups?
Regex to split file paths into groups?

Time:05-19

This is the structure of the files:

(number)/firstdirectory/unimportant/unimportant/lastdirectory.DAT

I need to write a regex that will place the number, the first directory, and the last directory in groups 1, 2, and 3 respectively.

example of other files(files I use to test):

(1)/Downloads/Maps/Map of Places.pdf
(25)/Publications/1995Publications.pdf
(31)/Table-of-Contents.pdf

This is what I have:

import re

reggie = r"^.* \(([0-9]*)\)(.*)\/([^\/]*)\.(.*)$"


with open('test2.txt') as f:
    lines = f.readlines()

for line in lines:
    match = re.search(reggie, line)
    if match:
        num = match.group(1)
        sub = match.group(2)
        file = match.group(3)
        print(num, sub, file)

What I hope to get is:

    1 Downloads Map of Places
    25 Publications 1995Publications
    31 Table-of-Contents (assumes theres no first directory and just takes the last)

What I end up getting is:

    1 /Downloads/Maps Map of Places
    25 /Publications 1963Publications
    31  Table of Contents

It's very close, the only problem is, when there's more than 2 directories, the middle ones are included with the first one and there's unnecessary forward slashes before the first directory.

I've been thinking about this for a couple hours, and I'm stumped. My best attempt was to force a forward slash after the number to remove the unnecessary ones in the output, then adding an optional one after the first directory, in cases where there's more than 2 directories.

Like this:

    reggie = r"^.*\(([0-9]*)\)\/(.*)\/*([^\/]*)\.(.*)$"

However, with this, all the directories merge into one and there is no last directory.

Any help would be appreciated, it seems like a simple solution, but I must be looking at it all wrong.

CodePudding user response:

First of all regex is not the way to go. Pathlib should be used instead.

Here is the regex solution if you do wish to use it anyway:

import re
regex = re.compile(r"\((\d )\)(?:/([^/] ))?.*/([^\.] )\..*$")
paths = ["(1)/Downloads/Maps/Map of Places.pdf","(25)/Publications/1995Publications.pdf","(31)/Table-of-Contents.pdf"]
for path in paths:
    print(regex.match(path).groups())

Output:

('1', 'Downloads', 'Map of Places')
('25', 'Publications', '1995Publications')
('31', None, 'Table-of-Contents')

CodePudding user response:

Instead of using a regex, you should use Pathlib. It is more reliable and supports different operating systems:

import pathlib
paths = ["(1)/Downloads/Maps/Map of Places.pdf","(25)/Publications/1995Publications.pdf","(31)/Table-of-Contents.pdf"]
for path in map(pathlib.PurePath, paths):  # Convert all paths to PurePaths
    path_parts = path.parts
    number = path_parts[0]
    filename = path.stem
    root_directory = path_parts[1] if len(path_parts) > 2 else None
    print((number, root_directory, filename))

Output:

('(1)', 'Downloads', 'Map of Places')
('(25)', 'Publications', '1995Publications')
('(31)', None, 'Table-of-Contents')
  • Related