I have a python pandas dataframe with the following pattern:
file_path |
---|
/home |
/home/folder1 |
/home/folder1/file1.xlsx |
/home/folder1/file2.xlsx |
/home/folder2 |
/home/folder2/date |
/home/folder2/date/dates.txt |
/home/folder3 |
I would like to get the parent path in a new column, if there is no parent then call it "ROOT"
file_path | parent_path |
---|---|
/home | ROOT |
/home/folder1 | /home |
/home/folder1/file1.xlsx | /home/folder1 |
/home/folder1/file2.xlsx | /home/folder1 |
/home/folder2 | /home |
/home/folder2/date | /home/folder2 |
/home/folder2/date/dates.txt | /home/folder2/date |
/home/folder3 | /home |
My attempt:
import re
import pandas as pd
df = pd.DataFrame(["/home", "/home/folder1", "/home/folder1/file1.xlsx",
"/home/folder1/file1.xlsx", "/home/folder1/file2.xlsx", "/home/folder2",
"/home/folder2/date", "/home/folder2/date/dates.txt", "/home/folder3"], columns=["file_path"])
# Get list
file_paths = df.file_path.unique()
def match_parent(x, file_paths):
x = x.split('/')
levels = len(x)
# Check that parent contains all elements of x and the length is 1 less
I was thinking to make a function that:
For each row, compute its length and match those that are 1 length less than the current row AND,
All previous items match (are exactly the same)
How can I do that?
CodePudding user response:
Use pathlib.Path.parent
to extract the parent, as follows:
import pandas as pd
import pathlib
df = pd.DataFrame(["/home", "/home/folder1", "/home/folder1/file1.xlsx",
"/home/folder1/file1.xlsx", "/home/folder1/file2.xlsx", "/home/folder2",
"/home/folder2/date", "/home/folder2/date/dates.txt", "/home/folder3"], columns=["file_path"])
df["parent"] = df["file_path"].apply(lambda x: pathlib.Path(x).parent)
print(df)
Output
file_path parent
0 /home /
1 /home/folder1 /home
2 /home/folder1/file1.xlsx /home/folder1
3 /home/folder1/file1.xlsx /home/folder1
4 /home/folder1/file2.xlsx /home/folder1
5 /home/folder2 /home
6 /home/folder2/date /home/folder2
7 /home/folder2/date/dates.txt /home/folder2/date
8 /home/folder3 /home
to match the exact output:
df["parent"] = df["file_path"].apply(lambda x: res if (res := pathlib.Path(x).parent) != pathlib.Path("/") else "ROOT")
print(df)
Output
file_path parent
0 /home ROOT
1 /home/folder1 /home
2 /home/folder1/file1.xlsx /home/folder1
3 /home/folder1/file1.xlsx /home/folder1
4 /home/folder1/file2.xlsx /home/folder1
5 /home/folder2 /home
6 /home/folder2/date /home/folder2
7 /home/folder2/date/dates.txt /home/folder2/date
8 /home/folder3 /home