Home > Mobile >  python pandas regex find pattern from another row
python pandas regex find pattern from another row

Time:07-20

I have a python pandas dataframe with the following pattern:

file_path
/home
/home/folder1
/home/folder1/file1.xlsx
/home/folder1/file2.xlsx
/home/folder2
/home/folder2/date
/home/folder2/date/dates.txt
/home/folder3

I would like to get the parent path in a new column, if there is no parent then call it "ROOT"

file_path parent_path
/home ROOT
/home/folder1 /home
/home/folder1/file1.xlsx /home/folder1
/home/folder1/file2.xlsx /home/folder1
/home/folder2 /home
/home/folder2/date /home/folder2
/home/folder2/date/dates.txt /home/folder2/date
/home/folder3 /home

My attempt:

import re
import pandas as pd

df = pd.DataFrame(["/home", "/home/folder1", "/home/folder1/file1.xlsx", 
"/home/folder1/file1.xlsx", "/home/folder1/file2.xlsx", "/home/folder2", 
"/home/folder2/date", "/home/folder2/date/dates.txt", "/home/folder3"], columns=["file_path"])

# Get list

file_paths = df.file_path.unique()

def match_parent(x, file_paths):
    x = x.split('/')
    levels = len(x)
    # Check that parent contains all elements of x and the length is 1 less





I was thinking to make a function that:

  1. For each row, compute its length and match those that are 1 length less than the current row AND,

  2. All previous items match (are exactly the same)

How can I do that?

CodePudding user response:

Use pathlib.Path.parent to extract the parent, as follows:

import pandas as pd
import pathlib

df = pd.DataFrame(["/home", "/home/folder1", "/home/folder1/file1.xlsx",
                   "/home/folder1/file1.xlsx", "/home/folder1/file2.xlsx", "/home/folder2",
                   "/home/folder2/date", "/home/folder2/date/dates.txt", "/home/folder3"], columns=["file_path"])


df["parent"] = df["file_path"].apply(lambda x: pathlib.Path(x).parent)
print(df)

Output

                      file_path              parent
0                         /home                   /
1                 /home/folder1               /home
2      /home/folder1/file1.xlsx       /home/folder1
3      /home/folder1/file1.xlsx       /home/folder1
4      /home/folder1/file2.xlsx       /home/folder1
5                 /home/folder2               /home
6            /home/folder2/date       /home/folder2
7  /home/folder2/date/dates.txt  /home/folder2/date
8                 /home/folder3               /home

to match the exact output:

df["parent"] = df["file_path"].apply(lambda x: res if (res := pathlib.Path(x).parent) != pathlib.Path("/") else "ROOT")
print(df)

Output

                      file_path              parent
0                         /home                ROOT
1                 /home/folder1               /home
2      /home/folder1/file1.xlsx       /home/folder1
3      /home/folder1/file1.xlsx       /home/folder1
4      /home/folder1/file2.xlsx       /home/folder1
5                 /home/folder2               /home
6            /home/folder2/date       /home/folder2
7  /home/folder2/date/dates.txt  /home/folder2/date
8                 /home/folder3               /home
  • Related