Extracting Data From Pandas DataFrame-CodePudding

I have two pandas dataframe named df1 and df2. I want to extract same named files from both of the dataframe and put extracted in two columns in a data frame. I want the take, files name from df1 and match with df2 (df2 has more files than df1). There is only one column in both dataframe (df1 and df2). The "BOLD" one started with letter s**** is the common matching alpha-numeric characters. We have to match both dataframe on that.

df1["Text_File_Location"] = 0 /home/mzkhan/2.0.0/files/p15/p15546261/s537061.txt

1 /home/mzkhan/2.0.0/files/p15/p15098455/s586955.txt

df2["Image_File_Location"]=

0 /media/mzkhan/external_dr/radiology_image/2.0.0/files/p10/p10000032/s537061/02aa804e- bde0afdd-112c0b34-7bc16630-4e384014.jpg'

1 /media/mzkhan/external_dr/radiology_image/2.0.0/files/p10/p10000032/s586955/174413ec-4ec4c1f7-34ea26b7-c5f994f8-79ef1962.jpg

CodePudding user response：

In Python 3.4 , you can use pathlib to handily work with filepaths. You can extract the filename without extension ("stem") from df1 and then you can extract the parent folder name from df2. Then, you can do an inner merge on those names.

import pandas as pd
from pathlib import Path

df1 = pd.DataFrame(
    {
        "Text_File_Location": [
            "/home/mzkhan/2.0.0/files/p15/p15546261/s537061.txt",
            "/home/mzkhan/2.0.0/files/p15/p15098455/s586955.txt",
        ]
    }
)
df2 = pd.DataFrame(
    {
        "Image_File_Location": [
            "/media/mzkhan/external_dr/radiology_image/2.0.0/files/p10/p10000032/s537061/02aa804e- bde0afdd-112c0b34-7bc16630-4e384014.jpg",
            "/media/mzkhan/external_dr/radiology_image/2.0.0/files/p10/p10000032/s586955/174413ec-4ec4c1f7-34ea26b7-c5f994f8-79ef1962.jpg",
            "/media/mzkhan/external_dr/radiology_image/2.0.0/files/p10/p10000032/foo/bar.jpg",
        ]
    }
)

df1["name"] = df1["Text_File_Location"].apply(lambda x: Path(str(x)).stem)
df2["name"] = df2["Image_File_Location"].apply(lambda x: Path(str(x)).parent.name)

df3 = pd.merge(df1, df2, on="name", how="inner")