How to add folder name and .csv filename to a pandas dataframe in Python?-CodePudding

I have the following scenario, in which I have thousands of CSV files spread in multiple folders with the following pattern of a path: data-output/region/month/CustomerName.csv.

I need to insert in each dataframe based on each csv file the region as a new column and the CustomerName (name of the csv file itself) as new column.

After that I would require to combine all of my CSV files into 1 single dataframe. This last I know how to do, but the column addition of the information that resides in the folder path is what is troubling me.

The bottom line of how the dataframe should look like in the end is the following:

date        user_id   time_spent    region    CustomerName
2022-10-01  11111     456           EU        Filename1
2022-10-02  22222     789           US        Filename2
2022-10-03  33333     123           UK        Filename3

Any help would be greatly appreciated.

CodePudding user response：

If I'm understanding correctly? Not tested but should be very close.

from glob import glob

import pandas as pd


def concatenate_csvs(file_path: str) -> pd.DataFrame:
    files = sorted(glob(f"{file_path}/*/*/*.csv"))

    new_files = []
    for file in files:
        region = file_path.split("/")[1]
        customer = file_path.split("/")[3]
        df = pd.read_csv(file)
        df.insert(loc=len(df.columns), column="region", value=region)
        df.insert(loc=len(df.columns), column="customer", value=customer)
        new_files.append(df)
    df = pd.concat(new_files, ignore_index=True)
    df = df[df.iloc[:, 0] != df.columns[0]]

    return df.drop_duplicates(keep="first").dropna(axis=0, how="all").fillna(0).reset_index(drop=True)


path = "data-output/"
print(concatenate_csvs(path))