I have the following scenario, in which I have thousands of CSV files spread in multiple folders with the following pattern of a path: data-output/region/month/CustomerName.csv.
I need to insert in each dataframe based on each csv file the region as a new column and the CustomerName (name of the csv file itself) as new column.
After that I would require to combine all of my CSV files into 1 single dataframe. This last I know how to do, but the column addition of the information that resides in the folder path is what is troubling me.
The bottom line of how the dataframe should look like in the end is the following:
date user_id time_spent region CustomerName
2022-10-01 11111 456 EU Filename1
2022-10-02 22222 789 US Filename2
2022-10-03 33333 123 UK Filename3
Any help would be greatly appreciated.
CodePudding user response:
If I'm understanding correctly? Not tested but should be very close.
from glob import glob
import pandas as pd
def concatenate_csvs(file_path: str) -> pd.DataFrame:
files = sorted(glob(f"{file_path}/*/*/*.csv"))
new_files = []
for file in files:
region = file_path.split("/")[1]
customer = file_path.split("/")[3]
df = pd.read_csv(file)
df.insert(loc=len(df.columns), column="region", value=region)
df.insert(loc=len(df.columns), column="customer", value=customer)
new_files.append(df)
df = pd.concat(new_files, ignore_index=True)
df = df[df.iloc[:, 0] != df.columns[0]]
return df.drop_duplicates(keep="first").dropna(axis=0, how="all").fillna(0).reset_index(drop=True)
path = "data-output/"
print(concatenate_csvs(path))