I am trying to concat multiple CSVs that live in subfolders of my parent directory into a data frame, while also adding a new filename column.
/ParentDirectory
│
│
├───SubFolder 1
│ test1.csv
│
├───SubFolder 2
│ test2.csv
│
├───SubFolder 3
│ test3.csv
│ test4.csv
│
├───SubFolder 4
│ test5.csv
I can do something like this to concat all the CSVs into a single data frame
import pandas as pd
import glob
files = glob.glob('/ParentDirectory/**/*.csv', recursive=True)
df = pd.concat([pd.read_csv(fp) for fp in files], ignore_index=True)
But is there a way to also add the filename of each file as a column to the final data frame, or do I have to loop through each individual file first before concatenating the final data frame? Output should look like:
Col1 Col2 file_name
0 AAAA XYZ test1.csv
1 BBBB XYZ test1.csv
2 CCCC RST test1.csv
3 DDDD XYZ test2.csv
4 AAAA WXY test3.csv
5 CCCC RST test4.csv
6 DDDD XTZ test4.csv
7 AAAA TTT test4.csv
8 CCCC RRR test4.csv
9 AAAA QQQ test4.csv
CodePudding user response:
you can assign the file_names on the fly:
from pathlib import Path
df = pd.concat([pd.read_csv(fp).assign(file_name=Path(fp).name)
for fp in files], ignore_index=True)
where pathlib.Path helps to extract the basename of the file from the path.
CodePudding user response:
A possible solution (you may need to replace /
in the code below by the appropriate slash for your operating system):
df = pd.concat([pd.read_csv(fp).assign(file_name=str.rsplit(
fp, '/', 1)[-1]) for fp in files], ignore_index=True)