I'm reading a number of csv files into python using a glob matching and would like to add the filename as a column in each of the dataframes. I'm currently matching on a pattern and then using a generator to read in the files as so:
base_list_of_files = glob.glob(matching_pattern)
loaded_csv_data_frames = (pd.read_csv(csv, encoding= 'latin-1') for csv in base_list_of_files)
for idx, df in enumerate(loaded_csv_data_frames):
df['file_origin'] = base_list_of_files[idx]
combined_data = pd.concat(loaded_csv_data_frames)
I however get the error ValueError: No objects to concatenate
when I come to do the concatenation - why does the adding the column iteratively break the list of dataframes ?
CodePudding user response:
Generators can only go through one iteration, at the end of which they throw a StopIteration
exception which is automatically handled by the for loop. If you try to consume them again they will just raise StopIteration
, as demonstrated here:
def consume(gen):
while True:
print(next(gen))
except StopIteration:
print("Stop iteration")
break
>>> gen = (i for i in range(2))
>>> consume(gen)
0
1
Stop iteration
>>> consume(gen)
Stop iteration
That's why you get the ValueError
when you try to use loaded_csv_data_frames
for a second time.
I cannot replicate your example, but here it is something that should be similar enough:
df1 = pd.DataFrame(0, columns=["a", "b"], index=[0, 1])
df2 = pd.DataFrame(1, columns=["a", "b"], index=[0, 1])
loaded_csv_data_frames = iter((df1, df2)) # Pretend that these are read from a csv file
base_list_of_files = iter(("df1.csv", "df2.csv")) # Pretend these file names come from glob
You can add the file of origin as a key when you concatenate. Add names too to give titles to your index levels.
>>> df = pd.concat(loaded_csv_data_frames, keys=base_list_of_files, names=["file_origin", "index"])
>>> df
a b
file_origin index
df1.csv 0 0 0
1 0 0
df2.csv 0 1 1
1 1 1
If you want file_origin
to be one of your columns, just reset first level of the index.
>>> df.reset_index("file_origin")
file_origin a b
index
0 df1.csv 0 0
1 df1.csv 0 0
0 df2.csv 1 1
1 df2.csv 1 1