Home > Software engineering >  Adding Column to Pandas Dataframes from Python Generator
Adding Column to Pandas Dataframes from Python Generator

Time:10-22

I'm reading a number of csv files into python using a glob matching and would like to add the filename as a column in each of the dataframes. I'm currently matching on a pattern and then using a generator to read in the files as so:

base_list_of_files = glob.glob(matching_pattern)

loaded_csv_data_frames = (pd.read_csv(csv, encoding= 'latin-1') for csv in base_list_of_files)    

for idx, df in enumerate(loaded_csv_data_frames):

    df['file_origin'] = base_list_of_files[idx]

combined_data = pd.concat(loaded_csv_data_frames)

I however get the error ValueError: No objects to concatenate when I come to do the concatenation - why does the adding the column iteratively break the list of dataframes ?

CodePudding user response:

Generators can only go through one iteration, at the end of which they throw a StopIteration exception which is automatically handled by the for loop. If you try to consume them again they will just raise StopIteration, as demonstrated here:

def consume(gen):
    while True:
        print(next(gen))
    except StopIteration:
        print("Stop iteration")
        break
>>> gen = (i for i in range(2))
>>> consume(gen)
0
1
Stop iteration
>>> consume(gen)
Stop iteration

That's why you get the ValueError when you try to use loaded_csv_data_frames for a second time.

I cannot replicate your example, but here it is something that should be similar enough:

df1 = pd.DataFrame(0, columns=["a", "b"], index=[0, 1])
df2 = pd.DataFrame(1, columns=["a", "b"], index=[0, 1])
loaded_csv_data_frames = iter((df1, df2))  # Pretend that these are read from a csv file
base_list_of_files = iter(("df1.csv", "df2.csv"))  # Pretend these file names come from glob

You can add the file of origin as a key when you concatenate. Add names too to give titles to your index levels.

>>> df = pd.concat(loaded_csv_data_frames, keys=base_list_of_files, names=["file_origin", "index"])
>>> df
                  a   b
file_origin index       
df1.csv     0     0   0
            1     0   0
df2.csv     0     1   1
            1     1   1

If you want file_origin to be one of your columns, just reset first level of the index.

>>> df.reset_index("file_origin")
    file_origin a   b
index           
0   df1.csv     0   0
1   df1.csv     0   0
0   df2.csv     1   1
1   df2.csv     1   1
  • Related