Home > Enterprise >  Joining DataFrame in a for loop
Joining DataFrame in a for loop

Time:12-16

I am working on a script that reads a text file into a pandas DataFrame that can contain a variety of columns and rows. Then, some operations are made on the data, and it needs to sum it all up into a single DataFrame for output to an excel document.

My code works for a single file but now I need to iterate over all of the files.

This seems like it should be very easy to do but I've tried all of the pandas functions I can find to accomplish this but nothing works.

Here is the basic structure:

import glob
import pandas as pd
# ...
inputFiles = glob.glob('*.rep')

for filename in inputFiles:
    df = pd.read_csv(filename, sep = ' ')
    # DF MODIFICATIONS...
    # Need to send a new df here to avoid overwriting on loop

Example of inputs/desired output:

#file1.rep:
columnA columnB columnC
val1 val2 val3
#file2.rep:
columnA columnB columnX
val4 val5 val6

#resulting dataframe:
columnA columnB columnC columnX
val1    val2    val3    NaN
val4    val5    NaN     val6

I tried append, add, combine, join, concat, and none of them have worked. Am I just using one of these improperly?

CodePudding user response:

Try appending all the dataframes to a list and then using pd.concat (with axis=0, which is the default) to combine them:

import glob
import pandas as pd
# ...
inputFiles = glob.glob('*.rep')

dfs = []
for filename in inputFiles:
    df = pd.read_csv(filename, sep = ' ')
    # DF MODIFICATIONS...
    dfs.append(df)

full_df = pd.concat(dfs)

CodePudding user response:

Consider generalizing your process in a defined method. Then run pandas.concat on output of a list comprehension:

def process_df(filename):
    df = pd.read_csv(filename, sep = ' ') 

    # DF MODIFICATIONS...

    return df

final_df = pd.concat(
    [process_df(f) for f in inputFiles]
)
  • Related