I am working on a script that reads a text file into a pandas DataFrame that can contain a variety of columns and rows. Then, some operations are made on the data, and it needs to sum it all up into a single DataFrame for output to an excel document.
My code works for a single file but now I need to iterate over all of the files.
This seems like it should be very easy to do but I've tried all of the pandas functions I can find to accomplish this but nothing works.
Here is the basic structure:
import glob
import pandas as pd
# ...
inputFiles = glob.glob('*.rep')
for filename in inputFiles:
df = pd.read_csv(filename, sep = ' ')
# DF MODIFICATIONS...
# Need to send a new df here to avoid overwriting on loop
Example of inputs/desired output:
#file1.rep:
columnA columnB columnC
val1 val2 val3
#file2.rep:
columnA columnB columnX
val4 val5 val6
#resulting dataframe:
columnA columnB columnC columnX
val1 val2 val3 NaN
val4 val5 NaN val6
I tried append, add, combine, join, concat, and none of them have worked. Am I just using one of these improperly?
CodePudding user response:
Try appending all the dataframes to a list and then using pd.concat
(with axis=0
, which is the default) to combine them:
import glob
import pandas as pd
# ...
inputFiles = glob.glob('*.rep')
dfs = []
for filename in inputFiles:
df = pd.read_csv(filename, sep = ' ')
# DF MODIFICATIONS...
dfs.append(df)
full_df = pd.concat(dfs)
CodePudding user response:
Consider generalizing your process in a defined method. Then run pandas.concat
on output of a list comprehension:
def process_df(filename):
df = pd.read_csv(filename, sep = ' ')
# DF MODIFICATIONS...
return df
final_df = pd.concat(
[process_df(f) for f in inputFiles]
)