I would like to loop through a bunch of .txt files, for each of the files processing it (removing columns, changing names, nan etc) to get the end dataframe output of df1, which has certain date, lat, lon, and variables assigned to it. Over the loop, I would like to get df_all, with all the information from all the files in (most likely in date order).
However, each of my dataframes are different lengths, and there is the possibility of them sharing the same date lat/lon values in that column.
I have made code to feed in and process files individually, but I'm stuck on how to make this into a larger loop (via concat/append...?).
I am trying to end up with one large dataframe (df_all), which contains all the 'scattered' information of the different files (df1 outputs). In addition, if there is a conflicting date and lat/lon, I would find the mean. Is this possible to do in python/pandas?
Any help at all on any of the multiple issues would be greatly appreciated! Or ideas on how to go about this.
CodePudding user response:
Here are fake tables that are read in by a for-loop and concat
to a big table. Then after all rows are added to a single big table, you can group together multiple rows that have the same values in the A
column and get the mean
of the B
and C
columns as an example. You should be able to run this chunk of code yourself and I hope this helps give you keywords to use to search for other questions similar to yours!
import pandas as pd
#Making fake table read ins. you'd be using pd.read_csv or similar
def fake_read_table(name):
small_df1 = pd.DataFrame({'A': {0: 5, 1: 1, 2: 3, 3: 1}, 'B': {0: 4, 1: 4, 2: 4, 3: 4}, 'C': {0: 2, 1: 1, 2: 4, 3: 1}})
small_df2 = pd.DataFrame({'A': {0: 4, 1: 5, 2: 1, 3: 4, 4: 3, 5: 2, 6: 5, 7: 1}, 'B': {0: 3, 1: 1, 2: 1, 3: 1, 4: 5, 5: 1, 6: 4, 7: 2}, 'C': {0: 4, 1: 1, 2: 5, 3: 2, 4: 4, 5: 4, 6: 5, 7: 2}})
small_df3 = pd.DataFrame({'A': {0: 2, 1: 2, 2: 4, 3: 3, 4: 1, 5: 4, 6: 5}, 'B': {0: 1, 1: 2, 2: 3, 3: 1, 4: 3, 5: 5, 6: 4}, 'C': {0: 5, 1: 2, 2: 3, 3: 3, 4: 5, 5: 4, 6: 5}})
if name == '1.txt':
return small_df1
if name == '2.txt':
return small_df2
if name == '3.txt':
return small_df3
#Start here
txt_paths = ['1.txt','2.txt','3.txt']
big_df = pd.DataFrame()
for txt_path in txt_paths:
small_df = fake_read_table(txt_path)
# .. do some processing you need to do somewhere in here ..
big_df = pd.concat((big_df,small_df))
#Taking the average B and C values for rows that have the same A value
agg_df = big_df.groupby('A').agg(
mean_B = ('B','mean'),
mean_C = ('C','mean'),
).reset_index()
print(agg_df)