I have a dataframe as follows,(but much bigger):
import pandas as pd
panda_list = [pd.DataFrame({'sent_a.1': [0, 3, 2, 1], 'sent_a.2': [0, 1, 4, 0], 'sent_a.3': [0, 6, 0, 8],'sent_a.4': [1, 1, 8, 6],'ID':['id_1_2','id_1_3','id_1_4','id_1_4']}),
pd.DataFrame({'sent_a.1': [0, 3], 'sent_a.2': [0, 2], 'sent_a.3': [0, 6],'sent_a.4': [1, 1],'ID':['id_2_1','id_2_2']}),
pd.DataFrame({'sent_a.1': [0, 3, 2, 1], 'sent_a.2': [0, 1, 4, 0], 'sent_a.3': [0, 6, 0, 8], 'sent_a.4': [1, 1, 8, 6], 'ID': ['id_1_2', 'id_1_3', 'id_1_4', 'id_1_4']}),
]
I would like to combine the pandas on similar columns into one dataframe, so the output would look like this,
desired output:
print(new_df)
sent_a.1 sent_a.2 sent_a.3 sent_a.4 ID
0 0,0,0 0,0,0 0,0,0 1,1,1 id_1_2,id_2_1,id_3_1
1 3,3,3 1,2,1 6,6,6 1,1,1 id_1_3,id_2_2,id_3_2
2 2,NaN,2 4,NaN,4 0,NaN,0 8,NaN,8 id_1_4,NaN,id_3_3
3 1,NaN,1 0,NaN,0 8,NaN,8 6,NaN,6 id_1_4,NaN,id_3_4
So I have tried the following for example for the first columns but since the sizes are not the same, it send an error,
new_df=pd.DataFrame()
new_df['sent_a.1']=panda_list[0]['sent_a.1'] ',' panda_list[1]['sent_a.1'] ',' panda_list[2]['sent_a.1']
Note:
I do not want to concatenate or merge the columns though, i want to combine them. please look at the desired output.
CodePudding user response:
IIUC, you can concat
and aggregate (assuming here as list).
dropping NaNs:
out = pd.concat(panda_list).groupby(level=0).agg(list)
output:
sent_a.1 sent_a.2 sent_a.3 sent_a.4 ID
0 [0, 0, 0] [0, 0, 0] [0, 0, 0] [1, 1, 1] [id_1_2, id_2_1, id_1_2]
1 [3, 3, 3] [1, 2, 1] [6, 6, 6] [1, 1, 1] [id_1_3, id_2_2, id_1_3]
2 [2, 2] [4, 4] [0, 0] [8, 8] [id_1_4, id_1_4]
3 [1, 1] [0, 0] [8, 8] [6, 6] [id_1_4, id_1_4]
To aggregate as string, ensure to convert to string first:
out = pd.concat(panda_list).astype(str).groupby(level=0).agg(','.join)
output:
sent_a.1 sent_a.2 sent_a.3 sent_a.4 ID
0 0,0,0 0,0,0 0,0,0 1,1,1 id_1_2,id_2_1,id_1_2
1 3,3,3 1,2,1 6,6,6 1,1,1 id_1_3,id_2_2,id_1_3
2 2,2 4,4 0,0 8,8 id_1_4,id_1_4
3 1,1 0,0 8,8 6,6 id_1_4,id_1_4
ensuring not to lose NaNs
concatenate on the other axis and perform a double transpose
out = (
pd.concat(panda_list, axis=1).astype(str)
.T.groupby(level=0, sort=False)
.agg(','.join).T
)
output:
sent_a.1 sent_a.2 sent_a.3 sent_a.4 ID
0 0,0.0,0 0,0.0,0 0,0.0,0 1,1.0,1 id_1_2,id_2_1,id_1_2
1 3,3.0,3 1,2.0,1 6,6.0,6 1,1.0,1 id_1_3,id_2_2,id_1_3
2 2,nan,2 4,nan,4 0,nan,0 8,nan,8 id_1_4,nan,id_1_4
3 1,nan,1 0,nan,0 8,nan,8 6,nan,6 id_1_4,nan,id_1_4