How to average across two dataframes-CodePudding

I have two dataframes:

{'id': {4: 1548638, 6: 1953603, 7: 1956216, 8: 1962245, 9: 1981386, 10: 1981773, 11: 2004787, 13: 2017418, 14: 2020989, 15: 2045043}, 'total': {4: 17, 6: 38, 7: 59, 8: 40, 9: 40, 10: 40, 11: 80, 13: 44, 14: 51, 15: 46}}
{'id': {4: 1548638, 6: 1953603, 7: 1956216, 8: 1962245, 9: 1981386, 10: 1981773, 11: 2004787, 13: 2017418, 14: 2020989, 15: 2045043}, 'total': {4: 17, 6: 38, 7: 59, 8: 40, 9: 40, 10: 40, 11: 80, 13: 44, 14: 51, 15: 46}}

For every 'id' that exists in both dataframes I would like to compute the average of their values in 'total' and have that in a new dataframe.

I tried:

 pd.merge(df1, df2, on="id")

with the hope that I could then do:

merged_df[['total']].mean(axis=1)

but it doesn't work at all.

How can you do this?

CodePudding user response：

You could use:

df1.merge(df2, on='id').set_index('id').mean(axis=1).reset_index(name='total')

Or, if you have many columns, a more generic approach:

(df1.merge(df2, on='id', suffixes=(None, '_other')).set_index('id')
    .rename(columns=lambda x: x.removesuffix('_other')) # requires python 3.9 
    .groupby(axis=1, level=0)
    .mean().reset_index()
)

Output:

        id  total
0  1548638   17.0
1  1953603   38.0
2  1956216   59.0
3  1962245   40.0
4  1981386   40.0
5  1981773   40.0
6  2004787   80.0
7  2017418   44.0
8  2020989   51.0
9  2045043   46.0

CodePudding user response：

You can do like the below:

df1 = pd.DataFrame({'id': {4: 1548638, 6: 1953603, 7: 1956216, 8: 1962245, 9: 1981386, 10: 1981773, 11: 2004787, 13: 2017418, 14: 2020989, 15: 2045043}, 'total': {4: 17, 6: 38, 7: 59, 8: 40, 9: 40, 10: 40, 11: 80, 13: 44, 14: 51, 15: 46}})
df2 = pd.DataFrame({'id': {4: 1548638, 6: 1953603, 7: 1956216, 8: 1962245, 9: 1981386, 10: 1981773, 11: 2004787, 13: 2017418, 14: 2020989, 15: 2045043}, 'total': {4: 17, 6: 38, 7: 59, 8: 40, 9: 40, 10: 40, 11: 80, 13: 44, 14: 51, 15: 46}})

merged_df = df1.merge(df2, on='id')
merged_df['total_mean'] = merged_df.filter(regex='total').mean(axis=1)
print(merged_df)

Output:

        id  total_x  total_y  total_mean
0  1548638       17       17        17.0
1  1953603       38       38        38.0
2  1956216       59       59        59.0
3  1962245       40       40        40.0
4  1981386       40       40        40.0
5  1981773       40       40        40.0
6  2004787       80       80        80.0
7  2017418       44       44        44.0
8  2020989       51       51        51.0
9  2045043       46       46        46.0