calculating the mean values of duplicate entries in a dataframe and placing them in the original dat-CodePudding

suppose we have a dataframe like this:

    reviewerId  productId   overall
0   A1REUF3A1YCPHM  0001713353  5.0
1   AVP0HXC9FG790   0001713353  5.0
2   A324TTUBKTN73A  0001713353  2.0
3   A2RE7WG349NV5D  0001713353  4.0
...
16  A1IG9N5URR82EB  0001061240  5.0
17  A2CVLIZ9ELU88   0001061240  1.0
18  A2LGACKSC0MALY  0001061240  5.0
19  A6EQG0P75KHJ    0001061240  3.0

now we will sort them and find the average of them with this code:

df_final = df.groupby(['productId'], as_index=False)['overall'].mean()

now I want to have a column named 'average' and place the average in front of all rows of the 'df' not 'df_final' like this:

    reviewerId  productId   overall  average
0   A1REUF3A1YCPHM  0001713353  5.0  4.75
1   AVP0HXC9FG790   0001713353  5.0  4.75
2   A324TTUBKTN73A  0001713353  2.0  4.75
3   A2RE7WG349NV5D  0001713353  4.0  4.75
...
16  A1IG9N5URR82EB  0001061240  5.0  4.5
17  A2CVLIZ9ELU88   0001061240  1.0  4.5
18  A2LGACKSC0MALY  0001061240  5.0  4.5
19  A6EQG0P75KHJ    0001061240  3.0  4.5

consider the fact that we have over 20 million rows and I want the optimized way.

CodePudding user response：

Use transform:

df['average'] = df.groupby('productId', as_index=False)['overall'].transform('mean')