suppose we have a dataframe like this:
reviewerId productId overall
0 A1REUF3A1YCPHM 0001713353 5.0
1 AVP0HXC9FG790 0001713353 5.0
2 A324TTUBKTN73A 0001713353 2.0
3 A2RE7WG349NV5D 0001713353 4.0
...
16 A1IG9N5URR82EB 0001061240 5.0
17 A2CVLIZ9ELU88 0001061240 1.0
18 A2LGACKSC0MALY 0001061240 5.0
19 A6EQG0P75KHJ 0001061240 3.0
now we will sort them and find the average of them with this code:
df_final = df.groupby(['productId'], as_index=False)['overall'].mean()
now I want to have a column named 'average' and place the average in front of all rows of the 'df' not 'df_final' like this:
reviewerId productId overall average
0 A1REUF3A1YCPHM 0001713353 5.0 4.75
1 AVP0HXC9FG790 0001713353 5.0 4.75
2 A324TTUBKTN73A 0001713353 2.0 4.75
3 A2RE7WG349NV5D 0001713353 4.0 4.75
...
16 A1IG9N5URR82EB 0001061240 5.0 4.5
17 A2CVLIZ9ELU88 0001061240 1.0 4.5
18 A2LGACKSC0MALY 0001061240 5.0 4.5
19 A6EQG0P75KHJ 0001061240 3.0 4.5
consider the fact that we have over 20 million rows and I want the optimized way.
CodePudding user response:
Use transform
:
df['average'] = df.groupby('productId', as_index=False)['overall'].transform('mean')