Adding a summarised column back into a dataframe in python-CodePudding

as part of some data cleansing, i want to add the mean of a variable back into a dataframe to use if the variable is missing for a particular observation. so i've calculated my averages as follows

avg=all_data2.groupby("portfolio")"[sales"].mean().reset_index(name="sales_mean")

now I wanted to add that back into my original dataframe using a left join, but it doesnt appear to be working. what format is my avg now? I thought it would be a dataframe but is it something else?

CodePudding user response：

if you want add new column, you can use like this

df['sales_mean']=df[['sales_1','sales_2']].mean(axis=1)

CodePudding user response：

UPDATED

This is probably the most succinct way to do it:

all_data2.sales = all_data2.sales.fillna(all_data2.groupby('portfolio').sales.transform('mean'))

This is another way to do it:

all_data2['sales'] = all_data2[['portfolio', 'sales']].groupby('portfolio').transform(lambda x: x.fillna(x.mean()))

Output:

   portfolio  sales
0          1   10.0
1          1   20.0
2          2   30.0
3          2   40.0
4          3   50.0
5          3   60.0
6          3    NaN
   portfolio  sales
0          1   10.0
1          1   20.0
2          2   30.0
3          2   40.0
4          3   50.0
5          3   60.0
6          3   55.0

To answer your the part of your question that reads "what format is my avg now? I thought it would be a dataframe but is it something else?", avg is indeed a dataframe but using it not be the most direct way to update missing data in the original dataframe. The dataframe avg looks like this for the sample input data above:

   portfolio  sales_mean
0          1        15.0
1          2        35.0
2          3        55.0

A related SO question that you may find helpful is here.