as part of some data cleansing, i want to add the mean of a variable back into a dataframe to use if the variable is missing for a particular observation. so i've calculated my averages as follows
avg=all_data2.groupby("portfolio")"[sales"].mean().reset_index(name="sales_mean")
now I wanted to add that back into my original dataframe using a left join, but it doesnt appear to be working. what format is my avg now? I thought it would be a dataframe but is it something else?
CodePudding user response:
if you want add new column, you can use like this
df['sales_mean']=df[['sales_1','sales_2']].mean(axis=1)
CodePudding user response:
UPDATED
This is probably the most succinct way to do it:
all_data2.sales = all_data2.sales.fillna(all_data2.groupby('portfolio').sales.transform('mean'))
This is another way to do it:
all_data2['sales'] = all_data2[['portfolio', 'sales']].groupby('portfolio').transform(lambda x: x.fillna(x.mean()))
Output:
portfolio sales
0 1 10.0
1 1 20.0
2 2 30.0
3 2 40.0
4 3 50.0
5 3 60.0
6 3 NaN
portfolio sales
0 1 10.0
1 1 20.0
2 2 30.0
3 2 40.0
4 3 50.0
5 3 60.0
6 3 55.0
To answer your the part of your question that reads "what format is my avg now? I thought it would be a dataframe but is it something else?", avg
is indeed a dataframe but using it not be the most direct way to update missing data in the original dataframe. The dataframe avg
looks like this for the sample input data above:
portfolio sales_mean
0 1 15.0
1 2 35.0
2 3 55.0
A related SO question that you may find helpful is here.