Home > Net >  Creating a mean column in a dataframe dependent on other variables of the dataframe in pandas
Creating a mean column in a dataframe dependent on other variables of the dataframe in pandas

Time:10-18

I have a code that is roughly like this:

import numpy as np
import pandas as pd

df = pd.DataFrame({'Group':['a','a','b','b','b','c','c'], 'Label':[0,1,0,1,1,0,1], 'Num':[1,2,3,4,5,6,7]})

I would like to have a a new column that is the mean of Num, but only those with class label 1. However, this mean should be applied to all rows with label 1 only, with the rest being 0/NaN. The output should be like this: mean = [0,2,0,4.5,4.5,0,7]

Also how would it be if instead of 0/Nan, you just apply that mean to all values of that group? ie mean = [2,2,4.5,4.5,4.5,7,7]

Thanks a lot

CodePudding user response:

NaNs/0

If NaNs are OK, just slice before applying a groupby mean:

df['mean'] = df[df['Label'].eq(1)].groupby('Group')['Num'].transform('mean')

output:

  Group  Label  Num  mean
0     a      0    1   NaN
1     a      1    2   2.0
2     b      0    3   NaN
3     b      1    4   4.5
4     b      1    5   4.5
5     c      0    6   NaN
6     c      1    7   7.0

If you prefer 0, you can fillna(0)

output on all rows

To get the output on all rows.

mask the values in Num when Label is 1 to change them into NaNs, groupby the Group values and transform all rows with the mean of the group.

df['mean'] = (df['Num'].mask(df['Label'].ne(1))
                       .groupby(df['Group']).transform('mean'))

output:

  Group  Label  Num  mean
0     a      0    1   2.0
1     a      1    2   2.0
2     b      0    3   4.5
3     b      1    4   4.5
4     b      1    5   4.5
5     c      0    6   7.0
6     c      1    7   7.0
  • Related