Home > Software design >  Sum absolute values of one column using a groupby() of another column
Sum absolute values of one column using a groupby() of another column

Time:12-28

I have the following toy DataFrame named df:

df = pd.DataFrame({'foo' : ['red', 'red', 'red', 'blue', 'blue', 'blue', 'green', 'green', 'green'], 
                     'bar'  : [10, -5, 7, 14, 20, 3, 40, -100, 75]})

foo    bar   
red     10
red     -5
red     7
blue    14
blue    -20
blue    3 
green   40
green   -100
green   75

I would like to sum the absolute values of the bar column using a groupby() of the foo column. Here is my attempt:

df['baz'] = df.groupby('foo').bar.apply(lambda x: x.abs().sum())

I would expect to see the following:

foo    bar     baz
red     10     22
red     -5     22
red     7      22
blue    14     37
blue    -20    37
blue    3      37
green   40     215
green   -100   215
green   75     215

Unfortunately, I'm getting NaN in the newly created baz column.

Why is this approach resulting in NaN values? Separately (and notwithstanding the bug in my code), is there a more elegant (i.e. more Pythonic) way to do this?

CodePudding user response:

pandas aligns on index automatically, so your code does not work because the index of your groupby result is different than the index of your original dataframe. Consequently, it returns nan.

Having said that, you can set the index to be the 'foo' column, using set_index(), and retry your code:

df.set_index('foo',inplace=True)
df['baz'] = df.groupby('foo').bar.apply(lambda x: x.abs().sum())

print(df)

       bar  baz
foo            
red     10   22
red     -5   22
red      7   22
blue    14   37
blue    20   37
blue     3   37
green   40  215
green -100  215
green   75  215

A more pythonic way, as you say I think would be to use transform, instead of setting the index and distort your original dataframe. Therefore, my recommendation would be keep your dataframe as it is, and change your code to:

df['baz'] = df.groupby('foo').bar.transform(lambda x: x.abs().sum())

print(df)

     foo  bar  baz
0    red   10   22
1    red   -5   22
2    red    7   22
3   blue   14   37
4   blue   20   37
5   blue    3   37
6  green   40  215
7  green -100  215
8  green   75  215

CodePudding user response:

Try with

df['baz'] = df.bar.abs().groupby(df['foo']).transform('sum')
  • Related