How can I assign the result of a filtered, grouped aggregation as a new column in the original Panda-CodePudding

I am having trouble making the transition from using R data.table to using Pandas for data munging.

Specifically, I am trying to assign the results of aggregations back into the original df as a new column. Note that the aggregations are functions of two columns, so I don't think df.transform() is the right approach.

To illustrate, I'm trying to replicate what I would do in R with:

library(data.table)

df = as.data.table(read.csv(text=
"id,term,node,hours,price
1,qtr,A,300,107
2,qtr,A,300,104
3,qtr,A,300,91
4,qtr,B,300,89
5,qtr,B,300,113
6,qtr,B,300,116
7,mth,A,50,110
8,mth,A,100,119
9,mth,A,150,99
10,mth,B,50,111
11,mth,B,100,106
12,mth,B,150,108"))

df[term == 'qtr' , `:=`(vwap_ish = sum(hours * price),
                        avg_id = mean(id) ),
    .(node, term)]

df

# id term node hours price vwap_ish avg_id
# 1:  1  qtr    A   300   107    90600      2
# 2:  2  qtr    A   300   104    90600      2
# 3:  3  qtr    A   300    91    90600      2
# 4:  4  qtr    B   300    89    95400      5
# 5:  5  qtr    B   300   113    95400      5
# 6:  6  qtr    B   300   116    95400      5
# 7:  7  mth    A    50   110       NA     NA
# 8:  8  mth    A   100   119       NA     NA
# 9:  9  mth    A   150    99       NA     NA
# 10: 10  mth    B    50   111       NA     NA
# 11: 11  mth    B   100   106       NA     NA
# 12: 12  mth    B   150   108       NA     NA

Using Pandas, I can create an object from df that contains all rows of the original df, with the aggregations

import io
import numpy as np
import pandas as pd

data = io.StringIO("""id,term,node,hours,price
1,qtr,A,300,107
2,qtr,A,300,104
3,qtr,A,300,91
4,qtr,B,300,89
5,qtr,B,300,113
6,qtr,B,300,116
7,mth,A,50,110
8,mth,A,100,119
9,mth,A,150,99
10,mth,B,50,111
11,mth,B,100,106
12,mth,B,150,108""")

df = pd.read_csv(data)

df1 = df.groupby(['node','term']).apply(
        lambda gp: gp.assign(
                vwap_ish = (gp.hours * gp.price).sum(),
                avg_id = np.mean(gp.id)
                )
        )

df1

"""
              id term node  hours  price  vwap_ish  avg_id
node term                                                 
ben  mth  9   10  mth  B     50    111     32350    10.0
          10  11  mth  B    100    106     32350    10.0
          11  12  mth  B    150    108     32350    10.0
     qtr  3    4  qtr  B    300     89     95400     4.0
          4    5  qtr  B    300    113     95400     4.0
          5    6  qtr  B    300    116     95400     4.0
ota  mth  6    7  mth  A     50    110     32250     7.0
          7    8  mth  A    100    119     32250     7.0
          8    9  mth  A    150     99     32250     7.0
     qtr  0    1  qtr  A    300    107     90600     1.0
          1    2  qtr  A    300    104     90600     1.0
          2    3  qtr  A    300     91     90600     1.0
"""

This doesn't really get me what I want because a) it re-orders and creates indices, and b) it has calculated the aggregation for all rows.

I can get the subset easily enough with


df2 = df[df.term=='qtr'].groupby(['node','term']).apply(
        lambda gp: gp.assign(
                vwap_ish = (gp.hours * gp.price).sum(),
                avg_id = np.mean(gp.id)
                )

df2

"""
             id term node  hours  price  vwap_ish  avg_id
node term                                                
A    qtr  0   1  qtr    A    300    107     90600     1.0
          1   2  qtr    A    300    104     90600     1.0
          2   3  qtr    A    300     91     90600     1.0
B    qtr  3   4  qtr    B    300     89     95400     4.0
          4   5  qtr    B    300    113     95400     4.0
          5   6  qtr    B    300    116     95400     4.0
"""

but I can't get the values in the new columns (vwap_ish, avg_id) back into the old df.

I've tried:

df[df.term=='qtr'] = df[df.term == 'qtr'].groupby(['node','term']).apply(
        lambda gp: gp.assign(
                vwap_ish = (gp.hours * gp.price).sum(),
                avg_id = np.mean(gp.id)
                )
        )

ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'

And also a few variations of .merge and .join. For example:

df.merge(df2, how='left')

ValueError: 'term' is both an index level and a column label, which is ambiguous.

and

df.merge(df2, how='left', on=df.columns)

KeyError: Index(['id', 'term', 'node', 'hours', 'price'], dtype='object')

In writing this I realised I could take my first approach and then just do

df[df.term=='qtr', ['vwap_ish','avg_id']] = NaN

but this seems quite hacky. It means I have to use a new column, rather than overwriting an existing one on the filtered rows, and if the aggregation function were to break say if term='mth' then that would be problematic too.

I'd really appreciate any help with this as it's been a very steep learning curve to try to make the transition from data.table to Pandas and there's so much I would do in a one-liner that is taking me hours to figure out.

CodePudding user response：

You can add group_keys=False parameter for remove MultiIndex, so left join working well:

df2 = df[df.term == 'qtr'].groupby(['node','term'], group_keys=False).apply(
        lambda gp: gp.assign(
                vwap_ish = (gp.hours * gp.price).sum(),
                avg_id = np.mean(gp.id)
                )
        )

df = df.merge(df2, how='left')
print (df)
    id term node  hours  price  vwap_ish  avg_id
0    1  qtr    A    300    107   90600.0     2.0
1    2  qtr    A    300    104   90600.0     2.0
2    3  qtr    A    300     91   90600.0     2.0
3    4  qtr    B    300     89   95400.0     5.0
4    5  qtr    B    300    113   95400.0     5.0
5    6  qtr    B    300    116   95400.0     5.0
6    7  mth    A     50    110       NaN     NaN
7    8  mth    A    100    119       NaN     NaN
8    9  mth    A    150     99       NaN     NaN
9   10  mth    B     50    111       NaN     NaN
10  11  mth    B    100    106       NaN     NaN
11  12  mth    B    150    108       NaN     NaN

Solution without left join:

m = df.term == 'qtr'
df.loc[m, ['vwap_ish','avg_id']] = (df[m].groupby(['node','term'], group_keys=False)
                                        .apply(lambda gp: gp.assign(
                                                     vwap_ish = (gp.hours * gp.price).sum(),
                                                     avg_id = np.mean(gp.id)
                                                      )
                                               ))

Improved solution with named aggregation and create vwap_ish column before groupby can improve performance:

df2 = (df[df.term == 'qtr']
         .assign(vwap_ish = lambda x: x.hours * x.price)
         .groupby(['node','term'], as_index=False)
         .agg(vwap_ish=('vwap_ish','sum'),
              avg_id=('id','mean')))

df = df.merge(df2, how='left')
print (df)
    id term node  hours  price  vwap_ish  avg_id
0    1  qtr    A    300    107   90600.0     2.0
1    2  qtr    A    300    104   90600.0     2.0
2    3  qtr    A    300     91   90600.0     2.0
3    4  qtr    B    300     89   95400.0     5.0
4    5  qtr    B    300    113   95400.0     5.0
5    6  qtr    B    300    116   95400.0     5.0
6    7  mth    A     50    110       NaN     NaN
7    8  mth    A    100    119       NaN     NaN
8    9  mth    A    150     99       NaN     NaN
9   10  mth    B     50    111       NaN     NaN
10  11  mth    B    100    106       NaN     NaN
11  12  mth    B    150    108       NaN     NaN

CodePudding user response：

With Pandas, and groupby, one option is to break it into individual steps, if you are willing to avoid using apply (if you are keen on performance):

Compute the product of hours and price before grouping:

temp = df.assign(vwap_ish = df.hours * df.price, avg_id = df.id)

Get the groupby object after filtering term:

temp = (temp
        .loc[temp.term.eq('qtr'), ['vwap_ish', 'avg_id']]
        .groupby([df.node, df.term])
        )

Assign back the aggregated values with transform; pandas will take care of the index alignment:

(df
.assign(vwap_ish = temp.vwap_ish.transform('sum'), 
        avg_id = temp.avg_id.transform('mean'))
)

    id term node  hours  price  vwap_ish  avg_id
0    1  qtr    A    300    107   90600.0     2.0
1    2  qtr    A    300    104   90600.0     2.0
2    3  qtr    A    300     91   90600.0     2.0
3    4  qtr    B    300     89   95400.0     5.0
4    5  qtr    B    300    113   95400.0     5.0
5    6  qtr    B    300    116   95400.0     5.0
6    7  mth    A     50    110       NaN     NaN
7    8  mth    A    100    119       NaN     NaN
8    9  mth    A    150     99       NaN     NaN
9   10  mth    B     50    111       NaN     NaN
10  11  mth    B    100    106       NaN     NaN
11  12  mth    B    150    108       NaN     NaN