Home > front end >  How can I assign the result of a filtered, grouped aggregation as a new column in the original Panda
How can I assign the result of a filtered, grouped aggregation as a new column in the original Panda

Time:05-11

I am having trouble making the transition from using R data.table to using Pandas for data munging.

Specifically, I am trying to assign the results of aggregations back into the original df as a new column. Note that the aggregations are functions of two columns, so I don't think df.transform() is the right approach.

To illustrate, I'm trying to replicate what I would do in R with:

library(data.table)

df = as.data.table(read.csv(text=
"id,term,node,hours,price
1,qtr,A,300,107
2,qtr,A,300,104
3,qtr,A,300,91
4,qtr,B,300,89
5,qtr,B,300,113
6,qtr,B,300,116
7,mth,A,50,110
8,mth,A,100,119
9,mth,A,150,99
10,mth,B,50,111
11,mth,B,100,106
12,mth,B,150,108"))

df[term == 'qtr' , `:=`(vwap_ish = sum(hours * price),
                        avg_id = mean(id) ),
    .(node, term)]

df

# id term node hours price vwap_ish avg_id
# 1:  1  qtr    A   300   107    90600      2
# 2:  2  qtr    A   300   104    90600      2
# 3:  3  qtr    A   300    91    90600      2
# 4:  4  qtr    B   300    89    95400      5
# 5:  5  qtr    B   300   113    95400      5
# 6:  6  qtr    B   300   116    95400      5
# 7:  7  mth    A    50   110       NA     NA
# 8:  8  mth    A   100   119       NA     NA
# 9:  9  mth    A   150    99       NA     NA
# 10: 10  mth    B    50   111       NA     NA
# 11: 11  mth    B   100   106       NA     NA
# 12: 12  mth    B   150   108       NA     NA

Using Pandas, I can create an object from df that contains all rows of the original df, with the aggregations

import io
import numpy as np
import pandas as pd

data = io.StringIO("""id,term,node,hours,price
1,qtr,A,300,107
2,qtr,A,300,104
3,qtr,A,300,91
4,qtr,B,300,89
5,qtr,B,300,113
6,qtr,B,300,116
7,mth,A,50,110
8,mth,A,100,119
9,mth,A,150,99
10,mth,B,50,111
11,mth,B,100,106
12,mth,B,150,108""")

df = pd.read_csv(data)

df1 = df.groupby(['node','term']).apply(
        lambda gp: gp.assign(
                vwap_ish = (gp.hours * gp.price).sum(),
                avg_id = np.mean(gp.id)
                )
        )

df1

"""
              id term node  hours  price  vwap_ish  avg_id
node term                                                 
ben  mth  9   10  mth  B     50    111     32350    10.0
          10  11  mth  B    100    106     32350    10.0
          11  12  mth  B    150    108     32350    10.0
     qtr  3    4  qtr  B    300     89     95400     4.0
          4    5  qtr  B    300    113     95400     4.0
          5    6  qtr  B    300    116     95400     4.0
ota  mth  6    7  mth  A     50    110     32250     7.0
          7    8  mth  A    100    119     32250     7.0
          8    9  mth  A    150     99     32250     7.0
     qtr  0    1  qtr  A    300    107     90600     1.0
          1    2  qtr  A    300    104     90600     1.0
          2    3  qtr  A    300     91     90600     1.0
"""

This doesn't really get me what I want because a) it re-orders and creates indices, and b) it has calculated the aggregation for all rows.

I can get the subset easily enough with


df2 = df[df.term=='qtr'].groupby(['node','term']).apply(
        lambda gp: gp.assign(
                vwap_ish = (gp.hours * gp.price).sum(),
                avg_id = np.mean(gp.id)
                )

df2

"""
             id term node  hours  price  vwap_ish  avg_id
node term                                                
A    qtr  0   1  qtr    A    300    107     90600     1.0
          1   2  qtr    A    300    104     90600     1.0
          2   3  qtr    A    300     91     90600     1.0
B    qtr  3   4  qtr    B    300     89     95400     4.0
          4   5  qtr    B    300    113     95400     4.0
          5   6  qtr    B    300    116     95400     4.0
"""

but I can't get the values in the new columns (vwap_ish, avg_id) back into the old df.

I've tried:

df[df.term=='qtr'] = df[df.term == 'qtr'].groupby(['node','term']).apply(
        lambda gp: gp.assign(
                vwap_ish = (gp.hours * gp.price).sum(),
                avg_id = np.mean(gp.id)
                )
        )

ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'

And also a few variations of .merge and .join. For example:

df.merge(df2, how='left')

ValueError: 'term' is both an index level and a column label, which is ambiguous.

and

df.merge(df2, how='left', on=df.columns)

KeyError: Index(['id', 'term', 'node', 'hours', 'price'], dtype='object')

In writing this I realised I could take my first approach and then just do

df[df.term=='qtr', ['vwap_ish','avg_id']] = NaN

but this seems quite hacky. It means I have to use a new column, rather than overwriting an existing one on the filtered rows, and if the aggregation function were to break say if term='mth' then that would be problematic too.

I'd really appreciate any help with this as it's been a very steep learning curve to try to make the transition from data.table to Pandas and there's so much I would do in a one-liner that is taking me hours to figure out.

CodePudding user response:

You can add group_keys=False parameter for remove MultiIndex, so left join working well:

df2 = df[df.term == 'qtr'].groupby(['node','term'], group_keys=False).apply(
        lambda gp: gp.assign(
                vwap_ish = (gp.hours * gp.price).sum(),
                avg_id = np.mean(gp.id)
                )
        )

df = df.merge(df2, how='left')
print (df)
    id term node  hours  price  vwap_ish  avg_id
0    1  qtr    A    300    107   90600.0     2.0
1    2  qtr    A    300    104   90600.0     2.0
2    3  qtr    A    300     91   90600.0     2.0
3    4  qtr    B    300     89   95400.0     5.0
4    5  qtr    B    300    113   95400.0     5.0
5    6  qtr    B    300    116   95400.0     5.0
6    7  mth    A     50    110       NaN     NaN
7    8  mth    A    100    119       NaN     NaN
8    9  mth    A    150     99       NaN     NaN
9   10  mth    B     50    111       NaN     NaN
10  11  mth    B    100    106       NaN     NaN
11  12  mth    B    150    108       NaN     NaN

Solution without left join:

m = df.term == 'qtr'
df.loc[m, ['vwap_ish','avg_id']] = (df[m].groupby(['node','term'], group_keys=False)
                                        .apply(lambda gp: gp.assign(
                                                     vwap_ish = (gp.hours * gp.price).sum(),
                                                     avg_id = np.mean(gp.id)
                                                      )
                                               ))

Improved solution with named aggregation and create vwap_ish column before groupby can improve performance:

df2 = (df[df.term == 'qtr']
         .assign(vwap_ish = lambda x: x.hours * x.price)
         .groupby(['node','term'], as_index=False)
         .agg(vwap_ish=('vwap_ish','sum'),
              avg_id=('id','mean')))

df = df.merge(df2, how='left')
print (df)
    id term node  hours  price  vwap_ish  avg_id
0    1  qtr    A    300    107   90600.0     2.0
1    2  qtr    A    300    104   90600.0     2.0
2    3  qtr    A    300     91   90600.0     2.0
3    4  qtr    B    300     89   95400.0     5.0
4    5  qtr    B    300    113   95400.0     5.0
5    6  qtr    B    300    116   95400.0     5.0
6    7  mth    A     50    110       NaN     NaN
7    8  mth    A    100    119       NaN     NaN
8    9  mth    A    150     99       NaN     NaN
9   10  mth    B     50    111       NaN     NaN
10  11  mth    B    100    106       NaN     NaN
11  12  mth    B    150    108       NaN     NaN

CodePudding user response:

With Pandas, and groupby, one option is to break it into individual steps, if you are willing to avoid using apply (if you are keen on performance):

Compute the product of hours and price before grouping:

temp = df.assign(vwap_ish = df.hours * df.price, avg_id = df.id)

Get the groupby object after filtering term:

temp = (temp
        .loc[temp.term.eq('qtr'), ['vwap_ish', 'avg_id']]
        .groupby([df.node, df.term])
        )

Assign back the aggregated values with transform; pandas will take care of the index alignment:

(df
.assign(vwap_ish = temp.vwap_ish.transform('sum'), 
        avg_id = temp.avg_id.transform('mean'))
)

    id term node  hours  price  vwap_ish  avg_id
0    1  qtr    A    300    107   90600.0     2.0
1    2  qtr    A    300    104   90600.0     2.0
2    3  qtr    A    300     91   90600.0     2.0
3    4  qtr    B    300     89   95400.0     5.0
4    5  qtr    B    300    113   95400.0     5.0
5    6  qtr    B    300    116   95400.0     5.0
6    7  mth    A     50    110       NaN     NaN
7    8  mth    A    100    119       NaN     NaN
8    9  mth    A    150     99       NaN     NaN
9   10  mth    B     50    111       NaN     NaN
10  11  mth    B    100    106       NaN     NaN
11  12  mth    B    150    108       NaN     NaN
  • Related