DataFrame groupby and divide by group sum-CodePudding

In order to build stock portfolios for a backtest I am trying to get the market capitalization (me) weight of each stock within its portfolio. For test purposes I built the following DataFrame of price and return observations. Every day I am assigning the stocks to quantiles based on price and all stocks in the same quantile that day will be in one portfolio:

d = {'date' : ['202211', '202211', '202211','202211', '202212', '202212', '202212', '202212'],
     'price' : [1, 1.2, 1.3, 1.5, 1.7, 2, 1.5, 1],
     'shrs' : [100, 100, 100, 100, 100, 100, 100, 100]}
df = pd.DataFrame(data = d)
df.set_index('date', inplace=True)
df.index = pd.to_datetime(df.index, format='%Y%m%d')
df["me"] = df['price'] * df['shrs']
df['rank'] = df.groupby('date')['price'].transform(lambda x: pd.qcut(x, 2, labels=range(1,3), duplicates='drop'))
df

            price  shrs     me rank
date                               
2022-01-01    1.0   100  100.0    1
2022-01-01    1.2   100  120.0    1
2022-01-01    1.3   100  130.0    2
2022-01-01    1.5   100  150.0    2
2022-01-02    1.7   100  170.0    2
2022-01-02    2.0   100  200.0    2
2022-01-02    1.5   100  150.0    1
2022-01-02    1.0   100  100.0    1

In the next step I am grouping by 'date' and 'rank' and divide each observation's market cap by the sum of the groups market cap in order to obtain the stocks weight in the portfolio:

df['weight'] = df.groupby(['date', 'rank'], group_keys=False).apply(lambda x: x['me'] / x['me'].sum()).sort_index()
print(df)

            price  shrs     me rank    weight
date                                         
2022-01-01    1.0   100  100.0    1  0.454545
2022-01-01    1.2   100  120.0    1  0.545455
2022-01-01    1.3   100  130.0    2  0.464286
2022-01-01    1.5   100  150.0    2  0.535714
2022-01-02    1.7   100  170.0    2  0.600000
2022-01-02    2.0   100  200.0    2  0.400000
2022-01-02    1.5   100  150.0    1  0.459459
2022-01-02    1.0   100  100.0    1  0.540541

Now comes the flaw. On my test df this works perfectly fine. However on the real data (DataFrame with shape 160000 x 21) the calculations take endless and I always have to interrupt the Jupyter Kernel at some point. Is there a more efficient way to do this? What am I missing? Interestingly I am using the same code as some colleagues on similar DataFrames and for them it takes seconds only.

CodePudding user response：

Use GroupBy.transform with sum for new Series and use it for divide me column:

df['weight'] = df['me'].div(df.groupby(['date', 'rank'])['me'].transform('sum'))

CodePudding user response：

It might not be the most elegant solution, but if you run into performance issue you can try to split it into multiple parts, but storing the groupped value of me in a Series and then merge it back

temp = df.groupby(['date', 'rank'], group_keys=False).apply(lambda x: x['me'].sum())
temp = temp.reset_index(name='weight')
df = df.merge(temp, on=['date', 'rank'])
df['weight'] = df['me'] / df['weight']
df.set_index('date', inplace=True)
df

which should lead to the output:

    price   shrs    me  rank    weight
date                    
2022-01-01  1.0     100     100.0   1   0.454545
2022-01-01  1.2     100     120.0   1   0.545455
2022-01-01  1.3     100     130.0   2   0.464286
2022-01-01  1.5     100     150.0   2   0.535714
2022-01-02  1.7     100     170.0   2   0.459459
2022-01-02  2.0     100     200.0   2   0.540541
2022-01-02  1.5     100     150.0   1   0.600000
2022-01-02  1.0     100     100.0   1   0.400000