In order to build stock portfolios for a backtest I am trying to get the market capitalization (me) weight of each stock within its portfolio. For test purposes I built the following DataFrame of price and return observations. Every day I am assigning the stocks to quantiles based on price and all stocks in the same quantile that day will be in one portfolio:
d = {'date' : ['202211', '202211', '202211','202211', '202212', '202212', '202212', '202212'],
'price' : [1, 1.2, 1.3, 1.5, 1.7, 2, 1.5, 1],
'shrs' : [100, 100, 100, 100, 100, 100, 100, 100]}
df = pd.DataFrame(data = d)
df.set_index('date', inplace=True)
df.index = pd.to_datetime(df.index, format='%Y%m%d')
df["me"] = df['price'] * df['shrs']
df['rank'] = df.groupby('date')['price'].transform(lambda x: pd.qcut(x, 2, labels=range(1,3), duplicates='drop'))
df
price shrs me rank
date
2022-01-01 1.0 100 100.0 1
2022-01-01 1.2 100 120.0 1
2022-01-01 1.3 100 130.0 2
2022-01-01 1.5 100 150.0 2
2022-01-02 1.7 100 170.0 2
2022-01-02 2.0 100 200.0 2
2022-01-02 1.5 100 150.0 1
2022-01-02 1.0 100 100.0 1
In the next step I am grouping by 'date' and 'rank' and divide each observation's market cap by the sum of the groups market cap in order to obtain the stocks weight in the portfolio:
df['weight'] = df.groupby(['date', 'rank'], group_keys=False).apply(lambda x: x['me'] / x['me'].sum()).sort_index()
print(df)
price shrs me rank weight
date
2022-01-01 1.0 100 100.0 1 0.454545
2022-01-01 1.2 100 120.0 1 0.545455
2022-01-01 1.3 100 130.0 2 0.464286
2022-01-01 1.5 100 150.0 2 0.535714
2022-01-02 1.7 100 170.0 2 0.600000
2022-01-02 2.0 100 200.0 2 0.400000
2022-01-02 1.5 100 150.0 1 0.459459
2022-01-02 1.0 100 100.0 1 0.540541
Now comes the flaw. On my test df this works perfectly fine. However on the real data (DataFrame with shape 160000 x 21) the calculations take endless and I always have to interrupt the Jupyter Kernel at some point. Is there a more efficient way to do this? What am I missing? Interestingly I am using the same code as some colleagues on similar DataFrames and for them it takes seconds only.
CodePudding user response:
Use GroupBy.transform
with sum
for new Series
and use it for divide me
column:
df['weight'] = df['me'].div(df.groupby(['date', 'rank'])['me'].transform('sum'))
CodePudding user response:
It might not be the most elegant solution, but if you run into performance issue you can try to split it into multiple parts, but storing the groupped value of me
in a Series and then merge it back
temp = df.groupby(['date', 'rank'], group_keys=False).apply(lambda x: x['me'].sum())
temp = temp.reset_index(name='weight')
df = df.merge(temp, on=['date', 'rank'])
df['weight'] = df['me'] / df['weight']
df.set_index('date', inplace=True)
df
which should lead to the output:
price shrs me rank weight
date
2022-01-01 1.0 100 100.0 1 0.454545
2022-01-01 1.2 100 120.0 1 0.545455
2022-01-01 1.3 100 130.0 2 0.464286
2022-01-01 1.5 100 150.0 2 0.535714
2022-01-02 1.7 100 170.0 2 0.459459
2022-01-02 2.0 100 200.0 2 0.540541
2022-01-02 1.5 100 150.0 1 0.600000
2022-01-02 1.0 100 100.0 1 0.400000