I am trying to groupby my DataFrame and shift some columns for 1 day. Code for the test df:
import pandas as pd
import datetime as dt
d = {'date' : ['202211', '202211', '202211','202211','202211', '202212', '202212', '202212', '202212', '202213', '202213', '202213', '202213', '202213'],
'id' : ['a', 'b', 'c', 'd','e', 'a', 'b', 'c', 'd', 'a', 'b', 'c', 'd', 'e'],
'price' : [1, 1.2, 1.3, 1.5, 1.7, 2, 1.5, 2, 1.1, 2, 1.5, 0.8, 1.3, 1.5],
'shrs' : [100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100]}
df = pd.DataFrame(data = d)
df['date'] = pd.to_datetime(df['date'], format='%Y%m%d')
df.set_index('date', inplace=True)
df["me"] = df['price'] * df['shrs']
df['rank'] = df.groupby('date')['price'].transform(lambda x: pd.qcut(x, 2, labels=range(1,3),
duplicates='drop'))
df['weight'] = df['me'] / (df.groupby(['date', 'rank'])['me'].transform('sum'))
df['ew'] = 1 / (df.groupby(['date', 'rank'])['price'].transform('count'))
df.sort_values(['id', 'date'], inplace=True)
print(df)
id price shrs me rank weight ew
date
2022-01-01 a 1.0 100 100.0 1 0.285714 0.333333
2022-01-02 a 2.0 100 200.0 2 0.500000 0.500000
2022-01-03 a 2.0 100 200.0 2 1.000000 1.000000
2022-01-01 b 1.2 100 120.0 1 0.342857 0.333333
2022-01-02 b 1.5 100 150.0 1 0.576923 0.500000
2022-01-03 b 1.5 100 150.0 1 0.294118 0.250000
2022-01-01 c 1.3 100 130.0 1 0.371429 0.333333
2022-01-02 c 2.0 100 200.0 2 0.500000 0.500000
2022-01-03 c 0.8 100 80.0 1 0.156863 0.250000
2022-01-01 d 1.5 100 150.0 2 0.468750 0.500000
2022-01-02 d 1.1 100 110.0 1 0.423077 0.500000
2022-01-03 d 1.3 100 130.0 1 0.254902 0.250000
2022-01-01 e 1.7 100 170.0 2 0.531250 0.500000
2022-01-03 e 1.5 100 150.0 1 0.294118 0.250000
The following code results almost in what I want. But as my data is not consistent and some days might be skipped (see observations for id "e"), I need cannot do simply shift(1) but need to implement the frequency as well.
df['rank'] = df.groupby('id')['rank'].shift(1)
df['weight'] = df.groupby('id')['weight'].shift(1)
df['ew'] = df.groupby('id')['ew'].shift(1)
print(df)
results in:
id price shrs me rank weight ew
date
2022-01-01 a 1.0 100 100.0 NaN NaN NaN
2022-01-02 a 2.0 100 200.0 1 0.285714 0.333333
2022-01-03 a 2.0 100 200.0 2 0.500000 0.500000
2022-01-01 b 1.2 100 120.0 NaN NaN NaN
2022-01-02 b 1.5 100 150.0 1 0.342857 0.333333
2022-01-03 b 1.5 100 150.0 1 0.576923 0.500000
2022-01-01 c 1.3 100 130.0 NaN NaN NaN
2022-01-02 c 2.0 100 200.0 1 0.371429 0.333333
2022-01-03 c 0.8 100 80.0 2 0.500000 0.500000
2022-01-01 d 1.5 100 150.0 NaN NaN NaN
2022-01-02 d 1.1 100 110.0 2 0.468750 0.500000
2022-01-03 d 1.3 100 130.0 1 0.423077 0.500000
2022-01-01 e 1.7 100 170.0 NaN NaN NaN
2022-01-03 e 1.5 100 150.0 2 0.531250 0.500000
The desired outcome would be (watch observation for id "e"):
id price shrs me rank weight ew
date
2022-01-01 a 1.0 100 100.0 NaN NaN NaN
2022-01-02 a 2.0 100 200.0 1 0.285714 0.333333
2022-01-03 a 2.0 100 200.0 2 0.500000 0.500000
2022-01-01 b 1.2 100 120.0 NaN NaN NaN
2022-01-02 b 1.5 100 150.0 1 0.342857 0.333333
2022-01-03 b 1.5 100 150.0 1 0.576923 0.500000
2022-01-01 c 1.3 100 130.0 NaN NaN NaN
2022-01-02 c 2.0 100 200.0 1 0.371429 0.333333
2022-01-03 c 0.8 100 80.0 2 0.500000 0.500000
2022-01-01 d 1.5 100 150.0 NaN NaN NaN
2022-01-02 d 1.1 100 110.0 2 0.468750 0.500000
2022-01-03 d 1.3 100 130.0 1 0.423077 0.500000
2022-01-01 e 1.7 100 170.0 NaN NaN NaN
2022-01-03 e 1.5 100 150.0 NaN NaN NaN
I did not manage to simply use freq='d'
here. What could be the easiest solution?
CodePudding user response:
One quick and dirty solution that works with your data is to unstack()
and stack()
the id
column to create nulls for the gaps and then drop the nulls at the end:
In [50]: df = df.set_index('id', append=True).unstack().stack(dropna=False).reset_index(level=1).sort_values('id')
In [51]: df['rank'] = df.groupby('id')['rank'].shift(1)
...: df['weight'] = df.groupby('id')['weight'].shift(1)
...: df['ew'] = df.groupby('id')['ew'].shift(1)
In [52]: print(df[df['price'].notnull()])
id price shrs me rank weight ew
date
2022-01-01 a 1.0 100.0 100.0 NaN NaN NaN
2022-01-02 a 2.0 100.0 200.0 1 0.285714 0.333333
2022-01-03 a 2.0 100.0 200.0 2 0.500000 0.500000
2022-01-01 b 1.2 100.0 120.0 NaN NaN NaN
2022-01-02 b 1.5 100.0 150.0 1 0.342857 0.333333
2022-01-03 b 1.5 100.0 150.0 1 0.576923 0.500000
2022-01-01 c 1.3 100.0 130.0 NaN NaN NaN
2022-01-02 c 2.0 100.0 200.0 1 0.371429 0.333333
2022-01-03 c 0.8 100.0 80.0 2 0.500000 0.500000
2022-01-01 d 1.5 100.0 150.0 NaN NaN NaN
2022-01-02 d 1.1 100.0 110.0 2 0.468750 0.500000
2022-01-03 d 1.3 100.0 130.0 1 0.423077 0.500000
2022-01-01 e 1.7 100.0 170.0 NaN NaN NaN
2022-01-03 e 1.5 100.0 150.0 NaN NaN NaN
CodePudding user response:
You can resample
on a groupby to generate a daily series:
# Resample to produce a daily data for each ID
shifted = df.groupby("id").resample("1D").first().groupby(level=0).shift()
# `shifted` and `df`'s indexes must have the same shape
df = df.set_index("id", append=True).swaplevel()
# Rely on panda's auto row alignment to handle the assignment
cols = ["rank", "weight", "ew"]
df[cols] = shifted[cols]
df.reset_index(0, inplace=True)