Home > Back-end >  DataFrame groupby and shifting values based on freq
DataFrame groupby and shifting values based on freq

Time:04-16

I am trying to groupby my DataFrame and shift some columns for 1 day. Code for the test df:

    import pandas as pd
    import datetime as dt

    d = {'date' : ['202211', '202211', '202211','202211','202211', '202212', '202212', '202212', '202212', '202213', '202213', '202213', '202213', '202213'],
         'id' : ['a', 'b', 'c', 'd','e', 'a', 'b', 'c', 'd', 'a', 'b', 'c', 'd', 'e'],
         'price' : [1, 1.2, 1.3, 1.5, 1.7, 2, 1.5, 2, 1.1, 2, 1.5, 0.8, 1.3, 1.5],
         'shrs' : [100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100]}
    df = pd.DataFrame(data = d)
    df['date'] = pd.to_datetime(df['date'], format='%Y%m%d')
    df.set_index('date', inplace=True)
    df["me"] = df['price'] * df['shrs']
    df['rank'] = df.groupby('date')['price'].transform(lambda x: pd.qcut(x, 2, labels=range(1,3), 
    duplicates='drop'))
    df['weight'] = df['me'] / (df.groupby(['date', 'rank'])['me'].transform('sum'))
    df['ew'] = 1 / (df.groupby(['date', 'rank'])['price'].transform('count'))
    df.sort_values(['id', 'date'], inplace=True)
    print(df)
   
               id  price  shrs     me rank    weight        ew
    date                                                      
    2022-01-01  a    1.0   100  100.0    1  0.285714  0.333333
    2022-01-02  a    2.0   100  200.0    2  0.500000  0.500000
    2022-01-03  a    2.0   100  200.0    2  1.000000  1.000000
    2022-01-01  b    1.2   100  120.0    1  0.342857  0.333333
    2022-01-02  b    1.5   100  150.0    1  0.576923  0.500000
    2022-01-03  b    1.5   100  150.0    1  0.294118  0.250000
    2022-01-01  c    1.3   100  130.0    1  0.371429  0.333333
    2022-01-02  c    2.0   100  200.0    2  0.500000  0.500000
    2022-01-03  c    0.8   100   80.0    1  0.156863  0.250000
    2022-01-01  d    1.5   100  150.0    2  0.468750  0.500000
    2022-01-02  d    1.1   100  110.0    1  0.423077  0.500000
    2022-01-03  d    1.3   100  130.0    1  0.254902  0.250000
    2022-01-01  e    1.7   100  170.0    2  0.531250  0.500000
    2022-01-03  e    1.5   100  150.0    1  0.294118  0.250000

The following code results almost in what I want. But as my data is not consistent and some days might be skipped (see observations for id "e"), I need cannot do simply shift(1) but need to implement the frequency as well.

df['rank'] = df.groupby('id')['rank'].shift(1)
df['weight'] = df.groupby('id')['weight'].shift(1)
df['ew'] = df.groupby('id')['ew'].shift(1)
print(df)

results in:

           id  price  shrs     me rank    weight        ew
date                                                      
2022-01-01  a    1.0   100  100.0  NaN       NaN       NaN
2022-01-02  a    2.0   100  200.0    1  0.285714  0.333333
2022-01-03  a    2.0   100  200.0    2  0.500000  0.500000
2022-01-01  b    1.2   100  120.0  NaN       NaN       NaN
2022-01-02  b    1.5   100  150.0    1  0.342857  0.333333
2022-01-03  b    1.5   100  150.0    1  0.576923  0.500000
2022-01-01  c    1.3   100  130.0  NaN       NaN       NaN
2022-01-02  c    2.0   100  200.0    1  0.371429  0.333333
2022-01-03  c    0.8   100   80.0    2  0.500000  0.500000
2022-01-01  d    1.5   100  150.0  NaN       NaN       NaN
2022-01-02  d    1.1   100  110.0    2  0.468750  0.500000
2022-01-03  d    1.3   100  130.0    1  0.423077  0.500000
2022-01-01  e    1.7   100  170.0  NaN       NaN       NaN
2022-01-03  e    1.5   100  150.0    2  0.531250  0.500000

The desired outcome would be (watch observation for id "e"):

           id  price  shrs     me rank    weight        ew
date                                                      
2022-01-01  a    1.0   100  100.0  NaN       NaN       NaN
2022-01-02  a    2.0   100  200.0    1  0.285714  0.333333
2022-01-03  a    2.0   100  200.0    2  0.500000  0.500000
2022-01-01  b    1.2   100  120.0  NaN       NaN       NaN
2022-01-02  b    1.5   100  150.0    1  0.342857  0.333333
2022-01-03  b    1.5   100  150.0    1  0.576923  0.500000
2022-01-01  c    1.3   100  130.0  NaN       NaN       NaN
2022-01-02  c    2.0   100  200.0    1  0.371429  0.333333
2022-01-03  c    0.8   100   80.0    2  0.500000  0.500000
2022-01-01  d    1.5   100  150.0  NaN       NaN       NaN
2022-01-02  d    1.1   100  110.0    2  0.468750  0.500000
2022-01-03  d    1.3   100  130.0    1  0.423077  0.500000
2022-01-01  e    1.7   100  170.0  NaN       NaN       NaN
2022-01-03  e    1.5   100  150.0  NaN       NaN       NaN

I did not manage to simply use freq='d' here. What could be the easiest solution?

CodePudding user response:

One quick and dirty solution that works with your data is to unstack() and stack() the id column to create nulls for the gaps and then drop the nulls at the end:

In [50]: df = df.set_index('id', append=True).unstack().stack(dropna=False).reset_index(level=1).sort_values('id')

In [51]: df['rank'] = df.groupby('id')['rank'].shift(1)
    ...: df['weight'] = df.groupby('id')['weight'].shift(1)
    ...: df['ew'] = df.groupby('id')['ew'].shift(1)

In [52]: print(df[df['price'].notnull()])
           id  price   shrs     me rank    weight        ew
date
2022-01-01  a    1.0  100.0  100.0  NaN       NaN       NaN
2022-01-02  a    2.0  100.0  200.0    1  0.285714  0.333333
2022-01-03  a    2.0  100.0  200.0    2  0.500000  0.500000
2022-01-01  b    1.2  100.0  120.0  NaN       NaN       NaN
2022-01-02  b    1.5  100.0  150.0    1  0.342857  0.333333
2022-01-03  b    1.5  100.0  150.0    1  0.576923  0.500000
2022-01-01  c    1.3  100.0  130.0  NaN       NaN       NaN
2022-01-02  c    2.0  100.0  200.0    1  0.371429  0.333333
2022-01-03  c    0.8  100.0   80.0    2  0.500000  0.500000
2022-01-01  d    1.5  100.0  150.0  NaN       NaN       NaN
2022-01-02  d    1.1  100.0  110.0    2  0.468750  0.500000
2022-01-03  d    1.3  100.0  130.0    1  0.423077  0.500000
2022-01-01  e    1.7  100.0  170.0  NaN       NaN       NaN
2022-01-03  e    1.5  100.0  150.0  NaN       NaN       NaN

CodePudding user response:

You can resample on a groupby to generate a daily series:

# Resample to produce a daily data for each ID
shifted = df.groupby("id").resample("1D").first().groupby(level=0).shift()

# `shifted` and `df`'s indexes must have the same shape
df = df.set_index("id", append=True).swaplevel()

# Rely on panda's auto row alignment to handle the assignment
cols = ["rank", "weight", "ew"]
df[cols] = shifted[cols]
df.reset_index(0, inplace=True)
  • Related