Python: calculate std in the filtered rows-CodePudding

Original data

import pandas as pd
import datetime
import random
df = pd.DataFrame(columns=["date", "7 days return"])
date = datetime.datetime.strptime("2019-01-04", '%Y-%m-%d')
for i in range(365*3):
    df = df.append({"date": date   datetime.timedelta(i), "7 days return":random.uniform(0,1)}, ignore_index=True)
df

Out:

date	7 days return
2019-01-04	0.780342
2019-01-05	0.7238
2019-01-06	0.02408
2019-01-07	0.188626
2019-01-08	0.255635
...	...
2021-12-26	0.833696
2021-12-27	0.806889
2021-12-28	0.902588
2021-12-29	0.833696
2021-12-30	0.806889
2021-12-31	0.902588
2022-01-01	0.182522
2022-01-02	0.612269

I need to calculate the standard deviations in the recent 365 rows but only use the last row of every 7 rows. For example, if I want to calculate the number for the date 2022-01-02, I only use the rows indexed with 2022-01-02, 2021-12-26, 2021-12-19, 2021-12-12, 2021-12-05 and so on in the recent 1 year. If calculate the number for the date 2022-01-01, I only need the rows indexed with 2022-01-01, 2021-12-25, 2021-12-18, 2021-12-11 and so on in the recent 1 year.

To achieve such results, I coded as followed:

a=[]
for i in range(0,len(df),7):
    a.append(i)
Volatility1Y=df.iloc[a].transform(lambda x:x.rolling(52).std())

But it outputed the results also every 7 days, which is different from the results I need. I need to calculate the standard deviation everyday.

Could you please help me with this?

CodePudding user response：

Are you looking for:

volatility_1y = (
    df.groupby(df["date"].dt.dayofweek)["7 days return"]
    .rolling(52).std()
    .droplevel(level=0).sort_index()
)

Essentially: Use .rolling(52) on sub-series of df["7 days return"] defined by the day of the week. (If so, you might want to look into the min_periods keyword argument of .rolling because otherwise you'll get a lot of NaNs.)

Btw., you can create your dataframe much more efficient with something like (.append is depreciated since version 1.4.0):

from random import random

df = pd.DataFrame({
    "date": pd.date_range(start="2019-01-04", periods=365 * 3),
    "7 days return": [random() for _ in range(365 * 3)]
})

CodePudding user response：

If I understand you correct:

a = []
for i in range(7):
    # for each day of week
    df_temp = df[i::7].copy()
    # calculate rolling std
    df_temp['var'] = df_temp['7 days return'].rolling(52).std()
    a.append(
        df_temp
    )
    
result = pd.concat(a, ignore_index=True)

# sort in the end
result.sort_values('date')