I have a DataFrame that looks something like this:
daily_return year month day
date
2018-12-27 NaN 2018 12 27
2018-12-28 1.020245 2018 12 28
2018-12-31 1.000650 2018 12 31
2019-01-02 1.020473 2019 01 02
2019-01-03 1.009129 2019 01 03
... ... ... ... ..
2023-01-20 1.001087 2023 01 20
2023-01-23 0.998455 2023 01 23
2023-01-24 1.002491 2023 01 24
2023-01-25 1.004482 2023 01 25
2023-01-26 1.004953 2023 01 26
I want to calculate, for each date (each row), the year-to-Date Standard Deviation of the daily_return
column.
Examples:
- For the row '2018-12-31', the result would be the standard deviation of
daily_return
that goes from '2018-12-01' up until '2018-12-31': [NaN, 1.020245, 1.000650]. - For the row '2019-01-03', we are referring to the year '2019'. Therefore, the standard deviation should only consider values from '2019-01-01' up until the current row's date '2019-01-03': [1.020473, 1.009129]
- For the row [2023-01-23], the result should be all daily_returns from 2023-01-01 up until 2023-01-23
- For the row [2023-01-24], the result should be all daily_returns from 2023-01-01 up until 2023-01-24 ...
I managed to make something similar work for Month-to-Date cumprod
by using:
df.groupby(['month', 'year'])['daily_return'].cumprod()
date
2018-12-27 NaN
2018-12-28 1.020245
2018-12-31 1.020909
2019-01-02 1.020473
2019-01-03 1.029789
...
2023-01-20 0.999914
2023-01-23 0.998369
2023-01-24 1.000856
2023-01-25 1.005342
2023-01-26 1.010322
You can see it is working properly since, for every date, it gives me the Month-to-Date cumprod.
However, whenever I try grouping by year and calculating the std
, it will only calculate the std
for the entire year, and not for the Year-to-Date for each row:
df.groupby(['year'])['daily_return'].std()
year
2018 0.013856
2019 0.007128
2020 0.021325
2021 0.010358
2022 0.009854
2023 0.006444
CodePudding user response:
If you first sort the index, so that the rows are in the correct order, you can then then group by year, then using the expanding
window and apply the .std()
function:
# may need to set index as datetime if it isn't already
df.index = pd.to_datetime(df.index, format="%Y-%m-%d")
# sort the index
df.sort_index(inplace=True)
# calculate YTD standard deviation
df["YTD std"] = df.groupby(df.index.year)['daily_return'].expanding().std().droplevel(0)
df
#Out:
# daily_return year month day YTD std
#date
#2018-12-27 NaN 2018 12 27 NaN
#2018-12-28 1.020245 2018 12 28 NaN
#2018-12-31 1.000650 2018 12 31 0.013856
#2019-01-02 1.020473 2019 1 2 NaN
#2019-01-03 1.009129 2019 1 3 0.008021
#2023-01-20 1.001087 2023 1 20 NaN
#2023-01-23 0.998455 2023 1 23 0.001861
#2023-01-24 1.002491 2023 1 24 0.002049
#2023-01-25 1.004482 2023 1 25 0.002533
#2023-01-26 1.004953 2023 1 26 0.002650