I have data in a df and need to calculate the RMSE of a column consisting of rows of months and years data compared to the current month and year rows in a chunk period. I cannot figure out how to set up the sequencing by each year. For example, I need to calculate the RMSE by year from exactly month == 5 to month == 2 and print all the RMSE values in the "Variation" column by start year. My data looks like this:
month mean_mon_flow ... std_anomaly Variation
date ...
1992-04-01 00:00:00 4 12.265100 ... -1.074586 NaN
1992-05-01 00:00:00 5 12.533220 ... -1.017388 0.057198
1992-06-01 00:00:00 6 12.491247 ... -1.117406 -0.100018
1992-07-01 00:00:00 7 12.113165 ... -1.401221 -0.283815
1992-08-01 00:00:00 8 11.846904 ... -1.359026 0.042195
1992-09-01 00:00:00 9 11.526178 ... -0.299250 1.059776
1992-10-01 00:00:00 10 11.555834 ... -0.628162 -0.328911
1992-11-01 00:00:00 11 11.746104 ... -1.116374 -0.488213
1992-12-01 00:00:00 12 11.891824 ... -0.143343 0.973031
1993-01-01 00:00:00 1 11.997252 ... -0.486450 -0.343107
1993-02-01 00:00:00 2 12.028855 ... -0.862971 -0.376521
1993-03-01 00:00:00 3 12.063974 ... -0.596869 0.266102
1993-04-01 00:00:00 4 12.265100 ... -0.923695 -0.326826
1993-05-01 00:00:00 5 12.533220 ... 0.322987 1.246682
1993-06-01 00:00:00 6 12.491247 ... -0.478567 -0.801554
1993-07-01 00:00:00 7 12.113165 ... -0.274119 0.204448
1993-08-01 00:00:00 8 11.846904 ... -0.707968 -0.433849
1993-09-01 00:00:00 9 11.526178 ... 0.167246 0.875214
1993-10-01 00:00:00 10 11.555834 ... -0.089410 -0.256656
1993-11-01 00:00:00 11 11.746104 ... -1.046461 -0.957050
1993-12-01 00:00:00 12 11.891824 ... -1.293175 -0.246714
1994-01-01 00:00:00 1 11.997252 ... -1.505133 -0.211959
1994-02-01 00:00:00 2 12.028855 ... -0.610121 0.895012
1994-03-01 00:00:00 3 12.063974 ... -0.974184 -0.364063
1994-04-01 00:00:00 4 12.265100 ... -1.077609 -0.103424
The observed data from the current year looks like this:
month mean_mon_flow ... std_anomaly Variation
date ...
2021-05-01 00:00:00 5 12.533220 ... -0.935899 0.206586
2021-06-01 00:00:00 6 12.491247 ... -0.647261 0.288638
2021-07-01 00:00:00 7 12.113165 ... -0.711730 -0.064469
2021-08-01 00:00:00 8 11.846904 ... -0.482306 0.229424
2021-09-01 00:00:00 9 11.526178 ... -0.116989 0.365317
2021-10-01 00:00:00 10 11.555834 ... 0.319614 0.436603
2021-11-01 00:00:00 11 11.746104 ... 0.880379 0.560765
2021-12-01 00:00:00 12 11.891824 ... 0.630541 -0.249838
2022-01-01 00:00:00 1 11.997252 ... -0.151507 -0.782048
2022-02-01 00:00:00 2 12.028855 ... -0.237398 -0.085891
The result should be something like this below. I've tried using a groupby statement to calculate RMSE but not sure how to give groupby a range of dates.
year RMSE Variation
1992 number
1993 number
1994 number
.. ..
2020 number
thank you,
CodePudding user response:
Some pre-processing of your dataframe for previous years. First, get the year label by taking the year component of your date with 4-month subtracted. Second, drop March and April.
from datetime import date
from dateutil.relativedelta import relativedelta
df_prev['year'] = pd.Series(df_prev['date'].dt.to_pydatetime() - relativedelta(months=4)).dt.year
df_prev = df_prev[~df_prev['month'].isin([3,4])]
Then convert df_prev
into a matrix with years as column and month as index, convert the table for this year into a series with month as index.
df_prev_vari = df_prev.set_index(['month', 'year'])[['Variation']].unstack().droplevel(0, axis=1)
df_this_vari = df_this.set_index('month')['Variation']
Having month as the common index for both data enables us to subtract one another by matching the index, followed by squared, mean, and square-root operations.
(df_prev_vari.sub(df_this_vari, axis=0)**2).mean()**.5