Home > Back-end >  Including a lag specification in a pandas merge based on datetime column
Including a lag specification in a pandas merge based on datetime column

Time:08-13

I am merging a column from one dataframe with a larger one based on date column. With this code: df_final = pd.merge(df_final, pmms_df, how='left', on='PredictionDate')

pmms_df looks like this:

      PredictionDate    U.S. 30 yr FRM  U.S. 15 yr FRM
0      2014-12-31            3.87          3.15
1      2015-01-01            3.87          3.15
2      2015-01-02            3.87          3.15
3      2015-01-03            3.87          3.15
4      2015-01-04            3.87          3.15
               ...  ... ... ...
2769    2022-07-31           5.30          4.58
2770    2022-08-01           4.99          4.26
2771    2022-08-02           4.99          4.26
2772    2022-08-03           4.99          4.26
2773    2022-08-04           4.99          4.26

and df_final is a huge df with 20,000 rows and 61 columns, so I am only including the relevant output columns here post-merge:

      PredictionDate    U.S. 30 yr FRM  U.S. 15 yr FRM
0      2022-03-09            3.85           3.09
1      2022-04-11            5.00           4.17
2      2022-05-10            5.30           4.48
3      2022-06-09            5.23           4.38
4      2021-04-09            3.13           2.42
... ... ... ...
20528   2022-01-11           3.45           2.62
20529   2022-02-09           3.69           2.93
20530   2022-03-09           3.85           3.09
20531   2022-04-11           5.00           4.17
20532   2022-05-10           5.30           4.48

The dataframe I'm merging with has rows with only one day per month so the merge finds that day's row in the first dataframe and merges the U.S. 30 and 15 yr FRM data for that day into a new column in the other dataframe. However, I would like to add an additional column in the other dataframe for both 30 and 15 yr FRM that is based on the data in this dataframe but from 30 days earlier. Desired output would look like something like this:

       PredictionDate   U.S. 30 yr FRM  U.S. 15 yr FRM  30yrLag 15yrLag
0      2022-03-09            3.85           3.09           3.72  3.12
1      2022-04-11            5.00           4.17           5.05  4.15
2      2022-05-10            5.30           4.48           5.32  4.58
3      2022-06-09            5.23           4.38            .     .
4      2021-04-09            3.13           2.42            .     .
... ... ... ...
20528   2022-01-11           3.45           2.62            .     .
20529   2022-02-09           3.69           2.93            .     .
20530   2022-03-09           3.85           3.09            .     .
20531   2022-04-11           5.00           4.17            .     .
20532   2022-05-10           5.30           4.48            .     .

So the idea is that those last two columns would contain the 30yr and 15yr data of 30 days prior in pmms_df to the day it was merged on. The values I included here for 30yrLag and 15yrlag are supposed to be the values for those columns from 30 days before the date in PredictedDate in the final dataframe.

CodePudding user response:

Solution here.

Needed to do the lag first, then merge, instead of doing it simultaneously.

  • Related