Home > Back-end >  Index a dataframe based on a date column
Index a dataframe based on a date column

Time:08-30

I'd like to do a manual train-test-split for a random forest or linear regression on a dataframe called whatever_df based on the Date column. I would use this column to select all rows that have dates earlier than 3 months before the most recent one in the column to make a new dataframe called train_df with those older dates, and a test_df with all the dates within the latest 3 months. In raw format the dataframe looks like:

        PC1         PC2         Date
0      -0.319258    -0.042817   2019-05-24
1      -0.246079    0.131233    2019-05-24
2      -0.037325    0.562841    2019-05-24
3      -0.080725    0.594007    2019-05-24
4       0.133341    0.322822    2019-05-24
             ...    ...   ...   ...
3607    -3.583419   3.766158    2022-06-26
3608    -3.305263   4.019327    2022-06-26
3609    -2.913036   4.854316    2022-06-26
3610    -2.755733   4.873996    2022-06-26
3611    -2.535929   4.582312    2022-06-26

So what I'd want for train_df would be all the rows where Date is up through March 2022 inclusive and test_df would be all the rows for March 2022-June 2022. I know I could just hardcode this but I would like a dynamic way to select rows based on month values, and I know with datetime formatted columns I could find the newest date with just max(df['Date']) but I'm not sure how to say in Python subtract 3 months from that

I have tried this:

whatever_df['Month'] = pd.DatetimeIndex(whatever_df['Date']).month

To get

     PC1        PC2         Date      Month
0   -0.319258   -0.042817   2019-05-24  5
1   -0.246079   0.131233    2019-05-24  5
2   -0.037325   0.562841    2019-05-24  5
3   -0.080725   0.594007    2019-05-24  5
4   0.133341    0.322822    2019-05-24  5
... ... ... ... ...
3607    -3.583419   3.766158    2022-06-26  6
3608    -3.305263   4.019327    2022-06-26  6
3609    -2.913036   4.854316    2022-06-26  6
3610    -2.755733   4.873996    2022-06-26  6
3611    -2.535929   4.582312    2022-06-26  6

And I know I could use:

from datetime import datetime
from dateutil.relativedelta import relativedelta

To subtract 3 months from the Month column, however I'm not sure how that would work with January and February. Any ideas?

CodePudding user response:

Instead of trying to use dateutil's features, a PyData type can be used directly

>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({"date": pd.Series("2022-08-30", dtype="datetime[ns]")})
>>> df["date"]
0   2022-08-30
Name: date, dtype: datetime64[ns]
>>> df["date"]   np.timedelta64(2, 'M')
0   2022-10-29 20:58:12
Name: date, dtype: datetime64[ns]

CodePudding user response:

Solution

train_df = whatever_df[whatever_df['Date'] <= max(whatever_df['Date']) - relativedelta(months = 3)]

test_df = whatever_df[whatever_df['Date'] > max(whatever_df['Date']) - relativedelta(months = 3)]
  • Related