I'd like to do a manual train-test-split for a random forest or linear regression on a dataframe called whatever_df
based on the Date
column. I would use this column to select all rows that have dates earlier than 3 months before the most recent one in the column to make a new dataframe called train_df
with those older dates, and a test_df
with all the dates within the latest 3 months. In raw format the dataframe looks like:
PC1 PC2 Date
0 -0.319258 -0.042817 2019-05-24
1 -0.246079 0.131233 2019-05-24
2 -0.037325 0.562841 2019-05-24
3 -0.080725 0.594007 2019-05-24
4 0.133341 0.322822 2019-05-24
... ... ... ...
3607 -3.583419 3.766158 2022-06-26
3608 -3.305263 4.019327 2022-06-26
3609 -2.913036 4.854316 2022-06-26
3610 -2.755733 4.873996 2022-06-26
3611 -2.535929 4.582312 2022-06-26
So what I'd want for train_df
would be all the rows where Date
is up through March 2022 inclusive and test_df
would be all the rows for March 2022-June 2022. I know I could just hardcode this but I would like a dynamic way to select rows based on month values, and I know with datetime
formatted columns I could find the newest date with just max(df['Date'])
but I'm not sure how to say in Python subtract 3 months from that
I have tried this:
whatever_df['Month'] = pd.DatetimeIndex(whatever_df['Date']).month
To get
PC1 PC2 Date Month
0 -0.319258 -0.042817 2019-05-24 5
1 -0.246079 0.131233 2019-05-24 5
2 -0.037325 0.562841 2019-05-24 5
3 -0.080725 0.594007 2019-05-24 5
4 0.133341 0.322822 2019-05-24 5
... ... ... ... ...
3607 -3.583419 3.766158 2022-06-26 6
3608 -3.305263 4.019327 2022-06-26 6
3609 -2.913036 4.854316 2022-06-26 6
3610 -2.755733 4.873996 2022-06-26 6
3611 -2.535929 4.582312 2022-06-26 6
And I know I could use:
from datetime import datetime
from dateutil.relativedelta import relativedelta
To subtract 3 months from the Month
column, however I'm not sure how that would work with January and February. Any ideas?
CodePudding user response:
Instead of trying to use dateutil
's features, a PyData type can be used directly
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({"date": pd.Series("2022-08-30", dtype="datetime[ns]")})
>>> df["date"]
0 2022-08-30
Name: date, dtype: datetime64[ns]
>>> df["date"] np.timedelta64(2, 'M')
0 2022-10-29 20:58:12
Name: date, dtype: datetime64[ns]
CodePudding user response:
Solution
train_df = whatever_df[whatever_df['Date'] <= max(whatever_df['Date']) - relativedelta(months = 3)]
test_df = whatever_df[whatever_df['Date'] > max(whatever_df['Date']) - relativedelta(months = 3)]