I have a dataset on daily
level granularity for 4 years - 2018, 2019, 2020 and 2021. There is also some data available for Q1 2022 which I will be using as unseen data for model testing. I want to use K-fold for creating datasets per year where in I can loop through each fold and train a model and generate error metrics -
Here is what I am trying to do - Training Data - 2018-01-01 to 2021-12-31 Unseen Data - 2022-01-01 to 2022-03-31
From the training data, I want to generate the folds as below -
iteration 1 -
training data - 2018-01-01 to 2018-12-31, validation data - 2019-01-01 to 2019-03-31
iteration 2 -
training data - 2019-01-01 to 2019-12-31, validation data - 2020-01-01 to 2020-03-31
iteration 3 -
training data - 2020-01-01 to 2020-12-31, validation data - 2021-01-01 to 2021-03-31
Once I create these sets, then I can use training data
for training and validation data
for evaluation. How can I do this in pandas?
Here is the sample data (other fields are hidden for confidential purposes) -
CodePudding user response:
Scitkit-learn's TimeSeriesSplit would allow you to generate continuous train and test folds of defined size - TimeSeriesSplit(max_train_size=365, test_size=91)
will produce train folds of one year and test folds of (approximately) one quarter (not that you will drift away from calendar years slightly)
This should work for you if, as you suggest, it isn't critical to only test on Q1 of each year. If you prefer to only test Q1, you should be able to do so with a list comprehension and pandas
s datetime indexing, like:
years = np.arange(2018, 2021)
df = df.set_index('created_date`) # set drop=False if you wish to retain the old index
df.index = pd.to_datetime(df.index) # If it isn't already
cv_splits = [(df[f"{year}"], df[f"{year 1}-1":f"{year 1}-3"]) for year in years]
This should give you a list of tuples, that each contain first the all the samples from a single year, then all the samples from the first quarter of the following year.