I have this data frame, and I am interested in dividing the data into a ratio. So, 2013 to 2018 will be in the training set, and 2019 to 2022 in the testing set. I have tried it but it keeps randomly selecting the dates from the data. Anyone, please help.
Here is my code.
split<- sample.split(cement_index$CCYY, SplitRatio = 0.7)
train = subset(cement_index, split == TRUE)
test = subset(cement_index, split == FALSE
CodePudding user response:
You need to create the two groups you mention:
groups <- ifelse(cement_index$CCYY < 2019, "2013-2018", "2019-2022")
split <- sample.split(groups, SplitRatio = 0.7)
train = subset(cement_index, split == TRUE)
test = subset(cement_index, split == FALSE)
CodePudding user response:
A different approach would be splitting the data into two subsets. You can use the following code (I created random numbers for your index):
df <- data.frame(CCYY = c(2013:2022),
index = sample(1:10, 10))
split <- split(df, cut(df$CCYY, c(2012, 2018, 2022), include.lowest=F))
train = split$`(2012,2018]`
test = split$`(2018,2022]`
Output train:
CCYY index
1 2013 8
2 2014 1
3 2015 3
4 2016 7
5 2017 5
6 2018 9
Output test:
CCYY index
7 2019 6
8 2020 10
9 2021 2
10 2022 4