I am working on a random forest model for a medical prediction project. The dataset I am using contains patients' info including features, diagnosis, and patient ID (unique for each patient). Now, instead of splitting the dataset solely based on the proportion (i.e 75% of data to train, 25% to test), I wish to split the dataset based on patients (i.e. randomly select patients) while also satisfying the "75% to train, 25% to test" ratio. Can anyone help to provide some ideas on how I can achieve this? Thanks in advance!
CodePudding user response:
You can use initial_split
from package rsample. In this example, the column am serves as the stratum (in your case it would be your patient id):
library(tidyverse)
library(rsample)
my_split <- initial_split(mtcars, strata = am)
train_data <- training(my_split)
test_data <- testing(my_split)
# Original data
mtcars %>%
count(am) %>%
mutate(prop = n/sum(n))
am n prop
1 0 19 0.59375
2 1 13 0.40625
# training data
train_data %>%
count(am) %>%
mutate(prop = n/sum(n))
am n prop
1 0 15 0.6
2 1 10 0.4