Home > database >  can someone explain to me why the value of split is false in the test set?
can someone explain to me why the value of split is false in the test set?

Time:06-23

can someone explain to me why the value of split is false in the test set?

split = sample.split(dataset$Salary, SplitRatio = 2/3)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

CodePudding user response:

I assume you got this code from some kind of caTools documentation? I recommend trying to run the first line of code and it should start to make sense.

Basically what caTools::sample.split does is create a random vector of length nrow(x) with TRUEs and FALSEs, in the given ratio. Let's take the iris dataset for example (which has 150 rows):

split = sample.split(iris$Sepal.Length, SplitRatio = 2/3)

The result will be a 150 item vector with 2/3 TRUE and 1/3 FALSE.

Next you use the subset function to extract all the rows i from iris where split[i] == TRUE to create the training set and use all the rows i from iris where split[i] == FALSE to create the test set.

That is why you use split == TRUE in the training set and split == FALSE in the test set

  • Related