Split Train & Test Sets but Indexed Input Differs from Subscript by 1--why?-CodePudding

I've split my data into training and testing sets, but I keep receiving an error that

! Must subset rows with a valid subscript vector. ℹ Logical subscripts must match the size of the indexed input. x Input has size 4067 but subscript split_data_table == 0 has size 4066.

My data is named "JFK_weather_clean2". To execute the split, I did:

set.seed(1234)
split_data_table <- sample(c(rep(0, 0.8 * nrow(JFK_weather_clean2)), rep(1, 0.2 * nrow(JFK_weather_clean2))))

table(split_data_table) results:

0	1
3253	813

From there I tried to create the training set:

training_set <- JFK_weather_clean2[split_data_table == 0, ]

As you have probably noticed, my input data comprises 4,067 rows (which count includes header row), whereas the subscript has size 4,066. I am assuming this issue involves the header row, but I don't know what correction to make in my sample() code. Thanks for any help!

CodePudding user response：

The cause of your problem is that the rep function, which you used to split the data, has times argument that coerces the input into integer or double vector. This behavior has been explained in the documentation of rep.

A double vector is accepted, other inputs being coerced to an integer or double vector.

This behavior may lead to rounding the input to the largest integer not greater than the input. For example, mtcars has 32 rows, of which 80% is 25.6, but if you use rep, it is rounded to 25, not 26.

0.8 * nrow(mtcars)
# [1] 25.6
length(c(rep(0, 0.8 * nrow(mtcars))))
[1] 25

If you apply your code to split mtcars, you will get 31 rows in total, not 32 as expected.

length(c(rep(0, 0.8 * nrow(mtcars)), rep(1, 0.2 * nrow(mtcars))))
# [1] 31

This rounding behavior in rep is not a problem when the number of rows in your split data is an integer, such as in iris, which has 150 rows so that 80% of it is 120.

length(c(rep(0, 0.8 * nrow(iris)), rep(1, 0.2 * nrow(iris))))
# [1] 150

An alternative solution to get the correct total rows is to use round in the input of times argument in rep function.

length(c(rep(0, round(0.8 * nrow(mtcars))), rep(1, round(0.2 * nrow(mtcars)))))
# [1] 32