I've split my data into training and testing sets, but I keep receiving an error that
! Must subset rows with a valid subscript vector. ℹ Logical subscripts must match the size of the indexed input. x Input has size 4067 but subscript
split_data_table == 0
has size 4066.
My data is named "JFK_weather_clean2". To execute the split, I did:
set.seed(1234)
split_data_table <- sample(c(rep(0, 0.8 * nrow(JFK_weather_clean2)), rep(1, 0.2 * nrow(JFK_weather_clean2))))
table(split_data_table)
results:
0 | 1 |
---|---|
3253 | 813 |
From there I tried to create the training set:
training_set <- JFK_weather_clean2[split_data_table == 0, ]
As you have probably noticed, my input data comprises 4,067 rows (which count includes header row), whereas the subscript has size 4,066. I am assuming this issue involves the header row, but I don't know what correction to make in my sample()
code. Thanks for any help!
CodePudding user response:
The cause of your problem is that the rep
function, which you used to split the data, has times
argument that coerces the input into integer or double vector. This behavior has been explained in the documentation of rep
.
A double vector is accepted, other inputs being coerced to an integer or double vector.
This behavior may lead to rounding the input to the largest integer not greater than the input. For example, mtcars
has 32 rows, of which 80% is 25.6, but if you use rep
, it is rounded to 25, not 26.
0.8 * nrow(mtcars)
# [1] 25.6
length(c(rep(0, 0.8 * nrow(mtcars))))
[1] 25
If you apply your code to split mtcars
, you will get 31 rows in total, not 32 as expected.
length(c(rep(0, 0.8 * nrow(mtcars)), rep(1, 0.2 * nrow(mtcars))))
# [1] 31
This rounding behavior in rep
is not a problem when the number of rows in your split data is an integer, such as in iris
, which has 150 rows so that 80% of it is 120.
length(c(rep(0, 0.8 * nrow(iris)), rep(1, 0.2 * nrow(iris))))
# [1] 150
An alternative solution to get the correct total rows is to use round
in
the input of times
argument in rep
function.
length(c(rep(0, round(0.8 * nrow(mtcars))), rep(1, round(0.2 * nrow(mtcars)))))
# [1] 32