I am a noob at programming, sorry if this is a silly question.
My supervisor doesn't seem to trust set.seed() function in r as every number will yield a different output (with different test and train sets). Thus she asked me to specify the range for my training and test dataset.
I am conducting a Binary logistic regression model in R with a sample size of 1790. There are 8 independent variables in my model. I want to do a 70/30 split for train and test data. I did it using these lines of code the first time:
RLV <- read.csv(file.choose(), header = T)
set.seed(123)
index <- sample(2, nrow(RLV), replace = T, prob = c(0.7, 0.3))
train <- RLV[index == 1,]
test <- RLV[index == 2,]
But if I change 123 into say 1234, the output is similar but not exactly the previous one (and yes I know that's the point). But according to my supervisor, she wants me to train using the data obtained in Day 1 and Day 2 and test(validate) using the data of Day 3 (That was my initial plan as well).
Thus after intense brainstorming I came up with these lines of code...
RLV <- read.csv(file.choose(), header = T)
train <- RLV[1:1253,]
test <- RLV[1254:1790,]
head(test)
I want all the rows from 1 to 1253 (all columns too) in my train dataset and from 1254 to 1790 in my test(validation) dataset.
I checked using the head function and it does seem to work. But I am on the fence here. Can someone please clarify how this works? Or please if its even right (lol). I just want to complete this project without any hassle.
Thanks a bunch.
CodePudding user response:
As you said: It does work. head()
shows you the first six rows of a dataframe. So you should get rows 1254, 1255, 1256, 1257, 1258, and 1259 from your 'test set' after head(test)
.
It works because if you index a dataframe with [,]
, everything before the comma specifies row restrictions and everything after the comma specifies column restrictions. You indexed by row number. It would also be possible to index by a logical vector. For example, RLV[RLV$Day %in% 1:2,]
would give you all cases from RLV
where (the hypothetical) column Day
holds the value 1 or 2.
If this doesn't answer your question(s), please specify what you mean by "how this works" and "if it's even right" ;)