Home > Blockchain >  How to create a test and train dataset in R by specifying the range in the data set instead of using
How to create a test and train dataset in R by specifying the range in the data set instead of using

Time:08-03

I am a noob at programming, sorry if this is a silly question.

My supervisor doesn't seem to trust set.seed() function in r as every number will yield a different output (with different test and train sets). Thus she asked me to specify the range for my training and test dataset.

I am conducting a Binary logistic regression model in R with a sample size of 1790. There are 8 independent variables in my model. I want to do a 70/30 split for train and test data. I did it using these lines of code the first time:

RLV <- read.csv(file.choose(), header = T)

set.seed(123)
index <- sample(2, nrow(RLV), replace = T, prob = c(0.7, 0.3))
train <- RLV[index == 1,]
test <- RLV[index == 2,]

But if I change 123 into say 1234, the output is similar but not exactly the previous one (and yes I know that's the point). But according to my supervisor, she wants me to train using the data obtained in Day 1 and Day 2 and test(validate) using the data of Day 3 (That was my initial plan as well).

Thus after intense brainstorming I came up with these lines of code...

RLV <- read.csv(file.choose(), header = T)

train <- RLV[1:1253,]
test <- RLV[1254:1790,]
head(test)

I want all the rows from 1 to 1253 (all columns too) in my train dataset and from 1254 to 1790 in my test(validation) dataset.

I checked using the head function and it does seem to work. But I am on the fence here. Can someone please clarify how this works? Or please if its even right (lol). I just want to complete this project without any hassle.

Thanks a bunch.

CodePudding user response:

As you said: It does work. head() shows you the first six rows of a dataframe. So you should get rows 1254, 1255, 1256, 1257, 1258, and 1259 from your 'test set' after head(test).

It works because if you index a dataframe with [,], everything before the comma specifies row restrictions and everything after the comma specifies column restrictions. You indexed by row number. It would also be possible to index by a logical vector. For example, RLV[RLV$Day %in% 1:2,] would give you all cases from RLV where (the hypothetical) column Day holds the value 1 or 2.

If this doesn't answer your question(s), please specify what you mean by "how this works" and "if it's even right" ;)

  • Related