I would like to analyze the data set data(wine)
which is available in the R package gclus
.
How can I split the data set according to the proportions 70:30 into a training and a test set?
CodePudding user response:
You can split your data like this:
library(gclus)
data("wine")
sample_size <- floor(0.70 * nrow(wine))
set.seed(123)
train_index <- sample(seq_len(nrow(wine)), size = sample_size)
train <- wine[train_index, ]
test <- wine[-train_index, ]
Checking the sizes of the datasets:
> nrow(wine)
[1] 178
> nrow(train)
[1] 124
> nrow(test)
[1] 54
CodePudding user response:
Here is an alternative approach to @Quinten very good approach:
First we create an id
for each row and use sample_frac()
to finally anti_join()
original wine
with the train_wine
:
#install.packages("gclus")
library(gclus)
library(dplyr)
data("wine")
wine <- wine %>%
mutate(id = row_number())
train_wine <- wine %>%
sample_frac(.70)
test_wine <- anti_join(wine, train_wine, by = 'id')
nrow(train_wine)
nrow(test_wine)