Discriminant analysis splitting data-CodePudding

I would like to analyze the data set data(wine) which is available in the R package gclus. How can I split the data set according to the proportions 70:30 into a training and a test set?

CodePudding user response：

You can split your data like this:

library(gclus)
data("wine")
sample_size <- floor(0.70 * nrow(wine))
set.seed(123)
train_index <- sample(seq_len(nrow(wine)), size = sample_size)
train <- wine[train_index, ]
test <- wine[-train_index, ]

Checking the sizes of the datasets:

> nrow(wine)
[1] 178
> nrow(train)
[1] 124
> nrow(test)
[1] 54

CodePudding user response：

Here is an alternative approach to @Quinten very good approach: First we create an id for each row and use sample_frac() to finally anti_join() original wine with the train_wine:

#install.packages("gclus")
library(gclus)
library(dplyr)
data("wine")

wine <- wine %>% 
  mutate(id = row_number())
  
train_wine <- wine %>% 
  sample_frac(.70)

test_wine <- anti_join(wine, train_wine, by = 'id')

nrow(train_wine)
nrow(test_wine)