How can I split a dataset into two while ensuring that they are balanced across a column with catego-CodePudding

I'm testing the accuracy of an imputation model using training and test datasets. The model I'm running uses a categorical variable. Unfortunately, when I randomly split the dataset and run a model on the training set, I am unable to estimate a coefficient for some categorical variables which are present in the test dataset. I would like to split the data while ensuring that all categorical variables are present in both the training and test datasets. Is there an easy way to do this in R?

In the simulated data below, this would require the same sets of letters to be present in both datasets, so that I can test the accuracy of the model in the test dataset.


chars<-c("A","B","C","D")

complete_data<-data.frame(v1=rnorm(100,2,100), v2=rnorm(100,1,100), v3=sample(chars, 100, replace=TRUE))

In my dataset, the problem is a little trickier as some of the categorical variables are extremely scarce.

EDIT:

Thanks for the responses. I ended up looking up stratified sampling as Antimon suggested and came across the caret package which apparently works as well.

library(caret)
train.index <- createDataPartition(complete_data$v3, p = .7, list = FALSE)
train <- complete_data[ train.index,]
test  <- complete_data[-train.index,]

CodePudding user response：

There are several ways of doing this. You need to divide your data by v3 and then split each group randomly:

chars <- c("A","B","C","D")
complete_data <- data.frame(v1=rnorm(100,2,100), v2=rnorm(100,1,100), v3=sample(chars, 100, replace=TRUE))

Now we'll use the by() function to split the data into groups by v3 and draw a random sample of half of the rownames in each group:

test <- as.numeric(unlist(by(complete_data, complete_data$v3, function(x) sample(rownames(x), length(rownames(x))/2))))
train_test <- rep("train", nrow(complete_data))
train_test[test] <- "test"
table(complete_data$v3, train_test)
#    train_test
#     test train
#   A   11    12
#   B   12    13
#   C   13    14
#   D   12    13

Now complete_data[train_test=="train", ] is your training set and complete_data[train_test=="test", ] is your test set.

CodePudding user response：

This can be achieved quite simply.

library(tidyverse)

chars<-c("A","B","C","D")

complete_data <- tibble(v1=rnorm(100,2,100), 
                        v2=rnorm(100,1,100), 
                        v3=sample(chars, 100, replace=TRUE))


propCategory = function(data, category, prop){
  category = enquo(category)
  cat1 = data %>% pull(!!category)
  unlist(sapply(as.list(unique(cat1)), function(x) {sample(which(cat1==x), sum(cat1==x)*prop)}))
}
complete_data %>% propCategory(v3, .2)

output

 [1]  98  35  20  78  40  70  87   3  86  38  22 100  80  93  47   5  24  29  26

As you can see, my propCategory function returns the axial indexes. But let's check if they contain what you need. First, let's check the training indexes.

train = complete_data %>% propCategory(v3, .75)

complete_data[train,] %>% distinct(v3)
complete_data[train,] %>% nrow()

output

> complete_data[train,] %>% distinct(v3)
# A tibble: 4 x 1
  v3   
  <chr>
1 B    
2 A    
3 D    
4 C    
> complete_data[train,] %>% nrow()
[1] 74

Now it's time for the test indexes.

complete_data[-train,] %>% distinct(v3)
complete_data[-train,] %>% nrow()

output

> complete_data[-train,] %>% distinct(v3)
# A tibble: 4 x 1
  v3   
  <chr>
1 B    
2 A    
3 D    
4 C    
> complete_data[-train,] %>% nrow()
[1] 26

As you can see, both the training and test data include each of your categories.

A little note about the prop parameter. My propCategory function was written in such a way that for each value from the variable category it returns the number of randomly selected indices with prop * (the number of saved values of the categorical variable).

Take a good look at the results below.

complete_data %>% group_by(v3) %>% 
  summarise(n = n(), prop = n()/nrow(.))

complete_data[train,] %>% group_by(v3) %>% 
  summarise(n = n(), prop = n()/nrow(.))

complete_data[-train,] %>% group_by(v3) %>% 
  summarise(n = n(), prop = n()/nrow(.))

output

> complete_data %>% group_by(v3) %>% 
    summarise(n = n(), prop = n()/nrow(.))
# A tibble: 4 x 3
  v3        n  prop
  <chr> <int> <dbl>
1 A        26  0.26
2 B        35  0.35
3 C        24  0.24
4 D        15  0.15
> complete_data[train,] %>% group_by(v3) %>% 
    summarise(n = n(), prop = n()/nrow(.))
# A tibble: 4 x 3
  v3        n  prop
  <chr> <int> <dbl>
1 A        19 0.257
2 B        26 0.351
3 C        18 0.243
4 D        11 0.149
> complete_data[-train,] %>% group_by(v3) %>% 
    summarise(n = n(), prop = n()/nrow(.))
# A tibble: 4 x 3
  v3        n  prop
  <chr> <int> <dbl>
1 A         7 0.269
2 B         9 0.346
3 C         6 0.231
4 D         4 0.154