I'm testing the accuracy of an imputation model using training and test datasets. The model I'm running uses a categorical variable. Unfortunately, when I randomly split the dataset and run a model on the training set, I am unable to estimate a coefficient for some categorical variables which are present in the test dataset. I would like to split the data while ensuring that all categorical variables are present in both the training and test datasets. Is there an easy way to do this in R?
In the simulated data below, this would require the same sets of letters to be present in both datasets, so that I can test the accuracy of the model in the test dataset.
chars<-c("A","B","C","D")
complete_data<-data.frame(v1=rnorm(100,2,100), v2=rnorm(100,1,100), v3=sample(chars, 100, replace=TRUE))
In my dataset, the problem is a little trickier as some of the categorical variables are extremely scarce.
EDIT:
Thanks for the responses. I ended up looking up stratified sampling as Antimon suggested and came across the caret package which apparently works as well.
library(caret)
train.index <- createDataPartition(complete_data$v3, p = .7, list = FALSE)
train <- complete_data[ train.index,]
test <- complete_data[-train.index,]
CodePudding user response:
There are several ways of doing this. You need to divide your data by v3
and then split each group randomly:
chars <- c("A","B","C","D")
complete_data <- data.frame(v1=rnorm(100,2,100), v2=rnorm(100,1,100), v3=sample(chars, 100, replace=TRUE))
Now we'll use the by()
function to split the data into groups by v3
and draw a random sample of half of the rownames in each group:
test <- as.numeric(unlist(by(complete_data, complete_data$v3, function(x) sample(rownames(x), length(rownames(x))/2))))
train_test <- rep("train", nrow(complete_data))
train_test[test] <- "test"
table(complete_data$v3, train_test)
# train_test
# test train
# A 11 12
# B 12 13
# C 13 14
# D 12 13
Now complete_data[train_test=="train", ]
is your training set and complete_data[train_test=="test", ]
is your test set.
CodePudding user response:
This can be achieved quite simply.
library(tidyverse)
chars<-c("A","B","C","D")
complete_data <- tibble(v1=rnorm(100,2,100),
v2=rnorm(100,1,100),
v3=sample(chars, 100, replace=TRUE))
propCategory = function(data, category, prop){
category = enquo(category)
cat1 = data %>% pull(!!category)
unlist(sapply(as.list(unique(cat1)), function(x) {sample(which(cat1==x), sum(cat1==x)*prop)}))
}
complete_data %>% propCategory(v3, .2)
output
[1] 98 35 20 78 40 70 87 3 86 38 22 100 80 93 47 5 24 29 26
As you can see, my propCategory
function returns the axial indexes. But let's check if they contain what you need.
First, let's check the training indexes.
train = complete_data %>% propCategory(v3, .75)
complete_data[train,] %>% distinct(v3)
complete_data[train,] %>% nrow()
output
> complete_data[train,] %>% distinct(v3)
# A tibble: 4 x 1
v3
<chr>
1 B
2 A
3 D
4 C
> complete_data[train,] %>% nrow()
[1] 74
Now it's time for the test indexes.
complete_data[-train,] %>% distinct(v3)
complete_data[-train,] %>% nrow()
output
> complete_data[-train,] %>% distinct(v3)
# A tibble: 4 x 1
v3
<chr>
1 B
2 A
3 D
4 C
> complete_data[-train,] %>% nrow()
[1] 26
As you can see, both the training and test data include each of your categories.
A little note about the prop
parameter.
My propCategory
function was written in such a way that for each value from the variable category it returns the number of randomly selected indices with prop * (the number of saved values of the categorical variable).
Take a good look at the results below.
complete_data %>% group_by(v3) %>%
summarise(n = n(), prop = n()/nrow(.))
complete_data[train,] %>% group_by(v3) %>%
summarise(n = n(), prop = n()/nrow(.))
complete_data[-train,] %>% group_by(v3) %>%
summarise(n = n(), prop = n()/nrow(.))
output
> complete_data %>% group_by(v3) %>%
summarise(n = n(), prop = n()/nrow(.))
# A tibble: 4 x 3
v3 n prop
<chr> <int> <dbl>
1 A 26 0.26
2 B 35 0.35
3 C 24 0.24
4 D 15 0.15
> complete_data[train,] %>% group_by(v3) %>%
summarise(n = n(), prop = n()/nrow(.))
# A tibble: 4 x 3
v3 n prop
<chr> <int> <dbl>
1 A 19 0.257
2 B 26 0.351
3 C 18 0.243
4 D 11 0.149
> complete_data[-train,] %>% group_by(v3) %>%
summarise(n = n(), prop = n()/nrow(.))
# A tibble: 4 x 3
v3 n prop
<chr> <int> <dbl>
1 A 7 0.269
2 B 9 0.346
3 C 6 0.231
4 D 4 0.154