Randomly subsampling a dataframe without replacements in a specific column with R-CodePudding

I have a dataframe with this structure:

> df
factor  y  x
1       2  0
1       3  0
1       1  0
1       2  0
2       3  0
2       1  0
2       3  1
3       4  1
3       3  1 
3       6  3
3       5  2
4       4  1
4       7  8
4       2  1
2       5  3

In the actual dataset, I have 200 rows and different variables: several continuous variables and a factor variable with 70 levels with up to 4 observations each.

I would like to randomly subsample my entire dataframe into 4 groups of equal size without replacements within each group exclusively in the factor variable. In other words, I would like to have each level of the factor variable occurring not more than once per group.

I've tried different solutions. For instance, I tried by sampling the "factor" variable into four groups without replacements as follows:

factor1 <- as.character(df$factor)

set.seed(123)
group1 <- sample(factor, 35,replace = FALSE) 

factor2 <- setdiff(factor1, group1) 
group2 <- sample(factor2, 35,replace = FALSE) 

# and the same for "group3" and "group4"

but then I don't know how to associate the group vectors (group1, group2, etc.) to the other variables in my df ('x' and 'y').

I've also tried with:

group1 <- sample_n(df, 35, replace = FALSE)

but this solution fails as well since my dataframe doesn't include duplicated rows. The only duplicated values are in the factor variable.

Finally, I tried to use the solution proposed in reply to a similar question here, adapted to my case:

random.groups <- function(n.items = 200L, n.groups = 4L,
                          factor = rep(1L, n.items)) {

  splitted.items  <- split(seq.int(n.items), factor)

  shuffled <- lapply(splitted.items, sample)

  1L   (order(unlist(shuffled)) %% n.groups)
}

df$groups <- random.groups(nrow(df), n.groups = 4)

However, the resulting 4 groups include duplicated values for the factor variable, so something is not working properly.

I would really appreciate any idea or suggestion to solve this problem!

CodePudding user response：

A data.table solution demonstrated with a slightly larger dataset:

library(data.table)

dt <- setorder(data.table(factor = sample(1:10, 44, TRUE), x = runif(44), y = runif(44)), factor)
numGroups <- 4L
maxFactor <- max(dt$factor)
dt2 <- setorder(
        setorder(
          dt[sample(1:.N, .N)], # randomly reorder the data
          factor                # sort by factor
        )[, temp := cumsum(.I > 0), by = factor] # create a column to count the occurrence of each factor
        [temp <= numGroups]                              # remove rows that can't go in a group due to factor exclusion
        [sample(1:.N, .N) <= (.N %/% numGroups)*numGroups]       # randomly remove excess rows (keep the group sizes equal)
        [, grp := c(replicate(.N/numGroups, sample(1:numGroups, numGroups)))], # randomly assign each row a group
        grp # sort by group for table readability
        )[, temp := NULL] # remove the temporary column

CodePudding user response：

One way is to group by factor, create a variable of factro's length, arrange by size and length. At the end, you assign a group to each first, second, third and fourth row. You can then filter out using this variable.

library(dplyr)
df <- data_frame(factor = c(1,1,1,1,2,2,2,3,3,3,3,4,4,4,2),
                 x = floor(runif(15, min=0, max=20)),
                 y = floor(runif(15, min=211, max=305)))
df <- df %>% group_by(factor) %>% mutate(size = length(factor)) %>% arrange(desc(size), factor) %>% 
  ungroup() %>%  mutate(group = ifelse(row_number() %% 4 == 1, "A",
                                       ifelse(row_number() %% 4 == 2, "B",
                                              ifelse(row_number() %% 4 == 3, "C", "D"))))