I have a dataframe with this structure:
> df
factor y x
1 2 0
1 3 0
1 1 0
1 2 0
2 3 0
2 1 0
2 3 1
3 4 1
3 3 1
3 6 3
3 5 2
4 4 1
4 7 8
4 2 1
2 5 3
In the actual dataset, I have 200 rows and different variables: several continuous variables and a factor variable with 70 levels with up to 4 observations each.
I would like to randomly subsample my entire dataframe into 4 groups of equal size without replacements within each group exclusively in the factor variable. In other words, I would like to have each level of the factor variable occurring not more than once per group.
I've tried different solutions. For instance, I tried by sampling the "factor" variable into four groups without replacements as follows:
factor1 <- as.character(df$factor)
set.seed(123)
group1 <- sample(factor, 35,replace = FALSE)
factor2 <- setdiff(factor1, group1)
group2 <- sample(factor2, 35,replace = FALSE)
# and the same for "group3" and "group4"
but then I don't know how to associate the group vectors (group1, group2, etc.) to the other variables in my df ('x' and 'y').
I've also tried with:
group1 <- sample_n(df, 35, replace = FALSE)
but this solution fails as well since my dataframe doesn't include duplicated rows. The only duplicated values are in the factor variable.
Finally, I tried to use the solution proposed in reply to a similar question here, adapted to my case:
random.groups <- function(n.items = 200L, n.groups = 4L,
factor = rep(1L, n.items)) {
splitted.items <- split(seq.int(n.items), factor)
shuffled <- lapply(splitted.items, sample)
1L (order(unlist(shuffled)) %% n.groups)
}
df$groups <- random.groups(nrow(df), n.groups = 4)
However, the resulting 4 groups include duplicated values for the factor variable, so something is not working properly.
I would really appreciate any idea or suggestion to solve this problem!
CodePudding user response:
A data.table
solution demonstrated with a slightly larger dataset:
library(data.table)
dt <- setorder(data.table(factor = sample(1:10, 44, TRUE), x = runif(44), y = runif(44)), factor)
numGroups <- 4L
maxFactor <- max(dt$factor)
dt2 <- setorder(
setorder(
dt[sample(1:.N, .N)], # randomly reorder the data
factor # sort by factor
)[, temp := cumsum(.I > 0), by = factor] # create a column to count the occurrence of each factor
[temp <= numGroups] # remove rows that can't go in a group due to factor exclusion
[sample(1:.N, .N) <= (.N %/% numGroups)*numGroups] # randomly remove excess rows (keep the group sizes equal)
[, grp := c(replicate(.N/numGroups, sample(1:numGroups, numGroups)))], # randomly assign each row a group
grp # sort by group for table readability
)[, temp := NULL] # remove the temporary column
CodePudding user response:
One way is to group by factor, create a variable of factro's length, arrange by size and length. At the end, you assign a group to each first, second, third and fourth row. You can then filter out using this variable.
library(dplyr)
df <- data_frame(factor = c(1,1,1,1,2,2,2,3,3,3,3,4,4,4,2),
x = floor(runif(15, min=0, max=20)),
y = floor(runif(15, min=211, max=305)))
df <- df %>% group_by(factor) %>% mutate(size = length(factor)) %>% arrange(desc(size), factor) %>%
ungroup() %>% mutate(group = ifelse(row_number() %% 4 == 1, "A",
ifelse(row_number() %% 4 == 2, "B",
ifelse(row_number() %% 4 == 3, "C", "D"))))