New column with random boolean values while controlling the ratio of TRUE/FALSE per category-CodePudding

In R I've got a dataset like this one:

df <- data.frame(
  ID = c(1:30),
  x1 = seq(0, 1, length.out = 30),
  x2 = seq(100, 3000, length.out = 30),
  category = gl(3, 10, labels = c("NEGATIVE", "NEUTRAL", "POSITIVE"))
)

Now I want to add a new column with randomized boolean values, but inside each category the proportion of TRUE and FALSE values should be the same (i.e. the randomizing process should generate the same count of true and false values, in the above data frame 5 TRUEs and 5 FALSEs in each of the 3 categories). How to do this?

CodePudding user response：

You can sample a vector of "TRUE" and "FALSE" values without replacement so you have a randomized and balanced column in your data-frame.

sample(rep(c("TRUE","FALSE"),each=5),10,replace=FALSE)

CodePudding user response：

Based on Yacine Hajji answer:

addRandomBool <- function(df, p){
  
  n <- ceiling(nrow(df) * p)
  df$bool <- sample(rep(c("TRUE","FALSE"), times = c(n, nrow(df) - n)))
  
  df
}

Reduce(rbind, lapply(split(df, df$category), addRandomBool, p = 0.5))

where parametar p determines the proportion of TRUE.

CodePudding user response：

This will sample within each group from a vector of 5 TRUE and 5 FALSE without replacement. It will assume that there are always 10 records per group.

library(dplyr)
library(tidyr)

df <- data.frame(
  ID = c(1:30),
  x1 = seq(0, 1, length.out = 30),
  x2 = seq(100, 3000, length.out = 30),
  category = gl(3, 10, labels = c("NEGATIVE", "NEUTRAL", "POSITIVE"))
)

set.seed(pi)

df %>% 
  group_by(category) %>% 
  nest() %>% 
  mutate(data = lapply(data, 
                       function(df){ # Function to saple and assign the new_col
                         df$new_col <- sample(rep(c(FALSE, TRUE), 
                                                  each = 5), 
                                              size = 10, 
                                              replace = FALSE)
                         df
                       })) %>% 
  unnest(cols = "data")

This next example is a little more generalized, but still assumes (approximately) even distribution of TRUE and FALSE within a group. But it can accomodate variable group sizes, and even groups with odd numbers of records (but will favor FALSE for odd numbers of records)

library(dplyr)
library(tidyr)

df <- data.frame(
  ID = c(1:30),
  x1 = seq(0, 1, length.out = 30),
  x2 = seq(100, 3000, length.out = 30),
  category = gl(3, 10, labels = c("NEGATIVE", "NEUTRAL", "POSITIVE"))
)

set.seed(pi)

df %>% 
  group_by(category) %>% 
  nest() %>% 
  mutate(data = lapply(data, 
                       function(df){
                         df$new_col <- sample(rep(c(FALSE, TRUE), 
                                                  length.out = nrow(df)), 
                                              size = nrow(df), 
                                              replace = FALSE)
                         df
                       })) %>% 
  unnest(cols = "data")

Maintaining Column Order

A couple of options to maintain the column order:

First, you can save the column order before you do your group_by - nest, and then use select to set the order when you're done.

set.seed(pi)

orig_col <- names(df)  # original column order

df %>% 
  group_by(category) %>% 
  nest() %>% 
  mutate(data = lapply(data, 
                       function(df){
                         df$new_col <- sample(rep(c(FALSE, TRUE), 
                                                  length.out = nrow(df)), 
                                              size = nrow(df), 
                                              replace = FALSE)
                         df
                       })) %>% 
  unnest(cols = "data") %>% 
  select_at(c(orig_col, "new_col"))   # Restore the column order

Or you can use a base R solution that doesn't change the column order in the first place

df <- split(df, df["category"])
df <- lapply(df, 
             function(df){
               df$new_col <- sample(rep(c(FALSE, TRUE), 
                                        length.out = nrow(df)), 
                                    size = nrow(df), 
                                    replace = FALSE)
               df
             })
do.call("rbind", c(df, list(make.row.names = FALSE)))

There are likely a dozen other ways to do this, and probably more efficient ways that I'm not thinking of.