How to limit the size of the partition per class in R (starting from imbalanced dataset)-CodePudding

I'd like to partition an imbalanced dataset. My current dataset has 7 classes, some with 10.000 samples and some with 500. I'd basically like to limit the data frame size to 500 observations per class (so 500 observations * 7 classes), for example, so all classes have around 500 observations instead of having some with 10.000, others with 2.000, etc.

Currently, I'm doing the partition with caret package like this:

index <- createDataPartition(snv_data$clade, p = .70, list = FALSE)
train <- snv_data[index,]
test <- snv_data[-index,]

How could I make it so that the new data frame has a somewhat even distribution of observations per class?

Thank you in advance!

Edit:

I've come up with a not so "desirable" solution, but it worked as intended:

# i = 1
# 
# 
# for (row in 1:nrow(snv_data)) {
#   
#   if (snv_data$clade[row] == "20I (Alpha, V1)" ) {
#     
#     snv_data = snv_data[-c(row),]
#     i = i 1
#     if (i == 4600) {
#       stop("end")
#     }
# 
#   }
#   
# }

CodePudding user response：

Consider a base solution using by (wrapper to tapply) to run operations across factor-split subsets:

# SAMPLE ROWS OF DATA FRAME (DEFAULTS TO 500 OBS)
run_sample <- function(sub, n=500) {
    obs <- sample(nrow(sub), n)
    return(sub[obs,])
}

# LIST OF DATA FRAMES (EACH 500 OBS)
class_dfs <- by(snv_data, snv_data$clade, run_sample)

# COMPILED DATA FRAME (500 x NUM OF CLASSES)
balanced_data <- do.call(rbind, unname(class_dfs))