I'd like to partition an imbalanced dataset. My current dataset has 7 classes, some with 10.000 samples and some with 500. I'd basically like to limit the data frame size to 500 observations per class (so 500 observations * 7 classes), for example, so all classes have around 500 observations instead of having some with 10.000, others with 2.000, etc.
Currently, I'm doing the partition with caret package like this:
index <- createDataPartition(snv_data$clade, p = .70, list = FALSE)
train <- snv_data[index,]
test <- snv_data[-index,]
How could I make it so that the new data frame has a somewhat even distribution of observations per class?
Thank you in advance!
Edit:
I've come up with a not so "desirable" solution, but it worked as intended:
# i = 1
#
#
# for (row in 1:nrow(snv_data)) {
#
# if (snv_data$clade[row] == "20I (Alpha, V1)" ) {
#
# snv_data = snv_data[-c(row),]
# i = i 1
# if (i == 4600) {
# stop("end")
# }
#
# }
#
# }
CodePudding user response:
Consider a base
solution using by
(wrapper to tapply
) to run operations across factor-split subsets:
# SAMPLE ROWS OF DATA FRAME (DEFAULTS TO 500 OBS)
run_sample <- function(sub, n=500) {
obs <- sample(nrow(sub), n)
return(sub[obs,])
}
# LIST OF DATA FRAMES (EACH 500 OBS)
class_dfs <- by(snv_data, snv_data$clade, run_sample)
# COMPILED DATA FRAME (500 x NUM OF CLASSES)
balanced_data <- do.call(rbind, unname(class_dfs))