I would like to do something like the following where I append a row from an existing dataframe to a new dataframe if it contains a particular value (in this case "setosa"). For the rows that don't contain "setosa," I would like to randomly sample and populate the rest of the rows until I get a 50/50 split between those rows with "setosa" and those with something that's not "setosa."
Here is my start so far.
data(iris)
dataset_new <- iris %>%
ifelse(Species == "setosa", rbind()?, nrow()?)
CodePudding user response:
You can subset rows that are setosa
, and sample rows from non-setosa
species, with size of the number of setosa
rows.
#setosa rows
seto <- iris[iris$Species == "setosa",]
#random subsample of non-setosa rows of size n("setosa")
nonseto <- iris[sample(which(iris$Species != "setosa"), size = length(which(iris$Species == "setosa"))),]
dataset_new <- rbind(seto, nonseto)
CodePudding user response:
One liner solution with base R only:
iris[c(which(iris$Species == "setosa"),
sample(which(iris$Species != "setosa"),
size = sum(iris$Species == "setosa"))), ]
Explanation: We're selecting a set of rows by concatenating with c()
all row numbers which Species
value equals "setosa"
and a sample of rows that have Species != "setosa"
, with the same size as the count of rows that had Species == "setosa"
, without replacement (sample()
function standard).
CodePudding user response:
A solution using {dplyr}:
library(dplyr)
df_setosa <- iris %>%
filter(Species == "setosa")
dataset_new <- iris %>%
filter(Species != "setosa") %>%
slice_sample(n = nrow(df_setosa)) %>%
bind_rows(df_setosa)
# check results
dataset_new %>%
count(Species == "setosa")
# Species == "setosa" n
# 1 FALSE 50
# 2 TRUE 50