How to you append rows from an existing dataframe to a new dataframe based on a column value?-CodePudding

I would like to do something like the following where I append a row from an existing dataframe to a new dataframe if it contains a particular value (in this case "setosa"). For the rows that don't contain "setosa," I would like to randomly sample and populate the rest of the rows until I get a 50/50 split between those rows with "setosa" and those with something that's not "setosa."

Here is my start so far.

data(iris)

dataset_new <- iris %>%
  ifelse(Species == "setosa", rbind()?, nrow()?)

CodePudding user response：

You can subset rows that are setosa, and sample rows from non-setosa species, with size of the number of setosa rows.

#setosa rows
seto <- iris[iris$Species == "setosa",]

#random subsample of non-setosa rows of size n("setosa") 
nonseto <- iris[sample(which(iris$Species != "setosa"), size = length(which(iris$Species == "setosa"))),]

dataset_new <- rbind(seto, nonseto)

CodePudding user response：

One liner solution with base R only:

iris[c(which(iris$Species == "setosa"), 
       sample(which(iris$Species != "setosa"),
              size = sum(iris$Species == "setosa"))), ]

Explanation: We're selecting a set of rows by concatenating with c() all row numbers which Species value equals "setosa" and a sample of rows that have Species != "setosa", with the same size as the count of rows that had Species == "setosa", without replacement (sample() function standard).

CodePudding user response：

A solution using {dplyr}:

library(dplyr)

df_setosa <- iris %>%
  filter(Species == "setosa")
  
dataset_new <- iris %>%
  filter(Species != "setosa") %>%
  slice_sample(n = nrow(df_setosa)) %>%
  bind_rows(df_setosa)

# check results
dataset_new %>%
  count(Species == "setosa")
#   Species == "setosa"  n
# 1               FALSE 50
# 2                TRUE 50