Hi I'm trying to match two dataframes, I have a large dataframe with a million observations and other dataframe with an ID variable and the size of how long must the random sample be.

Name <- c("Jon", "Bill", "Maria", "Ben", "Tina", "Jack", "Laura")
Gender <- c("male", "male", "female", "male", "female", "male", "female")

bigdf <- data.frame(Name, Gender)

ID <- c("male", "female")
samplesize <- c(1,2)
sampledf <- data.frame(ID, samplesize)

So, what I want is match both dataframes and get the following outcome (for example)

Name	Gender
Ben	male
Laura	female
Maria	female

I tried to create a function like

j <- function(x,y){
output<- filter(bigdf, Gender==x) %>% sample_n(y)
}
mapply(j, sampledf$Gender, sampledf$samplesize)

But the only thing I get is a long waiting time and a lot of empty columns. So it's obvious that I'm doing something wrong.

Any suggestion?

Thanks!

CodePudding user response：

dplyr

library(dplyr)
left_join(bigdf, sampledf, by = c(Gender = "ID")) %>%
  group_by(Gender) %>%
  filter(row_number() %in% sample(first(samplesize))) %>%
  ungroup() %>%
  select(-samplesize)
# # A tibble: 3 × 2
#   Name  Gender
#   <chr> <chr> 
# 1 Jon   male  
# 2 Maria female
# 3 Tina  female

base R

merge(bigdf, sampledf, by.x = "Gender", by.y = "ID") |>
  subset(ave(samplesize, Gender,
             FUN = function(z) seq_along(z) %in% sample(z[1])) > 0,
         select = -samplesize)
#   Gender  Name
# 1 female Maria
# 2 female  Tina
# 4   male   Jon

CodePudding user response：

Another base R approach that splits based on gender then samples using lapply and rbinds it all together with do.call:

do.call(rbind, lapply(split(bigdf, bigdf$Gender), function(x)
  x[sample(1:nrow(x), sampledf[sampledf$ID == unique(x$Gender), "samplesize"]), ]))

Output:

#           Name Gender
# female.5  Tina female
# female.3 Maria female
# male      Jack   male