Hi I'm trying to match two dataframes, I have a large dataframe with a million observations and other dataframe with an ID variable and the size of how long must the random sample be.
Name <- c("Jon", "Bill", "Maria", "Ben", "Tina", "Jack", "Laura")
Gender <- c("male", "male", "female", "male", "female", "male", "female")
bigdf <- data.frame(Name, Gender)
ID <- c("male", "female")
samplesize <- c(1,2)
sampledf <- data.frame(ID, samplesize)
So, what I want is match both dataframes and get the following outcome (for example)
Name | Gender |
---|---|
Ben | male |
Laura | female |
Maria | female |
I tried to create a function like
j <- function(x,y){
output<- filter(bigdf, Gender==x) %>% sample_n(y)
}
mapply(j, sampledf$Gender, sampledf$samplesize)
But the only thing I get is a long waiting time and a lot of empty columns. So it's obvious that I'm doing something wrong.
Any suggestion?
Thanks!
CodePudding user response:
dplyr
library(dplyr)
left_join(bigdf, sampledf, by = c(Gender = "ID")) %>%
group_by(Gender) %>%
filter(row_number() %in% sample(first(samplesize))) %>%
ungroup() %>%
select(-samplesize)
# # A tibble: 3 × 2
# Name Gender
# <chr> <chr>
# 1 Jon male
# 2 Maria female
# 3 Tina female
base R
merge(bigdf, sampledf, by.x = "Gender", by.y = "ID") |>
subset(ave(samplesize, Gender,
FUN = function(z) seq_along(z) %in% sample(z[1])) > 0,
select = -samplesize)
# Gender Name
# 1 female Maria
# 2 female Tina
# 4 male Jon
CodePudding user response:
Another base R approach that split
s based on gender then sample
s using lapply
and rbinds
it all together with do.call
:
do.call(rbind, lapply(split(bigdf, bigdf$Gender), function(x)
x[sample(1:nrow(x), sampledf[sampledf$ID == unique(x$Gender), "samplesize"]), ]))
Output:
# Name Gender
# female.5 Tina female
# female.3 Maria female
# male Jack male