Home > Software design >  How to join data with a weighted sampling?
How to join data with a weighted sampling?

Time:10-05

I'm looking to do a weighted join between two datasets:

library(tidyverse)
set.seed(1)

test.sample <- data.frame(zip=sample(1:3,50,replace = TRUE))

index.dat <- data.frame(zip=c(1,1,2,3,3,3),
                  fips=c("A1", "A2", "B", "C1", "C2","C3"),
                  prob=c(.75,.25,1,.7,.2,.1))

My expected output would be a weighted sample from the index dataset:

results1 <- c(rep("A1",14),rep("A2",4),rep("B",19,),rep("C1",9),rep("C2",3),"C3")

Ultimately trying to join zip codes that match to multiple fips codes from a probability distribution for the population.

This comment is a good description of what I'm trying to overcome: https://stackoverflow.com/a/13316857/4828653

Here's a potential solution I've come up with but given I have billions of records I need something much more performant.

test_function <- function(x) {
index.dat %>% 
filter(zip == x) %>% 
sample_n(size=1,weight=prob) %>% 
select(fips)
}

results2 <- lapply(test.sample$zip, function(x) test_function(x)) %>% 
unlist() %>% 
data.frame(fips = .)

> table(results1)
results1
A1 A2  B C1 C2 C3 
14  4 19  9  3  1 
> table(results2)
results2
A1 A2  B C1 C2 C3 
15  3 19  8  2  3 

CodePudding user response:

You can split index.dat according to the zip, to give a list of data frames for each individual zip code. If you use test.sample$zip to subset this list, you will get a list of 50 data frames with the appropriate zip codes. You can then sample the fip using the weights in the prob column of each data frame.

In your case, that would look like this:

sample_space <- split(index.dat, index.dat$zip)[test.sample$zip]

test.sample$fips <- sapply(sample_space, 
                           function(x) sample(x$fips, 1, prob = x$prob))

Now test.sample$fips will have a random fip chosen from the appropriate zip code, with the sampling done according to the relative weight. If we do a table of test.sampl$fips, we can see that the proportions are about right:

table(test.sample$fips)

#> A1 A2  B C1 C2 
#> 13  5 19 10  3 

The 18 members of zip 1 have been assigned to A1 and A2 with an (almost) 75:25 split. All members of zip 2 are given a B, as expected, and the 13 members of zip 3 have been assigned appropriately (though by chance no C3s have been selected due to its low probability)

If test.sample had 5000 rows, we would see that the proportions are much closer to the expected weightings due to the law of large numbers:

#>   A1   A2    B   C1   C2   C3 
#> 1257  419 1687 1153  325  159 
  • Related