R code to randomly allocate data based on probability/proportion-CodePudding

I'm still relatively new to R and am having trouble figuring out how to randomly allocate data based on a known proportion. Or, if there is another way to crosswalk this data - I'd love to learn that, also.

I have health data at the zip code level that I need to allocate to the census tract level. My health data is in this form:

patzip     dx_prin    patid
95437      L03111     0123
95437      L03111     0124
95437      L02112     0125
95437      L03114     0126
95437      L01001     0127
95437      L02112     0128
...

where each row is a unique patient visit and "patzip" is the patient zip code.

I have a crosswalk file from HUD that has all the California zip codes and the proportion of residential addresses within that zipcode that fall within each California census tract. That data is in this form:

zip     tract         res_ratio
95437   06045010300   0.2113775795
95437   06045010400   0.4151701060
95437   06045010200   0.0072504183
95437   06045010500   0.2339096486
95437   06045011002   0.1322922476
...

What I'd like to do is randomly allocate each person in my health dataset to a census tract based on the "res_ratio" value. So, 21% of the patients with zip code 95437 would go to the first census tract, 41.5% to the second census tract, and so on. I feel like the answer to this problem is likely simple, but I've been stuck for a few days. I'd really appreciate any help or advice!

CodePudding user response：

Your answer will depend to some extent how you assign rows from the health data when the proportions within the tracts don't result in integers when multiplied by the number of rows. Here I've ensured that new_n will be sum to n in the desired_split data frame so that the number of rows will be equal to the number of rows in the health dataset. In that case, after sorting them equivalently, they can be joined naively (with bind_cols()). The random assignment occurs because they have already been randomly sorted within zip code (with slice_sample()).

library(dplyr)
library(tidyr)

health_data <- tribble(
  ~patzip, ~dx_prin, ~patid,
  95437, "L03111", 0123,
  95437, "L03111", 0124,
  95437, "L02112", 0125,
  95437, "L03114", 0126,
  95437, "L01001", 0127,
  95437, "L02112", 0128,
  95438, "L03111", 0123,
  95438, "L03111", 0124,
  95438, "L02112", 0125,
  95438, "L03114", 0126,
  95439, "L01001", 0127,
  95439, "L02112", 0128
)
tract_data <- tribble(
  ~zip, ~tract, ~res_ratio,
  95437, "06045010300", 0.2113775795,
  95437, "06045010400", 0.4151701060,
  95437, "06045010200", 0.0072504183,
  95437, "06045010500", 0.2339096486,
  95437, "06045011002", 0.1322922476,
  95438, "06045320200", 0.4239835343,
  95438, "06042345401", 0.1829403728,
  95438, "06044304322", 0.3930761000,
  95439, "06045234002", 0.2500000000,
  95439, "06045123003", 0.7500000000
)

desired_split <- health_data %>% 
  count(patzip) %>% 
  right_join(tract_data, by = c("patzip" = "zip")) %>% 
  group_by(patzip) %>% 
  mutate(frac = n * res_ratio,
         cumsum_round = round(cumsum(frac)),
         new_n = cumsum_round - lag(cumsum_round, default = 0)) %>% 
  uncount(weights = new_n) %>% 
  select(zip = patzip, tract) %>% 
  arrange(zip)

assigned_tracts <- health_data %>% 
  slice_sample(prop = 1) %>% 
  arrange(patzip) %>% 
  bind_cols(desired_split)

assigned_tracts
#> # A tibble: 12 × 5
#>    patzip dx_prin patid   zip tract      
#>     <dbl> <chr>   <dbl> <dbl> <chr>      
#>  1  95437 L02112    125 95437 06045010300
#>  2  95437 L03111    124 95437 06045010400
#>  3  95437 L03111    123 95437 06045010400
#>  4  95437 L01001    127 95437 06045010400
#>  5  95437 L03114    126 95437 06045010500
#>  6  95437 L02112    128 95437 06045011002
#>  7  95438 L03111    123 95438 06045320200
#>  8  95438 L02112    125 95438 06045320200
#>  9  95438 L03111    124 95438 06044304322
#> 10  95438 L03114    126 95438 06044304322
#> 11  95439 L01001    127 95439 06045123003
#> 12  95439 L02112    128 95439 06045123003

^{Created on 2022-05-31 by the reprex package (v2.0.1)}

CodePudding user response：

You can do this by:

On the px dataset, assign a random uniform number to each patient
On the hud dataset, create a lower/upper bound for each tract within a zip code
Do a non-equi join using data.table.

Here is the code:

# load data.table library
library(data.table)

# set both of the data frames to data.table
setDT(px)
setDT(hud)

# add a random uniform variate to each row of `px`
px[, rnd:=runif(.N)]

# add a lower and upper bound on the `hud` using `cumsum`, by zip
hud[, `:=`(lower = cumsum(res_ratio)-res_ratio, upper=cumsum(res_ratio)), by=zip]

# join the datatables, using a non-equi join
px[hud, on=.(patzip=zip, rnd>=lower, rnd<upper)]

To illustrate, I take your patient data (call it px) and I randomly re-sample 10,000 rows from it:

set.seed(123)
px <- px[sample(1:6, 10000, replace=T)]

Now, running the above code returns an assignment of tract to these 10,000 patients, in the following proportion:

        tract   prop
1: 6045010300 0.2144
2: 6045010400 0.4109
3: 6045010200 0.0070
4: 6045010500 0.2421
5: 6045011002 0.1256