I'm still relatively new to R and am having trouble figuring out how to randomly allocate data based on a known proportion. Or, if there is another way to crosswalk this data - I'd love to learn that, also.
I have health data at the zip code level that I need to allocate to the census tract level. My health data is in this form:
patzip dx_prin patid
95437 L03111 0123
95437 L03111 0124
95437 L02112 0125
95437 L03114 0126
95437 L01001 0127
95437 L02112 0128
...
where each row is a unique patient visit and "patzip" is the patient zip code.
I have a crosswalk file from HUD that has all the California zip codes and the proportion of residential addresses within that zipcode that fall within each California census tract. That data is in this form:
zip tract res_ratio
95437 06045010300 0.2113775795
95437 06045010400 0.4151701060
95437 06045010200 0.0072504183
95437 06045010500 0.2339096486
95437 06045011002 0.1322922476
...
What I'd like to do is randomly allocate each person in my health dataset to a census tract based on the "res_ratio" value. So, 21% of the patients with zip code 95437 would go to the first census tract, 41.5% to the second census tract, and so on. I feel like the answer to this problem is likely simple, but I've been stuck for a few days. I'd really appreciate any help or advice!
CodePudding user response:
Your answer will depend to some extent how you assign rows from the health data when the proportions within the tracts don't result in integers when multiplied by the number of rows. Here I've ensured that new_n
will be sum to n
in the desired_split
data frame so that the number of rows will be equal to the number of rows in the health dataset. In that case, after sorting them equivalently, they can be joined naively (with bind_cols()
). The random assignment occurs because they have already been randomly sorted within zip code (with slice_sample()
).
library(dplyr)
library(tidyr)
health_data <- tribble(
~patzip, ~dx_prin, ~patid,
95437, "L03111", 0123,
95437, "L03111", 0124,
95437, "L02112", 0125,
95437, "L03114", 0126,
95437, "L01001", 0127,
95437, "L02112", 0128,
95438, "L03111", 0123,
95438, "L03111", 0124,
95438, "L02112", 0125,
95438, "L03114", 0126,
95439, "L01001", 0127,
95439, "L02112", 0128
)
tract_data <- tribble(
~zip, ~tract, ~res_ratio,
95437, "06045010300", 0.2113775795,
95437, "06045010400", 0.4151701060,
95437, "06045010200", 0.0072504183,
95437, "06045010500", 0.2339096486,
95437, "06045011002", 0.1322922476,
95438, "06045320200", 0.4239835343,
95438, "06042345401", 0.1829403728,
95438, "06044304322", 0.3930761000,
95439, "06045234002", 0.2500000000,
95439, "06045123003", 0.7500000000
)
desired_split <- health_data %>%
count(patzip) %>%
right_join(tract_data, by = c("patzip" = "zip")) %>%
group_by(patzip) %>%
mutate(frac = n * res_ratio,
cumsum_round = round(cumsum(frac)),
new_n = cumsum_round - lag(cumsum_round, default = 0)) %>%
uncount(weights = new_n) %>%
select(zip = patzip, tract) %>%
arrange(zip)
assigned_tracts <- health_data %>%
slice_sample(prop = 1) %>%
arrange(patzip) %>%
bind_cols(desired_split)
assigned_tracts
#> # A tibble: 12 × 5
#> patzip dx_prin patid zip tract
#> <dbl> <chr> <dbl> <dbl> <chr>
#> 1 95437 L02112 125 95437 06045010300
#> 2 95437 L03111 124 95437 06045010400
#> 3 95437 L03111 123 95437 06045010400
#> 4 95437 L01001 127 95437 06045010400
#> 5 95437 L03114 126 95437 06045010500
#> 6 95437 L02112 128 95437 06045011002
#> 7 95438 L03111 123 95438 06045320200
#> 8 95438 L02112 125 95438 06045320200
#> 9 95438 L03111 124 95438 06044304322
#> 10 95438 L03114 126 95438 06044304322
#> 11 95439 L01001 127 95439 06045123003
#> 12 95439 L02112 128 95439 06045123003
Created on 2022-05-31 by the reprex package (v2.0.1)
CodePudding user response:
You can do this by:
- On the
px
dataset, assign a random uniform number to each patient - On the
hud
dataset, create a lower/upper bound for each tract within a zip code - Do a non-equi join using data.table.
Here is the code:
# load data.table library
library(data.table)
# set both of the data frames to data.table
setDT(px)
setDT(hud)
# add a random uniform variate to each row of `px`
px[, rnd:=runif(.N)]
# add a lower and upper bound on the `hud` using `cumsum`, by zip
hud[, `:=`(lower = cumsum(res_ratio)-res_ratio, upper=cumsum(res_ratio)), by=zip]
# join the datatables, using a non-equi join
px[hud, on=.(patzip=zip, rnd>=lower, rnd<upper)]
To illustrate, I take your patient data (call it px
) and I randomly re-sample 10,000 rows from it:
set.seed(123)
px <- px[sample(1:6, 10000, replace=T)]
Now, running the above code returns an assignment of tract
to these 10,000 patients, in the following proportion:
tract prop
1: 6045010300 0.2144
2: 6045010400 0.4109
3: 6045010200 0.0070
4: 6045010500 0.2421
5: 6045011002 0.1256